On 6/26/06, Felipe Ortega <glimmer_phoenix(a)yahoo.es> wrote:
In the past few weeks I' ve read a bunch of mail
messages talking about what is precisely my first goal: extracting behavioral conclusions
from a quantitative analysis of wikipedia database dumps in all languages.
But, despite all my efforts, and some mails offering myself as contributor I have
received no answer from Wikipedia Community. I only wanted to contribute in a very
interesting area (I think) and I hope it could lead me to build an interesting thesis
about this topic. I'm currently developing some scripts in Phyton that analyze
database dumps.
I wrote a paper with some preliminar results, if you may take a glance to it. I only ask
for some collaboration from anyone involved with the project, because otherwise maybe I
should simply think that all my efforts don' t bother anyone.
Felipe,
I'm a bachelor student at University of Texas at Dallas, also
working on what I'm calling fine-grained statistics for Wikipedia,
using python to interpret text from database dumps. :)
I've been working on it off and on for almost a year. The big
problems for me has been disk space and wikitext parsing. After
fiddling for quite a while trying to make my own parser, I have
finally broken down to using the HTML as rendered by MediaWiki, then
using that as the basis for the rest.
My basic goal is to provide statistics for things such as "how many
revisions has this piece of text survived" and similar, then render
that information onto a reader's wikipedia browser page. As a second
goal, it'd be nice to find some combination of stats that suggest
which bits of a page a more likely trustworthy.
How far have you gotten, and does it sound like we're on the same track?
Cheers,
Jeremy