[Foundation-l] Wikipedia meets git

jamesmikedupont at googlemail.com jamesmikedupont at googlemail.com
Sat Oct 17 08:40:59 UTC 2009


I have

On Sat, Oct 17, 2009 at 10:18 AM, Gregory Maxwell <gmaxwell at gmail.com> wrote:
> On Fri, Oct 16, 2009 at 10:31 AM, Anthony <wikimail at inbox.org> wrote:
>> On Fri, Oct 16, 2009 at 12:45 AM, jamesmikedupont at googlemail.com
>>> if you want only the last 3 revisions checked out , it takes about 10
>>> seconds and produces 300k of data.
>>
>> 10 seconds?  That's horrible.  Have you tried using svn?
>

> Still— much of the neat things that can be done by having the article
> in git are only possible if you have the complete history, for
> example: generating a blame map needs the entire history.

yes, and if you just want to view and edit then you need one revision.
if you want to do more, you can pull the history.

>
> It would be nice if the git archival format was more efficient for the
> kinds of changes made in Wikipedia articles: Source code changes tends
> to have short lines and changes tend to change a significant portion
> of the lines, while edits on Wikipedia are far more likely to change
> only part of a very long line (really, a paragraph).... so working
> with line level deltas is efficient for source code while inefficient
> for Wikipedia data.

I have started to work on the blame code
to bring it down to the char level and learn about it.
I am willing to invest some time to learn how to make git better for WMF.
it is much more interesting than hacking php code.

Also, I have been able to use the wm-render code on the git archive, you
can see the results of new version of my reader script here : 2 hours
of reading the full article :

http://www.archive.org/details/KosovoWikipediaArticlesVideo

I am thinking to store the wikipedia articles in the intermediate xml
parse tree format from mw-render, if that would help the diff toos.

Another idea would be to allow editing of the articles with open
office for example, and provide tracibility in the document structure
back to the original article. it could be marked up with blame
information, even more, the blame information could be embedded in
each word, with an xml attribute. that would allow for exact tracking
of where the edits come from.

mike



More information about the foundation-l mailing list