[Foundation-l] Wikipedia meets git

John Vandenberg jayvdb at gmail.com
Sat Oct 17 15:04:12 UTC 2009


On Sun, Oct 18, 2009 at 1:05 AM, Anthony <wikimail at inbox.org> wrote:
> On Sat, Oct 17, 2009 at 4:40 AM, jamesmikedupont at googlemail.com
> <jamesmikedupont at googlemail.com> wrote:
>>> It would be nice if the git archival format was more efficient for the
>>> kinds of changes made in Wikipedia articles: Source code changes tends
>>> to have short lines and changes tend to change a significant portion
>>> of the lines, while edits on Wikipedia are far more likely to change
>>> only part of a very long line (really, a paragraph).... so working
>>> with line level deltas is efficient for source code while inefficient
>>> for Wikipedia data.
>>
>> I have started to work on the blame code
>> to bring it down to the char level and learn about it.
>
> Char level would probably make it too inefficient to merge deltas.
> Treating a period followed by a space as a line separator would
> probably be more efficient.
>
> The key to efficiency is to use skip deltas, though.  You build a
> binary tree so accessing any revision requires the application of only
> log(n) deltas.
>
> I asked whether or not you tried svn, because svn already uses skip deltas.

svn would be daft, for so many reasons.

> Is the idea that the entire file would need to be transferred over the
> Internet, though?  If so, I guess you wouldn't want to use skip deltas
> - they greatly increase access time to early revisions, but at a
> slight space penalty.

With git, parts of the checkout can be shallow clones.

--
John Vandenberg



More information about the foundation-l mailing list