On Mon, Nov 17, 2014 at 11:03 AM, James Forrester <jforrester(a)wikimedia.org>
wrote:
Moving to character-level rather than
paragraph-level diffing might help
here, potentially. I vaguely remember that we attempted that and abandoned
it because it caused more issues than it solved back
in ?2004, though.
A paragraph-level diff means that you only get an edit conflict if two
people change the same paragraph. A character-level diff would mean, then,
that you only get a conflict if they change the same character? That sounds
a bit excessive. (Stupid example: if I change "sixty-three" to
"sixty-five"
and someone else changes it to "seventy-three", that should probably be a
conflict, but a character-level diff would happily merge them into
"seventy-five".) A sentence-level diff would make much more sentence,
except breaking text to sentences is a less trivial task than breaking it
to paragraphs (lines). It is a very fundamental step in natural language
processing though, so I am sure mature algorithms and tools exist for it,
we just would have to research them.
Another low-hanging fruit would be to special-case the situation when
editor A adds text to the end of a section but does not start a new
section, while editor B adds a new section to the same place. This is
currently a conflict as they both try to insert to the same "slot" between
paragraphs, so a generic merge tool cannot figure out whether those
additions conflict and what would be the right order if they don't;
however, knowing the semantics of wikitext, inserting the text from A first
and the one from B after that seems a pretty safe bet. This kind of
conflict is very typical on talk pages where people almost always edit the
end of a section, and the few "hot topic" sections get the majority of the
edits. (Of course, using unstructured wikitext for talk pages is a bad
thing in general, but that's a long-term problem, and this kind of edit
conflict could be prevented quickly.)