On that topic, I'll share some of my experience.

First, parsing wikitext is way more difficult than you probably imagine.  People are often tempted to do a poor-man's job of it with regular expressions and the like.  Down that path lies madness.  Don't go there.

There's only two rational ways I know of to parse wikitext.

Parsoid is one.  It's complicated to get your head around, but it is the one true officially supported way.

The other is mwparserfromhell.  It has the advantage of being much simpler to use.  It has the disadvantage of not getting every possible edge case correct.  It also is only usable in Python, which is fine if you're using Python and a problem otherwise.

In either case, once you've got parsed versions of two revisions, you'll then be faced with the problem of diffing them.  That's going to be non-trivial.


On Jul 8, 2021, at 7:01 PM, David Lynch <dlynch@wikimedia.org> wrote:

The best I can say about this for your purposes is that using the parsoid HTML would relieve you of having to parse wikitext to work out whether the contents of a math tag were what changed. 🤷🏻