There's a few Python-based things that might be
interesting, but I
think you'll get a lot more love for doing something in PHP or C.
Since this is a student internship, you shouldn't be bashful about
using this as a learning opportunity.
I'd only caution against convincing yourself (and us) that you'll be
more interested in learning something like PHP than you truly are. It
might help you land a spot, but it will work against you in having a
successful project, and
this has such high visibility that you'll really
want to be
successful.
What visibility does this have? I thought it was some abandoned corner
of the wiki that nobody has touched in the seven years since it was
first written. What happens if I make a hash of this?
So, if you find yourself thinking about doing this in
PHP and having
your inner voice say "meh", then I'd recommend sticking to your guns
and propose doing this or something else in Python and/or C.
Well, now my inner voice says, "I really don't want to make a hash of
this texvc port!", so let me explain why I want to do it in Python
rather than PHP. I agree that the performance will probably be just
fine, and that it would be a great coup for maintainability and
installation and usage. The problem is, I don't think PHP has a
parser-generator package.
So let me make sure I understand the problem here. You already have a
texvc implementation that has worked just fine for the last seven years.
TeX is pretty stable at this point, so chances are good you'd make it
another seven years without problems. But you're still dissatisfied
because OCaml is a hard language to find programmers for, and the
existing implementation isn't really maintained. You want it ported to a
different language that has more programmers available.
(You also as a Mediawiki extension rather than a core feature; I'm going
to do that, but I won't say anything more because it seems fairly
uncontroversial.)
Since the subset of TeX you need parsed has a context-free grammar, it
needs an LALR parser, not just a bunch of regexes. I know three ways to
get an LALR parser:
(1) write a pushdown automaton manually (i.e., be yacc)
(2) write input for a parser-generator
(3) write a parser-generator, and give it input
Option (2) is the most maintainable and feasible option, and it's
precisely the one that cannot be done in PHP. As far as I know, PHP has
no parser-generator package. (Please, please let me know if that's
incorrect so I can stop embarrassing myself and get on with writing a
GSoC proposal.)
I could probably do (1), or some hackish kludge at half of it, by
throwing custom control structures into a bucketload of regexes, but I
don't think that's in the project's best interests. As has been pointed
out, the OCaml implementation is really concise and elegant. A large
fraction of that concision and elegance comes from not actually being a
parser but rather only a context-free grammar written in a BNF-like
syntax common to most parser-generators.
I think it'd be easier to find a programmer who has worked with a
parser-generator and can learn a little bit of OCaml, than it would be
to find a PHP programmer who has to read himself into a manually
implemented parser. After all, how many PHP programmers do you know who
have experience mucking around inside an LALR parser?
So that's why, while I'm happy to take it on in PHP as a learning
experience for myself, I think it'd be better for Mediawiki to port
texvc to Python. That gets us the larger pool of potential maintainers
that comes with using a commonly known language, without sacrificing the
amazing advantage of only needing to maintain a grammar rather than the
parser itself.
And as far as dependencies are concerned, Python is still a much easier
dependency to satisfy, both for programmers working with the code and
for sysadmins installing it.
What do you guys think?
Also, would anyone be interested in mentoring this project?
Yours,
Damon Wang