(Nick Reinking <nick(a)twoevils.org>)g>):
Couple quick questions:
When Wikitext is pulled from the database, what are the newlines?
MySQL gives back whatever you give it. We generally give it
Unix-style text with just \n, but a few browsers might add CRs.
Are they always \n? If so, I can clean up the parsing
a bit
and eke a bit more performance out (not a big deal).
It shouldn't hurt performance to just ignore and skip CRs.
That can be done in the lexer. You should never encounter CR-only
line ends.
Also, what format is the wikitext stored in the
database as?
UTF-8? UTF-16?
Some of the foreign ones use UTF-8. The English one is ISO-8859-1.
As far as performance goes, with what I'm handling
now, with all the
.txt data files in the testsuite (x256 = 492672 lines), I'm seeing
parsing speeds of about 86600 lines/sec (in an 18KB executable).
So on a typical page of, say, 40-50 lines, that makes half a
millisecond spent in parsing. If PHP were 100 times worse, it
would account for 1/20th of a second per page fetch. Doesn't
sound like much of a problem to me, and I doubt it's 1000 times
worse.
Just curious: what does your parser do with Quotes.txt from
the test suite?
--
Lee Daniel Crocker <lee(a)piclab.com> <http://www.piclab.com/lee/>
"All inventions or works of authorship original to me, herein and past,
are placed irrevocably in the public domain, and may be used or modified
for any purpose, without permission, attribution, or notification."--LDC