Simetrical skrev:
On 10/7/07, Rolf Lampa <rolf.lampa(a)rilnet.com>
wrote:
<...> In
other words; it seems like I have
not yet found /all/ the (undocumented) magicwords, after all...
You'll also have to watch out for things wrapped in <pre> and
<nowiki>, of course. Those will ruin your day.
<nowiki> is a bit tricky. Although I have no intent to actually parse
the text for human visualization (Mediawiki does that the best) I'm in
need of being able to Expand templates as a preparatory stage for
analysis of content and also for producing sql for import into a
Sphinx indexing db, etc.
However, I still need to "disable" some syntax, typically
<nowiki>{{</nowiki> etc, used for "disarming" template calls, pipe
chars etc.
Also might (depending on your purpose) need to
consider <noinclude>,
<includeonly>, <onlyinclude>.
Yes, these must be handled, they modify content quite significantly. I
do handle these according to MW logic though. It took me a while to
figure out how though, a bit tricky when they're nested, in all kinds
of arbitrary order... but I solved this.
And of course things like {{CURRENTTIME}} aren't
really inclusions at all, so you'll have to individually special-case
those.
Yes, I handle the ones which are/can be used for building param values
and template names, thus affecting my ExpandTemplate method. Well,
also quite a few have significance for content to be indexed. But
apart from that the rest is "stripped" and reinserted after expanding.
Yeah, we know all this is pretty pathetic.
=) Yes, it is. But soon I cover all the essentials for being able to
produce text for indexing, and other content analysis of the text.
The tool also produces all the "<name>links tables (into sql format)
directly from the dump. With all the significant output options for
the sql format, like *size* (instead of number of rows) for extended
INSERTs, Keys off, IGNORE (duplicates) etc.
Also the Utf8-->Ansi conversions can be "shifted" in both
"directions", for VarChars and Blobs independently, that is, an
integer (+/-, full range) determines how many "steps" to convert the
texts...
It's a kind of massive task to make it all neat
and standard, unfortunately.
Would be nice if we could at least provide libraries, though . . .
Challenging yes, but interesting. :)
Regards,
// Rolf Lampa