[Toolserver-l] Troubles with reading Articles

Sat Mar 25 21:38:51 UTC 2006

>> Currently, the best way to bulk-process article text is to read from an
>> XML dump. You can adopt the exiting importers to fit your purpose, code
>> is available in PHP, Java and C#, I believe.
>
> Well, I think this means that Stefan's team has to recode a lot. Pulling
> the titles and texts out of the XML dump is easy but you only get a new
> dump every 1 or 2 month. On the other hand XML is more robust while the
> database structure will change with every MediaWiki version - for
> instance I was not aware of the external text before.

XML dumps should be handled by the Wiki. Not only for the monthly dumps, but 
for the Special:Export, which also uses the same format. Queries done 
through it are supposed to be better for the server load as it only needs 
one query for getting many articles.

Well, you'd also need some kind of guessing about which articles will be 
queried after this to optimize it. Or you could get the asked article plus 
the next X pages on the DB that need http query.

Leo, you should also watch on that direction, as it is easier for the 
programmer to know the total amount of articles to be queried, not having to 
rely on the getting layer to guess the improvements.

Maybe you could have another parameter on the wikiproxy for the articles i 
want too, to make the wikiproxy aware of it?
The most accurate way would be to have the layer acting asyncronously, so it 
would get a query and not really do it through http unless a) a parameter 
'notwait' is set; b) the query queue is X long; c) it's Y seconds old (a 
wait timeout). Then it solves all the queries at the same time. However, it 
makes more difficult the client part, as client programs tend to use a 
ask-process-ask-process-loop