Brion Vibber wrote:
On Mar 7, 2004, at 03:41, Rob Hooft wrote:
The pywikipediabot uses Special:Export whenever
it can download more
than one page at a time. Sometimes, however, there are characters in
pages or usernames which are not decodable in the right character set;
in such a case the XML produced by this option is unparsable
(according to the standard, any XML compliant parser has to generate a
fatal error upon such an event). Unless that is fixed, the "export"
via the "edit" option, even for a locked page, is necessary for a
complete robot run to succeed.
Rob, I'd like to get this fixed before I push out the final 1.2.0
release of MediaWiki. I remember there were problems with line endings
and perhaps some misencoded Windows-1252 characters... could you point
to a few pages that exhibit problems to test with?
The line endings are not so important; these are invisible. What is more
annoying is pages that are made by a user whose name is given using
non-utf, non-ascii characters:
======================
import wikipedia
ar = [
wikipedia.PageLink('ca','Lept%C3%B3'),
]
ga = wikipedia.GetAll('ca',ar)
ga.run()
for pl in ar:
try:
print len(pl._contents)
except AttributeError:
print -1
========================
Gives:
Dumped invalid XML to sax_parse_bug.dat
Traceback (most recent call last):
File "x.py", line 7, in ?
ga.run()
File "/usr/local/home/rob/p/pywikipedia/wikipedia.py", line 583, in run
xml.sax.parseString(data, handler)
File "/usr/local/lib/python2.3/xml/sax/__init__.py", line 49, in
parseString
parser.parse(inpsrc)
File "/usr/local/lib/python2.3/xml/sax/expatreader.py", line 107, in
parse
xmlreader.IncrementalParser.parse(self, source)
File "/usr/local/lib/python2.3/xml/sax/xmlreader.py", line 123, in parse
self.feed(buffer)
File "/usr/local/lib/python2.3/xml/sax/expatreader.py", line 211, in feed
self._err_handler.fatalError(exc)
File "/usr/local/lib/python2.3/xml/sax/handler.py", line 38, in
fatalError
raise exception
xml.sax._exceptions.SAXParseException: <unknown>:7:25: not well-formed
(invalid token)
And line 7 is encoded in iso-8859-1 and therefore contains invalid utf-8
references:
<contributor><ip>Plàcid Pérez Bru</ip></contributor>
I don't know whether this is the only problem that occurs, but it is the
only one I can find now in my last log file.
Rob
--
Rob W.W. Hooft || rob(a)hooft.net ||
http://www.hooft.net/people/rob/