[Labs-l] Scripts which adds template to articles created by ContentTranslation tool do not work on the grid

Martin Urbanec martin.urbanec at wikimedia.cz
Sat Jun 17 09:19:38 UTC 2017


Thank you all! I've added export LC_ALL=en_US.UTF-8 to my launch bash
script and all works correctly.

Best,
Martin

so 17. 6. 2017 v 11:15 odesílatel Merlijn van Deen (valhallasw) <
valhallasw at arctus.nl> napsal:

> Hi all,
>
> This is a combination of a Python 3 design choice (PEP 383 [1]) and T60786
> [2]. What happens is the following:
>
> 1) The locale is set to a encoding that cannot decode certain bytes -- for
> example, ASCII, which can only decode bytes < 128.
> 2) Python is started with a command line parameter that contains a byte >
> 128 (\x80), for example, "ř' when UTF-8 encoded is represented by two
> bytes: \xc5\x99. Both of these are > \x80, and can therefore not be
> interpreted as ASCII
> 3) Python3 needs to somehow decode these bytes into a text string. But
> there is no valid way to do so! Instead of complaining loudly with a
> UnicodeDecodeError, Python3 embeds the bytes as 'fake characters' in the
> string -- as described in PEP 383.
> \xc5\x59 is therefore now suddenly decoded as "'\udcc5\udc99".  instead of
> "ř".
> 4) Pywikibot tries to encode these characters using utf-8, but they are
> fake characters, and the .encode step blows up.
>
> A simple way to reproduce this is the following:
>
> valhallasw at tools-bastion-03:~/ucm$ cat test.py
> import sys
> encoded = sys.argv[1].encode('utf-8')
>
> valhallasw at tools-bastion-03:~/ucm$ LC_ALL=C python3 test.py řeklad
> Traceback (most recent call last):
>   File "test.py", line 2, in <module>
>     encoded = sys.argv[1].encode('utf-8')
> UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc5' in
> position 0: surrogates not allowed
>
> This should be fixed in future Python versions (likely 3.7), when PEP540
> [3] is implemented.
>
> As for the current situation, the simplest solution is to add  'export
> LC_ALL=en_US.UTF-8' to your script, before the 'python ...' line.
>
> Best,
> Merlijn
>
> [1] https://www.python.org/dev/peps/pep-0383/
> [2] https://phabricator.wikimedia.org/T60784
> [3] https://www.python.org/dev/peps/pep-0540/
>
>
> On 16 June 2017 at 23:58, Bryan Davis <bd808 at wikimedia.org> wrote:
>
>> On Fri, Jun 16, 2017 at 10:15 AM, Martin Urbanec
>> <martin.urbanec at wikimedia.cz> wrote:
>> > Hello,
>> >
>> > I have a script which should add a template to articles which are
>> created by
>> > the ContentTranslation tool (the template has parameters which depends
>> on
>> > language and revision which were used as the source one; this is the
>> reason
>> > why I use separate script). It may be found at
>> > https://github.com/urbanecm/addPrekladCT/blob/master/addmissing.py. The
>> > script work perfectly on my local PC and on bastion host but I can't
>> get it
>> > work on the grid.
>> >
>> > The script itself is run by python3 addmissing.py -always
>> -file:pages.txt
>> > -search:'-insource:/\{\{[Pp]řeklad/' and require pages.txt file and
>> > preklads.txt file at
>> > https://tools.wmflabs.org/urbanecmbot/test/preklads.txt. The first
>> contains
>> > pages that should be processed and act as the generator, the second one
>> is
>> > something like a database with exact templates which should be inserted.
>> > Both files are as an example in the attachments.
>> >
>> > When I try to run it at toollabs bastion, all works as it should. When I
>> > send the script to grid, it do not work (see sample output below). Why?
>> Can
>> > somebody help me with it?
>> >
>> > Thank you in advance,
>> > Martin Urbanec / Urbanecm
>> >
>> > ; Output
>> >
>> > urbanecm at tools-bastion-02 ~/Documents/cswiki/addPrekladCT
>> > $ cat test.sh
>> > python3 addmissing.py -always -file:pages.txt
>> > -search:'-insource:/\{\{[Pp]řeklad/'
>> > urbanecm at tools-bastion-02 ~/Documents/cswiki/addPrekladCT
>> > $ jsub bash test.sh
>> > Your job 6201363 ("bash") has been submitted
>> > urbanecm at tools-bastion-02 ~/Documents/cswiki/addPrekladCT
>> > $ qstat
>> > job-ID  prior   name       user         state submit/start at     queue
>> > slots ja-task-ID
>> >
>> -----------------------------------------------------------------------------------------------------------------
>> > 6201363 0.30000 bash       urbanecm     r     06/16/2017 18:14:42
>> > task at tools-exec-1404.eqiad.wmf     1
>> > urbanecm at tools-bastion-02 ~/Documents/cswiki/addPrekladCT
>> > $ ls ~/bash.*
>> > /home/urbanecm/bash.err  /home/urbanecm/bash.out
>> > urbanecm at tools-bastion-02 ~/Documents/cswiki/addPrekladCT
>> > $ cat ~/bash.*
>> > Traceback (most recent call last):
>> >   File "addmissing.py", line 223, in <module>
>> >     main()
>> >   File "addmissing.py", line 183, in main
>> >     local_args = pywikibot.handle_args(args)
>> >   File "/shared/pywikipedia/core/pywikibot/bot.py", line 954, in
>> handle_args
>> >     writeToCommandLogFile()
>> >   File "/shared/pywikipedia/core/pywikibot/bot.py", line 1128, in
>> > writeToCommandLogFile
>> >     command_log_file.write(s + os.linesep)
>> >   File "/usr/lib/python3.4/codecs.py", line 711, in write
>> >     return self.writer.write(data)
>> >   File "/usr/lib/python3.4/codecs.py", line 368, in write
>> >     data, consumed = self.encode(object, self.errors)
>> > UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc5' in
>> > position 67: surrogates not allowed
>> > CRITICAL: Closing network session.
>> > <class 'UnicodeEncodeError'>
>> > urbanecm at tools-bastion-02 ~/Documents/cswiki/addPrekladCT
>> > $
>>
>> Zhuyifei1999 saw your email and noted on irc that it looks to be a
>> case of the known bug that I just retitled as "Shell LOCALE neither
>> consistent nor sane across grid engine nodes"
>> (<https://phabricator.wikimedia.org/T60784>). The current best work
>> around that bug is to launch the job as a shell script that sets
>> either LANG=C.UTF-8 or PYTHONIOENCODING=utf-8.
>>
>> If setting the job to run with the same locale you are using in your
>> interactive tests does not work to fix the problem, you may also be
>> hitting a deeper Python3 unicode issue related to surrogate codepoints
>> (<https://bugs.python.org/issue12892>). This is hinted by the
>> "position 67: surrogates not allowed" error message.
>>
>> I can actually reproduce your error message in an interactive python
>> session on tools-dev from a starting state of LANG=en_US.UTF-8:
>>
>>   $ python3
>>   Python 3.4.0 (default, Jun 19 2015, 14:20:21)
>>   [GCC 4.8.2] on linux
>>   Type "help", "copyright", "credits" or "license" for more information.
>>   >>> print('\udcc5')
>>   Traceback (most recent call last):
>>     File "<stdin>", line 1, in <module>
>>   UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc5' in
>> position 0: surrogates not allowed
>>   >>>
>>
>> Explictly encoding using 'surrogateescape' does work:
>>   >>> print('\udcc5'.encode('utf-8', 'surrogateescape'))
>>   b'\xc5'
>>
>> It looks like the error could be dealt with in pywikibot by patching
>> writeToCommandLogFile() to open the codec used for output with any
>> value other than the default errors='strict'
>> (<https://docs.python.org/3/library/codecs.html#error-handlers>).
>>
>>   $ python3
>>   Python 3.4.0 (default, Jun 19 2015, 14:20:21)
>>   [GCC 4.8.2] on linux
>>   Type "help", "copyright", "credits" or "license" for more information.
>>   >>> print('\udcc5'.encode('utf-8', 'ignore'))
>>   b''
>>   >>> print('\udcc5'.encode('utf-8', 'replace'))
>>   b'?'
>>   >>> print('\udcc5'.encode('utf-8', 'xmlcharrefreplace'))
>>   b'�'
>>   >>> print('\udcc5'.encode('utf-8', 'backslashreplace'))
>>   b'\\udcc5'
>>   >>> print('\udcc5'.encode('utf-8', 'surrogateescape'))
>>   b'\xc5'
>>   >>> print('\udcc5'.encode('utf-8', 'surrogatepass'))
>>   b'\xed\xb3\x85'
>>   >>>
>>
>>
>> Bryan
>> --
>> Bryan Davis              Wikimedia Foundation    <bd808 at wikimedia.org>
>> [[m:User:BDavis_(WMF)]] Manager, Cloud Services          Boise, ID USA
>> irc: bd808                                        v:415.839.6885 x6855
>>
>> _______________________________________________
>> Labs-l mailing list
>> Labs-l at lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/labs-l
>>
>
> _______________________________________________
> Labs-l mailing list
> Labs-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/labs-l
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.wikimedia.org/pipermail/labs-l/attachments/20170617/a3f728a8/attachment.html>


More information about the Labs-l mailing list