On Tue, May 2, 2017 at 7:10 PM, Mark Clements (HappyDog) <
gmane(a)kennel17.co.uk> wrote:
I seem to recall that a long, long time ago MediaWiki
was using UTF-8
internally but storing the data in 'latin1' fields in MySQL.
Indeed. See $wgLegacyEncoding
<https://www.mediawiki.org/wiki/Manual:$wgLegacyEncoding> (and T128149
<https://phabricator.wikimedia.org/T128149>/T155529
<https://phabricator.wikimedia.org/T155529>).
I notice that there is now the option to use either
'utf8' or 'binary'
columns (via the $wgDBmysql5 setting), and the default appears to be
'binary'.[1]
I've come across an old project which followed MediaWiki's lead (literally
- it cites MediaWiki as the reason) and stores its
UTF-8 data in latin1
tables. I need to upgrade it to a more modern data infrastructure, but I'm
hesitant to simply switch to 'utf8' without understanding the reasons for
this initial implementation decision.
utf8 uses three bytes per character (ie. BMP only) so it's not a good idea
to use it. utf8mb4 should work in theory. I think the only reason we don't
use it is inertia (compatibility problems with old MySQL versions; lack of
testing with MediaWiki; difficulty of migrating huge Wikimedia datasets).