Chad Perrin wrote:
PHP is supposedly planning to incorporate Python's
ICU, which has some
reasonable Unicode support for regexen, at some point in the future.
PHP already has unicode regex support, because PCRE has had it for some
time and PHP just bundles that. In fact, the simplest way to split a
UTF-8 string by character in PHP 4-5 with no mbstring is to do
preg_match_all('/./u',...). MediaWiki uses this on occasion.
In PHP 6, they are moving to a 16-bit character type (not sure if it's
UTF-16 or UCS-2), with a distinct binary string type. If "unicode
semantics" are enabled, string literals will be unicode by default, and
all the usual string operations would be character-wise. I dare say this
would cause some backwards compatibility problems for applications such
as MediaWiki.
PHP 6 requires ICU for its internal unicode support, but I'm not sure to
what extent they will be providing interfaces to ICU's more complex
functions. Note that ICU is not "Python's ICU", it's a library written
by IBM which is natively C, C++ and Java. There is a set of swig
wrappers to bind the C++ API to Python.
-- Tim Starling