In the process of writing some standards documents for the Wikipedia
content model (some lower level behind-the-scenes stuff that needs to
be done before working on the syntax and to beef up the test suite),
I've come to the point were I need to decide exactly what characters
are and are not allowed in page titles. I'd like to solicit input on
this. Keep in mind here that what I'm specifying is what set of
characters can a page title be chosen from; that is, what strings
will be allowed between the brackets of a link, and displayed at the
top of a page, regardless of whatever URL-encoding tricks we have to
use to make that happen. _After_ we specify that, then we can specify
exactly how to construct URLs from them. Here are my current thoughts:
* Cannot allow: # (sharp), | (pipe), " (quote), [] (brackets),
{} (braces), <> (greater,less), + (plus), \ (backslash) because
allowing them would interfere with link syntax and make the
software more tricky to write. I can live without these, though
I think + might be handy in some places (like C++), and might be
worth the effort to allow.
* Should allow anything Unicode calls a letter, numeral, syllable,
or ideograph.
* Should not allow Unicode diacriticals, combining forms, display
forms (ligatures), controls, and other specials.
* Should allow most ASCII punctuation that might appear in a name
or title in text, specifically - , . ( ) ' & : ; % ! ? / $ *
(Note that some of these, like *, are not currently alowed,
and that : is a special case that's allowed but only when the
text before it doesn't match a namespace, etc.)
* Should not allow non-ASCII punctuation like em dash, curly
quotes, etc., because they cause problems on machines with
strict ISO character sets.
* Space is allowed. Underscore is allowed, but indistinguishable
from space. No other controls (tab, etc.) are allowed.
Anyone have other ideas/suggestions?
--
Lee Daniel Crocker <lee(a)piclab.com> <http://www.piclab.com/lee/>
"All inventions or works of authorship original to me, herein and past,
are placed irrevocably in the public domain, and may be used or modified
for any purpose, without permission, attribution, or notification."--LDC