> As I've mentioned before, I'm pretty sure it's the encoding hack
> I set up to keep ampersands in titles _in_ the titles instead of
> as raw ampersands that indicate the beginning of the next variable
> in the query string:
>
> RewriteEngine On
> RewriteMap urlencode prg:/usr/local/bin/urlencode
> RewriteRule ^/wiki/(.*)$ /w/wiki.phtml?title=${urlencode:$1} [L]
>
> If the hackish little external program should die or get out of
> sync, we end up with the wrong URLs. But this ugliness *shouldn't*
> be needed. We *should* be able to use the internal function that
> Apache provides for this...
You are mistaken that Apache is doing the wrong thing: ampersands
are /not/ supposed to be urlencoded--they are valid and meaningful
characters needed for URLs. But ampersands do need to be messed with
for Wikipedia-specific reasons: since article titles must appear as
values in the query string (which is separated by ampersands), they
must be escaped somehow for that function. Also, the
non-escaped ampersands in the URL must be HTML-escaped when they
appear as attribute values, such as HREFs. These are both entirely
separate issues, and the code formerly dealt with them correctly,
although in a way that you didn't like. We may have to compromise;
accept the double-encoding for ampersands that you removed for other
characters. Either that, or come up with some other escaping
mechanism for ampersands in titles.
lcrocker(a)nupedia.com wrote:
>>The following URL goes successfully to the desired article:
>>* http://www.wikipedia.org/w/wiki.phtml?title=Insomnia&redirect=no
>>
>>The "random page" error occurs for me only when I follow a link by
>
> clicking on it:
>
>>* http://www.wikipedia.org/wiki/Insomnia
>>
>>Maybe this behavior difference of the 2 forms of the same URL will
>
> provide a hint of the bug's cause.
>
> Thanks, and yes, that's a clue. I just rebooted the server
> since Jimbo said that fixed it last time, but I'll look into
> it more closely now--maybe mod_redirect is dying for some reason?
As I've mentioned before, I'm pretty sure it's the encoding hack I set
up to keep ampersands in titles _in_ the titles instead of as raw
ampersands that indicate the beginning of the next variable in the query
string:
RewriteEngine On
RewriteMap urlencode prg:/usr/local/bin/urlencode
RewriteRule ^/wiki/(.*)$ /w/wiki.phtml?title=${urlencode:$1} [L]
If the hackish little external program should die or get out of sync, we
end up with the wrong URLs. But this ugliness *shouldn't* be needed. We
*should* be able to use the internal function that Apache provides for this:
RewriteEngine On
RewriteMap urlencode int:encode
RewriteRule ^/wiki/(.*)$ /w/wiki.phtml?title=${urlencode:$1} [L]
but that function doesn't encode ampersands (%26), so it misses the
entire point of the exercise.
If anyone knows a cleaner way to do this, do speak up. I've gotten eerie
silence on alt.apache.configuration.
-- brion vibber (brion @ pobox.com)
There's a few feature requests that have come up multiple times, and I'd
like some comments from other developers and power users before I go
trashing the code on my own...
* "Most wanted" and "Most popular" special pages list the _total_ number
of links to a page, regardless of how many links there are _per page_.
Some types of lists can hyperinflate the numbers; a list of video games
might link [[Playstation]] 389 times (once for each Playstation game
listed). The behavior that people seem to expect is a count of _pages_
that link, rather than the raw number of links. I would tend to agree.
This can be switched by the simple addition of "DISTINCT" to a couple
SQL queries; is there any reason to retain the current behavior?
* On blocking vandals; there's still no interface for blocking by
username, and you can't get the IP address of a logged-in user except by
sifting through the server logs. Should we retain and display IP
addresses/hostnames of logged-in editors (as on UseMod), and/or allow
usernames to be blocked?
* The administrative page rename feature on UseMod could optionally find
and change links to point to the new name in addition to just supplying
a redirect. We still haven't implemented this. Desirable?
* While the "Special:" namespace may be localized ("Spezial", "Speciel"
etc), the names of special pages are hardwired in English (hence
monstrosities like "Spezial:Recentchanges"). While these are mostly
hidden in the interface by descriptive names, the links, URLs, and most
annoyingly the tooltips on the links all show the raw internal English
name of the function which implements the special page. A table of
equivalencies could be set up, allowing more easily recognizable
localized names to be used. Good idea? Bad idea?
* For the French wiki, the Wikipedia: namespace is tentatively set up as
"Wikipédia" (with acute accent on the "e"). The parser doesn't accept
namespaces with non-ASCII chars so this doesn't work, which is a bug I
intend to fix, but additionally one tester asked:
'For the francophone wikipedians without a French keyboard, would it
be possible for the "Wikipedia:*" links to automatically transform
into "Wikipédia:*"? Or, more simply, could the system interpret e/é as
equal in the namespace portion?'
In short, allow aliases for namespaces. Good idea? Bad idea?
-- brion vibber (brion @ pobox.com)
Shouldn't wikipedias that have not been started be moved to the new
software? I mean, there is nothing to change so it seems to me, right now
would be the best time to do something like that since there is no content
and It'd be easier to start them in the new software rather than in the old
and then convert them....
there seems to be a lot of them.. look here:
http://www.wikipedia.org/wiki/Wikipedia:Complete_list_of_language_wikis_avai
lable
Also can some pass this on to the international list, i suspect they might
have some more views on this, but i really dont need the traffic from that
list in my inbox.
Lightning
Yep.
Some observations on the off chance that this will help:
Almost all of the links on the website don't go where they're supposed to
go, but instead result in a random page. Links that *do* work include:
[[Edit this page]] (supposing you actually do want to edit a page you
randomly hit!)
[[Watch this page]] and [[Stop watching]]
[[Move page]]
[[History]] (of a page)
[[What links here]]
[[Watch links]]
[[Printable version]]
Another weird thing is that you can hit at least certain links several
times (if they appear on several successive pages) and eventually the
correct link *will* come up. For instance, I can hit [[Larry Sanger]]
(which appears in the upper right hand of my screen) and after two hits
usually my user page does come up. Similarly with the [[Recent changes]]
link.
Also, a lot of the "random" links come up over and over, I noticed.
Would be very interesting to learn what the bug was!
--Larry
>My question is, is there a good reason that the database-creation
>scripts are defining the default as 1, or should it just be set to 0?
I don't recall having any specific reason for it. It's just
a simple flag exactly as you describe.
I finally got around to ferreting out the URL-encoding problem, which
could produce some URLs that were encoding correctly, but others that
actually encoded the URL-encoded form. For instance, for a page titled
'Anátomy?' we might see hrefs that are incorrectly double-encoded like
these:
http://www.piclab.com/wikitest/wiki.phtml?title=An%25E1tomy%253Fhttp://www.piclab.com/wikitest/wiki.phtml?title=An%25E1tomy%253F&action…http://www.piclab.com/wikitest/wiki.phtml?title=An%25E1tomy%253F&action…http://www.piclab.com/wikitest/wiki.phtml?title=Talk:An%25E1tomy%253F&a…
as well as correct ones like:
http://www.piclab.com/wikitest/wiki.phtml?title=Special:Userlogin&retur…http://www.piclab.com/wikitest/wiki.phtml?title=Special:Whatlinkshere&t…http://www.piclab.com/wikitest/wiki.phtml?title=Special:Recentchangeslinked…
(Note that the &s in the URL appear here as & because I copied them
right out of the HTML source of the page, where they must appear that
way to be legal HTML.)
A hackish redundant urldecode() in Title::newFromURL() was presumably
added to catch the first case. (PHP decodes URL-variable before we get
them, so it's not necessary on correctly-encoded URLs.) I'd prefer to
remove it, but double-encoded URLs have been polluting the search
engines for some time and we have to retain compatibility.
The culprit was wfLocalUrl(), which takes two parameters, a wiki page
title and a section for additional URL bits; it URL-encodes the title,
then tacks both onto the server's hostname... but the first one has
already been encoded by Title::getPrefixedURL(), so we get the fubar'd
double-encoding above.
The correct encoding remains up in the target=, returnto=, etc because
the URL bits aren't encoded a second time (the &s can't be URL-encoded
or they lose their meaning).
I've removed the redundant encoding from wfLocalUrl(); I haven't come
across another mis-encoded URL on since, and I've been trying.
Additionally, I've added a check from Title::newFromURL() that checks
the character encoding of links coming in from the outside; for a
latin-1 wiki UTF-8 encoded links are detected and converted to latin-1,
and on a UTF-8 wiki latin-1 links are detected and converted to UTF-8.
(The check is done in Language::checkTitleEncoding() and can be
customized by language; I've set up the Polish to detect Latin-2 and the
Esperanto to detect X-surrogates, so they'll be able to retain
compatibility with existing links once converted.)
This is needed for a couple reasons:
* Some browsers (notably Internet Explorer) send URLs encoded in UTF-8
if you type them into the URL bar or follow a link that's not
URL-encoded. Thus we were getting mis-encoded titles from time to time
when someone typed a title with accented chars directly into the URL bar
or followed links from differently-encoded external sites.
* This should help with linking between the various language wikis, with
less need for manually adding URL-encoding to interlanguage links that
cross encodings.
* As noted above, compatibility with old URLs on some wikis.
Note that _theoretically_ a legal UTF-8 sequence could also be legal ISO
8859-1. (But not bloody likely -- an uppercase accented letter followed
by a single high punctuation mark or symbol, or a lowercase accented
letter followed by two or three high punctuation marks or symbols.)
Title URLs aren't checked or converted if the referer matches our
server, so one could still work with such a page; just set up a redirect
from the converted form for the benefit of outside links.
-- brion vibber (brion @ pobox.com)
You know that handy little asterisk that notifies you when there's
something new on your user talk page since you last viewed it?
Well, on the English wiki the default value of the user_newtalk field
which controls said notifier is 0. On the other wikis, it's 1 as defined
in buildTables.inc.
A newly created user account has the field set to 0 anyway by
User::addToDatabase(), but in a wiki that's converted up from phase II
(ie, meta-wikipedia and potentially Enciclopedia Libre) a migrated user
account without an existing user talk page sees the asterisk
continuously until a user talk page is created and viewed, since the
field doesn't get cleared if the talk page doesn't exist...
My question is, is there a good reason that the database-creation
scripts are defining the default as 1, or should it just be set to 0?
-- brion vibber (brion @ pobox.com)