Hello,
I transformed the TcpQuery to send output in json (without the url)
and I added Nick Jenkins great autocomplete feature.
Results for english and french are available at
(includes the heuristic to choose the
correct redirection and the handle of articles with different
capitalization).
The format of fsa.bin and articles.bin is now a little bit
different, so you need to redownload them from :
English :
The latest version of the sources with all these modifications is
available at :
I hope you will enjoy all these modifications and I am open to any kind
of suggestion/modification.
I have done some benchmark of TcpQuery with MemoryQuery backend and 5
threads on my computer (Pentium D930). I used 10 threads to simulate
queries. I handled 154000 random queries in 24.7 seconds with CPU usage
of 100% (about 6234 queries per second).
I plan to wrote an analyzer more dedicated to Wikipedia, but I do not
know now to get titles/redirections/links for the moment.
Do you know how to get the target of redirections in the sql database ?
Do you think taking pages-articles.xml.bz2 and update index every month
is acceptable ?
Best Regards.
Julien Lemoine
Nick Jenkins wrote:
But the url
need to be
added since it is different of the title
Yep, but you can work out the url from the title:
------------------------------
function titleToUrl(title) {
var chr, url = "";
for (var i=0; i<title.length; i++) {
chr = title.charCodeAt(i);
url += (chr == 32 ? "_" : escape(String.fromCharCode(chr)));
}
return url;
}
// quick test:
var test_data = ["Roman Catholic Church", "cat (disambig)",
"\"!@$^&*))(_--{}"];
for (var i=0; i<test_data.length; i++) {
document.write(test_data[i] + " equals: " + titleToUrl(test_data[i]) +
" <br>\n");
}
------------------------------
Output is:
------------------------------
Roman Catholic Church equals: Roman_Catholic_Church <br>
cat (disambig) equals: cat_%28disambig%29 <br>
"!@$^&*))(_--{} equals: %22%21@%24%5E%26*%29%29%28_--%7B%7D <br>
------------------------------
(which seems identical to what the Wikipedia gives too).
Don't have to do it this way though, and if you'd prefer to do it on the server
side, then do that.
I just thought that transmitting less data and potentially storing less data might help.
Did you used json in EMCAsript/javascript ?
Nah, I just make this stuff up as I go along. ;-)
Should work fine though:
------------------------------
var json_data = eval("[\"cat\",[\"Catholics\", 7505,
\"Roman Catholic Church\"],[\"Catholic Archibishop\", 4484,
\"Bishop\"],[\"Catholic\", 4200, ],[\"Catholic\", 3269,
][\"CATV\", 2347, \"Cable television\"],[\"Catalogue
astrographique\", 2095,
\"Star catalogue\"],[\"Catholic Encyclopedia\", 1956,
],[\"Catalonia\", 1740, ],[\"Cattle\", 1604,
],[\"Catholicism\", 1527, ]]");
alert("length: " + json_data.length + " data: " + json_data);
------------------------------
I.e. you may have to get rid of the newlines in the data stream.
All the best,
Nick.
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)wikimedia.org
http://mail.wikipedia.org/mailman/listinfo/wikitech-l