Hi Julien,
I totally rewrote the compiler to works with the XML
format of
download.wikipedia.org.
Sounds good, and I really like it having the top 10 languages.
I think _maybe_ though it could be useful to have two variations of the Analyzer (like you
have two variations of TcpQuery - one
that uses DiskQuery, and one that uses MemoryQuery). With Analyzer though, it could be
good to have one that connects to MySQL and
gets the data directly from the database, and one that uses the downloaded XML dumps. This
way, people can use whichever one is most
appropriate for them. For example, for someone running a big MediaWiki site who wanted to
look at the possibility of using
suggestion searching, they probably wouldn't want to create an XML dump, then run
Analyzer on the XML dump (this would be too slow,
and too many steps, and take a lot of disk space). Rather, if possible, in that situation
it would be nice to create the compiled
files directly from the database.
To try and help with this, I've modified a copy of Analyzer.cpp to add basic importing
(but just of the article names, not redirects
or article counts) from MySQL (i.e. does not use any downloaded files). The rough file
(which still needs work for redirects +
article counts) is here:
http://files.nickj.org/MediaWiki/MysqlAnalyzerCmd.cpp
Please note that I have not used C or C++ in a _very_ long time, so if looks like I have
done something silly then that is almost
certainly correct. :-)
To use compile and run this, on a Debian/Ubuntu system, I did this:
# Install required MySQL libraries
apt-get install libmysqlclient15-dev
cd cmd
# Compile:
g++ -DHAVE_CONFIG_H -I. -I. -I.. -I../expat/lib -g -O2 -O3 -MT MysqlAnalyzerCmd.o -MD
-MP -MF ".deps/MysqlAnalyzerCmd.Tpo" -c -o
MysqlAnalyzerCmd.o MysqlAnalyzerCmd.cpp
# Link: (Note: needs " -lmysqlclient" parameter)
g++ -g -O2 -O3 -o MysqlAnalyzer -L../tools -L../serialization -L../analyzer
MysqlAnalyzerCmd.o -lanalyzer -lserialization -ltools -lexpat -lglib-2.0 -lmysqlclient
# Run (change hostname / username / password / database-name params as required) :
./MysqlAnalyzer localhost wikiuser FakePasswd wikidb
If it is working, it should print out something like this:
-----------------------------
Connection success
Found 12345 articles
-----------------------------
Then use the .bin files as per usual on TcpQuery.
Also there is a small diff for WSuggest.js to fix a small problem in my autocomplete
stuff. For example, suppose the user typed
"Aer", then moved the text cursor back to be between the 'A' and the
'e', typed 'm' (to make "Amer") then typed 'p' (to try
and
spell 'Amper'). However in-between typing the 'm' and the 'p', the
cursor position will jump to the end of the text box to try and
autocomplete "American", so the result of pressing 'p' will be
'Amerp', not 'Amper'. To prevent this, will now only try to
autocomplete if the cursor position is at the end of the text field. Diff is here:
http://files.nickj.org/MediaWiki/WSuggest.js-0.4-autocomplete-update.txt
All the best,
Nick.