I've been an enthusiastic downloader and user of Domas' wikistats
or page counter logfiles since they first appeared on December 9,
http://lists.wikimedia.org/pipermail/wikitech-l/2007-December/035435.html
One problem, however, is that they are plain text files that need
to be loaded into some kind of database system before you can do
any interesting analysis. Whether text files or XML or MySQL
dumps, they all take quite some time to import. It's like
unpacking a huge tar archive, rather than instantaneously mounting
an ISO 9660 image, if you get the analogy. You could probably
load the data into MySQL and then distribute the raw tablespace
files, but I haven't heard of any project that does this. MySQL
wasn't built with this in mind. The data could be loaded into
MySQL at the toolserver (maybe someone did already?), and we could
each run our queries there, but that doesn't scale if many people
want to run heavy queries.
When looking around, I found SQLite (www.sqlite.org), a free
software (public domain, actually) light-weight SQL database
engine. It is server-less and runs entirely as a one-user
application, storing tables and indexes in a plain file.
As an experiment, I loaded the first two months of page counter
log files for the Swedish Wikipedia into SQLite (version 3). The
resulting databse file is 3.1 GB, which bzip2 shrinks to 638 MB.
The idea is that you can download these 638 MB, run bunzip2, and
then start sqlite3 and run SQL queries right away. It doesn't
take many minutes to get started.
Now I want to find out if this is a useful scheme. Then I could
set up a process to provide such SQLite dumps for all languages.
But perhaps some parts need adjustment or tuning. I need your
feedback for this. You'll have to do analyze the Swedish
Wikipedia initially, since that's all I provide for now.
Here's what I have done:
First I decode the URL encoding and normalize some page names.
Main_Page, Main+Page and Main%20Page are all converted to
Main_Page, and even translated to Huvudsida for the Swedish
Wikipedia. That means I add up the page counters for these page
names and store them under Huvudsida. All page names are stored in
one table (names) and given an integer primary key (names.id), to
avoid duplicate storage of text strings. The names table now has
1.9 million entries.
Each logfile covers one hour, timestamped in UTC. A table called
"times" uses "unix seconds" as a primary key (times.unix) and
lists the year, month, day-of-month, week, day-of-week, and hour.
Perhaps this was unnecessary given the date and time functions
provided by SQLite, but I still believe it can be helpful. For
these 2 months (62 days), the times table has 1734 entries.
The big "counts" table contains the language ("sv") as a text
field and integer fields for time (references times.unix), name
(references names.id) and count. The counts table has 68.3
million entries.
A typical query you can run is
select sum(count), year, month, mday
from counts, times, names
where names.name='Huvudsida'
and year=2007
and names.id=counts.name and counts.time=times.unix
group by 2,3,4;
Queries are not necessarily fast, but you can create indexes as
you wish. Are there any indexes you would like me to build and
supply as part of the distributed database file?
The query above returns this result:
32403|2007|12|9
119005|2007|12|10
117551|2007|12|11
107630|2007|12|12
102178|2007|12|13
88766|2007|12|14
65733|2007|12|15
87048|2007|12|16
106643|2007|12|17
96751|2007|12|18
86955|2007|12|19
74297|2007|12|20
63383|2007|12|21
57908|2007|12|22
59360|2007|12|23
45230|2007|12|24
56469|2007|12|25
58494|2007|12|26
66068|2007|12|27
63538|2007|12|28
65137|2007|12|29
68636|2007|12|30
55821|2007|12|31
Currently, you'll find the database file (both compressed and not)
at http://mirabell.runeberg.lysator.liu.se/
Here's what you need to do (UNIX/Linux commands):
sudo apt-get install bzip2
sudo apt-get install sqlite3
wget http://mirabell.runeberg.lysator.liu.se/sv-counts-20080219.db.bz2
bunzip2 sv-counts-20080219.db.bz2
sqlite3 sv-counts-20080219.db
That URL is not permanent, but only available for the current
test.
--
Lars Aronsson (lars(a)aronsson.se)
Aronsson Datateknik - http://aronsson.se
I am trying to create an offline Wikipedia client for the Wikipedia
XML dump. I know there are lot of programs on the, but all seems to
render the page very badly, because the Wiki markup has enhanced
considerably but all these programs are quite outdated and almost
dead.
After scanning the internet since yesterday, I have come-up a number
of libraries and programs but all of them don't render the page
perfectly. Hence, I was toying with the idea of rendering the page
using MediaWiki's php files as the people at woc.fslab.de (Offline
Wikipedia Client) have done. I have downloaded Offline Wikipedia
Client but I yet haven't been able to figure it out how to use it.
Anyway, its looks too complicated and overly large. I want an Offline
Client myself which would be served via Http (I have tried importing
the dump into freashly installed Mediawiki but the rebuild links takes
forever and WikiFilter is not for Linux - yet I tried that over wine).
So, my question is. Can anyone please guide me to the php files of
MediaWiki that I can use with little modification. I intend to provide
it all the necessary details like list of Templates and their codes to
substitute, the categories and the article markup as input to the php
file,etc. I expect to get the html code that can be sent to the user's
browser.
All this may look very pointless, but after battering my brains over
this thing and repeatedly getting disappointing results, my brain has
gone fuzzy and desperate.
May you have peace pf mind.
Reagrds,
Apple Grew
my blog @ http://applegrew.blogspot.com/
Time for a lesson in basic PHP.
A bug was introduced in r25374 by Aaron, last September, and despite half
a dozen people editing the few lines around that point, nobody picked it
up. Simetrical eventually fixed it in r29156, blaming a bug in PHP's
array_diff(). It was not.
$permErrors += array_diff(
$this->mTitle->getUserPermissionsErrors('create', $wgUser),
$permErrors );
I used to make the same error myself. Although I learnt from my mistakes,
we obviously haven't learnt as a team.
http://www.php.net/manual/en/language.operators.array.php
"The + operator appends elements of remaining keys from the right handed
array to the left handed, whereas duplicated keys are NOT overwritten."
That explains the behaviour of the array plus operator in its entirety. If
you add two arrays, and both have an element with a key of zero, the one
on the left-hand side wins. The elements are NOT renumbered.
For example:
> print_r( array( 'foo' ) + array( 'bar' ) );
Array
(
[0] => foo
)
> print_r( array( 'foo' ) + array( 'bar', 'baz' ) );
Array
(
[0] => foo
[1] => baz
)
If you want the elements to be renumbered, use array_merge().
-- Tim Starling
Hi,
I have two extensions (SelectCategoryTagCloud and FCKeditor) that both
use the following hook to access the content of the edit page input
box:
$wgHooks['EditPage::showEditForm:initial']
How can I control which extension gets to do its work first?
I would like SelectCategoryTagCloud to strip out all existing category
links in the wiki text (and consequently put it in the second input
box for category assignment) and then have FCKeditor use the rest of
the text to display in the wysiwyg editor. And when the user saves the
page, I would like to reverse the order. First, FCKeditor stores the
wikitext and then SelectCategoryTagCloud adds the categories selected
by the user at the bottom of the wiki text. So the database always
stores wiki text.
Any thoughts?
Thanks,
Andi
Hi,
i'm using the wikipdf extension
(http://sourceforge.net/projects/wikipdf/). The translation script is
written in python and that's my problem. I don't know how to programm in
python and have the problem that if i use <math>"here stands tex
formula"</math> the part of the text which is already written in tex is
also translated. This means that special symbols like \sup_{} don't stay
the way they are...so the formula is not shown in my pdf but the
tex-text for the formular.
Is anybody using this extension, or has anybody already had the same
problem and a solution?
thanx for the help
julia
For your interest...
"The Freebase Wikipedia Extraction (WEX) is a processed dump of the
English language Wikipedia. The wiki markup for each article is
transformed into machine-readable XML, and common relational features
such as templates, infoboxes, categories, article sections, and
redirects are extracted in tabular form.
"Freebase WEX is provided as a set of database tables in TSV format
for PostgreSQL, along with tables providing mappings between Wikipedia
articles and Freebase topics, and corresponding Freebase Types."
<http://download.freebase.com/wex/>
cheers,
Brianna
---------- Forwarded message ----------
From: Georgi Kobilarov <gkob(a)gmx.de>
Date: 20 Feb 2008 07:45
Subject: [Dbpedia-discussion] Freebase provides data dumps
To: dbpedia-discussion(a)lists.sourceforge.net
Hi all,
Freebase now provides dumps of their data extracted from Wikipedia. See
[1] [2]. Interesting stuff. It is nice to see that Metaweb follows the
ideas of DBpedia ;)
@Metaweb: it's time to open source your extraction framework as well. (I
know you read this :)
Cheers,
Georgi
[1] http://blog.freebase.com/?p=108
[2] http://download.freebase.com/wex/
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion(a)lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
--
They've just been waiting in a mountain for the right moment:
http://modernthings.org/