> ----- Original Message -----
> From: "Brion Vibber" <brion(a)pobox.com>
> To: "Wikimedia developers" <wikitech-l(a)wikimedia.org>
> Subject: Re: [Wikitech-l] Parsing database dumps
> Date: Thu, 28 Sep 2006 13:50:07 -0700
>
>
> Those particular page titles are in the main (article) namespace, which has no
> prefix:
>
> > <namespace key="0" />
> [snip]
> > <title>AaA</title>
> [snip]
> > <title>AlgeriA</title>
OK, thanks, that makes sense. I nearly made a leap to that assumption but then thought I'd better not without really understanding. So are you saying that if there's no colon in the title it's safe the assume that they go in namespace 0. And if there is a colon then I need to parse to see if text before a colon matches a namespace, and if it doesn't then the colon is just part of the article title ?
Mike O
--
_______________________________________________
Surf the Web in a faster, safer and easier way:
Download Opera 9 at http://www.opera.com
Powered by Outblaze
> The namespace prefix appears at the beginning of the page title, which appears
> as the text contents of the /mediawiki/page/title element,
> separated by a colon
> from the remaining title part.
That's what I gathered from reading the Wiki docs before I started trying to parse the XML. But look at this snippet of XML from enwiki-latest-pages-articles.xml.bz2 taken from late August. I don't see any namespace in the <title> elements, any ideas why, or am I looking in the wrong spot?
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.3/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.3/http://www.mediawiki.org/xml/export-0.3.xsd" version="0.3" xml:lang="en">
<siteinfo>
<sitename>Wikipedia</sitename>
<base>http://en.wikipedia.org/wiki/Main_Page</base>
<generator>MediaWiki 1.8alpha</generator>
<case>first-letter</case>
<namespaces>
<namespace key="-2">Media</namespace>
<namespace key="-1">Special</namespace>
<namespace key="0" />
<namespace key="1">Talk</namespace>
<namespace key="2">User</namespace>
<namespace key="3">User talk</namespace>
<namespace key="4">Wikipedia</namespace>
<namespace key="5">Wikipedia talk</namespace>
<namespace key="6">Image</namespace>
<namespace key="7">Image talk</namespace>
<namespace key="8">MediaWiki</namespace>
<namespace key="9">MediaWiki talk</namespace>
<namespace key="10">Template</namespace>
<namespace key="11">Template talk</namespace>
<namespace key="12">Help</namespace>
<namespace key="13">Help talk</namespace>
<namespace key="14">Category</namespace>
<namespace key="15">Category talk</namespace>
<namespace key="100">Portal</namespace>
<namespace key="101">Portal talk</namespace>
</namespaces>
</siteinfo>
<page>
<title>AaA</title>
<id>1</id>
<revision>
<id>46448774</id>
<timestamp>2006-04-01T12:07:25Z</timestamp>
<contributor>
<username>Gurch</username>
<id>241822</id>
</contributor>
<minor />
<comment>{{R from CamelCase}}</comment>
<text xml:space="preserve">#REDIRECT [[AAA]] {{R from CamelCase}} {{R from other capitalisation}}</text>
</revision>
</page>
<page>
<title>AlgeriA</title>
<id>5</id>
<revision>
<id>18063769</id>
<timestamp>2005-07-03T11:13:13Z</timestamp>
<contributor>
<username>Docu</username>
<id>8029</id>
</contributor>
<minor />
<comment>adding cur_id=5: {{R from CamelCase}}</comment>
<text xml:space="preserve">#REDIRECT [[Algeria]]{{R from CamelCase}}</text>
</revision>
</page>
[...]
Mike O
--
_______________________________________________
Surf the Web in a faster, safer and easier way:
Download Opera 9 at http://www.opera.com
Powered by Outblaze
I'm having a bit of trouble figuring out database dump XML. Looking at the articles dump I see page content is wrapped in <page> and </page> elements. What I don't see is how to determine what namespace an article correlates to. I see the namespace elements at the top of the file, but how do I match articles to the right namespace?
Mike O
--
_______________________________________________
Surf the Web in a faster, safer and easier way:
Download Opera 9 at http://www.opera.com
Powered by Outblaze
An automated run of parserTests.php showed the following failures:
Running test TODO: Table security: embedded pipes (http://mail.wikipedia.org/pipermail/wikitech-l/2006-April/034637.html)... FAILED!
Running test TODO: Link containing double-single-quotes '' (bug 4598)... FAILED!
Running test TODO: Template with thumb image (with link in description)... FAILED!
Running test Template infinite loop... FAILED!
Running test TODO: message transform: <noinclude> in transcluded template (bug 4926)... FAILED!
Running test TODO: message transform: <onlyinclude> in transcluded template (bug 4926)... FAILED!
Running test BUG 1887, part 2: A <math> with a thumbnail- math enabled... FAILED!
Running test TODO: HTML bullet list, unclosed tags (bug 5497)... FAILED!
Running test TODO: HTML ordered list, unclosed tags (bug 5497)... FAILED!
Running test TODO: HTML nested bullet list, open tags (bug 5497)... FAILED!
Running test TODO: HTML nested ordered list, open tags (bug 5497)... FAILED!
Running test TODO: Parsing optional HTML elements (Bug 6171)... FAILED!
Running test TODO: Inline HTML vs wiki block nesting... FAILED!
Running test TODO: Mixing markup for italics and bold... FAILED!
Running test TODO: 5 quotes, code coverage +1 line... FAILED!
Running test TODO: HTML Hex character encoding.... FAILED!
Running test TODO: dt/dd/dl test... FAILED!
Passed 412 of 429 tests (96.04%) FAILED!
All,
I am planning on building a computer which is a copy of en.wikipedia (including portals), en.wikitonary, en.wikibooks, species.wikimedia along with the pictures. The computer will be sent to India to be used in educational settings, and will probably be copied and shared.
I have already built a mirror of en.wikipedia at http://freeknowledge.dyndns.org/ however some of the pages are not rendered properly. For e.g. http://freeknowledge.dyndns.org/index.php/India
Any suggestions/advice on how to fix this and pitfalls that I will likely encounter while adding other websites (en.wikitonary, en.wikibooks and probably ta.wik*) is greatly appreciated.
-Krishna
=====================================
Misinterpreting Copyright by Richard Stallman
"Die Gedanken Sind Frei": Free Software and the Struggle for Free Thought by Eben Moglen mp3 ogg
Free Knowledge blog
.
---------------------------------
Get your email and more, right on the new Yahoo.com
Sorry, took me longer to get around to this than I'd planned. So, I
restored the 6 million row categorylinks table to a local computer for
testing and threw some sql at it. I got mixed results - in the first
pass I'm using my "count... group by" approach and did different
queries to get the pages at the intersection of 2 categories. I used
at least semi-meaningful categories to try to make the testing at
least somewhat representative of real possible usage.
I got several sets of results in under 1 second (the lowest time being
.3 seconds), one query returned in 8 seconds and another in 36
seconds. I'm going to try re-running the same queries after the query
cache is empty (gotta go learn about how to do that) several times to
see what the repeatability is, then see if I can glean what the long
query times correlate to (intuitively I'm guessing the come from the
intersections of large categories, but I haven't tested that yet even
though it's easy to do). I'll publish detailed results with figures
and actual queries once I've got more data. (Plan to do this tonight
or tomorrow night.)
Best Regards,
Aerik
Hello
Could somebody tell me please whether texvc is still under development
and who is the author(s), whom I could contact?
Thanks and regards
Uwe Brauer
An automated run of parserTests.php showed the following failures:
Running test TODO: Table security: embedded pipes (http://mail.wikipedia.org/pipermail/wikitech-l/2006-April/034637.html)... FAILED!
Running test TODO: Link containing double-single-quotes '' (bug 4598)... FAILED!
Running test TODO: Template with thumb image (with link in description)... FAILED!
Running test Template infinite loop... FAILED!
Running test TODO: message transform: <noinclude> in transcluded template (bug 4926)... FAILED!
Running test TODO: message transform: <onlyinclude> in transcluded template (bug 4926)... FAILED!
Running test BUG 1887, part 2: A <math> with a thumbnail- math enabled... FAILED!
Running test TODO: HTML bullet list, unclosed tags (bug 5497)... FAILED!
Running test TODO: HTML ordered list, unclosed tags (bug 5497)... FAILED!
Running test TODO: HTML nested bullet list, open tags (bug 5497)... FAILED!
Running test TODO: HTML nested ordered list, open tags (bug 5497)... FAILED!
Running test TODO: Parsing optional HTML elements (Bug 6171)... FAILED!
Running test TODO: Inline HTML vs wiki block nesting... FAILED!
Running test TODO: Mixing markup for italics and bold... FAILED!
Running test TODO: 5 quotes, code coverage +1 line... FAILED!
Running test TODO: HTML Hex character encoding.... FAILED!
Running test TODO: dt/dd/dl test... FAILED!
Passed 412 of 429 tests (96.04%) FAILED!
> ----- Original Message -----
> From: "Brion Vibber" <brion(a)pobox.com>
> To: "Wikimedia developers" <wikitech-l(a)wikimedia.org>
> Subject: Re: [Wikitech-l] Finding moved/redirected pages
> Date: Tue, 26 Sep 2006 12:26:00 -0700
>
>
> That is indeed the right record.
Thanks Brion, I see now what is happening. The other records point to history. So essentially it's a chain, but the top (page_id in pages) points to the current revision and the rest points to historic revs.
--
_______________________________________________
Surf the Web in a faster, safer and easier way:
Download Opera 9 at http://www.opera.com
Powered by Outblaze
Can someone explain how to programatically find the current text of a redirected article? I can't seem to figure this out. Say for example 'article A' is created, then moved to 'article B' then moved again to 'article C'. The Page table has an entry for 'article C' (and 'article B' for that matter) with the original page_id assigned when the article was first created as 'article A' along with a page_latest pointer to the record in the Revision table as rev_id. In the Revision table record there is rev_text_id which has a record number pointer to the actual article content in the Text table. But, since the article was moved twice, the record in the Text table isn't the right record - it's the original record, not the current record.
Ho do I trace this logically to find the current article content?
--
_______________________________________________
Surf the Web in a faster, safer and easier way:
Download Opera 9 at http://www.opera.com
Powered by Outblaze