Hi,
I'm looking to download the "wiki-latest-stub-meta-history.xml" for smaller
languages and perform some analytics on it. I dont really care about the english
wikipedia because its too large to handle. I want a csv file made out of this
xml so that i can do stats modelling on it.
The trouble is I ve been unable to convert this xml to a csv so far. If i can
get this to sql then phpmyadmin can spit out a csv. But mwdumper has failed.
I've gotten the following error (copied below)
Thanks in advance,
Abhishek
Exception in thread "main" java.lang.NullPointerException
at org.mediawiki.importer.XmlDumpReader.readTitle(Unknown Source)
at org.mediawiki.importer.XmlDumpReader.endElement(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source)
at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanEndElement(Unknown
Source)
at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher
.dispatch(Unknown Source)
at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown
Source)
at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
at org.mediawiki.importer.XmlDumpReader.readDump(Unknown Source)
at org.mediawiki.dumper.Dumper.main(Unknown Source)
c:\PROGRA~1\Java\jdk1.6.0_01\bin>dir
Hi,
I've heard that wikipedia will be among the first content providers to
support the video and audio tags in html5. I'm trying to put up a
presentation about the subject for a FF3.5 release party and I would
like to find out more. Could you point me to some documents or answer
some of the questions below?
1) When will this support appear?
2) Has the code already been modified accordingly?
3) How much time will legacy browsers be supported?
4) What prompted this desire to be an early adopter of this technology?
5) Will other codecs except Theora be supported?
I know some of these questions have already been answered, either here
or on the techblog (which BTW is currently down), but I thought
putting them all toghether would make more sense.
Thanks,
Andrei Cipu [aka Strainu]
Hi, folks.
Recently, the problem of user tracking via third party companies has
been debated on mailing lists. I wonder, if an inclusion of the jQuery
library linked directly from Google servers (!) does not qualify as a
bad idea, too… (Even though no user tracking has obviously been
intended, and due to caching, privacy violation is extremely limited.)
See http://zh.wikipedia.org/wiki/MediaWiki:Common.js (added on
2009-05-22 http://zh.wikipedia.org/w/index.php?diff=10119416&diffonly=1).
-- [[cs:User:Mormegil | Petr Kadlec]]
Just a quick note -- we're experiencing some fun load spikes due to
heavy net usage of people searching for or talking about Michael
Jackson's reported death or near-death.
You may see some intermittent database connection failures on
en.wikipedia.org for a little while as connections back up; we're poking
to see if we can reduce this.
Updates at tech blog:
http://techblog.wikimedia.org/2009/06/current-events/
-- brion
Quick note to all -- PHP 5.3.0 final release is scheduled for June 30.
Everybody don't be shy about testing out your code with the release
candidates! :)
-- brion vibber (brion @ wikimedia.org)
Hi!
There are different apostrophe signs exist. Let's consider 2 of them:
U+0027 and U+2019. They have the same meaning and both of them are
acceptable and apostrophes for the English language, for instance. The
problem is that MediaWiki internal search distinguishes these two
apostrophes and the words containing U+2019 can't be found with the
request containing U+0027 and vice versa.
MediaWiki uses a search index for the internal search and the index is
renewed every time the article is saved. I have found that if to
override the function stripForSearch() in the language class with the
new function wich relpaces the U+2019 with U+0027 for search index it
appears that the internal search begins to work properly not paying
attention to which exactly apostrophe was provided in the search
query, either U+0027 or U+2019. For sure, the context is not
highlighted if the apostrophes differ in the query and in the result,
but the search returns what is really needed.
The question is, if we override the stripForSearch() function in the
language class in such a way, won't this cause any problems?
The code of the override function is the following:
function stripForSearch( $string ) {
$s = $string;
$s = preg_replace( '/\xe2\x80\x99/', '\'', $s );
return parent::stripForSearch( $s );
}
We want to introduce such an issue for Belarusian, but I think
Ukrainian language may experience the same problem with the different
apostrophes, as U+0027 is not a valid apostrophe here as well, but
only U+0027 (the typewriter apostrophe) is available on the majority
of Belarusian and Ukrainian keyboard layouts.
Thanks,
zedlik
I would need to integrate piwigo inside existing mediawiki master site. I did not find any already done work ! .
I have following questions :
authentification: I want to be authenticated in piwigo when logging in mediawiki , I have seen some existing auth plugins , what are your recommandations ? using an existing one and adapt it ?which one (the piwigo site will be setted up on another server,remote mysql db)
searches: I want searches done in mediawiki forwarded to piwigo search , and results links added to mediawiki search results . I am using lucene search for mediawiki.Your experience and tips is needed.
Best Regards
Stephane ANCELOT
Hello fellow developers,
In Håkon Wium Lie's recent analysis of Wikipedia image markup (
http://www.princexml.com/howcome/2009/wikipedia/image/), he makes a good
point: we include image captions both below images and again in the images'
tooltips. Also, for inline images without explicitly defined tooltips, the
image name is used as the tooltip even though it is also shown in the URL
when mousing over the image. Neither of these automatic tooltips are really
useful, and they slow down page load time on image-heavy pages.
What do you think? Should we keep the redundant tooltips, or start leaving
them out?
--
Remember the dot
http://en.wikipedia.org/wiki/User:Remember_the_dot
Operationally, I've noticed that templates don't seem to be getting through
the jobs queue, sometimes for days.
About 11 hours ago, I changed a template with roughly 700 invocations, and it
has only had 17 or so entries refreshed (showing up in new categories).
A couple of weeks ago, I changed a pretty common template {{Cent}} with
several thousand invocations, and it hadn't finished updating after a
weekend! In that case, I ended up commenting out 4 years worth of old uses in
discussion archives, and null editing the remaining dozen or so.... Hopefully,
that made it better for the future.
Does any edit clear the entry from the jobs queue? There'd be no need for
it to be refreshed anymore.
Does multiple edits to a template continue the old entries, or replace them
with newer entries, or leave multiple sets of entries?
I'm assuming the queue is FIFO, so leaving older entries would be ideal,
and multiple sets would be bad.
I've made some fixes to the MySQL search backend for Chinese and other
languages using variants.
Some languages don’t use word spacing, like Chinese and Japanese. To let
the search index know where word boundaries are, we have to internally
insert spaces between some characters:
维基百科 -> 维 基 百 科
Then to add insult to injury, we need to fudge the Unicode characters to
ensure things work reliably with older and newer versions of MySQL:
维 基 百 科 -> u8e7bbb4 u8e59fba u8e799be u8e7a791
For a long time, this word segmentation wasn’t being handled correctly
for Chinese in our default MySQL search backend, so searching for a
multi-character word often gave false matches where the characters were
all present, but not together.
This should now be fixed in r52338: the intermediate query
representation passed to the search backend internally treats your
multi-character Chinese input as a phrase, which will only match actual
adjacent characters:
维基百科 -> +"u8e7bbb4 u8e59fba u8e799be u8e7a791"
Variants for eg Serbian are also now using parens internally so they
should match more usefully.
Note that Wikimedia’s sites such as Wikipedia run on a fancier, but more
demanding, search backend with a separate Java-based engine built around
Apache Lucene. Sometimes we have to remind ourselves that third-party
users will mostly be using the MySQL-based default, and oh boy it still
needs some lovin’! :)
-- brion