Wikitech-l June 2009

wikitech-l@lists.wikimedia.org

107 participants
80 discussions

by Abhishek

Hi, I'm looking to download the "wiki-latest-stub-meta-history.xml" for smaller languages and perform some analytics on it. I dont really care about the english wikipedia because its too large to handle. I want a csv file made out of this xml so that i can do stats modelling on it. The trouble is I ve been unable to convert this xml to a csv so far. If i can get this to sql then phpmyadmin can spit out a csv. But mwdumper has failed. I've gotten the following error (copied below) Thanks in advance, Abhishek Exception in thread "main" java.lang.NullPointerException at org.mediawiki.importer.XmlDumpReader.readTitle(Unknown Source) at org.mediawiki.importer.XmlDumpReader.endElement(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanEndElement(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher .dispatch(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source) at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source) at javax.xml.parsers.SAXParser.parse(SAXParser.java:198) at org.mediawiki.importer.XmlDumpReader.readDump(Unknown Source) at org.mediawiki.dumper.Dumper.main(Unknown Source) c:\PROGRA~1\Java\jdk1.6.0_01\bin>dir

14 years, 10 months

Mediawiki and html5

by Strainu

Hi, I've heard that wikipedia will be among the first content providers to support the video and audio tags in html5. I'm trying to put up a presentation about the subject for a FF3.5 release party and I would like to find out more. Could you point me to some documents or answer some of the questions below? 1) When will this support appear? 2) Has the code already been modified accordingly? 3) How much time will legacy browsers be supported? 4) What prompted this desire to be an early adopter of this technology? 5) Will other codecs except Theora be supported? I know some of these questions have already been answered, either here or on the techblog (which BTW is currently down), but I thought putting them all toghether would make more sense. Thanks, Andrei Cipu [aka Strainu]

14 years, 10 months

zh.wikipedia including a JavaScript from Google

by Petr Kadlec

Hi, folks. Recently, the problem of user tracking via third party companies has been debated on mailing lists. I wonder, if an inclusion of the jQuery library linked directly from Google servers (!) does not qualify as a bad idea, too… (Even though no user tracking has obviously been intended, and due to caching, privacy violation is extremely limited.) See http://zh.wikipedia.org/wiki/MediaWiki:Common.js (added on 2009-05-22 http://zh.wikipedia.org/w/index.php?diff=10119416&diffonly=1). -- [[cs:User:Mormegil | Petr Kadlec]]

14 years, 10 months

Current events-related overloads

by Brion Vibber

Just a quick note -- we're experiencing some fun load spikes due to heavy net usage of people searching for or talking about Michael Jackson's reported death or near-death. You may see some intermittent database connection failures on en.wikipedia.org for a little while as connections back up; we're poking to see if we can reduce this. Updates at tech blog: http://techblog.wikimedia.org/2009/06/current-events/ -- brion

14 years, 10 months

PHP 5.3.0 coming soon!

by Brion Vibber

Quick note to all -- PHP 5.3.0 final release is scheduled for June 30. Everybody don't be shy about testing out your code with the release candidates! :) -- brion vibber (brion @ wikimedia.org)

14 years, 10 months

Different apostrophe signs and MediaWiki internal search

by Jaska Zedlik

Hi! There are different apostrophe signs exist. Let's consider 2 of them: U+0027 and U+2019. They have the same meaning and both of them are acceptable and apostrophes for the English language, for instance. The problem is that MediaWiki internal search distinguishes these two apostrophes and the words containing U+2019 can't be found with the request containing U+0027 and vice versa. MediaWiki uses a search index for the internal search and the index is renewed every time the article is saved. I have found that if to override the function stripForSearch() in the language class with the new function wich relpaces the U+2019 with U+0027 for search index it appears that the internal search begins to work properly not paying attention to which exactly apostrophe was provided in the search query, either U+0027 or U+2019. For sure, the context is not highlighted if the apostrophes differ in the query and in the result, but the search returns what is really needed. The question is, if we override the stripForSearch() function in the language class in such a way, won't this cause any problems? The code of the override function is the following: function stripForSearch( $string ) { $s = $string; $s = preg_replace( '/\xe2\x80\x99/', '\'', $s ); return parent::stripForSearch( $s ); } We want to introduce such an issue for Belarusian, but I think Ukrainian language may experience the same problem with the different apostrophes, as U+0027 is not a valid apostrophe here as well, but only U+0027 (the typewriter apostrophe) is available on the majority of Belarusian and Ukrainian keyboard layouts. Thanks, zedlik

14 years, 10 months

piwigo+mediawiki integration

by sancelot＠free.fr

I would need to integrate piwigo inside existing mediawiki master site. I did not find any already done work ! . I have following questions : authentification: I want to be authenticated in piwigo when logging in mediawiki , I have seen some existing auth plugins , what are your recommandations ? using an existing one and adapt it ?which one (the piwigo site will be setted up on another server,remote mysql db) searches: I want searches done in mediawiki forwarded to piwigo search , and results links added to mediawiki search results . I am using lucene search for mediawiki.Your experience and tips is needed. Best Regards Stephane ANCELOT

14 years, 10 months

Image hovering effects

by Remember the dot

Hello fellow developers, In Håkon Wium Lie's recent analysis of Wikipedia image markup ( http://www.princexml.com/howcome/2009/wikipedia/image/), he makes a good point: we include image captions both below images and again in the images' tooltips. Also, for inline images without explicitly defined tooltips, the image name is used as the tooltip even though it is also shown in the URL when mousing over the image. Neither of these automatic tooltips are really useful, and they slow down page load time on image-heavy pages. What do you think? Should we keep the redundant tooltips, or start leaving them out? -- Remember the dot http://en.wikipedia.org/wiki/User:Remember_the_dot

14 years, 10 months

severe lag in en: jobs queue

by William Allen Simpson

Operationally, I've noticed that templates don't seem to be getting through the jobs queue, sometimes for days. About 11 hours ago, I changed a template with roughly 700 invocations, and it has only had 17 or so entries refreshed (showing up in new categories). A couple of weeks ago, I changed a pretty common template {{Cent}} with several thousand invocations, and it hadn't finished updating after a weekend! In that case, I ended up commenting out 4 years worth of old uses in discussion archives, and null editing the remaining dozen or so.... Hopefully, that made it better for the future. Does any edit clear the entry from the jobs queue? There'd be no need for it to be refreshed anymore. Does multiple edits to a template continue the old entries, or replace them with newer entries, or leave multiple sets of entries? I'm assuming the queue is FIFO, so leaving older entries would be ideal, and multiple sets would be bad.

14 years, 10 months

Chinese-language search fixes

by Brion Vibber

I've made some fixes to the MySQL search backend for Chinese and other languages using variants. Some languages don’t use word spacing, like Chinese and Japanese. To let the search index know where word boundaries are, we have to internally insert spaces between some characters: 维基百科 -> 维基百科 Then to add insult to injury, we need to fudge the Unicode characters to ensure things work reliably with older and newer versions of MySQL: 维基百科 -> u8e7bbb4 u8e59fba u8e799be u8e7a791 For a long time, this word segmentation wasn’t being handled correctly for Chinese in our default MySQL search backend, so searching for a multi-character word often gave false matches where the characters were all present, but not together. This should now be fixed in r52338: the intermediate query representation passed to the search backend internally treats your multi-character Chinese input as a phrase, which will only match actual adjacent characters: 维基百科 -> +"u8e7bbb4 u8e59fba u8e799be u8e7a791" Variants for eg Serbian are also now using parens internally so they should match more usefully. Note that Wikimedia’s sites such as Wikipedia run on a fancier, but more demanding, search backend with a separate Java-based engine built around Apache Lucene. Sometimes we have to remind ourselves that third-party users will mostly be using the MySQL-based default, and oh boy it still needs some lovin’! :) -- brion

14 years, 10 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Wikitech-l June 2009