Wikitext-l February 2013

wikitext-l@lists.wikimedia.org

7 participants
5 discussions

by Andrew Dunbar

I'd like to get involved in the parsoid effort. I've been hanging out in the IRC channel on freenode but there's usually just a dozen lurkers and no action. In the mailing lists parsoid seems to be mentioned about as often in wikitext-l and wikitech-l - which one is best to ask questions of this nature? I'd like to scratch my own itch rather than necessarily go after things on the todo list and roadmap. Basically I'm interested in what parsoid can do for parsing wikitext markup into HTML (or other formats). I want to use it without a mediawiki install and without an internet connection. I see there is already some kind of support for reading in articles from compressed dump files. Any suggestions where I should start or where I can hang out to chat live with people who could help getting me involved? Andrew Dunbar (hippietrail)

11 years, 2 months

A grab bag of parsoid issues

by C. Scott Ananian

For context: I've been working on replacing the html5 and jsdom modules (which depend on the native 'contextify' module) with the pure-javascript 'domino' implementation of DOM4. This seems to be faster, cleaner, and fix some bug caused by jsdom's eccentric DOM handling. Domino is (in my brief experience) more reliable and standards-compliant. Here's a list of issues I came across in the process: * There were 3 new failures in wt2html tests. (There were also some new passes, so the number of correct tests increases on net.) They are: 1) "expansion of multi-line templates in attribute values (bug 6255 sanity check 2)" For reference, this test looks like: !! test > Expansion of multi-line templates in attribute values (bug 6255 sanity > check) > !! input > <div style="background: > #00FF00">-</div> > !! result > <div style="background: #00FF00">-</div> > !! end > !! test > Expansion of multi-line templates in attribute values (bug 6255 sanity > check 2) > !! input > <div style="background: 
#00FF00">-</div> > !! result > <div style="background: 
#00FF00">-</div> > !! end I'm not sure how this test ever passed in jsdom -- the inputs here are actually identical to an HTML parser, since hex-escape decoding happens very early. But apparently the wikitext parser should defer processing of the &#10 somehow? On the domino branch our HTML serialization now uses the upstream standard HTML5-serialization algorithm, which doesn't escape newlines. ( http://www.whatwg.org/specs/web-apps/current-work/multipage/the-end.html#se…) Note that the first test also involves whitespace normalization, which the PHP parser does (see https://www.mediawiki.org/wiki/Special:Code/MediaWiki/14689) but parsoid does not do. (I've got a patch to do whitespace normalization in parsoid if there's interest, but it causes other tests to break.) What's the plan to handle cases like this? Is it really important to generate the 
 in the output? 2) "Play a bit with r67090 and bug 3158" This is a parsoid-only test which looks like: > !! test > Play a bit with r67090 and bug 3158 > !! options > disabled > !! input > <div style="width:50% !important"> </div> > <div style="width:50% !important"> </div> > <div style="width:50% !important"> </div> > <div style="border : solid;"> </div> > !! result > <div style="width:50% !important"> </div> > <div style="width:50% !important"> </div> > <div style="width:50% !important"> </div> > <div style="border : solid;"> </div> > !! end In standard HTML serialization,   is encoded uniformly as   so even if you wanted to be bug-compatible with the 'border :' style, you should be emitting a   not a   there. The other two cases are whitespace normalization within attributes (again). I'm guessing jsdom (incorrectly) did this by default whether you wanted it or not; you need to explicitly add attribute-normalization into the domino case if that's desired. (But there's some other reason why the 'border :' case is failing now which needs to be chased down, unrelated to the   vs   issue.) 3) "Parsoid-only: Table with broken attribute value quoting on consecutive lines" > !! test > Parsoid-only: Table with broken attribute value quoting on consecutive > lines > !! options > disabled > !! input > {| > | title="Hello world|Foo > | style="color:red|Bar > |} > !! result > <table> > <tr> > <td title="Hello world">Foo > </td><td style="color: red;">Bar > </td></tr></table> > !! end jsdom used to insert the extraneous semicolon at the end of the 'style' attribute. domino does not. I believe this test case is broken and the extraneous semicolon should be removed. * Other observed bugs & failures: http://parsoid.wmflabs.org/en/Pi gives: > TypeError: Cannot assign to read only property 'ksrc' of #<KV> > at AttributeExpander._returnAttributes > (/home/cananian/Projects/OLPC/Narrative/mediawiki/Parsoid/js/lib/ext.core.AttributeExpander.js:71:20) > at AttributeTransformManager.process > (/home/cananian/Projects/OLPC/Narrative/mediawiki/Parsoid/js/lib/mediawiki.TokenTransformManager.js:1017:8) > at AttributeExpander.onToken > (/home/cananian/Projects/OLPC/Narrative/mediawiki/Parsoid/js/lib/ext.core.AttributeExpander.js:46:7) > at AsyncTokenTransformManager.transformTokens > (/home/cananian/Projects/OLPC/Narrative/mediawiki/Parsoid/js/lib/mediawiki.TokenTransformManager.js:568:17) > at AsyncTokenTransformManager.onChunk > (/home/cananian/Projects/OLPC/Narrative/mediawiki/Parsoid/js/lib/mediawiki.TokenTransformManager.js:356:17) > at SyncTokenTransformManager.EventEmitter.emit (events.js:96:17) > at SyncTokenTransformManager.onChunk > (/home/cananian/Projects/OLPC/Narrative/mediawiki/Parsoid/js/lib/mediawiki.TokenTransformManager.js:904:7) > at PegTokenizer.EventEmitter.emit (events.js:96:17) > at PegTokenizer.process > (/home/cananian/Projects/OLPC/Narrative/mediawiki/Parsoid/js/lib/mediawiki.tokenizer.peg.js:88:11) > at ParserPipeline.process > (/home/cananian/Projects/OLPC/Narrative/mediawiki/Parsoid/js/lib/mediawiki.parser.js:360:21) http://localhost:8000/simple/Game gives: > starting parsing of Game > *********** ERROR: cs/s mismatch for node: A s: 3808; cs: 3821 ************ > completed parsing of Game in 1491 ms * [[File:]] tag parsing for images appears to be incomplete: a) alt= and class= are not parsed b) 'thumb' and 'right' should result in <img class="thumb tright" /> or some such, but there doesn't appear to be an indication of either option in the parsoid output. * I'd like to see title and revision information in the <head> * Interwiki links are not converted to relative links when the "interwiki" is actually the current wiki. (Maybe this isn't really a bug.) Let's discuss these a bit and I'll file bugzilla tickets for the bits we can agree are actually bugs. ;) --scott -- ( http://cscott.net/ )

11 years, 2 months

Size: Parsoid output vs Mediawiki markup

by C. Scott Ananian

[Resending without the full list of articles, which caused the message to be bounced into moderation.] Here are the results of a quick test I ran over the weekend, comparing a compressed excerpt from simple.wikipedia.org in mediawiki markup to the compressed parsoid representation of the same articles. The list of articles is attached to this message. [Not any more.] For the base case I used the processing pipeline for the OLPC's "Wikipedia activity", source code at github.com/cscott/wikiserver It begins with a hand-written "portal page", then grabs all articles within two links of the portal page. The original markup was taken from the simplewiki-20130112-pages-articles.xml dump. Templates were then fully expanded, and just the selected articles were written. Articles are separated by the character 0x01, a newline, the title of the article, a newline, the length of the article in bytes, a newline, and the character 0x02 and a newline. For comparison, I took the list of articles included in the dump and wrote a small script to fetch them from parsoid, using the HEAD of the master branch from this weekend (2013-02-17, roughly). I wrote the full parsoid HTML document (including top-level <html> tag, <head>, <base href>, and <body> but not including a <!DOCTYPE>) to a file, separating articles with the title of the article, a newline, the length of the article in bytes, and a newline. Results, with and without compression: # of articles: 3640 Mediawiki markup, uncompressed: 18M Parsoid markup, uncompressed: 199M Mediawiki markup, gzip -9 compressed: 6.4M Parsoid markup, gzip -9 compressed: 26M Mediawiki markup, bzip2 -9 compressed: 4.7M Parsoid markup, bzip2 -9 compressed: 17M Mediawiki markup, lzma -9 compressed: 4.4M Parsoid markup, lzma -9 compressed: 15M So there's currently a 10x expansion in the uncompressed size, but only 3-4x expansion with compression. --scott -- ( http://cscott.net/ )

11 years, 2 months

VisualEditor update

by Sumana Harihareswara

Earlier this month, the Visual Editor team gave an update on their progress and their goals for 2013. Check it out: https://meta.wikimedia.org/wiki/File:VisualEditor-Parsoid_-_2013-02_Metrics… https://meta.wikimedia.org/wiki/Metrics_and_activities_meetings/2013-02-07 has the video on Commons & YouTube. Significantly: the Visual Editor *should* be able to work in Internet Explorer and to handle core basic templates, such as {{cite web}}, by 1 July 2013. -- Sumana Harihareswara Engineering Community Manager Wikimedia Foundation

11 years, 2 months

Fwd: [Wikitech-l] Extending Scribunto with new Lua functions

by Sumana Harihareswara

Forwarding from wikitech-l; discussion will be there. -------- Original Message -------- Subject: [Wikitech-l] Extending Scribunto with new Lua functions Date: Mon, 4 Feb 2013 14:46:44 +0100 From: Jens Ohlig <jens.ohlig(a)wikimedia.de> Reply-To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org> To: wikitech-l(a)lists.wikimedia.org Hello, I guess this can be answered by Tim or Victor, but I'm grateful for any pointers that can help me with a rather specific problem with Scribunto. I'm currently working on the Wikidata project to include Lua functions for templates that access Wikidata entities. I've toyed around a bit and extended LuaCommon.php with a getEntities function and a wikibase table to hold that function. Now I wonder if there are any plans for Lua extensions outside the mw.* namespace. I've added a wikibase.lua file and a wikibase.* namespace in Lua. However, the way PHP and Lua play together and how Scribunto can be extended looks a bit like black magic (which is to be expected, given that Scribunto is far from finished). Here are my questions: 1. Is there an easy way to add your own Lua functions (that call PHP Api functions) to Scribunto other than writing them into LuaCommon.php? 2. Is using your own namespace the way to go? 3. Are there some kind of examples how to wrap PHP functions into the Lua environment (using the frame etc)? 4. Is there any way to introspect or debug such wrapped functions? Thanks for any suggestions! Cheers, Jens -- Jens Ohlig Software developer Wikidata project Wikimedia Deutschland e.V. Obentrautstr. 72 10963 Berlin www.wikimedia.de Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985. _______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

11 years, 2 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

Wikitext-l February 2013