I'd like to get involved in the parsoid effort.
I've been hanging out in the IRC channel on freenode but there's usually
just a dozen lurkers and no action.
In the mailing lists parsoid seems to be mentioned about as often in
wikitext-l and wikitech-l - which one is best to ask questions of this
nature?
I'd like to scratch my own itch rather than necessarily go after things on
the todo list and roadmap.
Basically I'm interested in what parsoid can do for parsing wikitext markup
into HTML (or other formats).
I want to use it without a mediawiki install and without an internet
connection. I see there is already some kind of support for reading in
articles from compressed dump files.
Any suggestions where I should start or where I can hang out to chat live
with people who could help getting me involved?
Andrew Dunbar (hippietrail)
For context: I've been working on replacing the html5 and jsdom modules
(which depend on the native 'contextify' module) with the pure-javascript
'domino' implementation of DOM4. This seems to be faster, cleaner, and fix
some bug caused by jsdom's eccentric DOM handling. Domino is (in my brief
experience) more reliable and standards-compliant.
Here's a list of issues I came across in the process:
* There were 3 new failures in wt2html tests. (There were also some new
passes, so the number of correct tests increases on net.) They are:
1) "expansion of multi-line templates in attribute values (bug 6255 sanity
check 2)"
For reference, this test looks like:
!! test
> Expansion of multi-line templates in attribute values (bug 6255 sanity
> check)
> !! input
> <div style="background:
> #00FF00">-</div>
> !! result
> <div style="background: #00FF00">-</div>
> !! end
> !! test
> Expansion of multi-line templates in attribute values (bug 6255 sanity
> check 2)
> !! input
> <div style="background: #00FF00">-</div>
> !! result
> <div style="background: #00FF00">-</div>
> !! end
I'm not sure how this test ever passed in jsdom -- the inputs here are
actually identical to an HTML parser, since hex-escape decoding happens
very early. But apparently the wikitext parser should defer processing of
the 
 somehow? On the domino branch our HTML serialization now uses the
upstream standard HTML5-serialization algorithm, which doesn't escape
newlines. (
http://www.whatwg.org/specs/web-apps/current-work/multipage/the-end.html#se…)
Note that the first test also involves whitespace normalization, which the
PHP parser does (see
https://www.mediawiki.org/wiki/Special:Code/MediaWiki/14689) but parsoid
does not do. (I've got a patch to do whitespace normalization in parsoid
if there's interest, but it causes other tests to break.)
What's the plan to handle cases like this? Is it really important to
generate the in the output?
2) "Play a bit with r67090 and bug 3158"
This is a parsoid-only test which looks like:
> !! test
> Play a bit with r67090 and bug 3158
> !! options
> disabled
> !! input
> <div style="width:50% !important"> </div>
> <div style="width:50% !important"> </div>
> <div style="width:50% !important"> </div>
> <div style="border : solid;"> </div>
> !! result
> <div style="width:50% !important"> </div>
> <div style="width:50% !important"> </div>
> <div style="width:50% !important"> </div>
> <div style="border : solid;"> </div>
> !! end
In standard HTML serialization,   is encoded uniformly as so
even if you wanted to be bug-compatible with the 'border :' style, you
should be emitting a not a   there. The other two cases are
whitespace normalization within attributes (again). I'm guessing jsdom
(incorrectly) did this by default whether you wanted it or not; you need to
explicitly add attribute-normalization into the domino case if that's
desired. (But there's some other reason why the 'border :' case is failing
now which needs to be chased down, unrelated to the   vs issue.)
3) "Parsoid-only: Table with broken attribute value quoting on consecutive
lines"
> !! test
> Parsoid-only: Table with broken attribute value quoting on consecutive
> lines
> !! options
> disabled
> !! input
> {|
> | title="Hello world|Foo
> | style="color:red|Bar
> |}
> !! result
> <table>
> <tr>
> <td title="Hello world">Foo
> </td><td style="color: red;">Bar
> </td></tr></table>
> !! end
jsdom used to insert the extraneous semicolon at the end of the 'style'
attribute. domino does not. I believe this test case is broken and the
extraneous semicolon should be removed.
* Other observed bugs & failures:
http://parsoid.wmflabs.org/en/Pi gives:
> TypeError: Cannot assign to read only property 'ksrc' of #<KV>
> at AttributeExpander._returnAttributes
> (/home/cananian/Projects/OLPC/Narrative/mediawiki/Parsoid/js/lib/ext.core.AttributeExpander.js:71:20)
> at AttributeTransformManager.process
> (/home/cananian/Projects/OLPC/Narrative/mediawiki/Parsoid/js/lib/mediawiki.TokenTransformManager.js:1017:8)
> at AttributeExpander.onToken
> (/home/cananian/Projects/OLPC/Narrative/mediawiki/Parsoid/js/lib/ext.core.AttributeExpander.js:46:7)
> at AsyncTokenTransformManager.transformTokens
> (/home/cananian/Projects/OLPC/Narrative/mediawiki/Parsoid/js/lib/mediawiki.TokenTransformManager.js:568:17)
> at AsyncTokenTransformManager.onChunk
> (/home/cananian/Projects/OLPC/Narrative/mediawiki/Parsoid/js/lib/mediawiki.TokenTransformManager.js:356:17)
> at SyncTokenTransformManager.EventEmitter.emit (events.js:96:17)
> at SyncTokenTransformManager.onChunk
> (/home/cananian/Projects/OLPC/Narrative/mediawiki/Parsoid/js/lib/mediawiki.TokenTransformManager.js:904:7)
> at PegTokenizer.EventEmitter.emit (events.js:96:17)
> at PegTokenizer.process
> (/home/cananian/Projects/OLPC/Narrative/mediawiki/Parsoid/js/lib/mediawiki.tokenizer.peg.js:88:11)
> at ParserPipeline.process
> (/home/cananian/Projects/OLPC/Narrative/mediawiki/Parsoid/js/lib/mediawiki.parser.js:360:21)
http://localhost:8000/simple/Game gives:
> starting parsing of Game
> *********** ERROR: cs/s mismatch for node: A s: 3808; cs: 3821 ************
> completed parsing of Game in 1491 ms
* [[File:]] tag parsing for images appears to be incomplete:
a) alt= and class= are not parsed
b) 'thumb' and 'right' should result in <img class="thumb tright" /> or
some such, but there doesn't appear to be an indication of either option in
the parsoid output.
* I'd like to see title and revision information in the <head>
* Interwiki links are not converted to relative links when the "interwiki"
is actually the current wiki. (Maybe this isn't really a bug.)
Let's discuss these a bit and I'll file bugzilla tickets for the bits we
can agree are actually bugs. ;)
--scott
--
( http://cscott.net/ )
[Resending without the full list of articles, which caused the message to
be bounced into moderation.]
Here are the results of a quick test I ran over the weekend, comparing a
compressed excerpt from simple.wikipedia.org in mediawiki markup to the
compressed parsoid representation of the same articles. The list of
articles is attached to this message. [Not any more.]
For the base case I used the processing pipeline for the OLPC's "Wikipedia
activity", source code at github.com/cscott/wikiserver
It begins with a hand-written "portal page", then grabs all articles within
two links of the portal page. The original markup was taken from
the simplewiki-20130112-pages-articles.xml dump. Templates were then fully
expanded, and just the selected articles were written. Articles are
separated by the character 0x01, a newline, the title of the article, a
newline, the length of the article in bytes, a newline, and the character
0x02 and a newline.
For comparison, I took the list of articles included in the dump and wrote
a small script to fetch them from parsoid, using the HEAD of the master
branch from this weekend (2013-02-17, roughly). I wrote the full parsoid
HTML document (including top-level <html> tag, <head>, <base href>, and
<body> but not including a <!DOCTYPE>) to a file, separating articles with
the title of the article, a newline, the length of the article in bytes,
and a newline.
Results, with and without compression:
# of articles: 3640
Mediawiki markup, uncompressed: 18M
Parsoid markup, uncompressed: 199M
Mediawiki markup, gzip -9 compressed: 6.4M
Parsoid markup, gzip -9 compressed: 26M
Mediawiki markup, bzip2 -9 compressed: 4.7M
Parsoid markup, bzip2 -9 compressed: 17M
Mediawiki markup, lzma -9 compressed: 4.4M
Parsoid markup, lzma -9 compressed: 15M
So there's currently a 10x expansion in the uncompressed size, but only
3-4x expansion with compression.
--scott
--
( http://cscott.net/ )
Forwarding from wikitech-l; discussion will be there.
-------- Original Message --------
Subject: [Wikitech-l] Extending Scribunto with new Lua functions
Date: Mon, 4 Feb 2013 14:46:44 +0100
From: Jens Ohlig <jens.ohlig(a)wikimedia.de>
Reply-To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org>
To: wikitech-l(a)lists.wikimedia.org
Hello,
I guess this can be answered by Tim or Victor, but I'm grateful for any
pointers that can help me with a rather specific problem with Scribunto.
I'm currently working on the Wikidata project to include Lua functions
for templates that access Wikidata entities.
I've toyed around a bit and extended LuaCommon.php with a getEntities
function and a wikibase table to hold that function. Now I wonder if
there are any plans for Lua extensions outside the mw.* namespace.
I've added a wikibase.lua file and a wikibase.* namespace in Lua.
However, the way PHP and Lua play together and how Scribunto can be
extended looks a bit like black magic (which is to be expected, given
that Scribunto is far from finished).
Here are my questions:
1. Is there an easy way to add your own Lua functions (that call PHP Api
functions) to Scribunto other than writing them into LuaCommon.php?
2. Is using your own namespace the way to go?
3. Are there some kind of examples how to wrap PHP functions into the
Lua environment (using the frame etc)?
4. Is there any way to introspect or debug such wrapped functions?
Thanks for any suggestions!
Cheers,
Jens
--
Jens Ohlig
Software developer Wikidata project
Wikimedia Deutschland e.V.
Obentrautstr. 72
10963 Berlin
www.wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das
Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l