Il 01/08/2015 01:20, Subramanya Sastry ha scritto:
On 07/31/2015 12:55 PM, Ricordisamoa wrote:
Hi Subbu,
thank you for this thoughtful insight.
And thank you for starting this thread. :-)
HTML is not a barrier by itself. The problem
seems to be Parsoid
being built primarily with VisualEditor in mind.
While we want the DOM to be VE-friendly, we definitely don't want the
DOM to be VE-centric and that has been the intention from the very
beginning. Flow, CX also use the Parsoid DOM for their functionality.
There are other users too [1].
VE, Flow, CX all take advantage of HTML. And I can't make any sense out
of editProtectedHelper.js
<https://en.wikipedia.org/wiki/User:Jackmcbarn/editProtectedHelper.js> :'(
We definitely want Parsoid's output to be useful
and usable more
broadly as the canonical output representation of wikitext and are
open to fixing whatever prevents that.
As Scott noted in the other email on the thread, inspired (and maybe
challenged by :-) ) by mwparserfromhell's utilities, he has already
whipped out a layer that provides an easier interface for manipulating
the DOM.
It is not clear to me how can a single DOM
serving both view and edit
modes avoid redundancy.
You are right that there are some redundancies in information
representation (because of having to serve multiple needs), but as far
as I know, it is mostly around image attributes. If there is anything
else specific (beyond image attributes) that is bothering you, can you
flag that?
https://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec#Transclusion_cont…
All template parameters are in data-mw but not parsed. Parameters ending
up in the 'final' wikitext are parsed separately.
I see huge demand for alternative wikignome-style
editors. The more
Parsoid's DOM is predictable, concise and documented, the more users
you get.
I think Parsoid's DOM is predictable :-) but, can you say more about
what prompted you to say that?
For example, to find images I have to search elements where typeof is
one of mw:Image, mw:Image/Thumb, mw:Image/Frame, mw:Image/Frameless,
then see if it's a figure or a span, and expect either a <figcaption> or
data-mw accordingly. Add that the img tag's parent can be <a> or
<span>...
Instead, this is what I'd expect a proper structure to look like:
Image
.src = title, internal or external link?
.repository?
.page = number or null
.language = code or null
.format = thumb etc.
.caption = wikitext parsed recursively
.link = internal or external link or null
.size
.original
.width = 1234
.height = 4321
.specified
.width = 2468
.computed
.width = 2468
.height = 8642
As for documentation, we document the DOM we generate
and its
semantics here [2].
It seems that some sections need updates, e.g. noinclude / includeonly /
onlyinclude
<https://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec#noinclude_.2F_includeonly_.2F_onlyinclude>
As for size, I just looked at the Barack Obama page
and here are some
size numbers.
By "concise" I meant an antonym for redundant, not lengthy :-)
1540407 /tmp/Barack_Obama.parsoid.html
1197318 /tmp/Barack_Obama.parsoid.no-data-mw.html
1045161 /tmp/Barack_Obama.php-parser.output.footer-stripped.html
Right now, because we inline template and other editable information
(as inline JSON attributes of the DOM), it is a bit bulky. However, we
have always had plans to move the data-mw attribute into its own
bucket which we might at some point in which case the size will be
closer to the current PHP parser output. If we moved page properties
and other metadata out, it will shrink it a little bit more.
For views that don't need to support editing or any other manipulation
or analyses, we can more aggressively strip more from the HTML without
affecting the rendering
Stripping HTML altogether would be a huge step forward. :-)
and get close to or even shrink the size below the PHP
parser output
size (there might be use cases where that might be appropriate thing
to do). I could get this down to under 1M by stripping rel attributes,
element ids, and about ids for identifying template output.
But, for editing (not just in VE) use cases, because of additional
markup in place on the page (element ids, other markup for
transclusions, extensions, links, etc.), the output will probably be
somewhat larger than the corresponding PHP parser output. If we can
keep it under 1.1x of php parser output size, I think we are good.
I hope we can meet in the middle :-)
Please file bugs and continue to report things that get in the way of
using Parsoid.
Subbu.
[1]
https://www.mediawiki.org/wiki/Parsoid/Users
[2]
http://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l