Joaquin,
Thanks for your reply.
Regarding the data-parsoid route, I can't reproduce the trouble I was having. I
suspect I was just getting the /revision/tid part wrong.
Taking a step back, I think part of the problem was I apparently had an incorrect mental
model of how parsoid works. I was envisioning something that took wikitext, parsed it
into a semantic parse tree, (kind of like mwparserfromhell does), and then takes that
parse tree and converts it to html. What I was trying to get at was the intermediate
parse tree. Looking at
<https://www.mediawiki.org/wiki/Parsoid/API>, this appeared to be the pagebundle
format, and I was groping around trying to find the API which exposed that. I looked at
the /html routes and thought to myself, "No, that's not what I want. That's
the HTML. I want the parse tree". So I was trying things like:
GET /:domain/v3/page/:format/:title/:revision?
with :format set to "pagebundle". For example, I tried
which 404's.
I think the biggest thing that could be done to improve the documentation is to update
<https://www.mediawiki.org/wiki/Parsoid/API>. That's the page you get to most
directly when searching for parsoid documentation.
On Sep 7, 2020, at 6:05 AM, Joaquin Oltra Hernandez
<jhernandez(a)wikimedia.org> wrote:
Hi Roy,
Some responses inline:
On Fri, Sep 4, 2020 at 6:41 PM Roy Smith <roy(a)panix.com
<mailto:roy@panix.com>> wrote:
I know there's been a ton of work done of Parsoid lately. This is great, and the
amount of effort that's gone into this functionality is really appreciated. It's
clear that Parsoid is the way of the future, but the documentation of how you get a
Parsoid parse tree via an AP call isI kind of confusing.
I found
https://www.mediawiki.org/wiki/Parsoid/API
<https://www.mediawiki.org/wiki/Parsoid/API>, which looks like it's long out of
date. The last edit was almost 2 years ago. As far as I can tell, most of what it says
is obsolete, and refers to a series of /v3 routes which don't actually exist.
This definitely looks outdated, I'll forward your email to the maintainers so maybe
they can have a look and update it.
I also found
https://en.wikipedia.org/api/rest_v1/#/Page%20content
<https://en.wikipedia.org/api/rest_v1/#/Page%20content>, which seems more in line
with the current reality. But, the call I was most interested in,
/page/data-parsoid/{title}/{revision}/{tid}, doesn't actually respond (at least not on
en.wikipedia.org <http://en.wikipedia.org/>).
Maybe you can share exactly how you are querying the API and the responses you get, since
this does seem to work fine for me (examples below). I think these APIs are the ones
VisualEditor uses so they should work appropriately.
I tried querying
https://en.wikipedia.org/api/rest_v1/page/html/Banana
<https://en.wikipedia.org/api/rest_v1/page/html/Banana> first, and got back the
response. On it, you can get the revision and "tid" from the ETag header, like
it says on the swagger docs:
ETag header indicating the revision and render timeuuid separated by a slash:
"701384379/154d7bca-c264-11e5-8c2f-1b51b33b59fc" This ETag can be passed to the
HTML save end point (as base_etag POST parameter), and can also be used to retrieve the
exact corresponding data-parsoid metadata, by requesting the specific revision and tid
indicated by the ETag.
With that information, you can then compose the new API call URL:
https://en.wikipedia.org/api/rest_v1/page/data-parsoid/Banana/975959204/7e3…
<https://en.wikipedia.org/api/rest_v1/page/data-parsoid/Banana/975959204/7e3fb2f0-eb7b-11ea-bedb-95397ed6461a>
that should successfully respond with the metadata.
I'm not 100% clear on the difference between data-mw information on the /page/html
response vs the one found on the /page/data-parsoid response, but anyhow you should be
able to use both endpoints as needed that way.
Eventually, I discovered (see this thread
<https://en.wikipedia.org/w/index.php?title=Wikipedia:Village_pump_(technical)&oldid=976731421#Parsing_wikitext_in_javascript?>),
that the way to get a Parsoid parse tree is via the
https://en.wikipedia.org/api/rest_v1/page/html/
<https://en.wikipedia.org/api/rest_v1/page/html/> route, and digging the embedded
JSON out of data-mw fragments scattered throughout the HTML. This seems
counter-intuitive. And kind of awkward, since it's not even a full parse tree;
it's just little snippets of parse trees, which I guess correspond to each template
expansion?
I looked around and found
https://www.mediawiki.org/wiki/Specs/HTML/2.1.0
<https://www.mediawiki.org/wiki/Specs/HTML/2.1.0> linked on the Parsoid page, which
has extensive documentation on how wikitext <-> HTML is translated. It seems to be
more actively maintained. Hopefully this can give you some insight on how the responses
relate to the wikitext and how to find what you want.
So, taking a step backwards, my ultimate goal is to be able to parse the wikitext of a
page and discover the template calls, with their arguments. On the server side, I'm
doing this in Python with mwparserfromhell, which is fine. But now I need to do it on the
client side, in browser-executed javascript. I've looked at a few client-side
libraries, but if Parsoid really is ready for prime time, it seems silly not to use it,
and it's just a question of finding the right API calls.
You may be interested in the #Template_markup
<https://www.mediawiki.org/wiki/Specs/HTML/2.1.0#Template_markup> section from the
previous spec given your problem statement.
_______________________________________________
Wikimedia Cloud Services mailing list
Cloud(a)lists.wikimedia.org <mailto:Cloud@lists.wikimedia.org> (formerly
labs-l(a)lists.wikimedia.org <mailto:labs-l@lists.wikimedia.org>)
https://lists.wikimedia.org/mailman/listinfo/cloud
<https://lists.wikimedia.org/mailman/listinfo/cloud>
_______________________________________________
Wikimedia Cloud Services mailing list
Cloud(a)lists.wikimedia.org (formerly labs-l(a)lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud