Parsoid APIs?

List overview All Threads
Download

newer

older

[Cloud-announce] Clarification:...

Cloud Services Survey 2019 results...

Roy Smith

4 Sep 2020 4 Sep '20

5:40 p.m.

I know there's been a ton of work done of Parsoid lately. This is great, and the amount of effort that's gone into this functionality is really appreciated. It's clear that Parsoid is the way of the future, but the documentation of how you get a Parsoid parse tree via an AP call isI kind of confusing. I found https://www.mediawiki.org/wiki/Parsoid/API <https://www.mediawiki.org/wiki/Parsoid/API>, which looks like it's long out of date. The last edit was almost 2 years ago. As far as I can tell, most of what it says is obsolete, and refers to a series of /v3 routes which don't actually exist. I also found https://en.wikipedia.org/api/rest_v1/#/Page%20content <https://en.wikipedia.org/api/rest_v1/#/Page content>, which seems more in line with the current reality. But, the call I was most interested in, /page/data-parsoid/{title}/{revision}/{tid}, doesn't actually respond (at least not on en.wikipedia.org <http://en.wikipedia.org/>). Eventually, I discovered (see this thread <https://en.wikipedia.org/w/index.php?title=Wikipedia:Village_pump_(technical)&oldid=976731421#Parsing_wikitext_in_javascript?>), that the way to get a Parsoid parse tree is via the https://en.wikipedia.org/api/rest_v1/page/html/ <https://en.wikipedia.org/api/rest_v1/page/html/> route, and digging the embedded JSON out of data-mw fragments scattered throughout the HTML. This seems counter-intuitive. And kind of awkward, since it's not even a full parse tree; it's just little snippets of parse trees, which I guess correspond to each template expansion? So, taking a step backwards, my ultimate goal is to be able to parse the wikitext of a page and discover the template calls, with their arguments. On the server side, I'm doing this in Python with mwparserfromhell, which is fine. But now I need to do it on the client side, in browser-executed javascript. I've looked at a few client-side libraries, but if Parsoid really is ready for prime time, it seems silly not to use it, and it's just a question of finding the right API calls.

Attachments:

attachment.htm (text/html — 2.7 KB)

Show replies by date

Joaquin Oltra Hernandez

7 Sep 7 Sep

10:05 a.m.

Hi Roy, Some responses inline: On Fri, Sep 4, 2020 at 6:41 PM Roy Smith <roy(a)panix.com> wrote:

...

This definitely looks outdated, I'll forward your email to the maintainers so maybe they can have a look and update it.

...

I also found https://en.wikipedia.org/api/rest_v1/#/Page%20content, which seems more in line with the current reality. But, the call I was most interested in, /page/data-parsoid/{title}/{revision}/{tid}, doesn't actually respond (at least not on en.wikipedia.org).

Maybe you can share exactly how you are querying the API and the responses you get, since this does seem to work fine for me (examples below). I think these APIs are the ones VisualEditor uses so they should work appropriately. I tried querying https://en.wikipedia.org/api/rest_v1/page/html/Banana first, and got back the response. On it, you can get the revision and "tid" from the ETag header, like it says on the swagger docs: *ETag header indicating the revision and render timeuuid separated by a slash: "701384379/154d7bca-c264-11e5-8c2f-1b51b33b59fc" This ETag can be passed to the HTML save end point (as base_etag POST parameter), and can also be used to retrieve the exact corresponding data-parsoid metadata, by requesting the specific revision and tid indicated by the ETag.* With that information, you can then compose the new API call URL: https://en.wikipedia.org/api/rest_v1/page/data-parsoid/Banana/975959204/7e3… that should successfully respond with the metadata. I'm not 100% clear on the difference between data-mw information on the /page/html response vs the one found on the /page/data-parsoid response, but anyhow you should be able to use both endpoints as needed that way.

...

Eventually, I discovered (see this thread <https://en.wikipedia.org/w/index.php?title=Wikipedia:Village_pump_(technical)&oldid=976731421#Parsing_wikitext_in_javascript?>), that the way to get a Parsoid parse tree is via the https://en.wikipedia.org/api/rest_v1/page/html/ route, and digging the embedded JSON out of data-mw fragments scattered throughout the HTML. This seems counter-intuitive. And kind of awkward, since it's not even a full parse tree; it's just little snippets of parse trees, which I guess correspond to each template expansion?

I looked around and found https://www.mediawiki.org/wiki/Specs/HTML/2.1.0 linked on the Parsoid page, which has extensive documentation on how wikitext <-> HTML is translated. It seems to be more actively maintained. Hopefully this can give you some insight on how the responses relate to the wikitext and how to find what you want.

...

So, taking a step backwards, my ultimate goal is to be able to parse the wikitext of a page and discover the template calls, with their arguments. On the server side, I'm doing this in Python with mwparserfromhell, which is fine. But now I need to do it on the client side, in browser-executed javascript. I've looked at a few client-side libraries, but if Parsoid really is ready for prime time, it seems silly not to use it, and it's just a question of finding the right API calls.

You may be interested in the #Template_markup <https://www.mediawiki.org/wiki/Specs/HTML/2.1.0#Template_markup> section from the previous spec given your problem statement.

...

_______________________________________________ Wikimedia Cloud Services mailing list Cloud(a)lists.wikimedia.org (formerly labs-l(a)lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Roy Smith

3:15 p.m.

Joaquin, Thanks for your reply. Regarding the data-parsoid route, I can't reproduce the trouble I was having. I suspect I was just getting the /revision/tid part wrong. Taking a step back, I think part of the problem was I apparently had an incorrect mental model of how parsoid works. I was envisioning something that took wikitext, parsed it into a semantic parse tree, (kind of like mwparserfromhell does), and then takes that parse tree and converts it to html. What I was trying to get at was the intermediate parse tree. Looking at https://www.mediawiki.org/wiki/Parsoid/API <https://www.mediawiki.org/wiki/Parsoid/API>, this appeared to be the pagebundle format, and I was groping around trying to find the API which exposed that. I looked at the /html routes and thought to myself, "No, that's not what I want. That's the HTML. I want the parse tree". So I was trying things like: GET /:domain/v3/page/:format/:title/:revision? with :format set to "pagebundle". For example, I tried

...

https://en.wikipedia.org/v3/page/pagebundle/banana <https://en.wikipedia.org/v3/page/pagebundle/banana>

which 404's. I think the biggest thing that could be done to improve the documentation is to update https://www.mediawiki.org/wiki/Parsoid/API <https://www.mediawiki.org/wiki/Parsoid/API>. That's the page you get to most directly when searching for parsoid documentation.

...

On Sep 7, 2020, at 6:05 AM, Joaquin Oltra Hernandez <jhernandez(a)wikimedia.org> wrote: Hi Roy, Some responses inline: On Fri, Sep 4, 2020 at 6:41 PM Roy Smith <roy(a)panix.com <mailto:roy@panix.com>> wrote: I know there's been a ton of work done of Parsoid lately. This is great, and the amount of effort that's gone into this functionality is really appreciated. It's clear that Parsoid is the way of the future, but the documentation of how you get a Parsoid parse tree via an AP call isI kind of confusing. I found https://www.mediawiki.org/wiki/Parsoid/API <https://www.mediawiki.org/wiki/Parsoid/API>, which looks like it's long out of date. The last edit was almost 2 years ago. As far as I can tell, most of what it says is obsolete, and refers to a series of /v3 routes which don't actually exist. This definitely looks outdated, I'll forward your email to the maintainers so maybe they can have a look and update it. I also found https://en.wikipedia.org/api/rest_v1/#/Page%20content <https://en.wikipedia.org/api/rest_v1/#/Page%20content>, which seems more in line with the current reality. But, the call I was most interested in, /page/data-parsoid/{title}/{revision}/{tid}, doesn't actually respond (at least not on en.wikipedia.org <http://en.wikipedia.org/>). Maybe you can share exactly how you are querying the API and the responses you get, since this does seem to work fine for me (examples below). I think these APIs are the ones VisualEditor uses so they should work appropriately. I tried querying https://en.wikipedia.org/api/rest_v1/page/html/Banana <https://en.wikipedia.org/api/rest_v1/page/html/Banana> first, and got back the response. On it, you can get the revision and "tid" from the ETag header, like it says on the swagger docs: ETag header indicating the revision and render timeuuid separated by a slash: "701384379/154d7bca-c264-11e5-8c2f-1b51b33b59fc" This ETag can be passed to the HTML save end point (as base_etag POST parameter), and can also be used to retrieve the exact corresponding data-parsoid metadata, by requesting the specific revision and tid indicated by the ETag. With that information, you can then compose the new API call URL: https://en.wikipedia.org/api/rest_v1/page/data-parsoid/Banana/975959204/7e3… <https://en.wikipedia.org/api/rest_v1/page/data-parsoid/Banana/975959204/7e3fb2f0-eb7b-11ea-bedb-95397ed6461a> that should successfully respond with the metadata. I'm not 100% clear on the difference between data-mw information on the /page/html response vs the one found on the /page/data-parsoid response, but anyhow you should be able to use both endpoints as needed that way. Eventually, I discovered (see this thread <https://en.wikipedia.org/w/index.php?title=Wikipedia:Village_pump_(technical)&oldid=976731421#Parsing_wikitext_in_javascript?>), that the way to get a Parsoid parse tree is via the https://en.wikipedia.org/api/rest_v1/page/html/ <https://en.wikipedia.org/api/rest_v1/page/html/> route, and digging the embedded JSON out of data-mw fragments scattered throughout the HTML. This seems counter-intuitive. And kind of awkward, since it's not even a full parse tree; it's just little snippets of parse trees, which I guess correspond to each template expansion? I looked around and found https://www.mediawiki.org/wiki/Specs/HTML/2.1.0 <https://www.mediawiki.org/wiki/Specs/HTML/2.1.0> linked on the Parsoid page, which has extensive documentation on how wikitext <-> HTML is translated. It seems to be more actively maintained. Hopefully this can give you some insight on how the responses relate to the wikitext and how to find what you want. So, taking a step backwards, my ultimate goal is to be able to parse the wikitext of a page and discover the template calls, with their arguments. On the server side, I'm doing this in Python with mwparserfromhell, which is fine. But now I need to do it on the client side, in browser-executed javascript. I've looked at a few client-side libraries, but if Parsoid really is ready for prime time, it seems silly not to use it, and it's just a question of finding the right API calls. You may be interested in the #Template_markup <https://www.mediawiki.org/wiki/Specs/HTML/2.1.0#Template_markup> section from the previous spec given your problem statement. _______________________________________________ Wikimedia Cloud Services mailing list Cloud(a)lists.wikimedia.org <mailto:Cloud@lists.wikimedia.org> (formerly labs-l(a)lists.wikimedia.org <mailto:labs-l@lists.wikimedia.org>) https://lists.wikimedia.org/mailman/listinfo/cloud <https://lists.wikimedia.org/mailman/listinfo/cloud> _______________________________________________ Wikimedia Cloud Services mailing list Cloud(a)lists.wikimedia.org (formerly labs-l(a)lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Subramanya Sastry

10 Sep 10 Sep

3:50 p.m.

On 9/7/20 10:15 AM, Roy Smith wrote:

...

Joaquin, Thanks for your reply. Regarding the data-parsoid route, I can't reproduce the trouble I was having. I suspect I was just getting the /revision/tid part wrong. Taking a step back, I think part of the problem was I apparently had an incorrect mental model of how parsoid works. I was envisioning something that took wikitext, parsed it into a semantic parse tree, (kind of like mwparserfromhell does), and then takes that parse tree and converts it to html. What I was trying to get at was the intermediate parse tree. Looking at https://www.mediawiki.org/wiki/Parsoid/API, this appeared to be the pagebundle format, and I was groping around trying to find the API which exposed that. I looked at the /html routes and thought to myself, "No, that's not what I want. That's the HTML. I want the parse tree".

Parsoid doesn't produce any intermediate parser tree. Parsoid's output (HTML / DOM) is the canonical representation that captures wikitext information for you and you can reliably get most information that you want by inspecting that HTML based on the HTML spec Parsoid adheres to ( see https://www.mediawiki.org/wiki/Specs/HTML ). There are caveats in that Parsoid doesn't give you detailed information about nested templates when templates are parsed, but most usecases don't need that. So, if you parse Parsoid's HTML into DOM, you get the "parse tree" that you want. You can the modify the HTML appropriately and as long as your output confirms to Parsoid's HTML spec, you can post that HTML to Parsoid and have it converted to wikitext. For example, https://github.com/wikimedia/parsoid-jsapi is a library (now defunct since Parsoid/JS is not going to be maintained) that uses Parsoid's DOM as the wikitext parse tree and replicates mwparserfromhell functionality. We haven't built anything equivalent for the PHP version of Parsoid yet. However, Kunal (@legoktm) has built a Rust version of this. See https://docs.rs/parsoid/0.2.0/parsoid/ ... So, if Rust is your thing, you can use that library to manipulate wikitext similar to mwparserfromhell. But if not, for now, you will still have to work with a DOM to replicate mwparserfromhell functionality. Eventually, hopefully, other language implementations will show up and we expect much of the functionality provided by mwparserfromhell will be available. But, mwparserfromhell is usable on dumps which you currently cannot use Parsoid for. If you really wanted to, you can if you do a whole bunch of additional work, but for all practical purposes, it is non-trivial. So, that usecase is still not something we have targeted for now.

...

I think the biggest thing that could be done to improve the documentation is to update https://www.mediawiki.org/wiki/Parsoid/API. That's the page you get to most directly when searching for parsoid documentation.

As I indicated in my previous response, the information on that page is accurate. Given the responses in this thread, what would be most helpful wrt updating that page to eliminate some of the confusion around Parsoid vs. RESTBase? Feel free to edit the page directly or email me privately or respond on this thread and we'll tweak it approriately. Thanks, Subbu.

Subramanya Sastry

3:38 p.m.

HI Roy, Sorry it took a while before we could respond. On 9/4/20 12:40 PM, Roy Smith wrote:

...

That page is not out of date. That refers to Parsoid's API which is what you would use if you were querying Parsoid directly. When we ported Parsoid from JS to PHP, we ensured that the API routes didn't change. What changed was the base url of the Parsoid service (so clients could simply switch this URL in the configuration without having to change their code). For example, http://en.wikipedia.org/w/rest.php/en.wikipedia.org/v3/page/html/<title&… works if you curl this url on a production server (if you have access). But this Parsoid API is not accessible on the public internet. However, wikitech.wikimedia.org's Parsoid API is currently accessible on the public internet. So, for example: https://wikitech.wikimedia.org/w/rest.php/wikitech.wikimedia.org/v3/page/ht… works. So, you can verify that the API routes on the Parsoid wiki page will work on wikitech.wikimedia.org. But, anyway, this is not directly relevant to your usecase unless you are directly contacting a Parsoid service somewhere. In production wikimedia wikis, as I said, Parsoid's API isn't public (it wasn't public with the JS version either). You can only access Parsoid content via RESTBase's public API which you reference below.

...

I also found https://en.wikipedia.org/api/rest_v1/#/Page%20content <https://en.wikipedia.org/api/rest_v1/#/Page content>, which seems more in line with the current reality.

Yes, this is the API that RESTBase provides. Behind the scenes, it accesses Parsoid's API when it needs fresh content.

...

But, the call I was most interested in, /page/data-parsoid/{title}/{revision}/{tid}, doesn't actually respond (at least not on en.wikipedia.org <http://en.wikipedia.org>).

Joaquin already responded to this part (thanks Joaquin), so I'll skip this here. I will respond to your other Parsoid HTML questions / comments by responding to your other post. Subbu.

Roy Smith

3:53 p.m.

Could you update https://www.mediawiki.org/wiki/Parsoid/API <https://www.mediawiki.org/wiki/Parsoid/API> to indicate that? Even just a note saying, "This is an internal API. For external use, see https://en.wikipedia.org/api/rest_v1 <https://en.wikipedia.org/api/rest_v1>", or something like that. The way thing are now, it looks like this is the API you're supposed to be using, and the routes just don't respond. There's no way for a reader to differentiate between, "The documentation is wrong" vs, "I'm doing something wrong".

...

On Sep 10, 2020, at 11:38 AM, Subramanya Sastry <ssastry(a)wikimedia.org> wrote: HI Roy, Sorry it took a while before we could respond. On 9/4/20 12:40 PM, Roy Smith wrote:

That page is not out of date. That refers to Parsoid's API which is what you would use if you were querying Parsoid directly. When we ported Parsoid from JS to PHP, we ensured that the API routes didn't change. What changed was the base url of the Parsoid service (so clients could simply switch this URL in the configuration without having to change their code). For example, http://en.wikipedia.org/w/rest.php/en.wikipedia.org/v3/page/html/ <http://en.wikipedia.org/w/rest.php/en.wikipedia.org/v3/page/html/><title>/<revid> works if you curl this url on a production server (if you have access). But this Parsoid API is not accessible on the public internet. However, wikitech.wikimedia.org's Parsoid API is currently accessible on the public internet. So, for example: https://wikitech.wikimedia.org/w/rest.php/wikitech.wikimedia.org/v3/page/ht… <https://wikitech.wikimedia.org/w/rest.php/wikitech.wikimedia.org/v3/page/html/Parsoid>works. So, you can verify that the API routes on the Parsoid wiki page will work on wikitech.wikimedia.org. But, anyway, this is not directly relevant to your usecase unless you are directly contacting a Parsoid service somewhere. In production wikimedia wikis, as I said, Parsoid's API isn't public (it wasn't public with the JS version either). You can only access Parsoid content via RESTBase's public API which you reference below.

I also found https://en.wikipedia.org/api/rest_v1/#/Page%20content <https://en.wikipedia.org/api/rest_v1/#/Page content>, which seems more in line with the current reality.

Yes, this is the API that RESTBase provides. Behind the scenes, it accesses Parsoid's API when it needs fresh content.

But, the call I was most interested in, /page/data-parsoid/{title}/{revision}/{tid}, doesn't actually respond (at least not on en.wikipedia.org <http://en.wikipedia.org/>).

Joaquin already responded to this part (thanks Joaquin), so I'll skip this here. I will respond to your other Parsoid HTML questions / comments by responding to your other post. Subbu.

Subramanya Sastry

11 Sep 11 Sep

5:16 p.m.

See https://www.mediawiki.org/wiki/Parsoid/API now. -Subbu. On 9/10/20 10:53 AM, Roy Smith wrote:

...

Could you update https://www.mediawiki.org/wiki/Parsoid/API to indicate that? Even just a note saying, "This is an internal API. For external use, see https://en.wikipedia.org/api/rest_v1"quot;, or something like that. The way thing are now, it looks like this is the API you're supposed to be using, and the routes just don't respond. There's no way for a reader to differentiate between, "The documentation is wrong" vs, "I'm doing something wrong".

On Sep 10, 2020, at 11:38 AM, Subramanya Sastry <ssastry(a)wikimedia.org <mailto:ssastry@wikimedia.org>> wrote: HI Roy, Sorry it took a while before we could respond. On 9/4/20 12:40 PM, Roy Smith wrote:

That page is not out of date. That refers to Parsoid's API which is what you would use if you were querying Parsoid directly. When we ported Parsoid from JS to PHP, we ensured that the API routes didn't change. What changed was the base url of the Parsoid service (so clients could simply switch this URL in the configuration without having to change their code). For example, http://en.wikipedia.org/w/rest.php/en.wikipedia.org/v3/page/html/<title&… works if you curl this url on a production server (if you have access). But this Parsoid API is not accessible on the public internet. However, wikitech.wikimedia.org <http://wikitech.wikimedia.org>'s Parsoid API is currently accessible on the public internet. So, for example: https://wikitech.wikimedia.org/w/rest.php/wikitech.wikimedia.org/v3/page/ht… works. So, you can verify that the API routes on the Parsoid wiki page will work on wikitech.wikimedia.org <http://wikitech.wikimedia.org>. But, anyway, this is not directly relevant to your usecase unless you are directly contacting a Parsoid service somewhere. In production wikimedia wikis, as I said, Parsoid's API isn't public (it wasn't public with the JS version either). You can only access Parsoid content via RESTBase's public API which you reference below.

I also found https://en.wikipedia.org/api/rest_v1/#/Page%20content <https://en.wikipedia.org/api/rest_v1/#/Page content>, which seems more in line with the current reality.

Yes, this is the API that RESTBase provides. Behind the scenes, it accesses Parsoid's API when it needs fresh content.

But, the call I was most interested in, /page/data-parsoid/{title}/{revision}/{tid}, doesn't actually respond (at least not on en.wikipedia.org <http://en.wikipedia.org/>).

Joaquin already responded to this part (thanks Joaquin), so I'll skip this here. I will respond to your other Parsoid HTML questions / comments by responding to your other post. Subbu.

Roy Smith

5:55 p.m.

Thank you. That would have definitely kept me from getting off on the wrong track.

...

On Sep 11, 2020, at 1:16 PM, Subramanya Sastry <ssastry(a)wikimedia.org> wrote: See https://www.mediawiki.org/wiki/Parsoid/API <https://www.mediawiki.org/wiki/Parsoid/API> now. -Subbu. On 9/10/20 10:53 AM, Roy Smith wrote: > Could you update https://www.mediawiki.org/wiki/Parsoid/API <https://www.mediawiki.org/wiki/Parsoid/API> to indicate that? Even just a note saying, "This is an internal API. For external use, see https://en.wikipedia.org/api/rest_v1 <https://en.wikipedia.org/api/rest_v1>", or something like that. The way thing are now, it looks like this is the API you're supposed to be using, and the routes just don't respond. There's no way for a reader to differentiate between, "The documentation is wrong" vs, "I'm doing something wrong". > > >> On Sep 10, 2020, at 11:38 AM, Subramanya Sastry <ssastry(a)wikimedia.org <mailto:ssastry@wikimedia.org>> wrote: >> >> HI Roy, >> >> Sorry it took a while before we could respond. >> On 9/4/20 12:40 PM, Roy Smith wrote: >>> I know there's been a ton of work done of Parsoid lately. This is great, and the amount of effort that's gone into this functionality is really appreciated. It's clear that Parsoid is the way of the future, but the documentation of how you get a Parsoid parse tree via an AP call isI kind of confusing. >>> >>> I found https://www.mediawiki.org/wiki/Parsoid/API <https://www.mediawiki.org/wiki/Parsoid/API>, which looks like it's long out of date. The last edit was almost 2 years ago. As far as I can tell, most of what it says is obsolete, and refers to a series of /v3 routes which don't actually exist. >> That page is not out of date. That refers to Parsoid's API which is what you would use if you were querying Parsoid directly. When we ported Parsoid from JS to PHP, we ensured that the API routes didn't change. What changed was the base url of the Parsoid service (so clients could simply switch this URL in the configuration without having to change their code). >> For example, http://en.wikipedia.org/w/rest.php/en.wikipedia.org/v3/page/html/ <http://en.wikipedia.org/w/rest.php/en.wikipedia.org/v3/page/html/><title>/<revid> works if you curl this url on a production server (if you have access). But this Parsoid API is not accessible on the public internet. However, wikitech.wikimedia.org <http://wikitech.wikimedia.org/>'s Parsoid API is currently accessible on the public internet. So, for example:https://wikitech.wikimedia.org/w/rest.php/wikitech.wikimedia.org/v3… <https://wikitech.wikimedia.org/w/rest.php/wikitech.wikimedia.org/v3/page/html/Parsoid> works. So, you can verify that the API routes on the Parsoid wiki page will work on wikitech.wikimedia.org <http://wikitech.wikimedia.org/>. >> But, anyway, this is not directly relevant to your usecase unless you are directly contacting a Parsoid service somewhere. In production wikimedia wikis, as I said, Parsoid's API isn't public (it wasn't public with the JS version either). You can only access Parsoid content via RESTBase's public API which you reference below. >>> I also found https://en.wikipedia.org/api/rest_v1/#/Page%20content <https://en.wikipedia.org/api/rest_v1/#/Page content>, which seems more in line with the current reality. >> Yes, this is the API that RESTBase provides. Behind the scenes, it accesses Parsoid's API when it needs fresh content. >>> But, the call I was most interested in, /page/data-parsoid/{title}/{revision}/{tid}, doesn't actually respond (at least not on en.wikipedia.org <http://en.wikipedia.org/>). >> Joaquin already responded to this part (thanks Joaquin), so I'll skip this here. >> I will respond to your other Parsoid HTML questions / comments by responding to your other post. >> >> Subbu. >

1371

days inactive

1378

days old

cloud@lists.wikimedia.org

Manage subscription

7 comments

3 participants

tags (0)

participants (3)

Joaquin Oltra Hernandez
Roy Smith
Subramanya Sastry