Hi Dan!

On Mon, Dec 8, 2014 at 11:29 AM, Dan Garry <dgarry@wikimedia.org> wrote:
Background: The Mobile Apps Team is working on a restyling of the way content the first fold of content is presented in the Wikipedia app. You can see this image to see what this looks like. 

That looks awesome, can't wait to see it live! Any chance of something like this eventually hitting the desktop site? :-)

Having a high-resolution image so prominently at the top of the page will likely drive a lot of clicks, so we're working on a lightweight image viewer to deal with file pages, which are poorly styled monstrosities on the mobile app. We're going to use the CommonsMetadata API to help us out. :-)

Keep in mind that there is no guarantee the API output is an accurate representation of the file page (lack of machine-readable template markup etc. - for example, CommonsMetadata can't figure out the license name for about 5% of the MediaViewer pageviews), so you'll still need a link to the raw file page somewhere.

Problem: The CommonsMetadata API can sometimes return HTML [1]. Having HTML in the API response is a bit problematic for us. Native apps make next to no use of HTML when creating links or layouts, so we have to strip the HTML from every API response, lest it be displayed as plaintext to the user. In the short term this is fine, we can strip it and throw the information away. But in the long run it'd be better if the API didn't return HTML.

In the long run CommonsMetadata should die in a fire, together with the Commons paradigm of storing information in license parameters.
You can see the related plans at Commons:Structured data; these include migrating most information to plaintext (file descriptions will probably remain rich text).

In the not so long run, some HTML markup is fairly important. Links can be necessary for the attribution, paragraphs for making long descriptions more readable; removing lists and tables makes some descriptions unreadable (map legends tend to use tables, for example). So I think the API would be much less useful if it started stripping HTML. (It does that already in a few cases where the intent is clear, such as stripping the enclosing <p> generated by MediaWiki, or stripping certain kinds of purely presentational markup such as creator templates, but that only works when the source and intent of the markup is known.)

We could add an API parameter to provide a plaintext version, but that would split the cache (both varnish and memcached). Not a huge deal, but tag stripping is very easy, so if you don't need anything more specific than that, I would say it is simpler to do it on the client side. If more complex logic is needed (e.g. turning <ul>s into star lists), it makes sense to do that in the API instead of forcing each client to reimplement it, but I am not sure how generic such a text representation would be.

So, given that we can't do anything meaningful with the HTML in a native app, that means we only have three options:
  • Display the raw HTML directly to the user 
  • Try to parse the HTML for interesting information and update the relevant view's properties using native code
  • Strip any and all HTML tags that are given to us in the JSON
The first two aren't sounding workable at all to me; the first is unworkable from a product standpoint, and the second is an absolutely gigantic can of worms. So I guess we'll be stripping the HTML until such time that this is fixed. :-)

I'm not sure some limited HTML parsing is that bad. The low-hanging fruit is links (MediaViewer currently strips everything else, and most of the time that works decently), and those are never nested, so they can be processed by a trivial SAX parser, for which all platforms surely have libraries.