Hi Dan!
On Mon, Dec 8, 2014 at 11:29 AM, Dan Garry <dgarry(a)wikimedia.org> wrote:
*Background:* The Mobile Apps Team is working on a
restyling of the way
content the first fold of content is presented in the Wikipedia app. You
can see this image <http://i.imgur.com/dxqfJKd.png> to see what this
looks like.
That looks awesome, can't wait to see it live! Any chance of something like
this eventually hitting the desktop site? :-)
Having a high-resolution image so prominently at the top of the page will
likely drive a lot of clicks, so we're working on
a lightweight image
viewer to deal with file pages, which are poorly styled monstrosities on
the mobile app. We're going to use the CommonsMetadata API to help us out.
:-)
Keep in mind that there is no guarantee the API output is an accurate
representation of the file page (lack of machine-readable template markup
etc. - for example, CommonsMetadata can't figure out the license name for
about 5% of the MediaViewer pageviews), so you'll still need a link to the
raw file page somewhere.
*Problem:* The CommonsMetadata API can sometimes return HTML [1]. Having
HTML in the API response is a bit problematic for us.
Native apps make next
to no use of HTML when creating links or layouts, so we have to strip the
HTML from every API response, lest it be displayed as plaintext to the
user. In the short term this is fine, we can strip it and throw the
information away. But in the long run it'd be better if the API didn't
return HTML.
In the long run CommonsMetadata should die in a fire, together with the
Commons paradigm of storing information in license parameters.
You can see the related plans at Commons:Structured data
<https://commons.wikimedia.org/wiki/Commons:Structured_data>; these include
migrating most information to plaintext (file descriptions will probably
remain rich text).
In the not so long run, some HTML markup is fairly important. Links can be
necessary for the attribution, paragraphs for making long descriptions more
readable; removing lists and tables makes some descriptions unreadable (map
legends tend to use tables, for example). So I think the API would be much
less useful if it started stripping HTML. (It does that already in a few
cases where the intent is clear, such as stripping the enclosing <p>
generated by MediaWiki, or stripping certain kinds of purely presentational
markup such as creator templates
<https://commons.wikimedia.org/wiki/Template:Creator>, but that only works
when the source and intent of the markup is known.)
We could add an API parameter to provide a plaintext version, but that
would split the cache (both varnish and memcached). Not a huge deal, but
tag stripping is very easy, so if you don't need anything more specific
than that, I would say it is simpler to do it on the client side. If more
complex logic is needed (e.g. turning <ul>s into star lists), it makes
sense to do that in the API instead of forcing each client to reimplement
it, but I am not sure how generic such a text representation would be.
So, given that we can't do anything meaningful with the HTML in a native
app, that means we only have three options:
- Display the raw HTML directly to the user
- Try to parse the HTML for interesting information and update the
relevant view's properties using native code
- Strip any and all HTML tags that are given to us in the JSON
The first two aren't sounding workable at all to me; the first is
unworkable from a product standpoint, and the second is an absolutely
gigantic can of worms. So I guess we'll be stripping the HTML until such
time that this is fixed. :-)
I'm not sure some limited HTML parsing is that bad. The low-hanging fruit
is links (MediaViewer currently strips everything else, and most of the
time that works decently), and those are never nested, so they can be
processed by a trivial SAX parser, for which all platforms surely have
libraries.