This [1] looks
quite acrobatic indeed. Can’t we make better use of the
machine-readable markings provided by templates?
<https://commons.wikimedia.org/wiki/Commons:Machine-readable_data>
[1]
https://gerrit.wikimedia.org/r/#/c/80403/4/CommonsMetadata_body.php
It is using the machine readable data from that page. (Although its
debatable how much "Look for a <td> with this id, and then look at the
contents of the next sibling <td> you encounter is").
I'm somewhat of a newb though with extracting microformat style
metadata, so its quite possible there is a better way, or some higher
level parsing library I could use (Something like xpath maybe,
although its not really xml I'm looking at).
Parsoid might be able to help you with access to template parameters
along with the fully expanded HTML that was produced from them. See [1].
We are going to work on page metadata storage as well, see [2] and [3].
Maybe our storage work could eventually also provide a backend for you.
Gabriel
[1]: