On 10 February 2015 at 00:40, Dario Taraborelli
<dtaraborelli(a)wikimedia.org> wrote:
We should test
for tehse in citation templates. Does your data show which templates (if any) the broken
DoIs were in?
we haven’t checked if these errors occur systematically within specific templates, but we
know that the
code extracted them correctly with no parsing errors. We’ll share the list of broken DOIs
so they can be
reviewed and fixed.
FWIW, it looks like it might be possible to automatically fix a good
proportion of them. Glancing through the list, I saw quite a few
entries like:
10.1046/j.1095-8339.2003.00157.x/abs/
10.1111/1532-7795.1301001/enhancedabs/
Both represent a valid DOI + extra text. These will probably have been
copy-pasted from the website URL, which often uses the DOI plus a
suffix like pdf, abstract, etc, and it's an easy mistake to make.
About a fifth of the errors match this pattern - 4000 of the entries
have /abs* in them, and 1200 /pdf
It suggests we could try automatically trimming broken ones that match
this pattern and seeing if the new one resolves. If so, the odds are
good it's the intended DOI...
One other thing that might be worth checking for is invisible
characters - I couldn't spot any in this file, but I don't know if
it's been sanitised in any way. I've had recurrent problems with
user-provided DOIs in repository data turning out to have zero width
spaces buried somewhere in them, possibly as a result of someone
copy-pasting from a PDF.
Andrew.
PS: really pleased that the one September 2002 DOI is still working fine :-)
--
- Andrew Gray
andrew.gray(a)dunelm.org.uk