Re: [OpenAccess] [Spam] Re: Scholarly citations by DOI in Wikipedia

10 Feb 2015

On 10 February 2015 at 00:40, Dario Taraborelli
&lt;dtaraborelli(a)wikimedia.org&gt; wrote:

...
   We should test
for tehse in citation templates. Does your data show which templates (if any) the broken
 DoIs were in? 
 we haven’t checked if these errors occur systematically within specific templates, but we
know that the
 code extracted them correctly with no parsing errors. We’ll share the list of broken DOIs
so they can be
 reviewed and fixed. 
FWIW, it looks like it might be possible to automatically fix a good
proportion of them. Glancing through the list, I saw quite a few
entries like:

10.1046/j.1095-8339.2003.00157.x/abs/
10.1111/1532-7795.1301001/enhancedabs/

Both represent a valid DOI + extra text. These will probably have been
copy-pasted from the website URL, which often uses the DOI plus a
suffix like pdf, abstract, etc, and it's an easy mistake to make.
About a fifth of the errors match this pattern - 4000 of the entries
have /abs* in them, and 1200 /pdf

It suggests we could try automatically trimming broken ones that match
this pattern and seeing if the new one resolves. If so, the odds are
good it's the intended DOI...

One other thing that might be worth checking for is invisible
characters - I couldn't spot any in this file, but I don't know if
it's been sanitised in any way. I've had recurrent problems with
user-provided DOIs in repository data turning out to have zero width
spaces buried somewhere in them, possibly as a result of someone
copy-pasting from a PDF.

Andrew.

PS: really pleased that the one September 2002 DOI is still working fine :-)

-- 
- Andrew Gray
  andrew.gray(a)dunelm.org.uk

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

Re: [OpenAccess] [Spam] Re: Scholarly citations by DOI in Wikipedia