Extracting PMIDs

List overview All Threads
Download

newer

older

Tool to find poorly written...

Re: [Wiki-research-l]...

Jake Orlowitz

21 Oct 2014 21 Oct '14

3:57 a.m.

Attachments:

attachment.htm (text/html — 697 bytes)

Show replies by date

Jeremy Baron

21 Oct 21 Oct

4:05 a.m.

New subject: [OpenAccess] Extracting PMIDs

On Tue, Oct 21, 2014 at 3:57 AM, Jake Orlowitz <jorlowitz(a)gmail.com> wrote:

...

"Do you know if it is possible to extract PubMed ID (PMID) or PMCIDs from Wiki references? Furthermore, could you dump those IDs out into a list for analysis?"

I think so. Can you tell us more about what they want? Using [[wikipedia:ebola virus disease]] as an example:

...

<ref name="Gatherer 2014">{{cite journal | author = Gatherer D | title = The 2014 Ebola virus disease outbreak in West Africa | journal = J. Gen. Virol. | volume = 95 | issue = Pt 8 | pages = 1619–1624 | year = 2014 | pmid = 24795448 | doi = 10.1099/vir.0.067199-0 }}</ref>

One of the params is "pmid". -Jeremy

Andrew G. West

4:20 a.m.

Jake, Yes, its a rather straightforward parse based on the citation format which Jeremy described. Doc James and I already have this coded up for a soon to be published [[WP:MED]] readership/editorship paper. Searching for PMID's in the entirety of the Wikipedia article base would be a bit time consuming -- but if one needs to pull down only articles in WikiProject Medicine, for example, I am also able to help on that front. Perhaps we'll take this offline, but if anyone else is interested in the dirty details, feel free to contact one of us off-list. -AW -- Andrew G. West, PhD http://www.andrew-g-west.com On 10/20/2014 11:57 PM, Jake Orlowitz wrote:

...

Hi folks, Relaying a question from a Stanford medical researcher: "Do you know if it is possible to extract PubMed ID (PMID) or PMCIDs from Wiki references? Furthermore, could you dump those IDs out into a list for analysis?" Best, Jake Orlowitz (Ocaasi) _______________________________________________ Wiki-research-l mailing list Wiki-research-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Federico Leva (Nemo)

22 Oct 22 Oct

5:46 a.m.

If missing, it would be best to (also) submit a mapping to DBpedia so that they extract PMIDs in next run. I only found something for ja.wiki: http://mappings.dbpedia.org/index.php/Mapping_ja:Infobox_Journal http://mappings.dbpedia.org/server/templatestatistics/en/?template=Vcite_jo… (stats look wrong, too). Nemo

Maximilian Klein

6:27 p.m.

...

_______________________________________________ Wiki-research-l mailing list Wiki-research-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Aaron Halfaker

7:48 p.m.

...

_______________________________________________ Wiki-research-l mailing list Wiki-research-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Maximilian Klein

10:06 p.m.

Out of interest, my regex was pmc\s*\=\s*(.*?)[\|\}] and then also pmid\s*\=\s*(.*?)[\|\}] with ignorecase flag set on. Make a great day, Max Klein ‽ http://notconfusing.com/ On Wed, Oct 22, 2014 at 12:48 PM, Aaron Halfaker <aaron.halfaker(a)gmail.com> wrote:

...

_______________________________________________ Wiki-research-l mailing list Wiki-research-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Aaron Halfaker

10:07 p.m.

Ahh. What are pmcs? On Wed, Oct 22, 2014 at 5:06 PM, Maximilian Klein <isalix(a)gmail.com> wrote:

...

wrote:

> Hey folks, > > Somehow I missed this thread, but I've already addressed this request on > the Village Pump[1]. See: > > See. > http://datasets.wikimedia.org/public-datasets/enwiki/etc/pmids.articles.201… > > > I extracted PMIDs with the following regex: /\bpmid *= *[0-9]+\b/i > > It includes page_id, page_namespace, page_title, rev_id (most recent), > pmid in TAB separated values. > > Let me know if you have questions or if you think the regex matching > strategy is insufficient. It's pretty quick to take another pass. > > 1. > https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Extracting… > > On Wed, Oct 22, 2014 at 1:27 PM, Maximilian Klein <isalix(a)gmail.com> > wrote: > >> Jake, >> I have script that does this already for DOIs, Its was one-line change >> to make. These files should answer what you were looking for. >> >> https://raw.githubusercontent.com/notconfusing/listiness/pmc/pmc_list.txt >> >> https://raw.githubusercontent.com/notconfusing/listiness/pmc/pmid_list.txt >> >> In the future you can tell them to use halfak's >> https://pythonhosted.org/mediawiki-utilities/ >> This is the code I used to get those lists. >> https://github.com/notconfusing/listiness/commit/e140ce9202b9c1098dec40ca1d… >> >> Make a great day, >> Max Klein ‽ http://notconfusing.com/ >> >> On Mon, Oct 20, 2014 at 9:20 PM, Andrew G. West <west.andrew.g(a)gmail.com >>

wrote:

>> >>> Jake, >>> >>> Yes, its a rather straightforward parse based on the citation format >>> which Jeremy described. Doc James and I already have this coded up for a >>> soon to be published [[WP:MED]] readership/editorship paper. >>> >>> Searching for PMID's in the entirety of the Wikipedia article base >>> would be a bit time consuming -- but if one needs to pull down only >>> articles in WikiProject Medicine, for example, I am also able to help on >>> that front. >>> >>> Perhaps we'll take this offline, but if anyone else is interested in >>> the dirty details, feel free to contact one of us off-list. -AW >>> >>> -- >>> Andrew G. West, PhD >>> http://www.andrew-g-west.com >>> >>> >>> >>> On 10/20/2014 11:57 PM, Jake Orlowitz wrote: >>> >>>> Hi folks, >>>> >>>> Relaying a question from a Stanford medical researcher: >>>> >>>> "Do you know if it is possible to extract PubMed ID (PMID) or PMCIDs >>>> from Wiki references? Furthermore, could you dump those IDs out into a >>>> list for analysis?" >>>> >>>> Best, >>>> Jake Orlowitz (Ocaasi) >>>> >>>> >>>> _______________________________________________ >>>> Wiki-research-l mailing list >>>> Wiki-research-l(a)lists.wikimedia.org >>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >>>> >>>> >>> >>> _______________________________________________ >>> Wiki-research-l mailing list >>> Wiki-research-l(a)lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >>> >> >> >> _______________________________________________ >> Wiki-research-l mailing list >> Wiki-research-l(a)lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >> >> > > _______________________________________________ > Wiki-research-l mailing list > Wiki-research-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > > _______________________________________________ Wiki-research-l mailing list Wiki-research-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Jodi Schneider

10:58 p.m.

PubMedCentral: https://en.wikipedia.org/wiki/PubMed_Central On Thu, Oct 23, 2014 at 12:07 AM, Aaron Halfaker <ahalfaker(a)wikimedia.org> wrote:

...

Ahh. What are pmcs? On Wed, Oct 22, 2014 at 5:06 PM, Maximilian Klein <isalix(a)gmail.com> wrote:

Jake, I have script that does this already for DOIs, Its was one-line change to make. These files should answer what you were looking for. https://raw.githubusercontent.com/notconfusing/listiness/pmc/pmc_list.txt https://raw.githubusercontent.com/notconfusing/listiness/pmc/pmid_list.txt In the future you can tell them to use halfak's https://pythonhosted.org/mediawiki-utilities/ This is the code I used to get those lists. https://github.com/notconfusing/listiness/commit/e140ce9202b9c1098dec40ca1d… Make a great day, Max Klein ‽ http://notconfusing.com/ On Mon, Oct 20, 2014 at 9:20 PM, Andrew G. West < west.andrew.g(a)gmail.com> wrote: > Jake, > > Yes, its a rather straightforward parse based on the citation format > which Jeremy described. Doc James and I already have this coded up for a > soon to be published [[WP:MED]] readership/editorship paper. > > Searching for PMID's in the entirety of the Wikipedia article base > would be a bit time consuming -- but if one needs to pull down only > articles in WikiProject Medicine, for example, I am also able to help on > that front. > > Perhaps we'll take this offline, but if anyone else is interested in > the dirty details, feel free to contact one of us off-list. -AW > > -- > Andrew G. West, PhD > http://www.andrew-g-west.com > > > > On 10/20/2014 11:57 PM, Jake Orlowitz wrote: > >> Hi folks, >> >> Relaying a question from a Stanford medical researcher: >> >> "Do you know if it is possible to extract PubMed ID (PMID) or PMCIDs >> from Wiki references? Furthermore, could you dump those IDs out into >> a >> list for analysis?" >> >> Best, >> Jake Orlowitz (Ocaasi) >> >> >> _______________________________________________ >> Wiki-research-l mailing list >> Wiki-research-l(a)lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >> >> > > _______________________________________________ > Wiki-research-l mailing list > Wiki-research-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > _______________________________________________ Wiki-research-l mailing list Wiki-research-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

_______________________________________________ Wiki-research-l mailing list Wiki-research-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Aaron Halfaker

23 Oct 23 Oct

12:15 p.m.

Thanks Jody, but I know what PubMed Central is. Here, I was (unclearly) asking about the meaning of the "pmc" field. I talked to Max on IRC. He said like "pmc" is an old legacy field name that corresponds to "pmid", so they can be used interchangably. I've updated my regex to be /\bpm(id|c) *= *([0-9]+)\b/i and restarted my run over the 2014-10-08 XML dump. -Aaron On Wed, Oct 22, 2014 at 5:58 PM, Jodi Schneider <jschneider(a)pobox.com> wrote:

...

PubMedCentral: https://en.wikipedia.org/wiki/PubMed_Central On Thu, Oct 23, 2014 at 12:07 AM, Aaron Halfaker <ahalfaker(a)wikimedia.org> wrote:

Ahh. What are pmcs? On Wed, Oct 22, 2014 at 5:06 PM, Maximilian Klein <isalix(a)gmail.com> wrote:

Hey folks, Somehow I missed this thread, but I've already addressed this request on the Village Pump[1]. See: See. http://datasets.wikimedia.org/public-datasets/enwiki/etc/pmids.articles.201… I extracted PMIDs with the following regex: /\bpmid *= *[0-9]+\b/i It includes page_id, page_namespace, page_title, rev_id (most recent), pmid in TAB separated values. Let me know if you have questions or if you think the regex matching strategy is insufficient. It's pretty quick to take another pass. 1. https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Extracting… On Wed, Oct 22, 2014 at 1:27 PM, Maximilian Klein <isalix(a)gmail.com> wrote: > Jake, > I have script that does this already for DOIs, Its was one-line change > to make. These files should answer what you were looking for. > > > https://raw.githubusercontent.com/notconfusing/listiness/pmc/pmc_list.txt > > https://raw.githubusercontent.com/notconfusing/listiness/pmc/pmid_list.txt > > In the future you can tell them to use halfak's > https://pythonhosted.org/mediawiki-utilities/ > This is the code I used to get those lists. > https://github.com/notconfusing/listiness/commit/e140ce9202b9c1098dec40ca1d… > > Make a great day, > Max Klein ‽ http://notconfusing.com/ > > On Mon, Oct 20, 2014 at 9:20 PM, Andrew G. West < > west.andrew.g(a)gmail.com> wrote: > >> Jake, >> >> Yes, its a rather straightforward parse based on the citation format >> which Jeremy described. Doc James and I already have this coded up for a >> soon to be published [[WP:MED]] readership/editorship paper. >> >> Searching for PMID's in the entirety of the Wikipedia article base >> would be a bit time consuming -- but if one needs to pull down only >> articles in WikiProject Medicine, for example, I am also able to help on >> that front. >> >> Perhaps we'll take this offline, but if anyone else is interested in >> the dirty details, feel free to contact one of us off-list. -AW >> >> -- >> Andrew G. West, PhD >> http://www.andrew-g-west.com >> >> >> >> On 10/20/2014 11:57 PM, Jake Orlowitz wrote: >> >>> Hi folks, >>> >>> Relaying a question from a Stanford medical researcher: >>> >>> "Do you know if it is possible to extract PubMed ID (PMID) or PMCIDs >>> from Wiki references? Furthermore, could you dump those IDs out >>> into a >>> list for analysis?" >>> >>> Best, >>> Jake Orlowitz (Ocaasi) >>> >>> >>> _______________________________________________ >>> Wiki-research-l mailing list >>> Wiki-research-l(a)lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >>> >>> >> >> _______________________________________________ >> Wiki-research-l mailing list >> Wiki-research-l(a)lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >> > > > _______________________________________________ > Wiki-research-l mailing list > Wiki-research-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > > _______________________________________________ Wiki-research-l mailing list Wiki-research-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

_______________________________________________ Wiki-research-l mailing list Wiki-research-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

3509

days inactive

3511

days old

wiki-research-l@lists.wikimedia.org

Manage subscription

9 comments

8 participants

tags (0)

participants (8)

Aaron Halfaker
Aaron Halfaker
Andrew G. West
Federico Leva (Nemo)
Jake Orlowitz
Jeremy Baron
Jodi Schneider
Maximilian Klein