Re: [Wikitech-l] Datamining infoboxes

23 Oct 2009

2009/10/23 Robert Ullmann &lt;rlullmann(a)gmail.com&gt;om>:
...
   I've been
spending hours on the parsing now and don't find it simple
 at all due to the fact that templates can be nested. Just extracting
 the Infobox as one big lump is hard due to the need to match nested {{
 and }}

 Andrew Dunbar (hippietrail) 
 Hi,

 Come now, you are over-thinking it. Find "{{Infobox [Ll]anguage" in
 the text, then count braces. Start at depth=2, count up and down 'till
 you reach 0, and you are at the end of the template. (you can be picky
 about only counting them if paired if you like ;-) 
Actually you have to find "{{[Ii]nfobox[ _][Ll]anguage"
And I wanted to be robust. It's perfectly legal for single unmatched
braces to apear anywhere and I didn't want them to break my code. As
it happens there don't seem to currently be any in the language
infofoxes.
I couldn't be sure whether there would be any cases where a {{{ or }}}
might show up either. And a few other edge cases such as HTML
comments, <nowiki> and friends, template invocations in values, and
even possibly template invokations in names?

...
  Then just regex match the lines/parameters you want.

 However, if you are pulling the wikitext with the API, the XML parse
 tree option sounds good; then you can just use elementTree (or the
 like) and pull out the parameters directly 
I've got it extracting the name/value pairs from the XML finally but
parsing XML is always a pain. And it still misses Norwegian, Bokmal,
and Nynorsk which wrap the infobox in another template...

Andrew Dunbar (hippietrail)

...
  Robert

 _______________________________________________
 Wikitech-l mailing list
 Wikitech-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- 
http://wiktionarydev.leuksman.com http://linguaphile.sf.net

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Datamining infoboxes