On Sun, Feb 24, 2008 at 7:02 AM, Ragib Hasan <ragibhasan(a)gmail.com> wrote:
Hi,
I need to extract the only the text from a Wikipedia page. I.e., I
need to remove all wiki markup, section headings etc, to extract only
the text a reader will read.
For example, for the text :
'''Paris''' ([[Help:IPA|pronounced]] /paʁi/ in French; /ˈpaɹɪs/
in
English) is the [[communes of France|capital city]] of [[France]]. It
is situated on the [[Seine|River Seine]], in northern France, at the
heart of the [[Île-de-France (region)|Île-de-France]] [[Regions of
France|region]] (aka "Paris Region"; in French: ''Région
Parisienne''
or ''RP''). The City of Paris has an estimated population of 2,167,994
within its administrative limits (January 2006)."
I need to get the following after extraction:
Paris (pronounced /paʁi/ in French; /ˈpaɹɪs/ in English) is the
capital city France. It is situated on the River Seine, in northern
France, at the heart of the Île-de-France region (aka "Paris Region";
in French: ''Région Parisienne'' or ''RP''). The City of
Paris has an
estimated population of 2,167,994 within its administrative limits
(January 2006)."
Using Pywikipediabot framework, I can get the raw text, but not the
text-sans-markups. Since I need to do some textual analysis on the
article contents, I need to get rid of all the extra markups, citation
tags or other templates.
So, what is the best/easiest way to do this? Thanks in advance.
Ragib
--
Ragib Hasan
PhD Student
Dept of Computer Science
University of Illinois at Urbana-Champaign
201 N Goodwin Avenue
Urbana IL 61801
Website:
http://www.ragibhasan.com
http://netfiles.uiuc.edu/rhasan/www
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Since you need it for textual analysis hence you have more options.
You can use wikiprep
(
). There is to be a
better maintained WikiPrep, which is maintained by some Tomaz, you can
get that from
.
Download wikiprep.pl and images.pm from there.
Hope this helps.
--
Apple Grew
my blog @