Felipe Ortega wrote:
Hi all.
I'm adding some tweaks to the WikiXRay parser of meta-history dumps. I now extract
internal, external links, and so on, but I'd also like to extract the plain text
(without HTML code and, possibly, also filtering wiki tags).
Does anyone nows a python library to do that? I believe there should be something out
there, as there exist bots and crawlers automating the data extraction process from one
wiki to other.
Thanks in advance for your comments.
Felipe.
If you have the html, extracting the plain text is really easy. Just
skip everything between < and > and decode entities :P