Hi Amirouche
On 21.06.20 12:12, Amirouche Boubekki wrote:
I am new to the mailing list. I used to work on
sotoki.
My question is somewhat related to my failed attempt to store
stackoverflow dumps inside wiredtiger. Eventually, I figured that
wiredtiger could not keep up with the write load and that it is a pain
point at least with wiredtiger (but also with sqlite lsm extension).
The workaround is to have as much as RAM as the data (which is in my
opinion not acceptable), and fine-tune eviction triggers et al.
Your initiative on StackExchange/Sotoki has not been forgotten, lost. We
maintain and develop the tools. We have really improved the scraper and
made many releases these last few months:
https://pypi.org/project/sotoki/#history
My questions are about libzim, zimwriterfs and how
full-text search is
implemented:
1) Why zimwriterfs or libzim succeed at putting together all the html
dumps of wikipedia? Is it because they're a lot of RAM? Or is it a
particular algorithm?
libzim does not use a lot of RAM, otherwise it would not be able to run
a smaller devices like RPIs or Low-end smartphones.
libzim succeeds to store huge amount of data and make it available on
really small devices, because the file format and the libzim have been
conceived for that purpose. I won't explain all the details here, but
everything need for the understanding is available here
https://openzim.org/.
2) Follow up question, how the full-text search is put
together? Is
the index written document by document then packed into the zim file?
The fulltext search engine has "nothing" to do with the ZIM format. We
use the Xapian engine for that optional feature. We keep only the key
words in the Xapian not the documents (they are already in the ZIM).
Since a few years, this index is embedded in the ZIM file itself for a
better UX. See
https://xapian.org/ for more details.
I am working on my free time on a search engine [1],
my goal is to
have my own search engine that I can use locally. That is why, I was
thinking about kiwix, because kiwix via .zim files provide readily
available dumps of many useful resources. The last question is:
If you deal with large amount of free text and want to do a fulltext
search engine. This might be a good choice indeed.
3) how can I read the content of .zim from C code? Are
there C
bindings of libzim?
The libzim is done in C++, you won't be able to deal properly in C with it!
Good luck for your project.
Regards
Emmanuel
--
Kiwix - Wikipedia Offline & more
* Web:
https://kiwix.org/
* Twitter:
https://twitter.com/KiwixOffline
* Wiki:
https://wiki.kiwix.org/