Hi Amirouche -- is this for an offline search?  Would love to read more about it. 

On Sun, Nov 1, 2020 at 6:36 AM Amirouche Boubekki <amirouche.boubekki@gmail.com> wrote:
Hello,


I am working on a search engine (unlike sphinx or elastic search, more
like bing or google), I was planning to use .zim files to feed the
index, the problem is there is no systematic way to find the original
URL of the documents.

I am wondering whether one of the following will be possible for kiwix
project to do:

A) Add a <meta url="https://foobar"> in the html inside the .zim files,

A bis) Add a metadata field per document with the original url inside
the .zim files,

B) Publish .warc files of wikipedia, stackoverflow dumps etc... so
that people like myself can re-use those. WARC files are more useful
than .zim files but still less user friendly than the following
proposal...

C) ... One last alternative, is to pivot the custom .zim file storage
to an okvs [0] like rocksdb or sqlite lsm extension [1]. The idea is
to make it very easy to access the kiwix dumps from many programming
languages unlike the current approach that is limited to C++ and
Python. Also, it will be easier to extend a given dump with custom
fields, unlike the current .zim which seems to be read-only.

Let me know what you think :-)

Thanks in advance!

[0] https://en.wikipedia.org/wiki/Ordered_Key-Value_Store
[1] https://github.com/sqlite/sqlite/tree/master/ext/lsm1

Le jeu. 29 oct. 2020 à 12:14, Emmanuel Engelhart <kelson@kiwix.org> a écrit :
>
> Hi
>
> I'm very proud to announce the release of our new tool: warc2zim.
>
> Warc2zim is a command line tool for GNU/Linux and macOS which allows to
> convert a WARC file to a ZIM file. WARC being a widely used storage
> format of the archive world, warc2zim offers new opportunities to reuse
> WARC stored data and benefit of the whole feature set of the ZIM file
> format and readers like Kiwix.
>
> The tool has been achieved with the strong collaboration of the
> Webrecorder team. It is one milestone of a bigger project called Zimit,
> a project we run we the sponsoring of the Mozilla Foundation.
>
> The ZIM created using that process works slightly differently than the
> traditional ones (the ZIM specification is formally respected). We are
> currently running an effort to update all the Kiwix readers, but it
> already works well with Kiwix Serve.
>
> The tool is distributed at:
> https://pypi.org/project/warc2zim/
>
> More news to come about warc2zim and Zimit in January 2020.
>
> Happy scraping!
> Happy coding!
> Happy offline reading!
>
> Emmanuel
>
> --
> Kiwix - Wikipedia Offline & more
> * Web: https://kiwix.org/
> * Twitter: https://twitter.com/KiwixOffline
> * Wiki: https://wiki.kiwix.org/
>
> _______________________________________________
> Offline-l mailing list
> Offline-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/offline-l



--
Amirouche ~ https://hyper.dev

_______________________________________________
Offline-l mailing list
Offline-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/offline-l


--
Samuel Klein          @metasj           w:user:sj          +1 617 529 4266