Hello Mitar! I'm glad you are finding the Wikimedia Enterprise dumps useful.
For your tar.gz question, this is the format that the Wikimedia Enterprise
dataset consumers prefer, from what I understand. But I would suggest that
if you are interested in other formats, you might open a task on
phabricator with a feature request, and add the Wikimedia Enterprise
project tag (
https://phabricator.wikimedia.org/project/view/4929/ ).
As to the API, I'm only familiar with the endpoints for bulk download, so
you'll want to ask the Wikimedia Enterprise folks, or have a look at their
API documentation here:
https://www.mediawiki.org/wiki/Wikimedia_Enterprise/Documentation
Ariel
On Sat, Jan 1, 2022 at 4:30 PM Mitar <mmitar(a)gmail.com> wrote:
Hi!
Awesome!
Is there any reason they are tar.gz files of one file and not simply
bzip2 of the file contents? Wikidata dumps are bzip2 of one json and
that allows parallel decompression. Having both tar (why tar of one
file at all?) and gz in there really requires one to first decompress
the whole thing before you can process it in parallel. Is there some
other way I am missing?
Wikipedia dumps are done with multistream bzip2 with an additional
index file. That could be nice here too, if one could have an index
file and then be able to immediately jump to a JSON line for
corresponding articles.
Also, is there an API endpoint or Special page which can return the
same JSON for a single Wikipedia page? The JSON structure looks very
useful by itself (e.g., not in bulk).
Mitar
On Tue, Oct 19, 2021 at 4:57 PM Ariel Glenn WMF <ariel(a)wikimedia.org>
wrote:
I am pleased to announce that Wikimedia Enterprise's HTML dumps [1] for
October 17-18th are available for public download; see
https://dumps.wikimedia.org/other/enterprise_html/ for more
information. We
expect to make updated versions of these files
available around the
1st/2nd
of the month and the 20th/21st of the month,
following the cadence of the
standard SQL/XML dumps.
This is still an experimental service, so there may be hiccups from time
to
time. Please be patient and report issues as you
find them. Thanks!
Ariel "Dumps Wrangler" Glenn
[1] See
https://www.mediawiki.org/wiki/Wikimedia_Enterprise for much
more
about Wikimedia Enterprise and its API.
_______________________________________________
Wiki-research-l mailing list -- wiki-research-l(a)lists.wikimedia.org
To unsubscribe send an email to
wiki-research-l-leave(a)lists.wikimedia.org
--
http://mitar.tnode.com/
https://twitter.com/mitar_m
_______________________________________________
Wikitech-l mailing list -- wikitech-l(a)lists.wikimedia.org
To unsubscribe send an email to wikitech-l-leave(a)lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/