Hello Ahmed, nice to meet you!
As a data analyst who constantly works with the edit data, I would love to
have it updated daily too. But there are serious infrastructural
limitations that make that very difficult.
Both the edit data and pageview data that you're talking about come from
the Hadoop-based Analytics Data Lake
<https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake>. However, because
of limitations in the underlying MediaWiki application databases
<https://www.mediawiki.org/wiki/Manual:Database_layout> that Hive pulls
edit data from, the data requires some complex reconstruction and
denormalization
<https://wikitech.wikimedia.org/wiki/Analytics/Systems/Data_Lake/Edits/Pipeline>
that takes several days to a week. This mostly affects the historical data,
but the reconstruction currently has to be done for all history at once
because historical data sometimes changes long after the fact in the
MediaWiki databases. So the entire dataset is regenerated every month,
which would be impossible to do daily.
I'm sure there are strategies that could ultimately fix these problems, but
I'm also sure that they would take great effort to implement, so
unfortunately that's unlikely to happen anytime soon.
In the meantime, you may be able to work around these issues by using
the public
replicas of the application database
<https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database#Connecting_to_the_database_replicas>
s. Unlike with the API, you'd have to do the computation yourself, but it
is updated in (near) real-time. Quarry
<https://meta.wikimedia.org/wiki/Research:Quarry> is an excellent,
easy-to-use tool for running SQL queries on those replicas.
I'm not an expert on the Data Lake, but I'm pretty sure this is broadly
accurate. Corrections from the Analytics team welcome :)
On 22 March 2018 at 08:21, Ahmed Fasih <wuzzyview(a)gmail.com> wrote:
Hello! I have some questions about the latency of some
Wikipedia REST
endpoints from
https://wikimedia.org/api/rest_v1
I see that I can get very recent pageviews data, e.g.
https://wikimedia.org/api/rest_v1/metrics/pageviews/
aggregate/en.wikipedia/all-access/all-agents/hourly/2018032100/2018032300
accessed now, on 2018/03/22, at 0249 UTC, gives me an hourly pageviews
on the English Wikipedia at timestamp "2018032200", so with about ~4
hours latency, very nice!
In contrast, asking for the daily number of edits via
https://wikimedia.org/api/rest_v1/metrics/edits/
aggregate/en.wikipedia/all-editor-types/all-page-types/
daily/20180225/20180321
only gives me data up to the end of February, with no March data. This
makes me think the daily datasets are generated only once a month? How
might I gain access to more recent daily data like the
"rest_v1/metrics/edits" endpoints?
Thanks!
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics