Hello Ahmed, nice to meet you!

As a data analyst who constantly works with the edit data, I would love to have it updated daily too. But there are serious infrastructural limitations that make that very difficult.

Both the edit data and pageview data that you're talking about come from the Hadoop-based Analytics Data Lake. However, because of limitations in the underlying MediaWiki application databases that Hive pulls edit data from, the data requires some complex reconstruction and denormalization that takes several days to a week. This mostly affects the historical data, but the reconstruction currently has to be done for all history at once because historical data sometimes changes long after the fact in the MediaWiki databases. So the entire dataset is regenerated every month, which would be impossible to do daily.

I'm sure there are strategies that could ultimately fix these problems, but I'm also sure that they would take great effort to implement, so unfortunately that's unlikely to happen anytime soon.

In the meantime, you may be able to work around these issues by using the public replicas of the application databases. Unlike with the API, you'd have to do the computation yourself, but it is updated in (near) real-time. Quarry is an excellent, easy-to-use tool for running SQL queries on those replicas.

I'm not an expert on the Data Lake, but I'm pretty sure this is broadly accurate. Corrections from the Analytics team welcome :)


On 22 March 2018 at 08:21, Ahmed Fasih <wuzzyview@gmail.com> wrote:
Hello! I have some questions about the latency of some Wikipedia REST
endpoints from

https://wikimedia.org/api/rest_v1

I see that I can get very recent pageviews data, e.g.

https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/en.wikipedia/all-access/all-agents/hourly/2018032100/2018032300

accessed now, on 2018/03/22, at 0249 UTC, gives me an hourly pageviews
on the English Wikipedia at timestamp "2018032200", so with about ~4
hours latency, very nice!

In contrast, asking for the daily number of edits via

https://wikimedia.org/api/rest_v1/metrics/edits/aggregate/en.wikipedia/all-editor-types/all-page-types/daily/20180225/20180321

only gives me data up to the end of February, with no March data. This
makes me think the daily datasets are generated only once a month? How
might I gain access to more recent daily data like the
"rest_v1/metrics/edits" endpoints?

Thanks!

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics