Hi Ugur,
The pagecounts-raw data is deprecated and hasn’t been updated for a few
years. Have you seen the pagecounts-ez data? It is a merger of old
pagecounts-raw and newer better pageviews data. You can find it here:
https://dumps.wikimedia.org/other/pagecounts-ez/
As for the -1 view counts, that’s the first time I heard that problem. If
it’s in the file it means the page exists but I have no idea what a
negative count means, it shouldn’t be possible and I’m sure it doesn’t
happen in the new data.
The size field is the bytes served, so it could vary because as the page is
edited from one minute to the next. But I couldn’t tell you how reliable
it is. One tip would be to look at the page history and see how many bytes
the page has at each revision. You can do this using
https://quarry.wmflabs.org and querying the revision table for rev_size
during the hours you see pageviews. That way you can see the accuracy of
the size data.
Good luck, and we’re here to help.
On Fri, Nov 17, 2017 at 10:18 Ugur Yildirim <ugur.yildirim(a)berkeley.edu>
wrote:
Hi,
We are three graduate students at UC Berkeley, and we are currently
working on a machine learning project for a class that we’re taking.
We’re using the page views data that we believe you maintain:
https://dumps.wikimedia.org/other/pagecounts-raw/
We have two quick questions that we were hoping you could answer:
1) We found views with a size of -1 or 0. Does this mean the page doesn’t
exist?
2) We found some articles have `size` that widely varies throughout the
hourly snapshots of a day. Is that legitimate, or is there something odd
with the data?
Thanks,
Ugur
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics