Hello Ogier,
Thank you a lot for the wikimaps work, and your thorough analysis on the pageviews :)
Here is what I found on your two questions, investigating one day of `user` visited pageviews recent data (we keep detailed data for 90 days only and I needed those detailed for the analysis).
> What kind of query can cause theses "-" entries ?
Pages with a defined page_id and an undefined title ('-') were representing 0.04%, a bit more than 227k hits.
Among those, 152K requests were having a `curid=NUMBER` in their uri_query (meaning they were specifying the page to view only by id, and we don't extract page_title from ids).
More than 65K don't have any page-title nor page-id specified in the URLs, but have one specified in HTTP headers. This feels like either a bug or an unexpected user behavior.
And more than 10k are using a `diff=` uri pattern, providing diff between revisions for a given page, but not providing the page in the URL.
I also found, for mobile-app' cases, that some page-titles were incorrectly rejected as invalid for chinese wikipedia. This happens on a very small number of lines (less than 10 per day from my findings).
> Why the entry "Barack_Obama mobile-app" appears two times ?
The entry appears two times because for one of them there is no page_id defined in the request, therefore it is categorised as different from the one having a page_id defined. While it could be possible to bundle all rows with the same title to have a page_id if one of the rows have the page_id defined, we could also have problems for hours where a rename occurs (two different page_ids for the same title). I'll bring the concern to the team, but given the relatively small number of views impacted by this case, there are chances we will not prioritise it soon.
Please let us know if you have other questions :)
Best
Joseph