On Tue, Nov 25, 2014 at 12:21 PM, Oliver Keyes <okeyes(a)wikimedia.org> wrote:
We're writing up and redefining the pageview
definition. Amongst other
things, it uses MIME type filtering and folder-level filtering to exclude
non-pageviews.
Just to clarify the context, we are talking about counting page views using
some server side log (such as an Apache / Varnish request log), right? In
that case there is absolutely no way to tell apart requests to MediaViewer
and normal URLs - the fragment part of the URL is never sent to the server,
so the browser basically just requests the normal page and then executes a
bunch of Javascript.
This is intentional, as sending a request that is different from the
request for the normal page would split the varnish cache and result in
poor performance.
The result of this is that it's currently going to be counted as a
pageview, even though it's...well, not.
The aim of those URLs is to present an image in the context of the page;
the user can access the page simply by closing the viewer. So it is not
necessarily that different from a pageview. If you want detailed
information about a request (did the user just want the intro section or
the full article? Did they read any text at all?), you need client-side
logging anyway, and in that case it is trivial to filter MediaViewer
pageviews based on whether or not the user went back to the text.
Is there any way you lot could avoid the false anchor strategy and pick a
URL scheme that won't trigger this? If not, we can
just write an exception
- but I'd rather that we not have to do that every time anyone decides to
make software.
We could add an exception to varnish to ignore a certain query parameter
and treat, say,
https://az.wikipedia.org/wiki/Mar%C3%A7ello_Malpigi
*?mediaviewer*#mediaviewer/File:Marcello_Malpighi_large.jpg as a request to
wiki/Mar%C3%A7ello_Malpigi and not wiki/Mar%C3%A7ello_Malpigi?mediaviewer
and serve it from the same cache (or use something like ESI to the same
effect). It would still mess up browser caching and any proxies on the
client end, though.
Alternatively we could do something nasty like link to
https://az.wikipedia.org/wiki/Special:MediaViewer/Mar%C3%A7ello_Malpigi#med…
and then have that special page do some sort of redirect, but that sounds
rather horrible and still has some performance hit due to the extra request.
The nice and clean solution would be to have a kind of "landing page" which
does not include the HTML of the wiki page at all, just maybe the set of
images found on it, and load the page under it via AJAX when the user
closes the lightbox. That would have advantages apart from stats (mostly
performance, both server- and client-side), but it would be a major
undertaking and not very well aligned with the current MediaWiki
architecture, I think.
From the other end of the problem, we could log
MediaViewer pageviews over
a separate channel so they can be subtracted from the
pageview totals if
needed (there are about a million URLs with MediaViewer hashes loaded per
day, so that would not be a huge extra traffic), but I imagine maintaining
such a dirty hack over complex pageview queries is not something you would
wish upon yourselves.
So no, I don't really see any way around this (and also don't see how you
could write an exception - as far as the server is aware, there is
absolutely no difference between
http://example.com and
http://example.com#foo - that's kind of the point of not splitting browser
cache), except in the very long term by shifting logging to the client. I
suppose that has to happen eventually anyway, if we want to learn details
like time spent on the page or heatmaps or whether the visitor scrolled to
the bottom of the page.