False anchors

List overview All Threads
Download

newer

older

Re: [Multimedia] [Wikivideo-l]...

next IRC office hour about...

Oliver Keyes

25 Nov 2014 25 Nov '14

9:21 p.m.

(Hat-tip to Nemo for noticing this) We're writing up and redefining the pageview definition. Amongst other things, it uses MIME type filtering and folder-level filtering to exclude non-pageviews. Problem: the multimedia viewer false anchor (an example is https://az.wikipedia.org/wiki/Mar%C3%A7ello_Malpigi#mediaviewer/File:Marcel… is both text/html (one of the recognised and appreciated MIME types) and /wiki/ (one of the recognised and appreciated domains). The result of this is that it's currently going to be counted as a pageview, even though it's...well, not. Is there any way you lot could avoid the false anchor strategy and pick a URL scheme that won't trigger this? If not, we can just write an exception - but I'd rather that we not have to do that every time anyone decides to make software. Thanks, -- Oliver Keyes Research Analyst Wikimedia Foundation

Attachments:

attachment.htm (text/html — 1.2 KB)

Show replies by date

Brion Vibber

25 Nov 25 Nov

9:38 p.m.

...

Bryan Davis

9:52 p.m.

I'm pretty sure this should be counted as a page view. As Brion notes this is the equivalent of a page view of the /wiki/File:Marcello_Malpighi_large.jpg page. If this doesn't count as a page view then mobile page views shouldn't count either as they are transformations of the canonical page as well. Bryan On Tue, Nov 25, 2014 at 1:38 PM, Brion Vibber <bvibber(a)wikimedia.org> wrote:

...

Is it not a page view, or is it a page view of the image rather than the article? If the media viewer is dismissed and the page is then read, is it still not a page view? -- brion On Tue, Nov 25, 2014 at 12:21 PM, Oliver Keyes <okeyes(a)wikimedia.org> wrote: > > (Hat-tip to Nemo for noticing this) > > We're writing up and redefining the pageview definition. Amongst other > things, it uses MIME type filtering and folder-level filtering to exclude > non-pageviews. > > Problem: the multimedia viewer false anchor (an example is > https://az.wikipedia.org/wiki/Mar%C3%A7ello_Malpigi#mediaviewer/File:Marcel… > is both text/html (one of the recognised and appreciated MIME types) and > /wiki/ (one of the recognised and appreciated domains). > > The result of this is that it's currently going to be counted as a > pageview, even though it's...well, not. > > Is there any way you lot could avoid the false anchor strategy and pick a > URL scheme that won't trigger this? If not, we can just write an exception - > but I'd rather that we not have to do that every time anyone decides to make > software. > > Thanks, > > -- > Oliver Keyes > Research Analyst > Wikimedia Foundation

-- Bryan Davis Wikimedia Foundation <bd808(a)wikimedia.org> [[m:User:BDavis_(WMF)]] Sr Software Engineer Boise, ID USA irc: bd808 v:415.839.6885 x6855

Brion Vibber

10:05 p.m.

On Tue, Nov 25, 2014 at 12:52 PM, Bryan Davis <bd808(a)wikimedia.org> wrote:

...

AFAIK, the hash is generally not sent to the server so it (and thus the image name) won't be seen by log-based data crunching. As such I think the issue Oliver is raising is that it would likely be counted as a page view for the article page, even though we don't know whether or not the article will actually be seen or read. However there's no way offhand to know that that page view won't result in a read of the article as well once the media viewer is dismissed. And in general, simply seeing that a data transfer was made tells us nothing about how much attention a human paid to the contents... -- brion

Oliver Keyes

10:59 p.m.

Actually, I'd argue it's not equivalent at all, for two reasons: 1. it doesn't present all of the same data. In fact, it presents very little data, compared to a pageview of the "File" page; 2. The argument behind MMV is, as I understand it, that people are focusing on the images. It is designed so that people do so, on the basis that people clicking on images probably want those images. As such, it'd be inaccurate to weight it as equivalent to say https://az.wikipedia.org/wiki/Mar%C3%A7ello_Malpigi <https://az.wikipedia.org/wiki/Mar%C3%A7ello_Malpigi#mediaviewer/File:Marcello_Malpighi_large.jpg> in textual value - we believe (correct me if I'm wrong) that someone clicking for an image wants a media file, not a wall of text. And, of course, even if we do include it, it's Yet Another URL Scheme to take into account when extracting "page" from "URL". I don't think mobile pageviews are a valid equivalency because our design pattern there does not assume a user has a !text intended outcome of the request. On 25 November 2014 at 16:05, Brion Vibber <bvibber(a)wikimedia.org> wrote:

...

On Tue, Nov 25, 2014 at 12:52 PM, Bryan Davis <bd808(a)wikimedia.org> wrote:

-- Oliver Keyes Research Analyst Wikimedia Foundation

Erik Moeller

11:15 p.m.

I don't see changing the URL schema for the sake of PV measurement as on the table at this point. URLs to MMV invocations are shared, so we'd need to keep those links working anyway. I do agree that we ideally want those to be countable separately (as part of an overall understanding of image views), but I don't think it's unreasonable or broken to count them as PVs for now until we can do so. Erik -- Erik Möller VP of Product & Strategy, Wikimedia Foundation

Oliver Keyes

11:51 p.m.

Fair point! In the meantime I'll add an exception for these requests, and factor them into the eventual image views thing. On 25 November 2014 at 17:15, Erik Moeller <erik(a)wikimedia.org> wrote:

...

-- Oliver Keyes Research Analyst Wikimedia Foundation

Gergo Tisza

26 Nov 26 Nov

12:51 a.m.

On Tue, Nov 25, 2014 at 1:59 PM, Oliver Keyes <okeyes(a)wikimedia.org> wrote:

...

MediaViewer hash loads and File page requests have little to do with each other. File page request happens when 1) someone clicks on a thumbnail, 2) someone shares the URL of a file page and someone else follows that URL. In the case of MediaViewer, only the first case results in a text/html request to the server. The second case (which is about 30x more frequent) only results in a bunch of AJAX calls and an image request (actually more than one, due to preloading). Those AJAX calls could easily be made unique, if that is of any interest. So basically when you click on an image, MediaViewer uses AJAX requests to load some of the information from the file page, then creates an <img> tag so the browser loads a large image thumbnail. When you visit an URL ending in #mediaviewer/..., that just tells the MV code to simulate an image click as soon as the page has loaded.

Gilles Dubuc

11:07 a.m.

Server logs of page hits provide less and less value in terms of knowing what people are doing (was it ever possible to truly tell bots apart from humans? to compensate for caching proxies run by organizations?), the more client-side and mobile apps we develop. I think that it's inevitable that any meaningful tracking will have to be done client-side. Looking for ways to adapt our URL schemes for the sake of server logs seems like rearranging the deck chairs on the titanic to me. We should be trying to put as little work into it as possible. Our stats efforts should be rather focused on more fine-grained client-side and mobile tracking, which is what we need to truly answer questions, even on our old "static" pages like the articles themselves. The same way that I've been working on tracking how long images are being viewed for at the Amsterdam hackathon in preparation for Erik Zachte's RFC on image views, we should be doing the same sort of measurements on articles. On Wed, Nov 26, 2014 at 12:51 AM, Gergo Tisza <gtisza(a)wikimedia.org> wrote:

...

On Tue, Nov 25, 2014 at 1:59 PM, Oliver Keyes <okeyes(a)wikimedia.org> wrote:

Oliver Keyes

2:11 p.m.

So, in sequence: Gergo: Either the false anchors are sent to the server or some conniving elf has been inserting thousands of fake requests into our logs ;). I'm seeing a lot of requests with #mediaviewer/ URLs, some internal and some with referers from outside the WMF (implying someone following a link). The proposed ways forward are useful, but as Erik M says, reorganising active products for the sake of avoiding a pageviews filter is probably not worth it unless it's a truly trivial change, so let's just stick with the status quo for now and I'll build in a filter. Gilles: see above, re Erik's comments. Thanks to everyone for their commentary and help; I'll build a filter into the definition this morning :) On 26 November 2014 at 05:07, Gilles Dubuc <gilles(a)wikimedia.org> wrote:

...

On Tue, Nov 25, 2014 at 1:59 PM, Oliver Keyes <okeyes(a)wikimedia.org> wrote:

_______________________________________________ Multimedia mailing list Multimedia(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/multimedia

-- Oliver Keyes Research Analyst Wikimedia Foundation

Mark Holmquist

2:33 p.m.

On Wed, Nov 26, 2014 at 08:11:27AM -0500, Oliver Keyes wrote:

...

Gergo: Either the false anchors are sent to the server or some conniving elf has been inserting thousands of fake requests into our logs ;). I'm seeing a lot of requests with #mediaviewer/ URLs, some internal and some with referers from outside the WMF (implying someone following a link). The proposed ways forward are useful, but as Erik M says, reorganising active products for the sake of avoiding a pageviews filter is probably not worth it unless it's a truly trivial change, so let's just stick with the status quo for now and I'll build in a filter.

Given we're already in the realm of the impossible - Oliver, do any of the internal requests look like the Referer is *the same page* as the one the #mediaviewer link is for? If so, something is very wrong. If not, something is only moderately wrong. -- Mark Holmquist Software Engineer, Multimedia Wikimedia Foundation mtraceur(a)member.fsf.org https://wikimediafoundation.org/wiki/User:MHolmquist

Oliver Keyes

6:13 p.m.

Not that I saw? But I only grabbed a small sample - I can grab a bigger one and send it over internally if you'd like. Many of them look like external links in or filtering mechanisms (based on the referer), but not all. On 26 November 2014 at 08:33, Mark Holmquist <mtraceur(a)member.fsf.org> wrote:

...

On Wed, Nov 26, 2014 at 08:11:27AM -0500, Oliver Keyes wrote:

The

proposed ways forward are useful, but as Erik M says, reorganising active products for the sake of avoiding a pageviews filter is probably not

worth

it unless it's a truly trivial change, so let's just stick with the

status

quo for now and I'll build in a filter.

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) iQIcBAEBAgAGBQJUddaPAAoJEEPl+wghkjzx5QIQAIZzPhnBCgFaGjFqT+Brdd/J 9EWZd6S+u7VG8pZMlYauq2Hn38YJ9WlyIC52c4Ge5NkWmF9khcs/THQs1VKSxnl1 h89GmO4S/eG5jj7y3V+dLdbWcXOim+20XtKS+UE72s2Q3VINowb7Bqjg7QHsAI9r 36sUd4g9mBPI0qOi5vzhQEPKuWmMiy0y93xLWSdVlE5AF8tKGsV3tER2Dz6MUgUj k20Vu2miSd2Uj3IBDRVI8gOOjqve09xtTpeVMJdUpTjV5ajvIGTjnHPDqHgqI3BF aVU3ehyxLYni6OvyHK8Cz3lVFcTlx/m9at0CrHgL0iHlVtbAp+yK09kps7aCPviC VluV1CMGsls8GL+0ADe2sQlxZBSMA+Udo9qwWzeA34FGjGDj7Cxj12nJtmzVu8tY 4uy5YPjIVp86q8SAq+0Zqg2QLZRkNnZJIgtu83DvxL5P4Uel1OVglwBkU5/AgGYG hmnviWP5NZX4ps+U7uEcINrRajGwRFkfuGSgR7Pc3uovX1NSTe7B4ccHO+y88Mca sm4O4WMAvXd+plwGPWVnHDtNjufipxDSO69V3X45njIeoOAZhUbheTRIpeV9v0ne Wl/PirN0EILpkccCbnweyLoJu8mJHYsX+dfROq0BwhLxTSupgTSItKm9McPVTdId 4qfkbStJvtGlyvaPX6YG =oImh -----END PGP SIGNATURE-----

_______________________________________________ Multimedia mailing list Multimedia(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/multimedia

-- Oliver Keyes Research Analyst Wikimedia Foundation

Mark Holmquist

6:16 p.m.

On Wed, Nov 26, 2014 at 12:13:25PM -0500, Oliver Keyes wrote:

...

Nah, it sounds like only moderately wrong, so I won't worry about it. Thanks, -- Mark Holmquist Software Engineer, Multimedia Wikimedia Foundation mtraceur(a)member.fsf.org https://wikimediafoundation.org/wiki/User:MHolmquist

Gergo Tisza

3:26 p.m.

On Wed, Nov 26, 2014 at 5:11 AM, Oliver Keyes <okeyes(a)wikimedia.org> wrote:

...

Gergo: Either the false anchors are sent to the server or some conniving elf has been inserting thousands of fake requests into our logs ;).

Probably not an elf but some sort of misbehaving client. The base URI identifies a primary resource (the web page, in this case), the fragment id identifies a secondary one which might or might not be part of the primary. Retrieving the primary resource is the server's job; retrieving the secondary one based on the primary is the user agent's. So before the URi is dereferenced (retrieved), the user agent removes the fragment. This is officially described in RFC 3986 section 3.5 <https://tools.ietf.org/html/rfc3986#section-3.5>; the W3C recommendation on web architecture has a more human-readable version <http://www.w3.org/TR/webarch/#media-type-fragid> ("Interpretation of the fragment identifier is performed solely by the agent that dereferences a URI; the fragment identifier is not passed to other systems during the process of retrieval."). There will be some user agents which do not conform to the RFC (mostly search engines, I would expect, but maybe some unusual browsers as well), but I'm pretty sure major browsers do. So you should be aware that you are only filtering out a fraction of the MediaViewer requests. We log an event (with sampling) when MediaViewer is loaded via an URL hash, and over all wikis it happens about 1M/day - see http://multimedia-metrics.wmflabs.org/graphs/mmv_actions_global , "hash load" at the bottom (replace "global" with DB name for a filtered view, raw data is in Schema:MediaViewer <https://meta.wikimedia.org/wiki/Schema:MediaViewer>); you can use that to approximate how often such links are really followed.

Gergo Tisza

12:39 a.m.

On Tue, Nov 25, 2014 at 12:21 PM, Oliver Keyes <okeyes(a)wikimedia.org> wrote:

...

We're writing up and redefining the pageview definition. Amongst other things, it uses MIME type filtering and folder-level filtering to exclude non-pageviews.

Just to clarify the context, we are talking about counting page views using some server side log (such as an Apache / Varnish request log), right? In that case there is absolutely no way to tell apart requests to MediaViewer and normal URLs - the fragment part of the URL is never sent to the server, so the browser basically just requests the normal page and then executes a bunch of Javascript. This is intentional, as sending a request that is different from the request for the normal page would split the varnish cache and result in poor performance. The result of this is that it's currently going to be counted as a

...

pageview, even though it's...well, not.

The aim of those URLs is to present an image in the context of the page; the user can access the page simply by closing the viewer. So it is not necessarily that different from a pageview. If you want detailed information about a request (did the user just want the intro section or the full article? Did they read any text at all?), you need client-side logging anyway, and in that case it is trivial to filter MediaViewer pageviews based on whether or not the user went back to the text. Is there any way you lot could avoid the false anchor strategy and pick a

...

URL scheme that won't trigger this? If not, we can just write an exception - but I'd rather that we not have to do that every time anyone decides to make software.

We could add an exception to varnish to ignore a certain query parameter and treat, say, https://az.wikipedia.org/wiki/Mar%C3%A7ello_Malpigi *?mediaviewer*#mediaviewer/File:Marcello_Malpighi_large.jpg as a request to wiki/Mar%C3%A7ello_Malpigi and not wiki/Mar%C3%A7ello_Malpigi?mediaviewer and serve it from the same cache (or use something like ESI to the same effect). It would still mess up browser caching and any proxies on the client end, though. Alternatively we could do something nasty like link to https://az.wikipedia.org/wiki/Special:MediaViewer/Mar%C3%A7ello_Malpigi#med… and then have that special page do some sort of redirect, but that sounds rather horrible and still has some performance hit due to the extra request. The nice and clean solution would be to have a kind of "landing page" which does not include the HTML of the wiki page at all, just maybe the set of images found on it, and load the page under it via AJAX when the user closes the lightbox. That would have advantages apart from stats (mostly performance, both server- and client-side), but it would be a major undertaking and not very well aligned with the current MediaWiki architecture, I think.

...

From the other end of the problem, we could log MediaViewer pageviews over

a separate channel so they can be subtracted from the pageview totals if needed (there are about a million URLs with MediaViewer hashes loaded per day, so that would not be a huge extra traffic), but I imagine maintaining such a dirty hack over complex pageview queries is not something you would wish upon yourselves. So no, I don't really see any way around this (and also don't see how you could write an exception - as far as the server is aware, there is absolutely no difference between http://example.com and http://example.com#foo - that's kind of the point of not splitting browser cache), except in the very long term by shifting logging to the client. I suppose that has to happen eventually anyway, if we want to learn details like time spent on the page or heatmaps or whether the visitor scrolled to the bottom of the page.

3464

days inactive

3465

days old

multimedia@lists.wikimedia.org

Manage subscription

14 comments

7 participants

tags (0)

participants (7)

Brion Vibber
Bryan Davis
Erik Moeller
Gergo Tisza
Gilles Dubuc
Mark Holmquist
Oliver Keyes