Re: [Wiki-research-l] [Analytics] Wikipedia aggregate clickstream data released - Wiki-research-l

20 Jan 2018

Hoi,
I am a big fan of suggesting people to write articles / do work that will
be read, will be used. In a blogpost [1], I suggest the accumulation of
these click streams and use the missing popular articles as suggestions for
new articles. Articles that people seek and are truly missing are also
obvious candidates as suggestions for new articles.

My question: how hard is it to do this accumulation and analysis for
missing new articles and, combine it with suggestions to authors to write
something that is likely to prove popular? Does this idea have merit?
Thanks,
        GerardM

[1]
https://ultimategerardm.blogspot.nl/2018/01/wikipedia-entering-rabbit-hole.…

On 18 January 2018 at 21:37, Joseph Allemandou &lt;jallemandou(a)wikimedia.org&gt;
wrote:

...
  Hi Gerard,
 Here are my two cents on your questions.

 About redlinks, you are correct in saying that the 3% of "other" link-type
 are jumps from a page to another (using http-referer), while the hyperlink
 from the origin to the target allowing for such a jump doesn't exist in the
 origin page at the moment of computation.
 From my exploration of the dataset, such "other" links happen with the
 "manually-edited-with-error" url class (the "-" article has a lot of
such
 entering links for instance), as well as with links that I think have been
 edited in the origin page (for instance in November 2017 dataset, there are
 "other" links from page "Kevin Spacey" to "Dan Savage",
 "hebephilia","pedophilia or "Harvey_Weinstein" - Those links are
confirmed
 as existing at some point in the page in November, but not anymore at the
 beginning of December when the pages hyperlinks are snapshot).

 As for your question about what people are looking for and don't find, the
 one way I can think of to get ideas is to use detailed session analysis
 correlated with search results, in order to try to get a signal of pages
 reached from search and not being visited for long. Even if I think we have
 data we could use in that respect on the cluster, we can't publish such
 details externally for privacy concerns, obviously.

 Please let me know if what I say makes sense :)
 Many thanks
 Joseph Allemandou

  Hoi,
 Do I understand well that the 3% of "other" links are the ones that have
 articles at *this *time but they did not exist at the time of the dump. So

 in effect they are not red links?

 Is there any way to find the articles people were seeking but could not
 find??
 Thanks,
      GerardM

 On 16 January 2018 at 20:21, Leila Zia &lt;leila(a)wikimedia.org&gt; wrote:

  Hi all,

 For archive happiness:

 Clickstream dataset is now being generated on a monthly basis for 5
 Wikipedia languages (English, Russian, German, Spanish, and Japanese).  You
  can access the data at
https://dumps.wikimedia.org/other/clickstream/  and
  read more about the release and those who
contributed to it at
 https://blog.wikimedia.org/2018/01/16/wikipedia-rabbit-hole-  clickstream/

 Best,
 Leila

 --
 Leila Zia
 Senior Research Scientist
 Wikimedia Foundation

 On Tue, Feb 17, 2015 at 11:00 AM, Dario Taraborelli <
 dtaraborelli(a)wikimedia.org&gt; wrote:

 > We’re glad to announce the release of an aggregate clickstream dataset
 > extracted from English Wikipedia
 >
 > http://dx.doi.org/10.6084/m9.figshare.1305770
 >
 > This dataset contains counts of *(referer, article) *pairs aggregated
 > from the HTTP request logs of English Wikipedia. This snapshot  captures
  22
 > million *(referer, article)* pairs from a total of 4 billion requests
 > collected during the month of January 2015.
 >
 > This data can be used for various purposes:
 > • determining the most frequent links people click on for a given  article
  > • determining the most common links people
followed to an article
 > • determining how much of the total traffic to an article clicked on a
 > link in that article
 > • generating a Markov chain over English Wikipedia
 >
 > We created a page on Meta for feedback and discussion about this  release:

https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream

 Ellery and Dario

 _______________________________________________
 Analytics mailing list
 Analytics(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics

  _______________________________________________
 Wiki-research-l mailing list
 Wiki-research-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
  _______________________________________________
 Wiki-research-l mailing list
 Wiki-research-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

 --

 *Dario Taraborelli  *Director, Head of Research, Wikimedia Foundation
 wikimediafoundation.org • nitens.org • @readermeter
 <http://twitter.com/readermeter>

 --
 *Joseph Allemandou*
 Data Engineer @ Wikimedia Foundation
 IRC: joal