I agree that user privacy is paramount, and people have thought of various whitelist rules and other automatic approaches to filter out personally identifiable information (PII), but they tend not to work once you dig into the data.

One caveat on the link Chris provided: I was only looking at "unsuccessful" queries. Felix seems to be after all queries—and there are plenty of successful queries that give good results that I didn't consider. All the queries that match titles and redirects would dilute (but not at all eliminate) the queries that cause privacy concerns.

I second Erik's suggestion of the Discernatron data. It's not perfect and there's not a lot of it, but it's available.

A moderate effort way to mine for queries would be to get volunteers to let you have their Wikipedia search history. In Chrome, for example, you can get an extension that will let you view all of your browser history at once (rather than one page at a time). I searched wikipedia special:search, clicked "All History", "select all" and pasted to a text file. I was able to gather almost 1200 queries in less than a minute. My home computer yielded 130 or so (that's probably more typical—I search a lot at work, for work). 20 volunteers would get you an admittedly biased sample of ~2,000 queries. It's not a great source, but it's something.

Such a manually mined corpus would have the advantage of being actual human queries. We get a lot of bots, and a lot of queries that aren't something you would necessarily want to optimize to improve human users' experience.

—Trey

Trey Jones
Software Engineer, Discovery
Wikimedia Foundation


On Wed, Aug 17, 2016 at 2:10 PM, Chris Koerner <ckoerner@wikimedia.org> wrote:
The discussion around the difficulty of providing such a list (and it's relative usefulness) is well summarized in Trey's notes from his research into the matter. 


On Wed, Aug 17, 2016 at 12:58 PM, Eran Rosenthal <eranroz89@gmail.com> wrote:
Unfortunately WMF policy to release search queries to the public is too strict.
(Although there are privacy concerns, I'm sure anyone here could easily think of some simple whitelist rules. For more details please refer to https://phabricator.wikimedia.org/T115085 or https://phabricator.wikimedia.org/T8373 or similar bugs in phabricator)

As a workaround you can use other data as approximation to what users look for (though you don't get the query itself, only the result - under assumption the users find what they look for):
https://wikimedia.org/api/rest_v1/ - page view data
or as dump:
https://dumps.wikimedia.org/other/analytics/

Other options (they have their own caveats but you can try):
* Search for "Special:Search/QUERY" in the pagecounts-all-sites linked above (zcat DUMP | grep "Search/") - this can provide you results such as "commons.m.m Special:Search/Jnnjjjnnnnjnjjnbnjbnjnjj 1 5418" so you know 1 user seared for "Jnnjjjnnnnjnjjnbnjbnjnjj" in mobile, at 2016-05-15 13:00-14:00
* Use google trends




On Wed, Aug 17, 2016 at 8:18 PM, Stas Malyshev <smalyshev@wikimedia.org> wrote:
Hi!

>> I’m currently writing by bachelor thesis at University Koblenz,
>> Germany. The goal is to improve Wikipedia search by exploiting the
>> text structure of Wikipedia articles. To conduct unbiased user
>> studies I need real world queries so I can compare the novel
>> algorithms agains the currently used ones. Are there any query logs
>> existing which I can use for this purpose?

We do have query logs, but they are not publicly accessible for privacy
reasons. You may want to check this out though:
https://meta.wikimedia.org/wiki/Research:Understanding_Wikidata_Queries

--
Stas Malyshev
smalyshev@wikimedia.org

_______________________________________________
discovery mailing list
discovery@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/discovery


_______________________________________________
discovery mailing list
discovery@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/discovery




--
Yours,
Chris Koerner
Community Liaison - Discovery
Wikimedia Foundation

_______________________________________________
discovery mailing list
discovery@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/discovery