Quite understandable. It's also possible to augment the dataset w/ some
percent (perhaps ~5%) of the data having the, human reviewed & PII safe,
query.
On the PII topic, one missing feature is user geolocation. This will help
disambiguate user intent for queries that are geolocal. For instance, [civic
center <https://en.wikipedia.org/w/index.php?search=civic+center>]
(location search), [john marks
<https://en.wikipedia.org/w/index.php?search=john+marks>] (people query),
or [air marshal <https://en.wikipedia.org/w/index.php?search=air+marshal>]
(alternative meanings in US/UK). Reducing the Lat/Lng to the metropolitan
area, or even state level may mitigate the PII impact. You can likely see
examples of Google/Bing/DDG doing geo based ranking by using a VPN and
running [xyz
] queries.
Another feature I'd like to try: one hot encoding of the top 1-5k page
categories. Aka create N binary columns (one for each of the top categories
across enwiki) in the dataset where each column has a 1/0 if the page for
that training row exists in that column's category. This would help uprank
certain types of page categories, and can usefully intact w/ the word
embedding (word2vec) you're using.
--justin
On Thu, Jan 5, 2017 at 12:48 PM, Trey Jones <tjones(a)wikimedia.org> wrote:
The privacy impact is greater, but having the original
query would be
useful for folks wanting to create their own
query level features & query
dependent features. You do have a great set of features listed
<https://phabricator.wikimedia.org/P4677> there. As always, I'd bias for
action, and release what's possible currently, letting folks play with the
dataset.
Right now the standard is that all queries that are released must be
reviewed by humans. A query data dump had to be retracted in the past for
containing PII, so I don't see us getting around that (nor would I want to,
really, having seen the kind of info that can be in there).
We did the manual review for the Discernatron query data, but it's not
scalable for the size of dataset needed to do machine learning. However, if
anyone has any good ideas for features, please let us know, and maybe we
can generate those features and share them, too, time permitting.
—Trey
Trey Jones
Software Engineer, Discovery
Wikimedia Foundation
On Fri, Dec 30, 2016 at 2:28 PM, Justin Ormont <justin.ormont(a)gmail.com>
wrote:
> I think the PII impact in releasing a dataset w/ only numerical feature
> vectors is extremely low.
>
> The privacy impact is greater, but having the original query would be
useful for folks wanting to create their own
query level features & query
dependent features. You do have a great set of features listed
<https://phabricator.wikimedia.org/P4677> there. As always, I'd bias for
action, and release what's possible currently, letting folks play with the
dataset.
>
> I'd recommend having a groupId which is uniq for each instance of a user
> running a query. This is used to group together all of the results in a
> viewed SERP, and allows the ranking function to worry only about rank order
> instead of absolute scoring; aka, the scoring only matters relative to the
> other viewed documents.
>
> I'd try out LightGBM & XGBoost in their ranking modes for creating a
> model.
>
> --justin
>
> On Thu, Dec 22, 2016 at 4:00 PM, Erik Bernhardson <
> ebernhardson(a)wikimedia.org> wrote:
>
>> gh it with 100 normalized queries to get a count, and there are 4852
>> features. Lots of them are probably useless, but choosing which ones is
>> probably half the battle. These are ~230MB in pickle format, which stores
>> the floats in binary. This can then be compressed to ~20MB with gzip, so
>> the data size isn't particularly insane. In a released dataset i would
>> probably use 10k normalized queries, meaning about 100x this size Could
>> plausibly release as csv's instead of pickled numpy arrays. That will
>> probably increase the data size further,
>
>
>
>
> _______________________________________________
> discovery mailing list
> discovery(a)lists.wikimedia.org
>
https://lists.wikimedia.org/mailman/listinfo/discovery
>
>
_______________________________________________
discovery mailing list
discovery(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/discovery