On Mon, Jun 5, 2023 at 12:46 PM Vivian Rook <vrook(a)wikimedia.org> wrote:
On Mon, Jun 5, 2023 at 2:43 PM Hal Triedman
<htriedman(a)wikimedia.org> wrote:
Hi cloud admins!
My name is Hal Triedman — I'm a Privacy Engineer at WMF, but in my spare time I do a
lot of work on machine learning. One of the things we've been looking into is the
creation of label-query datasets for Mediawiki database queries, with the goal of being
able to finetune an AI model to help users write queries with more ease/create embeddings
that allow for easier searching of past queries.
Quarry is particularly interesting for this project because it has the following
qualities:
1) it is entirely on Mediawiki databases
2) it has been used to make hundreds of thousands of queries
3) many of those queries have relatively descriptive titles about what is happening in
the SQL
Is there any easy way of assembling a database of existing public title-query pairs (i.e.
by running a database query that excludes things like "Untitled query", or just
pulling published queries)? Please let me know, and thanks.
I don't see a reason that you can't have access to the quarry db. Does anyone
else?
It seems both reasonable and useful to me. See also backlog tasks like
<https://phabricator.wikimedia.org/T93907> (Database dump for
analysis) and <https://phabricator.wikimedia.org/T151158> (Support
queries against Quarry's own database and ToolsDB).
Bryan
--
Bryan Davis Technical Engagement Wikimedia Foundation
Principal Software Engineer Boise, ID USA
[[m:User:BDavis_(WMF)]] irc: bd808