I'm not sure what you are referring to when you say "id space". A page can be identified by it's page_id or the pair: (page_namespace, page_title). A category can be identified by its title. You could enumerate them how you like post-hoc.

> Someone previously asked for temporal data. How can we get access to that?

It doesn't exist. We could start recording a history now, but without a clear use-case, I'm not sure it's worth the time.

-Aaron

On Tue, Dec 10, 2013 at 5:38 PM, Sebastiano Vigna <vigna@di.unimi.it> wrote:

On 10 Dec 2013, at 9:46 AM, Aaron Halfaker <ahalfaker@wikimedia.org> wrote:

> This request seems like it could be easy to fulfill. Am I understanding correctly that the dataset being sought would simply contain a list of pairs of pages (in the cases of internal links) and a list of page/category pairs (in the case of categorization)?

Yes, but one thing that must be done is normalize the id space. Presently category and pages have overlapping id spaces. They're also non-contiguous, which is a pain for running, say, any ranking algorithm.

> We can simply just dump out the categorylinks and pagelinks tables in order to meet these needs. I understand that the SQL dump format is painful to deal with, but a plain CSV/TSV format should be fine. The mysql cleint will created a solid TSV format if you just pipe a query to the client and stream that out to a file (or better yet, bzip and then a file).

We would need to 1) decide an id space organization. 2) dump data translating it into the id space. 3) build the graph.

For 1) I think it would be good to have, like, spaces [0..x) [x..y), one for categories and one for pages. For 2) it's just a matter of a Java class fiddling with the ids and the titles (the format for link is asymmetrical). For 3) I'd love to release a binary compressed version because it takes much less space, it is immediately usable and if you want to dump the pairs <x,y> in ASCII is just a single command line.

> These two tables track the most recent categorization/link structure of pages, so we wouldn't be able to use them historically.

Someone previously asked for temporal data. How can we get access to that? We might provide a label file with on-off dates for every, say, category link.

Ciao,

seba

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics