On 10 Dec 2013, at 9:46 AM, Aaron Halfaker <ahalfaker@wikimedia.org> wrote:Yes, but one thing that must be done is normalize the id space. Presently category and pages have overlapping id spaces. They're also non-contiguous, which is a pain for running, say, any ranking algorithm.
> This request seems like it could be easy to fulfill. Am I understanding correctly that the dataset being sought would simply contain a list of pairs of pages (in the cases of internal links) and a list of page/category pairs (in the case of categorization)?
We would need to 1) decide an id space organization. 2) dump data translating it into the id space. 3) build the graph.
> We can simply just dump out the categorylinks and pagelinks tables in order to meet these needs. I understand that the SQL dump format is painful to deal with, but a plain CSV/TSV format should be fine. The mysql cleint will created a solid TSV format if you just pipe a query to the client and stream that out to a file (or better yet, bzip and then a file).
For 1) I think it would be good to have, like, spaces [0..x) [x..y), one for categories and one for pages. For 2) it's just a matter of a Java class fiddling with the ids and the titles (the format for link is asymmetrical). For 3) I'd love to release a binary compressed version because it takes much less space, it is immediately usable and if you want to dump the pairs <x,y> in ASCII is just a single command line.
Someone previously asked for temporal data. How can we get access to that? We might provide a label file with on-off dates for every, say, category link.
> These two tables track the most recent categorization/link structure of pages, so we wouldn't be able to use them historically.
Ciao,
seba
_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics