[Labs-l] dimension well my queries for very large tables like pagelinks - Tool Labs

John phoenixoverride at gmail.com
Fri Mar 13 17:07:20 UTC 2015


what kind of queries are you doing? odds are they can be optimized.

On Fri, Mar 13, 2015 at 12:59 PM, Marc Miquel <marcmiquel at gmail.com> wrote:

> Hello guys,
>
> I have a question regarding Tool Labs. I am doing research on links and
> although I know very well what I am looking for I struggle in how to get it
> effectively...
>
> I need to know your opinion because you know very well the system and
> what's feasible and what is not.
>
> I explain you what I need to do:
> I have a list of articles for different languages which I need to check
> their pagelinks and see where they point to and from where they point at
> them.
>
> I now do a query for each article id in this list of articles, which goes
> from 80000 in some wikipedias to 300000 in other and more. I have to do it
> several times and it is very time consuming (several days). I wish I could
> only count the total of links for each case but I need to see only some of
> the links per article.
>
> I was thinking about getting all pagelinks and iterating using python
> (which is the language I use for all this). This would be much faster
> because I'd save all the queries, one per article, I am doing now. But
> pagelinks table has millions of rows and I cannot load that because mysql
> would die. I could buffer, but I haven't tried if it works also.
>
> I am considering creating a personal table in the database with titles,
> ids, and inner joining to just obtain the pagelinks for these 300.000
> articles. With this I would just retrieve 20% of the database instead of
> the 100%. That would be maybe 8M rows sometimes (page_title or page_id, one
> of both per row), or even more... loaded into python dictionaries and
> lists. Would that be a problem...? I have no idea of how much RAM this
> implies and how much I can use in Tool labs.
>
> I am totally lost when I get these problems related to scale...I thought
> about writing to the IRC channel but I thought it was maybe too long and
> too specific. If you give me any hint that would really help.
>
> Thank you very much!
>
> Cheers,
>
> Marc Miquel
>>
> _______________________________________________
> Labs-l mailing list
> Labs-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/labs-l
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/c60b6d44/attachment.html>


More information about the Labs-l mailing list