[Labs-l] Labs-l Digest, Vol 39, Issue 13

John phoenixoverride at gmail.com
Fri Mar 13 18:51:00 UTC 2015


Where are you getting your list of pages from?

On Fri, Mar 13, 2015 at 2:46 PM, Marc Miquel <marcmiquel at gmail.com> wrote:

> Hi John,
>
>
> My queries are to obtain "inlinks" and "outlinks" for some articles I have
> in a group (x). Then I check (using python) if they have inlinks and
> outlinks from another group of articles. By now I am doing a query for each
> article. I wanted to obtain all links for group (x) and then do this
> comprovation....But getting all links for groups as big as 300000 articles
> would imply 6 million links. Is it possible to obtain all this or is there
> a MySQL/RAM limit?
>
> Thanks.
>
> Marc
>
>>
> 2015-03-13 19:29 GMT+01:00 <labs-l-request at lists.wikimedia.org>:
>
>> Send Labs-l mailing list submissions to
>>         labs-l at lists.wikimedia.org
>>
>> To subscribe or unsubscribe via the World Wide Web, visit
>>         https://lists.wikimedia.org/mailman/listinfo/labs-l
>> or, via email, send a message with subject or body 'help' to
>>         labs-l-request at lists.wikimedia.org
>>
>> You can reach the person managing the list at
>>         labs-l-owner at lists.wikimedia.org
>>
>> When replying, please edit your Subject line so it is more specific
>> than "Re: Contents of Labs-l digest..."
>>
>>
>> Today's Topics:
>>
>>    1. dimension well my queries for very large tables like
>>       pagelinks - Tool Labs (Marc Miquel)
>>    2. Re: dimension well my queries for very large tables like
>>       pagelinks - Tool Labs (John)
>>    3. Re: Questions regarding the Labs Terms of use (Tim Landscheidt)
>>    4. Re: Questions regarding the Labs Terms of use (Ryan Lane)
>>    5. Re: Questions regarding the Labs Terms of use (Pine W)
>>
>>
>> ----------------------------------------------------------------------
>>
>> Message: 1
>> Date: Fri, 13 Mar 2015 17:59:09 +0100
>> From: Marc Miquel <marcmiquel at gmail.com>
>> To: "labs-l at lists.wikimedia.org" <labs-l at lists.wikimedia.org>
>> Subject: [Labs-l] dimension well my queries for very large tables like
>>         pagelinks - Tool Labs
>> Message-ID:
>>         <CANSEGinZBWYsb0Y9r9Yk8AZo3COwzT4NTs7YkFxj=
>> naEa9d6+w at mail.gmail.com>
>> Content-Type: text/plain; charset="utf-8"
>>
>> Hello guys,
>>
>> I have a question regarding Tool Labs. I am doing research on links and
>> although I know very well what I am looking for I struggle in how to get
>> it
>> effectively...
>>
>> I need to know your opinion because you know very well the system and
>> what's feasible and what is not.
>>
>> I explain you what I need to do:
>> I have a list of articles for different languages which I need to check
>> their pagelinks and see where they point to and from where they point at
>> them.
>>
>> I now do a query for each article id in this list of articles, which goes
>> from 80000 in some wikipedias to 300000 in other and more. I have to do it
>> several times and it is very time consuming (several days). I wish I could
>> only count the total of links for each case but I need to see only some of
>> the links per article.
>>
>> I was thinking about getting all pagelinks and iterating using python
>> (which is the language I use for all this). This would be much faster
>> because I'd save all the queries, one per article, I am doing now. But
>> pagelinks table has millions of rows and I cannot load that because mysql
>> would die. I could buffer, but I haven't tried if it works also.
>>
>> I am considering creating a personal table in the database with titles,
>> ids, and inner joining to just obtain the pagelinks for these 300.000
>> articles. With this I would just retrieve 20% of the database instead of
>> the 100%. That would be maybe 8M rows sometimes (page_title or page_id,
>> one
>> of both per row), or even more... loaded into python dictionaries and
>> lists. Would that be a problem...? I have no idea of how much RAM this
>> implies and how much I can use in Tool labs.
>>
>> I am totally lost when I get these problems related to scale...I thought
>> about writing to the IRC channel but I thought it was maybe too long and
>> too specific. If you give me any hint that would really help.
>>
>> Thank you very much!
>>
>> Cheers,
>>
>> Marc Miquel
>>>> -------------- next part --------------
>> An HTML attachment was scrubbed...
>> URL: <
>> https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/59a09113/attachment-0001.html
>> >
>>
>> ------------------------------
>>
>> Message: 2
>> Date: Fri, 13 Mar 2015 13:07:20 -0400
>> From: John <phoenixoverride at gmail.com>
>> To: Wikimedia Labs <labs-l at lists.wikimedia.org>
>> Subject: Re: [Labs-l] dimension well my queries for very large tables
>>         like pagelinks - Tool Labs
>> Message-ID:
>>         <CAP-JHpn=ToVYdT7i-imp7+XLTkgQ1PtieORx7BTuVAJw=
>> YSbFQ at mail.gmail.com>
>> Content-Type: text/plain; charset="utf-8"
>>
>> what kind of queries are you doing? odds are they can be optimized.
>>
>> On Fri, Mar 13, 2015 at 12:59 PM, Marc Miquel <marcmiquel at gmail.com>
>> wrote:
>>
>> > Hello guys,
>> >
>> > I have a question regarding Tool Labs. I am doing research on links and
>> > although I know very well what I am looking for I struggle in how to
>> get it
>> > effectively...
>> >
>> > I need to know your opinion because you know very well the system and
>> > what's feasible and what is not.
>> >
>> > I explain you what I need to do:
>> > I have a list of articles for different languages which I need to check
>> > their pagelinks and see where they point to and from where they point at
>> > them.
>> >
>> > I now do a query for each article id in this list of articles, which
>> goes
>> > from 80000 in some wikipedias to 300000 in other and more. I have to do
>> it
>> > several times and it is very time consuming (several days). I wish I
>> could
>> > only count the total of links for each case but I need to see only some
>> of
>> > the links per article.
>> >
>> > I was thinking about getting all pagelinks and iterating using python
>> > (which is the language I use for all this). This would be much faster
>> > because I'd save all the queries, one per article, I am doing now. But
>> > pagelinks table has millions of rows and I cannot load that because
>> mysql
>> > would die. I could buffer, but I haven't tried if it works also.
>> >
>> > I am considering creating a personal table in the database with titles,
>> > ids, and inner joining to just obtain the pagelinks for these 300.000
>> > articles. With this I would just retrieve 20% of the database instead of
>> > the 100%. That would be maybe 8M rows sometimes (page_title or page_id,
>> one
>> > of both per row), or even more... loaded into python dictionaries and
>> > lists. Would that be a problem...? I have no idea of how much RAM this
>> > implies and how much I can use in Tool labs.
>> >
>> > I am totally lost when I get these problems related to scale...I thought
>> > about writing to the IRC channel but I thought it was maybe too long and
>> > too specific. If you give me any hint that would really help.
>> >
>> > Thank you very much!
>> >
>> > Cheers,
>> >
>> > Marc Miquel
>> > ᐧ
>> >
>> > _______________________________________________
>> > Labs-l mailing list
>> > Labs-l at lists.wikimedia.org
>> > https://lists.wikimedia.org/mailman/listinfo/labs-l
>> >
>> >
>> -------------- next part --------------
>> An HTML attachment was scrubbed...
>> URL: <
>> https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/c60b6d44/attachment-0001.html
>> >
>>
>> ------------------------------
>>
>> Message: 3
>> Date: Fri, 13 Mar 2015 17:36:00 +0000
>> From: Tim Landscheidt <tim at tim-landscheidt.de>
>> To: labs-l at lists.wikimedia.org
>> Subject: Re: [Labs-l] Questions regarding the Labs Terms of use
>> Message-ID: <878uf0vlz3.fsf at passepartout.tim-landscheidt.de>
>> Content-Type: text/plain
>>
>> (anonymous) wrote:
>>
>> > [...]
>>
>> > To be clear: I'm not going to make my code proprietary in
>> > any way. I just wanted to know whether I'm entitled to ask
>> > for the source of every Labs bot ;-)
>>
>> Everyone is entitled to /ask/, but I don't think you have a
>> right to /receive/ the source :-).
>>
>> AFAIK, there are two main reasons for the clause:
>>
>> a) WMF doesn't want to have to deal with individual licences
>>    that may or may not have the potential for litigation
>>    ("The Software shall be used for Good, not Evil").  With
>>    requiring OSI-approved, tried and true licences, the risk
>>    is negligible.
>>
>> b) Bots and tools running on an infrastructure financed by
>>    donors, like contributions to Wikipedia & Co., shouldn't
>>    be usable for blackmail.  Noone should be in a legal po-
>>    sition to demand something "or else ..."  The perpetuity
>>    of OS licences guarantees that everyone can be truly
>>    thankful to developers without having to fear that other-
>>    wise they shut down devices, delete content, etc.
>>
>> But the nice thing about collaboratively developed open
>> source software is that it usually is of a better quality,
>> so clandestine code is often not that interesting.
>>
>> Tim
>>
>>
>>
>>
>> ------------------------------
>>
>> Message: 4
>> Date: Fri, 13 Mar 2015 11:52:18 -0600
>> From: Ryan Lane <rlane32 at gmail.com>
>> To: Wikimedia Labs <labs-l at lists.wikimedia.org>
>> Subject: Re: [Labs-l] Questions regarding the Labs Terms of use
>> Message-ID:
>>         <
>> CALKgCA3Lv-SQoeibEsm7Ckc0gaPJwph_b0HSTx+actaKMDuXmg at mail.gmail.com>
>> Content-Type: text/plain; charset="utf-8"
>>
>> On Fri, Mar 13, 2015 at 8:42 AM, Ricordisamoa <
>> ricordisamoa at openmailbox.org>
>> wrote:
>>
>> > From https://wikitech.wikimedia.org/wiki/Wikitech:Labs_Terms_of_use
>> > (verbatim): "Do not use or install any software unless the software is
>> > licensed under an Open Source license".
>> > What about tools and services made up of software themselves? Do they
>> have
>> > to be Open Source?
>> > Strictly speaking, do the Terms of use require that all code be made
>> > available to the public?
>> > Thanks in advance.
>> >
>> >
>> As the person who wrote the initial terms and included this I can speak to
>> the spirit of the term (I'm not a lawyer, so I won't try to go into any
>> legal issues).
>>
>> I created Labs with the intent that it could be used as a mechanism to
>> fork
>> the projects as a whole, if necessary. A means to this end was including
>> non-WMF employees in the process of infrastructure operations (which is
>> outside the goals of the tools project in Labs). Tools/services that are
>> can't be distributed publicly harm that goal. Tools/services that aren't
>> open source completely break that goal. It's fine if you wish to not
>> maintain the code in a public git repo, but if another tool maintainer
>> wishes to publish your code, there should be nothing blocking that.
>>
>> Depending on external closed source services is a debatable topic. I know
>> in the past we've decided to allow it. It goes against the spirit of the
>> project, but it doesn't require us to distribute close sourced software in
>> the case of a fork.
>>
>> My personal opinion is that your code should be in a public repository to
>> encourage collaboration. As the terms are written, though, your code is
>> required to be open source, and any libraries it depends on must be as
>> well.
>>
>> - Ryan
>> -------------- next part --------------
>> An HTML attachment was scrubbed...
>> URL: <
>> https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/8d760f03/attachment-0001.html
>> >
>>
>> ------------------------------
>>
>> Message: 5
>> Date: Fri, 13 Mar 2015 11:29:47 -0700
>> From: Pine W <wiki.pine at gmail.com>
>> To: Wikimedia Labs <labs-l at lists.wikimedia.org>
>> Subject: Re: [Labs-l] Questions regarding the Labs Terms of use
>> Message-ID:
>>         <CAF=dyJjO69O-ye+327BU6wC_k_+AQvwUq0rfZvW0YaV=
>> P+iCaA at mail.gmail.com>
>> Content-Type: text/plain; charset="utf-8"
>>
>> Question: are there heightened security or privacy risks posed by having
>> non-open-source code running in Labs?
>>
>> Is anyone proactively auditing Labs software for open source compliance,
>> and if not, should this be done?
>>
>> Pine
>> On Mar 13, 2015 10:52 AM, "Ryan Lane" <rlane32 at gmail.com> wrote:
>>
>> > On Fri, Mar 13, 2015 at 8:42 AM, Ricordisamoa <
>> > ricordisamoa at openmailbox.org> wrote:
>> >
>> >> From https://wikitech.wikimedia.org/wiki/Wikitech:Labs_Terms_of_use
>> >> (verbatim): "Do not use or install any software unless the software is
>> >> licensed under an Open Source license".
>> >> What about tools and services made up of software themselves? Do they
>> >> have to be Open Source?
>> >> Strictly speaking, do the Terms of use require that all code be made
>> >> available to the public?
>> >> Thanks in advance.
>> >>
>> >>
>> > As the person who wrote the initial terms and included this I can speak
>> to
>> > the spirit of the term (I'm not a lawyer, so I won't try to go into any
>> > legal issues).
>> >
>> > I created Labs with the intent that it could be used as a mechanism to
>> > fork the projects as a whole, if necessary. A means to this end was
>> > including non-WMF employees in the process of infrastructure operations
>> > (which is outside the goals of the tools project in Labs).
>> Tools/services
>> > that are can't be distributed publicly harm that goal. Tools/services
>> that
>> > aren't open source completely break that goal. It's fine if you wish to
>> not
>> > maintain the code in a public git repo, but if another tool maintainer
>> > wishes to publish your code, there should be nothing blocking that.
>> >
>> > Depending on external closed source services is a debatable topic. I
>> know
>> > in the past we've decided to allow it. It goes against the spirit of the
>> > project, but it doesn't require us to distribute close sourced software
>> in
>> > the case of a fork.
>> >
>> > My personal opinion is that your code should be in a public repository
>> to
>> > encourage collaboration. As the terms are written, though, your code is
>> > required to be open source, and any libraries it depends on must be as
>> well.
>> >
>> > - Ryan
>> >
>> > _______________________________________________
>> > Labs-l mailing list
>> > Labs-l at lists.wikimedia.org
>> > https://lists.wikimedia.org/mailman/listinfo/labs-l
>> >
>> >
>> -------------- next part --------------
>> An HTML attachment was scrubbed...
>> URL: <
>> https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/fc853f84/attachment.html
>> >
>>
>> ------------------------------
>>
>> _______________________________________________
>> Labs-l mailing list
>> Labs-l at lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/labs-l
>>
>>
>> End of Labs-l Digest, Vol 39, Issue 13
>> **************************************
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/de7a29b8/attachment-0001.html>


More information about the Labs-l mailing list