Toolserver-l March 2010

toolserver-l@lists.wikimedia.org

27 participants
25 discussions

by Daniel Kinzler

Hi all I'm happy to let you know that new hardware has been ordered by Wikimedia Deutschland and will arrive probably in about two weeks. We will get two new systems: * A more powerful web server, to replace hemlock: Sun Fire X4150, 2x Quad-Core Xeon, 8GB RAM, 2x73GB SAS HDD. The current web server only has two cores. * Another database server, to be used for S1 (english wikipedia), so S1 and S3 no longer have to share a server: Sun Fire X4250, 2x Quad-Core Xeon, 32GB RAM, 16x146GB SAS RAID. This should improve performance and give us some head space for growth. Once the new servers arrive, S3 will be re-imported too, so we will have live data again. Any ideas for names? To stay with the nightshade theme, how about Jurubeba and Erubia? Or perhaps we go the "witches' weed" way, with Datura and Mandrake? Henbane is taken, i think. Amanita sounds nice, too :) A third server has been ordered, which will also be installed in Amsterdam, but will not be part of the toolserver cluster. It's a storage server (X4540, 24TB RAID) that will keep a live backup of all media files. Cheers, Daniel

12 years, 10 months

Page view stats

by Lars Aronsson

Is page view statistics (as in stats.grok.se) being imported to the toolserver and available in a database for quick reference? I'm experimenting with using this as a tool for finding which articles need improvement: Among short stubs or articles in a watch category, I'm addressing the most popular articles first. The only thing I need is the aggregate number of page views in the previous month. -- Lars Aronsson (lars(a)aronsson.se) Aronsson Datateknik - http://aronsson.se

14 years

Golem issues

by Alex Rave

Hello, all. I'm from ru-wiki, i'm one of active members in Connectivity project. I was very concerned when I learned that Golem can't work anymore because of new limitation on Toolserver. Golem's data is a key part of Connectivity project, which works about improving of Wikipedia quality. Connectivity project works mostly in russian and ukrainian editions of WP, but Golem also collects very useful information for every other language, except english, which is too huge yet to analyse. Project's code is improving continuously, for example two years ago, when ruwiki has about 250k of articles and project's tools were few in number, analyse of ruwiki took about 2 hours, and now, when ruwiki has 500k of articles and number of connectivity tools is increased several times, it's required about 1 hour 40 minutes for analyse. Improvement may go faster: there is only one programmer now in project - Mashiah, and if anybody wants to help him and participate in code improving, he is free to join. Our project needs any help from programmers. We have noted that the number of isolated articles is directly related to the authors' awareness of the lack of referencing articles . At certain periods of time due to toolserver problems in February 2009 we were unable to obtain timely data. During these periods the number of isolated articles usually grows, and the growth gradually turns to decline once the Golem being started to work again. So, it means, that any idle period of Golem leads to a deterioration in the quality of articles. The code will be improved in any case, sooner or later, but we want to try all the ways for keeping Golem running during the optimization process. I want to ask if hardware upgrade can resolve this problem? And if it's possible, can you please estimate models and cost of required equipment? Please, help us to help Wikipedia.

14 years, 1 month

Reimport of s4 on hyacinth and cassia

by River Tarnell

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi, As our master copy of s4 is missing parts of the database (TS-583), I will reimport the database today. While this is in progress, there will be no commonswiki_p database on cassia, the server for s3 and s6. The import should only take an hour or two. After the import, s4 will be switched back to cassia, which will fix the problem with user databases being on the wrong server. This issue is being tracked in JIRA as MNT-436. - river. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (HP-UX) iEYEARECAAYFAkuzN2gACgkQIXd7fCuc5vIGNACgwHNE6wqPA6KuYk+BDtXjoCzL nQIAn1ymlXYcU+0T2EUyhSKZrFdQ/k7R =xVyB -----END PGP SIGNATURE-----

14 years, 1 month

Server switch for s3/s4/s6, Monday morning UTC

by River Tarnell

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi, On Monday morning we will switch these clusters from the current server (hyacinth) to cassia due to previously announced problems with hyacinth. This will involve a couple of hours read-only time while the user databases are copied. There should be no interruption to wiki database access. This issue is being tracked in JIRA as MNT-423. - river. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (HP-UX) iEYEARECAAYFAkuvYLoACgkQIXd7fCuc5vKpjQCfU/0Ag+woVV/HRaxKCh5EGKWl o8gAoKBLtifIjI6RpM6163MNxoYpdCZ4 =bhXV -----END PGP SIGNATURE-----

14 years, 1 month

Re: [Toolserver-l] Golem issues

by Vladimir Medeyko

Dear colleagues, dear Daniel, thanks for the detailed explanation! It looks like the Connectivity project definitely overgrew the Toolserver single user status, and maybe also overgrew the Toolserver structure in general. There is a contradiction between considering it as a single user tool and its real large-scale nature. (For example, it may be considered as equal to all interwiki bots summed together, or even similar to the semantic wiki project.) It looks like there's a REAL need to find some long-term solution - a complete (or at least major) bot refactoring, integrating support into mediawiki, financing a dedicated server, or something like this. Ok. But then I would like to ask you to consider also some short-term solution (say, for two-three months) which might allow the project to function before a long-term solution is implemented. Maybe also Mashiah is able to find a way how to run Golem with a limited functionality, still making the analysis essential for the project? Of course, I would be happy to discuss Toolserver support opportunities during the chapters' conference in Berlin! > Hello Vladimir > The problem is that Golem uses a very large amount of memory, about 4GB. That's > 1/8 of the total capacity, and it's memory that can not be used for normal > database operations if it's set aside for the memory tables golem uses (even if > they are not in use). This by far exceeds the fair share of resources for each > toolserver user. > It was only recently discovered that golem does use so much memory (because it > does it on the database server, not the normal user server), but it is suspected > that this is at least one of the causes that triggered system failured in the > past. We do not currently see the possibility of allowing individual users to > use that much memory, especially not on the database server. Basically, as it is > implemented now, Golem is unfit for the toolserver, because it consumes far too > much resources. > Earlier today, I ask Mashiah to consider alternative ways to implement the > network analysis. I think it would be possible to reduce the memory use by at > least a factor of 8. Should this not be possible, golem would have to run on a > dedicated system. > If there are good reasons and sufficient funding, setting aside a VM or even a > full server for a special project can be considered. How individual projects and > chapters can participate more in the governance (and funding) of the toolserver > is one of the topics that will be discussed at the upcoming chapters' conference > in april in berlin. I recommend you bring up the topic of golem there. > Regards, > Daniel PS: Below I quote my reply to Mashiah. >* Hello Mashiah *>* *>>* > Connectivity is a property of a graph as a whole, there is no way to *>>* > analyze it having just a part of all nodes and edges. Use of original *>>* > tables in language database or use of MyISAM tables makes the analysis *>>* > far too slow. Good thing with memory tables is not only in being *>>* > located in memory (which is not always true of course), the engine is *>>* > optimized for speed itself and the format is designed to allow that. *>* *>* If your project requires more resources than are available as your fair share on *>* the toolserver, then either the need for resources needs to be reduced, or the *>* project has to run elsewhere. If there are good reasons and sufficient funding, *>* setting aside a VM or even a full server for a special project can be *>* considered. How individual projects and chapters can participate more in the *>* givernance (and funding) of the toolserver is one of the topics that will be *>* discussed at the upcoming chapter's conference in april in berlin. I suggest you *>* contact someone who will attend the meeting, and discuss the issue with them. *>* *>* Anyway, if using MySQL's memory tables consumes too much resources, perhaps *>* consider alternatives? Have you looked at network analysis frameworks like JUNG *>* (Java) or SNAP (C++)? Relational databases are not good at managing linked *>* structures like trees and graphs anyway. *>* *>* The memory requirements shouldn't be that huge anyway: two IDs per edge = 8 *>* byte. The German language Wikipedia for instance has about 13 million links in *>* the main namespace, 8*|E| would need about 1GB even for a naive implementation. *>* With a little more effort, it can be nearly halved to 4*|E|+4*|V|. *>* *>* I have used the trivial edge store for analyzing the category structure before, *>* and Neil Harris is currently working on an nice standalone implementation of *>* this for Wikimedia Germany. This should allow recursive category lookup in *>* microseconds. *>* *>* *>* In any case, something needs to change. You can't expect to be frequently using *>* 1/8 of the toolserver's RAM. Even more so since this amount of memory can't be *>* used by MySQL for caching while you are not using it (because of the way the *>* innodb cache pool works). *>* *>* *>* Regards, *>* Daniel *>* * Vladimir Medeyko schrieb: >* Dear colleagues, *>* *>* I've heard that Golem bot, which is the heart of the connectivity *>* project, stopped to function due to the recent toolserver reconfiguration. *>* *>* Is it possible to adjust configuration specifically for Golem or to do *>* something else to make it function again? *>* *>* It is especially a pity that the connectivity project has problems now, *>* just two days after the project was reported at Konferencija Wikimedia *>* Polska and received much of interest from the listeners. *>* *>* What could be done to fix the situation? Thanks! *>* * -- Медейко Владимир Владимирович НП "Викимедиа РУ" Директор тел. +7-921-940-39-79

14 years, 1 month

Golem issues

by Vladimir Medeyko

Dear colleagues, I've heard that Golem bot, which is the heart of the connectivity project, stopped to function due to the recent toolserver reconfiguration. Is it possible to adjust configuration specifically for Golem or to do something else to make it function again? It is especially a pity that the connectivity project has problems now, just two days after the project was reported at Konferencija Wikimedia Polska and received much of interest from the listeners. What could be done to fix the situation? Thanks! -- Медейко Владимир Владимирович НП "Викимедиа РУ" Директор тел. +7-921-940-39-79

14 years, 1 month

Software upgrades

by River Tarnell

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi, Today we upgraded some software on the Solaris login servers (willow and wolfsbane). Notable upgrades include Perl 5.10.1, Python 2.6.5, git 1.7.0.3, GNU bash 4.1 and Mercurial 1.5. A full list of upgrades is available at <https://jira.toolserver.org/browse/MNT-409> - river. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (HP-UX) iEYEARECAAYFAkupKVMACgkQIXd7fCuc5vLdqwCfRHQmbY8Gv0kl1GWG5tl03T9l 1FwAnA0X4NtnYMbDG02gAFQdXu6Z64nF =KJ/1 -----END PGP SIGNATURE-----

14 years, 1 month

Downloading 7z dump for English Wikipedia, 30 GB

by emijrp

Hi; Yesterday (2010-03-26), the 7z dump for English Wikipedia was completed.[1] I am downloading it at /mnt/user-store/dump directory, it will be finished in a few hours (about 4), it is about 30 GB. So, if you need it, you know where is, don't download it again! ; ). A tip: in my python scripts, I decompress it on the fly, like this: 7za e -so ourdump.xml.7z | python ourscript.py And, inside the script, I capture the data with: source=sys.stdin Regards [1] http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-meta-histor…

14 years, 1 month

Conversion of nightshade to Solaris - discussion

by River Tarnell

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi, As mentioned previously, we are considering a proposal to convert nightshade (the Linux login server) to Solaris. Before making a decision on this, I'd like some input from users. If you have a moment, please have a look at this wiki page: <https://wiki.toolserver.org/view/Conversion_of_nightshade_to_Solaris> - river. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (HP-UX) iEYEARECAAYFAkurvGgACgkQIXd7fCuc5vKmawCeNSfh0FWsOf4flomNCIyaCyfB SBcAnRewduLGss30DD0Xa5c8OGCa3un4 =uTdG -----END PGP SIGNATURE-----

14 years, 1 month

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Toolserver-l March 2010