Hi,
There is a way to download wikidumps for any project / language, the
data is from early 2009. I will detail the steps as a note for future
reference.
The data is made available as part of Amazon AWS Public Datasets
(
http://aws.amazon.com/publicdatasets/). This method will cost a tiny
amount of money, as you will have to pay for an Amazon EC2 instance.
1) Create an AWS account
2) Log in to AWS and select the EC2 tab
3) Click 'Launch Instance'
4) Select a Configuration, the Basic 64-bit Amazon Linux AMI 1.0 is fine
5) Instance Type: select Micro (this is the cheapest) and press continue
6) Instance Defaults: keep the defaults and press continue
7) Enter key, value pairs, such as key=name, value=WikiMirror, this is
not really required and press continue
8) Create a new key pair and give it a name and press Create &
Download your Key Pair (this is your private EC2 key and you need to
store it somewhere safe).
9) Create a security group, default settings are fine and press enter
to continue.
10) Review your settings and press launch
This will start an Amazon EC2 instance. The next step is to make the
Wikimedia public dataset accessible to our instance.
1) Click EBS Volumes on the upper right of the window
2) Select Create Volume from the different tabs
3) Snapshot pulldown: scrolldown and search for Wikipedia XML Backups
(XML) and enter 500Gb in the size input field. Make sure that the zone
matches the zone of your primary volume and press create
4) Click Attach Volume and enter /dev/sdf or similar
We know have made the Dataset available to our EC2 instance. Let's get
the data: (I'll assume a Windows environment):
I think this step is not necessary for Linux / OSX users):
1) We have a .PEM certificate but if we would like to use Putty on
Windows then we need to convert the .PEM certificate to a certificate
the Putty can use.
2) Download puttygen from
http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html
3) Start puttygen and select Load Private Key. Select the key that we
downloaded at step 8 of creating an EC2 instance.
4) Press Save private key and save the new .ppk key
5) Close puttygen and start putty (you can download it from
http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html as
well)
6) Create a new session: the EC2 server name can be found by going to
EC2 dashboard on the left, and then selecting Running Instance. At
the bottom of the page you will find: Public DNS and a long name, copy
this name and enter it in the Putty session.
7) In putty, click on SSH on the left and then Auth. There is a field
saying use key for authentication and a browse button. Press the
browse button and select the key from step 4.
8) Start the session, enter as username ec2-user, the key will
authorize you. We have logged on to our EC2 instance
9) enter on the command line: sudo mkdir /mnt/data-store
10) enter on the command line: sudo mount /dev/sdf /mnt/data-store
(the sdf depends on what you entered at step 4 of creating the
dataset.
11) cd /mnt/data-store and enter ls -al and you will see all the files
available.
Next step is either to copy a file using SCP or start your own FTP
server on the EC2 instance and download the files that you need. You
will need to pay a small fee but this is in the range of cents.
Best,
Diederik
On Sun, Nov 28, 2010 at 8:20 AM, Daniel Kinzler <daniel(a)brightbyte.de> wrote:
It's not that toolserver admins are excentric
adding such rule, but an
issue of WM-DE liability if such information is published.
However, I think that providing such file to just a few selected people
would be acceptable.
I think so too, but I have to ask our ED. Will send an email now.
-- daniel
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
--
<a href="http://about.me/diederik">Check out my about.me
profile!</a>