Hi.
I've been asked a few times recently about doing reports of the most-viewed
pages per month/per day/per year/etc. A few years after Domas first started
publishing this information in raw form, the current situation seems rather
bleak. Henrik has a visualization tool with a very simple JSON API behind it
(<http://stats.grok.se>), but other than that, I don't know of any efforts
to put this data into a database.
Currently, if you want data on, for example, every article on the English
Wikipedia, you'd have to make 3.7 million individual HTTP requests to
Henrik's tool. At one per second, you're looking at over a month's worth of
continuous fetching. This is obviously not practical.
A lot of people were waiting on Wikimedia's Open Web Analytics work to come
to fruition, but it seems that has been indefinitely put on hold. (Is that
right?)
Is it worth a Toolserver user's time to try to create a database of
per-project, per-page page view statistics? Is it worth a grant from the
Wikimedia Foundation to have someone work on this? Is it worth trying to
convince Wikimedia Deutschland to assign resources? And, of course, it
wouldn't be a bad idea if Domas' first-pass implementation was improved on
Wikimedia's side, regardless.
Thoughts and comments welcome on this. There's a lot of desire to have a
usable system.
MZMcBride
We're doing well on the road to 1.18. We went from ~370 un-reviewed
revisions on last weekend to ~210 un-reviewed revisions today.
We need to sustain this momentum to have 1.18 ready for release in
time.
One area that hasn't been getting enough attention, though, is FIXME'd
revisions.
On Monday, there were 95 FIXMEs and today that was only reduced to 86.
We need that rate to really increase, so on Monday, I'll be contacting
people with FIXME'd revisions and asking them to take action. If you
don't have time, please let me, Robla, or the list know so that we can
make sure that we have the code in a releasable state in time for
release.
(Information in this email was gleaned from the Revision Report:
http://www.mediawiki.org/wiki/MediaWiki_roadmap/1.18/Revision_report)
Thanks,
Mark.
http://www.mediawiki.org/wiki/NOLA_Hackathon
MediaWiki developers are going to meet in New Orleans, Louisiana, USA,
October 14-16, 2011. Ryan Lane is putting this together and I'm helping
a bit. If you're intending to come, please add your name here, just so
we can start getting an idea of how many people are coming:
http://www.mediawiki.org/wiki/NOLA_Hackathon#Attendees
I'll add more details to the wiki page next week.
--
Sumana Harihareswara
Volunteer Development Coordinator
Wikimedia Foundation
[Resending as plain text]
I maintain compacted monthly version of dammit.lt page view stats, starting
with Jan 2010 (not an official WMF project).
This is to preserve our page views counts for future historians (compare
Twitter archive by Library of Congress)
It could also be used to resurrect
http://wikistics.falsikon.de/latest/wikipedia/en/ which was very popular.
Alas the author vanished and does not reply on requests and we dont have
the source code.
I just applied for storage on dataset1 or ..2, will publish the monthly <
2Gb files asap.
Each day I download 24 hourly dammit.lt files and compact these into one
file.
Each month I compact these into monthly file.
Major space saving: monthly files with all hourly page views is 8 Gb
(compressed),
with only articles with 5+ page views per month it is even less than 2 Gb.
This is because each page title occurs once instead of up to 24*31 times,
and bytes sent field is omitted.
All hourly counts are preserved, prefixed by day number and hour number.
Here are the first lines of one such file which also describes the format:
Erik Zachte (on wikibreak till Sep 12)
# Wikimedia article requests (aka page views) for year 2010, month 11
#
# Each line contains four fields separated by spaces
# - wiki code (subproject.project, see below)
# - article title (encoding from original hourly files is preserved to
maintain proper sort sequence)
# - monthly total (possibly extrapolated from available data when hours/days
in input were missing)
# - hourly counts (only for hours where indeed article requests occurred)
#
# Subproject is language code, followed by project code
# Project is b:wikibooks, k:wiktionary, n:wikinews, q:wikiquote,
s:wikisource, v:wikiversity, z:wikipedia
# Note: suffix z added by compression script: project wikipedia happens to
be sorted last in dammit.lt files, so add this suffix to fix sort order
#
# To keep hourly counts compact and tidy both day and hour are coded as one
character each, as follows:
# Hour 0..23 shown as A..X convert to number:
ordinal (char) - ordinal ('A')
# Day 1..31 shown as A.._ 27=[ 28=\ 29=] 30=^ 31=_ convert to number:
ordinal (char) - ordinal ('A') + 1
#
# Original data source: Wikimedia full (=unsampled) squid logs
# These data have been aggregated from hourly pagecount files at
http://dammit.lt/wikistats, originally produced by Domas Mituzas
# Daily and monthly aggregator script built by Erik Zachte
# Each day hourly files for previous day are downloaded and merged into one
file per day # Each month daily files are merged into one file per month
#
# This file contains only lines with monthly page request total
greater/equal 5
#
# Data for all hours of each day were available in input
#
aa.b File:Broom_icon.svg 6 AV1,IQ1,OT1,QB1,YT1,^K1
aa.b File:Wikimedia.png 7 BO1,BW1,CE1,EV1,LA1,TA1,^A1
aa.b File:Wikipedia-logo-de.png 5 BO1,CE1,EV1,LA1,TA1
aa.b File:Wikiversity-logo.png 7 AB1,BO1,CE1,EV1,LA1,TA1,[C1
aa.b File:Wiktionary-logo-de.png 5 CE1,CM1,EV1,TA1,^N1
aa.b File_talk:Commons-logo.svg 9 CE3,UO3,YE3
aa.b File_talk:Incubator-notext.svg 60
CH3,CL3,DB3,DG3,ET3,FH3,GM3,GO3,IA3,JQ3,KT3,LK3,LL3,MH3,OO3,PF3,XO3,[F3,[O3,
]P3
aa.b MediaWiki:Ipb_cant_unblock 5 BO1,JL1,XX1,[F2
I maintain compacted monthly version of dammit.lt page view stats, starting
with Jan 2010 (not an official WMF project).
This is to preserve our page views counts for future historians (compare
Twitter archive by Library of Congress)
It could also be used to resurrect
http://wikistics.falsikon.de/latest/wikipedia/en/ which was very popular.
Alas the author vanished and does not reply on requests and we don't have
the source code.
I just applied for storage on dataset1 or ..2, will publish the monthly <
2Gb files asap.
Each day I download 24 hourly dammit.lt files and compact these into one
file.
Each month I compact these into monthly file.
Major space saving: monthly files with all hourly page views is 8 Gb
(compressed),
with only articles with 5+ page views per month it is even less than 2 Gb.
This is because each page title occurs once instead of up to 24*31 times,
and 'bytes sent' field is omitted.
All hourly counts are preserved, prefixed by day number and hour number.
Here are the first lines of one such file which also describes the format:
Erik Zachte (on wikibreak till Sep 12)
# Wikimedia article requests (aka page views) for year 2010, month 11
#
# Each line contains four fields separated by spaces
# - wiki code (subproject.project, see below)
# - article title (encoding from original hourly files is preserved to
maintain proper sort sequence)
# - monthly total (possibly extrapolated from available data when hours/days
in input were missing)
# - hourly counts (only for hours where indeed article requests occurred)
#
# Subproject is language code, followed by project code
# Project is b:wikibooks, k:wiktionary, n:wikinews, q:wikiquote,
s:wikisource, v:wikiversity, z:wikipedia
# Note: suffix z added by compression script: project wikipedia happens to
be sorted last in dammit.lt files, so add this suffix to fix sort order
#
# To keep hourly counts compact and tidy both day and hour are coded as one
character each, as follows:
# Hour 0..23 shown as A..X convert to number:
ordinal (char) - ordinal ('A')
# Day 1..31 shown as A.._ 27=[ 28=\ 29=] 30=^ 31=_ convert to number:
ordinal (char) - ordinal ('A') + 1
#
# Original data source: Wikimedia full (=unsampled) squid logs
# These data have been aggregated from hourly pagecount files at
<http://dammit.lt/wikistats> http://dammit.lt/wikistats, originally produced
by Domas Mituzas
# Daily and monthly aggregator script built by Erik Zachte
# Each day hourly files for previous day are downloaded and merged into one
file per day # Each month daily files are merged into one file per month
#
# This file contains only lines with monthly page request total
greater/equal 5
#
# Data for all hours of each day were available in input
#
aa.b File:Broom_icon.svg 6 AV1,IQ1,OT1,QB1,YT1,^K1
aa.b File:Wikimedia.png 7 BO1,BW1,CE1,EV1,LA1,TA1,^A1
aa.b File:Wikipedia-logo-de.png 5 BO1,CE1,EV1,LA1,TA1
aa.b File:Wikiversity-logo.png 7 AB1,BO1,CE1,EV1,LA1,TA1,[C1
aa.b File:Wiktionary-logo-de.png 5 CE1,CM1,EV1,TA1,^N1
aa.b File_talk:Commons-logo.svg 9 CE3,UO3,YE3
aa.b File_talk:Incubator-notext.svg 60
CH3,CL3,DB3,DG3,ET3,FH3,GM3,GO3,IA3,JQ3,KT3,LK3,LL3,MH3,OO3,PF3,XO3,[F3,[O3,
]P3
aa.b MediaWiki:Ipb_cant_unblock 5 BO1,JL1,XX1,[F2
On Fri, Aug 12, 2011 at 6:55 AM, David Gerard <dgerard(a)gmail.com> wrote:
>
> [posted to foundation-l and wikitech-l, thread fork of a discussion
elsewhere]
>
>
> THESIS: Our inadvertent monopoly is *bad*. We need to make it easy to
> fork the projects, so as to preserve them.
>
> This is the single point of failure problem. The reasons for it having
> happened are obvious, but it's still a problem. Blog posts (please
> excuse me linking these yet again):
>
> * http://davidgerard.co.uk/notes/2007/04/10/disaster-recovery-planning/
> * http://davidgerard.co.uk/notes/2011/01/19/single-point-of-failure/
>
> I dream of the encyclopedia being meaningfully backed up. This will
> require technical attention specifically to making the projects -
> particularly that huge encyclopedia in English - meaningfully
> forkable.
>
> Yes, we should be making ourselves forkable. That way people don't
> *have* to trust us.
>
> We're digital natives - we know the most effective way to keep
> something safe is to make sure there's lots of copies around.
>
> How easy is it to set up a copy of English Wikipedia - all text, all
> pictures, all software, all extensions and customisations to the
> software? What bits are hard? If a sizable chunk of the community
> wanted to fork, how can we make it *easy* for them to do so?
Software and customizations are pretty easy -- that's all in SVN, and most
of the config files are also made visible on noc.wikimedia.org.
If you're running a large site there'll be more 'tips and tricks' in the
actual setup that you may need to learn; most documentation on the setups
should be on wikitech.wikimedia.org, and do feel free to ask for details on
anything that might seem missing -- it should be reasonably complete. But to
just keep a data set, it's mostly a matter of disk space, bandwidth, and
getting timely updates.
For data there are three parts:
* page data -- everything that's not deleted/oversighted is in the public
dumps at download.wikimedia.org, but may be a bit slow to build/process due
to the dump system's history; it doesn't scale as well as we really want
with current data size.
More to the point, getting data isn't enough for a "working" fork - a wiki
without a community is an empty thing, so being able to move data around
between different sites (merging changes, distributing new articles) would
be a big plus.
This is a bit awkward with today's MediaWiki (though I tjimk I've seen some
exts aiming to help); DVCSs like git show good ways to do this sort of thing
-- forking a project on/from a git hoster like github or gitorious is
usually the first step to contributing upstream! This is healthy and should
be encouraged for wikis, too.
* media files -- these are freely copiable but I'm not sure the state of
easily obtaing them in bulk. As the data set moved into TB it became
impractical to just build .tar dumps. There are batch downloader tools
available, and the metadata's all in dumps and api.
* user data -- watchlists, emails, passwords, prefs are not exported in
bulk, but you can always obtain your own info so an account migration tool
would not be hard to devise.
> And I ask all this knowing that we don't have the paid tech resources
> to look into it - tech is a huge chunk of the WMF budget and we're
> still flat-out just keeping the lights on. But I do think it needs
> serious consideration for long-term preservation of all this work.
This is part of WMF's purpose, actually, so I'll disagree on that point.
That's why for instance we insist on using so much open source -- we *want*
everything we do to be able to be reused or rebuilt independently of us.
-- brion
>
>
> - d.
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[All apologies for cross-posting]
We are happy to announce that you can now register for
SMWCon Fall 2011
Berlin, September 21–23, 2011
http://semantic-mediawiki.org/wiki/SMWCon_Fall_2011
Registration is at http://de.amiando.com/SMWCon_Fall_2011
SMWCon brings together developers, users, and organizations from the
Semantic MediaWiki community in particular and everyone interested in
managing data in wikis in general. The Fall 2011 event runs for three
days September 21–23, 2011:
* Sept 21: practical tutorials about using SMW (learn about essential
aspects of using SMW) + developer consultation (meet with all developers
and discuss technical questions)
* Sept 22–23: community conference with talks and discussions
The detailed program is about to take shape [1]. Contributions are still
possible. Please note that the event takes place at the time of the
famous Berlin Marathon and a visit of Pope Benedict XXI. Booking hotels
soon is recommended.
You can register for the whole event or for the conference days only.
Registration includes lunch and coffee on all days + a conference dinner
on Sept 22nd. Special subsidised rates are available for students.
Moreover, MediaWiki developers are invited to join the first day (in
particular the developer consultations) at a reduced rate.
We are stretching ourselves to keep rates as low as possible in spite of
additional costs incurred by the rooms this time. We are therefore
welcoming sponsors to help back up the finances of the meeting, now and
in the future. If your organisation would be interested in becoming an
official supporter of the event, please contact the Open Semantic Data
Association <osda(a)semantic-mediawiki.org>.
SMWCon Fall 2011 is organised by the Web-Based Systems Group at Free
University Berlin [2] and by MediaEvent Services [3].
Looking forward to seeing you in Berlin!
Markus
[1] http://semantic-mediawiki.org/wiki/SMWCon_Fall_2011
[2] http://www.wiwiss.fu-berlin.de/en/institute/pwo/bizer/index.html
[3] http://mediaeventservices.com/
First, I realize that this addition to the 1.17 installer is sort of a
"hold it with our fingertips and pinch our nose" feature, so thanks for
putting it in at all... :-)
That said, there are a few bits of tuning I've been trying to add, and
they lead me to a more basic question for which my google-fu is unavailing.
I've changed the page title and copy on the "Log In" page, having gotten
lucky with the names of those messages in the system message dictionary.
I'm trying to find an easy way to take "Click here to return to $pagetitle"
on the post-logout page *out*, since, on a login-required-to-read wiki,
it's lying to the user: they can't do that.
The copy above it about flushing your cache, I found and changed.
But that message, if it's not hardwired in code by accident, lives in a
spot in the message dictionary that I couldn't find (except perhaps by
scrolling through the entire thing)... which leads me to said question:
I tentatively assume that the built in text search engine might search
*modified* messages in the dictionary, given how one edits them... but is
there any way to make it search *default messages*? Do they live in a
(pseudo-)namespace the default, or user-specified, messages?
Cheers,
-- jra
--
Jay R. Ashworth Baylink jra(a)baylink.com
Designer The Things I Think RFC 2100
Ashworth & Associates http://baylink.pitas.com 2000 Land Rover DII
St Petersburg FL USA http://photo.imageinc.us +1 727 647 1274