FYI.
I wonder what their pageview definition is. :)
-Jeremy
---------- Forwarded message ----------
From: Eric Mill <eric(a)konklone.com>
Date: Thu, Mar 19, 2015 at 7:55 PM
Subject: [sunlightlabs] Open web traffic data from the US government
To: sunlightlabs(a)googlegroups.com
Hi all,
It seems to have done a good job flying about the internets today, but
I wanted to mention that the US government released a new open dataset
today (and a shiny dashboard), on web traffic to its .gov domains:
https://analytics.usa.gov/
This audience might appreciate 18F's tech post about how it was built,
and why some decisions were made the way they were. I'm particularly
happy that the data includes a view of browser and OS usage across
visitors, and hope that that's helpful to folks as a reference point.
This comes from a big Google Analytics roll-up account, managed by a
team called the Digital Analytics Program at GSA. They've enabled IP
address anonymization and disabled data sharing within Google.
Also, note that the data is currently just snapshots, it's not
publishing archival data. Feedback and suggestions welcome, preferably
on GitHub.
-- Eric
--
konklone.com | @konklone
Hi!
I'm looking to generate some data around what percentage of articles have
Wikidata descriptions for English, and a few other languages. The reason is
that the Mobile Apps Team is planning on tackling a project next quarter to
get users entering Wikidata descriptions.
What's the best way for me to get this data?
Thanks,
Dan
--
Dan Garry
Associate Product Manager, Mobile Apps
Wikimedia Foundation
Hi all,
Erik Zachte is working on file view stats and is looking for a way to track
Media Viewer image views (for which there is no 1:1 relation between server
hits and actual image views); after some back and forth in
https://phabricator.wikimedia.org/T86914 I proposed the following hack:
whenever the javascript code in MediaViewer determines that an image view
happened (e.g. an image has been displayed for a certain amount of time),
it makes a request to a certain fake image, say
upload.wikimedia.org/wikipedia/commons/thumb/0/00/Virtual-imageview-<real
image name>/<size>px-thumbnail.<ext> . These hits can than be easily
filtered from the varnish request logs and added to the normal requests. We
would add a rule to Vagrant to make sure it does not try to look up such
requests in Swift but returns a 404 immediately.
This would be a temporary workaround until there is a proper way to log
virtual image views, such as EventLogging with a non-SQL backend.
Do you see any fundamental problem with this?
Forwarding announcement.
Pine
---------- Forwarded message ----------
From: "Manprit Brar" <mbrar(a)wikimedia.org>
Date: Mar 18, 2015 1:19 PM
Subject: [Wikimedia Announcements] Announcing Wikimedia Foundation's New
Open Access Policy
To: <wikimediaannounce-l(a)lists.wikimedia.org>
Cc:
Hi all,
We're proud to announce that the Wikimedia Foundation today joins the
growing ranks of major institutions with open access policies. Our new Open
Access Policy <https://wikimediafoundation.org/wiki/Open_access_policy>[1]
will ensure that all research the Wikimedia Foundation supports through
grants, equipment, or research collaboration is made widely accessible and
reusable. Research, data, and code developed through these collaborations
will be made available in open access venues and under a free license
<http://freedomdefined.org/>[2] in keeping with the Wikimedia Foundation’s
mission to support free knowledge. You can read more about this effort in
today's blog post
<https://blog.wikimedia.org/2015/03/18/wikimedia-open-access-policy/>[3].
[1] https://wikimediafoundation.org/wiki/Open_access_policy
[2] http://freedomdefined.org/Definition
[3] https://blog.wikimedia.org/2015/03/18/wikimedia-open-access-policy/
Manprit Brar
Legal Counsel
*Wikimedia Foundation*
*NOTICE*: This message may be confidential or legally privileged. If you
have received it by accident, please delete it and let us know about the
mistake. As an attorney for the Wikimedia Foundation, for legal/ethical
reasons I cannot give legal advice to, or serve as a lawyer for, community
members, volunteers, or staff members in their personal capacity. For more
on what this means, please see our legal disclaimer
<https://meta.wikimedia.org/wiki/Wikimedia_Legal_Disclaimer>.
_______________________________________________
Please note: all replies sent to this mailing list will be immediately
directed to Wikimedia-l, the public mailing list of the Wikimedia
community. For more information about Wikimedia-l:
https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
_______________________________________________
WikimediaAnnounce-l mailing list
WikimediaAnnounce-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimediaannounce-l
I think we should split up Eventlogging and the other m2 clients (OTRS and
some minor players). Several reasons:
- Backfilling causes replication lag. Using faster out-of-band replication
for EL is easy because it is all simple bulk-INSERT statements, but the
same does not apply for the other clients. They need different approaches.
- Master disk space. Even with the data purging discussed at the MW Summit,
I would feel better if EL had more headroom that is does currently, and
zero possibility of unexpected spikes in disk activity and usage affecting
other services.
- EL is the service most sensitive to connection dropouts. Recently Ori and
Nuria have been tweaking SqlAlchemy, but future connection problems like
those seen last week would be easier to debug without having to risk
affecting other services.
I am therefore arranging to promote the current m2 slave db1046 to master
of an m4 cluster tuned for EL, including backfilling. Analytics-store,
s1-analytics-slave, and the new CODFW server will simply switch to
replicate from the new master.
For switchover of writes, we'll need to coordinate an EL consumer restart
to use a new CNAME of m4-master.eqiad.wmnet and allow vanadium the relevant
network access, and then presumably do a little backfilling. When would be
a reasonable time within the next fortnight or so?
Sean
Hi,
I'm trying to figure out which of the two pageview definitions we
currently have I can use for a question Bob and I are trying to address. It
would be great if you share your thoughts. If you choose to do so, please
do it by Tuesday, eod, PST.
More details:
*What are we doing?*
We are building an edit recommendation system that identifies the missing
articles in Wikipedia that have a corresponding page in at least one of the
top 50 Wikipedia languages, ranks them, and recommends the ranked articles
to editors who the algorithm assesses as those who may like to edit the
article.
*Where does pageview definition come into play?*
When we want to rank missing articles. To do the ranking, we want to
consider the pageviews to the article in the languages the article exists
in, and using this information estimate what the traffic is expected to be
in the language the article is missing in.
*Why does it matter which pageview definition we use?*
We would like to use webstatscollector pageview definition since the hourly
data we have based on this definition goes back to roughly September 2014.
If we go with the new pageview definition, we will have data for the past
2.5 months. The longer period of time we have data for, the better.
*Why don't you then just use webstatscollector data?*
We're inclined to do that but we need to make sure that data works for the
kind of analysis we want to do. Per discussions with Oliver,
webstatscollector data has a lot of pageviews from bots and spiders. The
question is: is the effect of bot/spider traffic, i.e., the number of
pageviews they add to each page, roughly uniform across all pages? If that
is the case, webstatscollector definition will be our choice.
I appreciate your thoughts on this.
Best,
Leila
Hey all,
After the patches to the definition following the previous hand-coding
run (see older threads) I've run a second set of tests. These can be
seen at https://commons.wikimedia.org/wiki/File:Pageviews_QA_2.png and
https://commons.wikimedia.org/wiki/File:Pageviews_QA_jittered_2.png
There's nothing particularly shocking in the new definition; it
follows the seasonal pattern that we're used to. I think we can call
the new definition done, with these tweaks! It's also not as unstable
as the legacy definition (good luck to whoever now has the
responsibility of explaining why pageviews abruptly halved in the
middle of February).
Have fun,
--
Oliver Keyes
Research Analyst
Wikimedia Foundation