Ori is in the process of obsoleting the current modality of the ClickTracking extension and replaced at some point with what the E3 team is building for collection on Vanadium (and then folded into Kraken as that comes available).
Depending on ops and the handling of the data collection piece, this may take a while, but I am asking if anyone is actually using ClickTracking for anything anymore? I know some things are/may still be recording (AFTv5, MoodBar, etc.), but to my knowledge none of this data is supposed to be being used at this point. This does not cover the clicktracking that is currently being done on the E3 team (obviously), but does cover past editor engagement stuff.
If that is not the case, then reply to this thread with what and how long and Ori will make sure that collection for those will be handled somehow for the duration it is needed or some sort of replacement implementation is done. Otherwise, let's let those streams die.
Note this only covers data generated by the clicktracking extension specifically.
Hi all,
you are cordially invited to the first ever IRC office hours of the
Foundation's recently formed Analytics team, taking place in
#wikimedia-analytics on Freenode on Monday, July 30 at 19:00 UTC /
noon PT (http://www.timeanddate.com/worldclock/fixedtime.html?hour=19&day=30&month=0…
).
It is an opportunity to ask all your analytics and statistics related
questions about Wikipedia and the other Wikimedia projects, in
particular regarding the Wikimedia Report Card and the upcoming
"Kraken" analytics platform. See also the blog post that the team just
published: https://blog.wikimedia.org/2012/07/25/meet-the-analytics-team/
, as well as https://www.mediawiki.org/wiki/Analytics
General information about IRC office hours is available at
https://meta.wikimedia.org/wiki/IRC_office_hours .
Regards,
--
Tilman Bayer
Senior Operations Analyst (Movement Communications)
Wikimedia Foundation
IRC (Freenode): HaeB
The main purpose is to add a log of this thread to a mailing list for archiving. (below)
A secondary purpose is to outline some of what has been discussed. Here is what I've gathered:
In the short term, packets to Vanadium will continue through the flow agreed on in the request ticket:
(old <http://wikitech.wikimedia.org/view/Squid_logging>): client -(clicktracking)-> api.php -(udp2log)-> emery -> pulled down from log files
(agreed): <client side> -(clicktracking)-> api.php -(udp2log)-> vanadium
(alternate): <client side> -(clicktracking)->api.php -(0mq?)-> vanadium
[NB: (alternate) would break existing users of clicktracking, of which there should be none, but can be addressed in different thread].
Ori is moving for this change:
(intermediate): <client side> -(clicktracking/E3 extension)-> bits.wikimedia.org -(0mq)-> vanadium
(permanent): <client side> -(anything)-> bits.wikimedia.org -(0mq/)-> Scribe -> Kraken
[NB: both breaks existing users of clicktracking]
[NB: for permanent, c.f. <http://www.mediawiki.org/wiki/Analytics/Kraken/Pixel_Service>]
Bikeshedding:
Because cross-dc redundant tunneling is not in place, vanadium is not reachable by everything. This may take 1-2 months, or longer. intermediate is thus modified to replace bits to a specific bits host on equiad. We can revisit moving the varnish rule up to cover all bits at a later date (as far as I'm concerned, I'm happy with deferring this until Kraken needs its pixelserver, but whatever).
Mark has also requested that this be properly packaged and puppetized. Ori will be using labs at a test for this setup a la the way Patrick is currently handling a similar request for Wikipedia Zero right now.
Asher has requested that the pub/sub model proposed by Ori be reversed. This seems reasonable.
0mq allows for different queue configurations than pub-sub. There is some consideration of using UDP multicast instead. This should probably be revisited when Kraken goes online.
Current actions (so CT has a map):
At some point, nothing gets done without Mark since he wrote the puppet manifests on varnish. However, it's reasonable to get as much done under Mark's approval before he actually hands-on this config. Mark if you are not okay with any of the stuff, tell me. No point in continuing if it isn't going to happen. ;-)
I guess technically we could punt on the whole thing for 1-2 months. However, since at some point something like this needs to be tested on varnish on the cluster, we should probably take the opportunity to get this running on a single varnish machine while we have an engineer willing to do the lifting, packaging and puppetization. Analytics is way too busy on other parts to worry about collection.
If we punt for longer than 1-2 months, then I guess ori can't be held accountable when he takes down the cluster again with too many calls to api.php. :-D
If we get it on a single instance now, I'm inclined to let ops decide when/if they want to move the config to cover all of bits (and just inform Ori so he can update any extensions to point to the edges). The only hard deadline on rolling out to all esams would be when Kraken goes online and needs the config to point the pub-sub to publish to their scribe servers instead of vanadium.
Other than the above, I consider the whole thing settled. Last I checked, none of you report to me, so I'm not involved at all. :-P
P.S. Today is SysAdmin Appreciation day. There are three bottles of whiskey (probably already added to Ryan Lane's stash) and 2 dozen cookies/baked goods on CT's desk. I bought them for ops to dispose of how they see fit.
http://www.someecards.com/workplace-cards/my-job-is-to-annoy
Take care,
terry
On Jul 17, 2012, at 3:56 PM, Ori Livneh <ori(a)wikimedia.org> wrote:
> Hey Mark, Asher,
>
> For event-tracking, could we add a VCL hook to bits.wikimedia.org that rewrites a specific URL to hit vanadium.eqiad.wmnet:8000?
>
> I have a simple HTTP server there that parses and stores query strings for all incoming requests. I mean to use it as a way of capturing events from JavaScript code (for AB testing features, for example.) It responds to all requests with HTTP status code 204 ("No Content") and an empty body. But vanadium isn't public-facing, so I need to expose a URL.
>
> Something like this should work, assuming vanadium is reachable from bits: http://p.defau.lt/?RhrkVPxrdhv0vPKvIwaRNQ
>
> Very crude benchmarking (see http://p.defau.lt/?VRssDYUMq1djVFHzlyN_Yw) clocks the server at ~1,600 reqs/sec, which would add up to ~140 mil. / day. My plan is to be extremely conservative and limit ourselves to 200k reqs / day and ramp up very gradually iff it's stable enough. Although 200k sounds tiny, it can comfortably accomodate some interesting metrics -- enwiki averages 140k edits / day, for example.
>
> Adding the URL to Varnish would complete this request: https://rt.wikimedia.org/Ticket/Display.html?id=3152
>
> Let me know what you think.
> Thanks,
> Ori
>
> --
> Ori Livneh
> ori(a)wikimedia.org
>
On Jul 17, 2012, at 5:12 PM, Asher Feldman <afeldman(a)wikimedia.org> wrote:
> Great latency results for your collector! I don't think it matters much at the traffic rate you're talking about, but I think we'd want to consider a different approach or a public endpoint other than bits if use of this will be seriously ramped up in the future. bits serves ~40k requests/sec via 4 servers in the US and 2 in Europe, with spare enough capacity for a couple of those hosts to die. The >99.6% cache hitrate is important to the small server footprint, which a shift in the number of backend http requests varnish has to make could impact. Additionally, bits servers in europe can't hit private servers in eqiad and use the public eqiad bits ip as its backend. An eu request would take a couple hundred ms due to network latency, hitting both varnish in the eu and us.
>
> I'm adding Patrick because we've discussed sending udp packets for a mobile analytics project directly from vanish via inline C. If progress is made there, perhaps your server could be modified to receive udp messages instead of http requests? It would be friendlier to EU users, since varnish could respond with a 204 immediately while whatever happens to get the udp packet forwarded to eqiad happens behind the scenes.
On Jul 17, 2012, at 10:13 PM, Ori Livneh <ori(a)wikimedia.org> wrote:
> Much obliged for the thoughtful response!
>
>
> UDP might not be the best option because data integrity is important. IIRC most implementations will fragment datagrams greater than 1472 bytes and will silently drop datagrams if a fragment is lost or delayed, which could easily skew our data if we're not super careful. Order and reliability count, and UDP is hard to reason about.
>
> varnishlog might be a better option if you're willing to allow vanadium to maintain a persistent connection to the varnish caches (over SSH perhaps, with varnishlog instead of a login shell). Alternately the varnish caches could pipe varnishlog into some lightweight tool that sends things to vanadium. (Maybe this is the use-case for 0MQ that Terry has been itching for.) If I write it, would you be able to help with deployment / testing? (I think we could keep it pretty simple..)
>
> --
> Ori Livneh
> ori(a)wikimedia.org
On Jul 17, 2012, at 10:13 PM, Ori Livneh <ori(a)wikimedia.org> wrote:
> Much obliged for the thoughtful response!
>
>
> UDP might not be the best option because data integrity is important. IIRC most implementations will fragment datagrams greater than 1472 bytes and will silently drop datagrams if a fragment is lost or delayed, which could easily skew our data if we're not super careful. Order and reliability count, and UDP is hard to reason about.
>
> varnishlog might be a better option if you're willing to allow vanadium to maintain a persistent connection to the varnish caches (over SSH perhaps, with varnishlog instead of a login shell). Alternately the varnish caches could pipe varnishlog into some lightweight tool that sends things to vanadium. (Maybe this is the use-case for 0MQ that Terry has been itching for.) If I write it, would you be able to help with deployment / testing? (I think we could keep it pretty simple..)
>
> --
> Ori Livneh
> ori(a)wikimedia.org
On Jul 19, 2012, at 4:40 AM, Mark Bergsma <mark(a)wikimedia.org> wrote:
> Hi Ori,
>
> Besides Asher's response, which I fully agree with, let me add the following:
>
> First of all, when we gave you that server vanadium, a few weeks ago, you argued for it by saying that you wanted to reduce the coupling with / dependencies of / imact on production as much as possible. But then you didn't mention any of this, and your proposed change, using bits, does quite the opposite. Let's not do that.
>
> Solutions around varnishlog and ssh/connections sound clunky. Sending udp packets from Varnish would be fine I think, but you don't want that.
>
> Why don't we see if we can integrate your requirements with the plans the analytics team has with their Hadoop cluster? That would avoid duplication of effort as well.
On Jul 19, 2012, at 10:16 AM, Ori Livneh <ori(a)wikimedia.org> wrote:
> Hi Mark,
>
> Thanks for your note. The design (capturing event data from URLs) is the plan for Kraken, and my work on the public-facing part of the stack is in collaboration with the analytics team, whose efforts are currently invested in storage and computation. I'm looping in David Schoonover, with whom I've been working to coordinate efforts. Once data is piping into vanadium, I'm going to drop server-side work entirely and focus on growing a client-side event tracking library, and that's going to integrate directly with Kraken.
>
> To state the obvious: any analytics solution is going to need a channel for incoming data if we hope to do anything more interesting than searching for patterns in /dev/random. The needs to be some endpoint that client-side JavaScript code can hit or we'll have no way of tracking client-side state, which is increasingly AJAX driven and therefore not easily gleaned by looking at bare request logs.
>
> Serializing state into URL params (as opposed to tracking data by issuing POST requests with JSON body, for example) is how we get a system designed to crunch page views (Kraken) to fulfill UX/UI testing requirements. So there is no duplicated effort here. A client-side library that transparently captures and transmits state in AJAX request URLs is going to help Kraken along.
>
> I don't think the change list on Gerrit is an inelegant solution. The coupling problem with the click tracking extension was that it was using MediaWiki to parse event data from incoming requests and to generate successful responses, which didn't scale. My proposed solution has Varnish doing nothing more than responding to /beacon.gif with an empty response. I can't think of a way of implementing a tracking endpoint that would scale better or that would be more lightweight.
>
> Transferring tracking data over a persistent SSH connection sucks, I agree, and I didn't go that route. I chose to do something very close to UDP, which is to pipe tracking request URLs from varnishlog into an unbuffered ZeroMQ publisher socket. The implementation does not require anything to be listening on the other end -- if the client on Vanadium dies, data is dropped on the floor, and the connection would be reestablished transparently once it is back up. I don't think this is going to perform worse than UDP, but I am not particular about this point -- UDP would be fine as well.
>
> Asher was going to test what impact on load running varnishlog with a URL pattern will have. If it's minimal, would this be OK?
>
> Thanks,
>
> --
> Ori Livneh
> ori(a)wikimedia.org
On Jul 19, 2012, at 10:54 AM, Asher Feldman <afeldman(a)wikimedia.org> wrote:
> It's good to know that the work here is in fact to create the public injection point for Kraken and not a duplicate effort. That likely means the total request rate will be much greater than what's driven by editor engagement tests, possibly up to a request per pageview.
>
> I will test varnishlog on a bits server with a regex to capture /beacon requests to get a feel for the resulting resource utilization. It still requires inspection of every bits request from shared memory (significantly more data per request than what goes into an access log) to pick out a few, so it may not be the most efficient solution.
>
> If varnish can send udp packets for specific requests, there's also the option of having it send one for /beacon requests to something listening on localhost, which could itself use 0mq or another reliable transport to pass messages on to kraken. That would probably address most concerns over udp, while also eliminating out of band processing of every bits request in order to find beacons.
>
> Yet another option would be to build a new beacon.wikimedia.org endpoint. You could have much greater flexibility over implementation choices if not piggybacking on bits, but with an operational and capital cost that would also delay release.
On Jul 20, 2012, at 1:38 PM, Asher Feldman <afeldman(a)wikimedia.org> wrote:
> It looks like varnishlog is actually quite efficient at finding specific requests based on a field regex, and fetching one of the many log fields from matching requests. 'varnishlog -c -m RxURL:"^/event.gif" -i RxURL' utilized 5% of a core on a production bits server while it was serving ~6.2k reqs/sec, vs. far more for an unfiltered varnishlog process. So this seems feasible, provided that whatever process reads stdout from varnishlog (or directly accesses varnish shm) is similarly efficient, and had no risk of run away failure cases that might impact varnish performance.
>
> This is invasive to bits, but seems reasonable in terms of asynchronously passing beacon messages from user requests (varnish returns an immediate 204 no matter what), and decoupling failures of the event reader or vanadium from users and varnish. Mark, what do you think?
On Jul 20, 2012, at 2:07 PM, Ori Livneh <ori(a)wikimedia.org> wrote:
> Thanks a bunch for testing this, Asher!
>
> --
> Ori Livneh
> ori(a)wikimedia.org
On Jul 23, 2012, at 4:01 AM, Mark Bergsma <mark(a)wikimedia.org> wrote:
>
> On Jul 20, 2012, at 10:38 PM, Asher Feldman wrote:
>
>> It looks like varnishlog is actually quite efficient at finding specific requests based on a field regex, and fetching one of the many log fields from matching requests. 'varnishlog -c -m RxURL:"^/event.gif" -i RxURL' utilized 5% of a core on a production bits server while it was serving ~6.2k reqs/sec, vs. far more for an unfiltered varnishlog process. So this seems feasible, provided that whatever process reads stdout from varnishlog (or directly accesses varnish shm) is similarly efficient, and had no risk of run away failure cases that might impact varnish performance.
>>
>> This is invasive to bits, but seems reasonable in terms of asynchronously passing beacon messages from user requests (varnish returns an immediate 204 no matter what), and decoupling failures of the event reader or vanadium from users and varnish. Mark, what do you think?
>
> Yeah, this seems reasonable, but:
>
> a) needs to be setup in a clean way (puppet configuration management, packaging of software used), and
> b) we need a way to transfer data from esams to the private collector (in eqiad). esams can't talk to it directly.
>
> --
> Mark Bergsma <mark(a)wikimedia.org>
> Lead Operations Architect
> Wikimedia Foundation
On Jul 23, 2012, at 9:13 AM, Mark Bergsma <mark(a)wikimedia.org> wrote:
>
> On Jul 20, 2012, at 10:38 PM, Asher Feldman wrote:
>
>> It looks like varnishlog is actually quite efficient at finding specific requests based on a field regex, and fetching one of the many log fields from matching requests. 'varnishlog -c -m RxURL:"^/event.gif" -i RxURL' utilized 5% of a core on a production bits server while it was serving ~6.2k reqs/sec, vs. far more for an unfiltered varnishlog process. So this seems feasible, provided that whatever process reads stdout from varnishlog (or directly accesses varnish shm) is similarly efficient, and had no risk of run away failure cases that might impact varnish performance.
>>
>> This is invasive to bits, but seems reasonable in terms of asynchronously passing beacon messages from user requests (varnish returns an immediate 204 no matter what), and decoupling failures of the event reader or vanadium from users and varnish. Mark, what do you think?
>
> Can't we use scribe for this, as is already the plan for kraken (as far as I understand it)? That would probably also solve the problem of esams contacting pmtpa/eqiad internal hosts...
>
> --
> Mark Bergsma <mark(a)wikimedia.org>
> Lead Operations Architect
> Wikimedia Foundation
On Jul 23, 2012, at 9:58 AM, David Schoonover <dschoonover(a)wikimedia.org> wrote:
> That's my cue.
>
> So I actually think this is a really elegant solution to the question of "how do you get Varnish (or whoever) to talk to scribe?" ZMQ is fucking fantastic -- super stable, super efficient, and with a lot of care in the little bits. For those not in the know: zmq is a wrapper around Unix domain sockets. It's like Super IPC. In the case where you're using it for plain IPC, it's merely a nice interface with almost zero overhead, but also providing some convenient features. One of those, importantly, is that writing to a dangling ZMQ socket doesn't vomit all over syslog with errors -- the bits just quietly end up in /dev/null. (You can configure it to yell, if you really want, iirc.)
>
> In the short-term, I'm not precisely sure what Ori plans on using as the consumer, but it would be great to have our own toolbox of connectors to, say, File, UDP, Scribe, etc. Then we'd have one interface that we could plug anything into. (We could theoretically upgrade our other custom connectors in nginx, etc with something like that, and have one universal backend, but I digress.)
>
> When Kraken comes online, we'd swap out that short-term backend with a Scribe connector. Easy and elegant.
>
> +1
>
> --
> David Schoonover
> dsc(a)wikimedia.org
On Jul 24, 2012, at 1:06 PM, Asher Feldman <afeldman(a)wikimedia.org> wrote:
> I think where this stands is that Ori needs to finalize a transport method for moving data off of varnish servers, and it sounds like ZMQ is appropriate and compatible with future kraken plans.
>
> That leaves a question of how to move ZMQ packets from esams to eqiad. ZMQ supports multicast udp (could possibly use existing multicast forwarding infrastructure?) and tcp as transports. Mark, do you have a preference / could you provide Ori some guidance?
On Jul 24, 2012, at 4:40 PM, Ori Livneh <ori(a)wikimedia.org> wrote:
> Update: I packaged this and put it up on a ppa on launchpad. Binaries are available for Ubuntu Precise, which is what I _think_ the Varnish machines are running. To install:
>
> apt-add-repository ppa:ori-livneh/e3
> apt-get update
> apt-get install zpubsub
>
> replete with a man page -- zpubsub(1)
>
> --
> Ori Livneh
> ori(a)wikimedia.org
On Jul 24, 2012, at 10:15 PM, Ori Livneh <ori(a)wikimedia.org> wrote:
> From vanadium (eqiad), I can connect to port 8649 on cp300[1-2].esams.wikimedia.org, which I presume is gmond. If we could open an additional port (bound to a zmq publisher socket that makes the filtered log stream available for vanadium to subscribe to), that would work.
>
> I'm not sure multicast makes sense because the flow of communication is many-to-one, not one-to-many. The way I see it, vanadium could persist a connection to each varnish machine (4 on eqiad, 4 on pmtpa, 2 on esams = 10 total). The pub/sub pattern ensures that if vanadium crashes, the varnishes don't care, and just let the log data drop.
>
> ZeroMQ pub/sub sockets support multicast over pgm or epgm, but I think that adds a layer of complexity (vs. unicast) that isn't needed or wanted for tracking events from A/B tests with fractional roll-outs.
>
> If you're squeamish about this -- which I understand! -- just remember: all these calls are currently hitting api.php, which entails failed cache lookups on the Squids*, followed by work for the Mediawiki instances, which generate UDP packets, which end up on emery. This setup is capable of knocking out the site, as I found out in June.
>
>
> * See:
> $ curl -is --data "action=clicktracking" http://en.wikipedia.org/w/api.php | grep X-Cache
> X-Cache: MISS from cp1004.eqiad.wmnet
> X-Cache-Lookup: MISS from cp1004.eqiad.wmnet:3128
> X-Cache: MISS from cp1017.eqiad.wmnet
> X-Cache-Lookup: MISS from cp1017.eqiad.wmnet:80
>
>
> --
> Ori Livneh
> ori(a)wikimedia.org
On Jul 25, 2012, at 4:04 AM, Mark Bergsma <mark(a)wikimedia.org> wrote:
>
> On Jul 25, 2012, at 7:15 AM, Ori Livneh wrote:
>
>> From vanadium (eqiad), I can connect to port 8649 on cp300[1-2].esams.wikimedia.org, which I presume is gmond. If we could open an additional port (bound to a zmq publisher socket that makes the filtered log stream available for vanadium to subscribe to), that would work.
>
> Err, no you can't. vanadium is on the eqiad internal network, and has a private address. Since there's no NAT and no tunneling over the Internet, you can't reach esams currently. Sure you didn't test from another host? :)
>
>> I'm not sure multicast makes sense because the flow of communication is many-to-one, not one-to-many. The way I see it, vanadium could persist a connection to each varnish machine (4 on eqiad, 4 on pmtpa, 2 on esams = 10 total). The pub/sub pattern ensures that if vanadium crashes, the varnishes don't care, and just let the log data drop.
>
> 2 more in esams soon, BTW.
>
>> ZeroMQ pub/sub sockets support multicast over pgm or epgm, but I think that adds a layer of complexity (vs. unicast) that isn't needed or wanted for tracking events from A/B tests with fractional roll-outs.
>>
>> If you're squeamish about this -- which I understand! -- just remember: all these calls are currently hitting api.php, which entails failed cache lookups on the Squids*, followed by work for the Mediawiki instances, which generate UDP packets, which end up on emery. This setup is capable of knocking out the site, as I found out in June.
>
>> On Tuesday, July 24, 2012 at 1:06 PM, Asher Feldman wrote:
>>
>>> I think where this stands is that Ori needs to finalize a transport method for moving data off of varnish servers, and it sounds like ZMQ is appropriate and compatible with future kraken plans.
>>>
>>> That leaves a question of how to move ZMQ packets from esams to eqiad. ZMQ supports multicast udp (could possibly use existing multicast forwarding infrastructure?) and tcp as transports. Mark, do you have a preference / could you provide Ori some guidance?
>
> We're actually working on connecting the internal subnets of pmtpa/eqiad and esams, via redundant tunnels. That would allow direct unicast and multicast connectivity with no proxying or other hacks. Some experiments have already been done a while back, but it won't be available and reliable until we finish a router migration, which is 1-2 months out. I think that would be the cleanest and nicest solution, but it's a question whether this can wait for that.
>
> --
> Mark Bergsma <mark(a)wikimedia.org>
> Lead Operations Architect
> Wikimedia Foundation
On Jul 25, 2012, at 5:03 AM, Terry Chay <tchay(a)wikimedia.org> wrote:
> I want to pull Gabriel for a couple ticks tomorrow to see if we can get this unstuck a bit. I'm not sure I want to wait 1-2 months with E3 clicktracking stuff going to api.php and risking another outage. Let's see if we can find a solution that is feasible under the current infrastructure and switch to the router solution when that's available.
>
> If someone reminds me tomorrow about this, I'll have Ori bring Gabriel up to speed on what this discussion is about… Imight forget because I had a bad case of the insomnias last night.
>
>
> On Jul 25, 2012, at 4:04 AM, Mark Bergsma wrote:
>
>>
>> On Jul 25, 2012, at 7:15 AM, Ori Livneh wrote:
>>
>>> From vanadium (eqiad), I can connect to port 8649 on cp300[1-2].esams.wikimedia.org, which I presume is gmond. If we could open an additional port (bound to a zmq publisher socket that makes the filtered log stream available for vanadium to subscribe to), that would work.
>>
>> Err, no you can't. vanadium is on the eqiad internal network, and has a private address. Since there's no NAT and no tunneling over the Internet, you can't reach esams currently. Sure you didn't test from another host? :)
>>
>>> I'm not sure multicast makes sense because the flow of communication is many-to-one, not one-to-many. The way I see it, vanadium could persist a connection to each varnish machine (4 on eqiad, 4 on pmtpa, 2 on esams = 10 total). The pub/sub pattern ensures that if vanadium crashes, the varnishes don't care, and just let the log data drop.
>>
>> 2 more in esams soon, BTW.
>
> I guess we need a standard for a count when a machine count is high enough for multicast udp to be better than pubsub. I don't think 12 (4/dc) is it though. ;-)
On Jul 25, 2012, at 3:36 PM, Asher Feldman <afeldman(a)wikimedia.org> wrote:
> On Tue, Jul 24, 2012 at 10:15 PM, Ori Livneh <ori(a)wikimedia.org> wrote:
> From vanadium (eqiad), I can connect to port 8649 on cp300[1-2].esams.wikimedia.org, which I presume is gmond. If we could open an additional port (bound to a zmq publisher socket that makes the filtered log stream available for vanadium to subscribe to), that would work.
>
> The number of varnish servers will change, data centers get failed over, etc. I think you'd want the publishers to establish the connection with vanadium, not the other way around.
>
>
terry chay 최태리
Director of Features Engineering
Wikimedia Foundation
“Imagine a world in which every single human being can freely share in the sum of all knowledge. That's our commitment.”
p: +1 (415) 839-6885 x6832
m: +1 (408) 480-8902
e: tchay(a)wikimedia.org
i: http://terrychay.com/
w: http://meta.wikimedia.org/wiki/User:Tychay
aim: terrychay
More from tsuna about how StumbleUpon uses Kafka:
Begin forwarded message:
> From: tsuna <tsunanet(a)gmail.com>
> Subject: Re: Scribe Packaging Effort
> Date: July 27, 2012 1:08:29 AM EDT
> To: Andrew Otto <otto(a)wikimedia.org>
>
> On Thu, Jul 26, 2012 at 9:01 AM, Andrew Otto <otto(a)wikimedia.org> wrote:
>> What do you guys use Kafka for?
>
> Just as a simple message bus between different components. For
> example when certain events happen on our site, we create a message
> corresponding to the event and send it to the appropriate Kafka topic,
> and consumers interested in these sort of messages can get them and
> handle them however they want.
>
> In your case it looks like you are shipping logs around and you want
> to have a pipeline where one of the stages is transforming the
> messages. I think this should be easy to build with Flume, as they
> have APIs for sources and sinks, so although I've never done it
> myself, I expect it would be rather straightforward to write your own
> Agent that transforms messages and insert it in between the producers
> and the HDFS sink. This sounds like it would be simpler than dealing
> with Scribe's problems or throwing Storm into the picture.
>
> --
> Benoit "tsuna" Sigoure
> Software Engineer @ www.StumbleUpon.com
Hey guys,
I reached out to this guy yesterday about the bug I ran into in Scribe. He had posted on the scribe-server google group that he had fixed this bug, and I also wanted to let him know about our potential efforts to standardize Scribe packaging.
Here's his opinion on Scribe:
> I personally gave up on Scribe, I'd recommend that you
> consider Flume as a better replacement, that is more supported and
> developed. Scribe has never been really well written or maintained,
> it's just one of the many hacks that Facebook released.
In general, it does seem to be pretty given up on. There have been a couple of pull requests merged in the last year, but beyond that there isn't much activity: https://github.com/facebook/scribe/commits/master
It would be really interesting to know how (and if?) Facebook still uses Scribe internally. I'm pretty sure they've done a lot more with Hadoop since 2008-2010 when Scribe was being more actively promoted. Maybe they're using Flume instead now? We need a Facebook insider, anyone know one?
-Ao
Begin forwarded message:
> From: tsuna <tsunanet(a)gmail.com>
> Subject: Re: Scribe Packaging Effort
> Date: July 26, 2012 12:45:30 AM EDT
> To: Andrew Otto <otto(a)wikimedia.org>
>
> On Wed, Jul 25, 2012 at 9:05 AM, Andrew Otto <otto(a)wikimedia.org> wrote:
>> Hi Benoît,
>
> Hi Andrew,
>
>> In the meantime, I have another question! I just ran into a problem that
>> you say you fixed in this thread:
>> https://groups.google.com/group/scribe-server/tree/browse_frm/month/2010-01…
>
> Are you referring to this?
>> [Thu Nov 19 18:29:59 2009] "[hdfs] Connecting to HDFS"
>> *** glibc detected *** ./scribed: munmap_chunk(): invalid pointer: 0x0000000001ea19c3 ***
>
>> However, the commit you link to 404s. I'm willing to rebuild scribe with
>> whatever fix or release version is necessary. Can you point me in the right
>> direction? What source should I use to build scribe to fix this bug?
>
> If you're referring to the bug above, it's a very old bug, it must be
> fixed upstream already. I can't believe you're running into the same
> bug almost 3 years later, it must be a different issue.
>
> Either way, I personally gave up on Scribe, I'd recommend that you
> consider Flume as a better replacement, that is more supported and
> developed. Scribe has never been really well written or maintained,
> it's just one of the many hacks that Facebook released.
>
> Good luck.
>
> --
> Benoit "tsuna" Sigoure
> Software Engineer @ www.StumbleUpon.com
Belated cross-post for those who are interested in Git analytics.
-Sumana
-------- Original Message --------
Subject: [Wikitech-l] Git code review metrics
Date: Fri, 20 Apr 2012 16:42:23 -0700
From: Erik Moeller <erik(a)wikimedia.org>
Reply-To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org>
To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org>
Following up on the earlier thread by Rob [1], Rob and I kicked around
the question what metrics/targets for code review we want to surface
on an ongoing basis. We're not going to invest in a huge dashboard
project right now, but we'll try to get at least some of the key
metrics generated and visualized automatically. Help is appreciated,
starting with what the metrics are that we should look at.
Here's what we came up with, by priority:
1) Most important: Time series graph of # of open changesets
Target: Numer of open changesets should not exceed 200.
Optional breakdown:
- mediawiki/core
- mediawiki/extensions
- WMF-deployed extensions
- specific repos
2) Important: Aging trends.
- Time series graph of # open changesets older than a, b, c days
(to indicate troubling aging trends, e.g. a=3, b=5, c=7)
- Target: There should be 0 changes that haven't been looked at
all for more than 7 days.
- Including only: Changes which have not received a -1 review, -1
verification, or -2
- Optional breakdown as above
- Rationale: We're looking for tendencies of complete neglect of
submissions here, which is why we have to exclude -1s or -2s.
3) Possibly useful:
- Per-reviewer or reviewee(?) statistics regarding merge activity,
number of -1s, neglected code, etc.
Any obvious thinking errors in the above / do the targets make sense /
should we look at other metrics or approaches?
Erik
[1] http://lists.wikimedia.org/pipermail/wikitech-l/2012-April/059940.html
--
Erik Möller
VP of Engineering and Product Development, Wikimedia Foundation
Support Free Knowledge: https://wikimediafoundation.org/wiki/Donate
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Hey Jan. Yes this tool looks like almost what I need. The issue is it
only does 7 languages. We are currently translating content into
nearly 40. Is it possible to expand it to all languages of Wikipedia?
James Heilman
On Tue, Jul 24, 2012 at 6:00 AM, <analytics-request(a)lists.wikimedia.org> wrote:
> Send Analytics mailing list submissions to
> analytics(a)lists.wikimedia.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> https://lists.wikimedia.org/mailman/listinfo/analytics
> or, via email, send a message with subject or body 'help' to
> analytics-request(a)lists.wikimedia.org
>
> You can reach the person managing the list at
> analytics-owner(a)lists.wikimedia.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Analytics digest..."
>
>
> Today's Topics:
>
> 1. Re: Calculating page views for projects in other languages
> (Jan Ainali)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Mon, 23 Jul 2012 15:02:53 +0200
> From: Jan Ainali <jan.ainali(a)wikimedia.se>
> To: "A mailinglist for the Analytics Team at WMF and everybody who has
> an interest in Wikipedia and analytics."
> <analytics(a)lists.wikimedia.org>
> Subject: Re: [Analytics] Calculating page views for projects in other
> languages
> Message-ID:
> <CAKwu9WHOiC34E1uZwJiAEaCdHpyL_fQH4jnZozG9j7ayhxhMHQ(a)mail.gmail.com>
> Content-Type: text/plain; charset="iso-8859-1"
>
> I just wanted to let you know of a tool that Holger Motzkau
> (User:Prolineserver) did. It does not really solve your problem though, but
> it is close to. It lists the page views from a category on one Wikipedia
> (and all the interwikilinks and the QRpedia statistics). It shouldn't be
> too hard too feed it with a list of articles with a certain template I
> guess.
>
> http://toolserver.org/~prolineserver/glamorous/glamorous_cats.php
>
> --
> Best,
> Jan Ainali
> Chairman, Wikimedia Sverige <http://se.wikimedia.org/wiki/Huvudsida>
>
>
> 2012/7/23 Erik Zachte <ezachte(a)wikimedia.org>
>
>> James seeks a one page overview of most read articles for any wiki/project.
>>
>> A list of articles per project could be retrieved from the mediawiki API.
>> I did something similar with list of articles per category (incl subcat, x
>> levels deep).
>> Perl script on request.
>>
>> Then the machine readable version of grok could be used to retrieve article
>> counts. (see Dario's comment)
>> However this might now scale well to 1(0),000's of projects and 100,000's
>> pages.
>>
>> In the somewhat longer run I see two developments that might warrant
>> putting
>> this on hold:
>>
>> 1)
>> The new analytics cluster will be used to aggregate page and image views.
>> (Another use case would be aggregating image views per donating GLAM
>> institute)
>> Which aggregations precisely better be determined once the infrastructure
>> is
>> available, and capacity is known.
>>
>> 2)
>> There are scripts to aggregate Domas' hourly page view feeds into monthly
>> files.
>> These aggregates are so much smaller, after cruft removal only 2Gb per
>> month, without losing hourly resolution, easy to download and
>> archive/process somewhere else.
>> http://lists.wikimedia.org/pipermail/wikitech-l/2011-August/054644.html
>> These need final work, I spoke to long time dev. wikimedian EMW at
>> Wikimania
>> and he might be interested to take this upon him, starting October. From
>> these aggregates the 1(0),000 of projects overviews could be generated in a
>> batch process, mind you after month completed.
>>
>> Erik Zachte
>>
>>
>> -----Original Message-----
>> From: analytics-bounces(a)lists.wikimedia.org
>> [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Dario
>> Taraborelli
>> Sent: Friday, July 20, 2012 6:56 PM
>> To: A mailinglist for the Analytics Team at WMF and everybody who has an
>> interest in Wikipedia and analytics.
>> Subject: Re: [Analytics] Calculating page views for projects in other
>> languages
>>
>> James,
>>
>> can you expand on this request? If you are interested in per-article
>> pageview stats you can use: http://stats.grok.se/
>>
>> For example: http://stats.grok.se/fr/201207/Paris
>> A machine readable version: http://stats.grok.se/json/fr/201207/Paris
>>
>> Dario
>>
>> On Jul 20, 2012, at 9:38 AM, Sumana Harihareswara wrote:
>>
>> > James: Good places to add your requests to:
>> >
>> > https://lists.wikimedia.org/mailman/listinfo/toolserver-l
>> >
>> > https://www.mediawiki.org/wiki/Annoying_large_bugs
>> >
>> >
>> > --
>> > Sumana Harihareswara
>> > Engineering Community Manager
>> > Wikimedia Foundation
>> >
>> >
>> >
>> > On 07/20/2012 12:38 PM, Diederik van Liere wrote:
>> >> It is always tricky to convince someone to start working on a request
>> for
>> yourself. Given the fact that there is an existing code base I would say
>> that your best bet is to study that and tweak it your own requirements. If
>> you have specific technical questions then there are enough people within
>> the different wikimedia communities that can help you.
>> >>
>> >> Good luck!
>> >>
>> >> Diederik
>> >>
>> >> Sent from my iPhone
>> >>
>> >> On 2012-07-20, at 11:47, James Heilman <jmh649(a)gmail.com> wrote:
>> >>
>> >>> This is something I was hoping to convince someone with programming
>> >>> skills to take on. What prevents me from doing it is my complete
>> >>> lack of programming skills thus the request here.
>> >>>
>> >>> --
>> >>> James Heilman
>> >>> MD, CCFP-EM, Wikipedian
>> >>>
>> >>> The Wikipedia Open Textbook of Medicine
>> >>> www.opentextbookofmedicine.com
>> >>>
>> >>> _______________________________________________
>> >>> Analytics mailing list
>> >>> Analytics(a)lists.wikimedia.org
>> >>> https://lists.wikimedia.org/mailman/listinfo/analytics
>> >>
>> >> _______________________________________________
>> >> Analytics mailing list
>> >> Analytics(a)lists.wikimedia.org
>> >> https://lists.wikimedia.org/mailman/listinfo/analytics
>> >>
>> >
>> >
>> > _______________________________________________
>> > Analytics mailing list
>> > Analytics(a)lists.wikimedia.org
>> > https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>> _______________________________________________
>> Analytics mailing list
>> Analytics(a)lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>> _______________________________________________
>> Analytics mailing list
>> Analytics(a)lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>
This is something I was hoping to convince someone with programming
skills to take on. What prevents me from doing it is my complete lack
of programming skills thus the request here.
--
James Heilman
MD, CCFP-EM, Wikipedian
The Wikipedia Open Textbook of Medicine
www.opentextbookofmedicine.com