Neil Harris wrote:
We have a long-standing problem with AOL, which is
that they insist on
being a single giant cluster of anonymizing proxies. Should we
consider sending a cookie to AOL browsers which issue edit requests,
to give them some kind of identity?
Some comments to make this clearer:
* we detect AOL users by IP address range, not browser user-agent
* when I say "hash", I mean "keyed hash with a secret key"
We can also generate the cookie values nicely by generating them as
"seed" + hash(key1 + "seed" + key1)
where + is string concatenation, key1 is a secret key only our server
knows, and the seed is a random string.
By the nature of the hash algorithm, it is hard to discover key1, but we
(who know key1) can trivially verify whether it is a genuine cookie or
not without doing a database lookup, which will scale nicely for the
long-term future. Making the strings at least 512 bits (64 chars) long
and using a good hash will effectively frustrate brute-force attacks on
the key.
* we can re-use the blocking subsystem to do this, by giving it a
variety of policies which can be enforced in the blocking table, eg
"block" or "force-anon-cookie"
To generate the user pseudonyms, after checking that
hash(key1 + "seed" + key1) is the same as the second part of the cookie
we then generate
hash(key2 + "seed" + key2)
where key2 is another secret key, and truncate it to some reasonable
length. Note that we don't need the authenticating hash from the cookie
string, as it is simply dependent on the seed and adds no greater
security once it has been verified.
A nicer way of showing the hashes of the cookie-strings might be as
strings with a leading hash thus: #ekls9fjr5i39 (using base-36 encoding
using [a-z][0-9]). For added revelatory power, we could even present it
as #aol:ekls9fjr5i39, so that the originating ISP information was not lost.
Note that even though the actual keyed hash of the cookie string may be
512 bits long, we only need use the first 8 base-36 encoded characters
as a user pseudonym to have less than one chance in 2 * 10^12 of an
accidental collision, so using 12 as shown above is plenty.
So, the cost of doing this is mostly the cost of hashing, at two hashes
per edit, one to verify the cookie and one to generate the pseudonym
string. On a modern processor, each hash will take roughly 2 or 3
milliseconds to execute, which is trivial at Wikipedia's current edit
rate of an edit every 2 or 3 seconds. (We could also memo-cache verified
cookie -> name lookups in memcached, if we wanted to accelerate things
further, thus eliminating the hashing entirely in most cases except for
the first time, but that's probably too much complexity for too little
gain).
-- N.