Those CSVs don't have column names -- what do they represent?
On 8/13/14, 7:24 AM, Gilles Dubuc wrote:
Just to get a sense of the scale,
"non-standard" sizes > 1280
represent approximately 2 TB of Swift storage at the moment. And all
sizes <= 1280 (where we can't tell "non-standard/standard" apart)
represent approximately 16 TB. As for "standard sizes" > 1280, they
total around 1.6 TB.
It's hard to estimate how much we're looking to save on sizes < 1280
due to the issue I've described earlier. But it's probably something
expressed in terabytes.
Filippo told me that the space I've just mentioned doesn't take into
account the swift replication (currently 3 copies). Which means that
we're currently talking about three times as much physical storage space.
I've looked at the amount of hits for sizes > 1280 and "non-standard"
thumbnails are viewed 3.3 times less than "standard ones". That means
some strange sizes are getting a decent amount of traffic, but I
haven't looked at the distribution yet to see if there are some sizes
that clearly stand out and might be "standard" sizes which we don't
know about lurking in there.
I've attached Filippo's CSV dumps, so that everyone can have fun at
home extracting meaning from that data.
For reference, this is the list of "standard" sizes we've come up
with, by hunting for various areas of the code that govern thumbnail
sizes served:
On Wed, Aug 13, 2014 at 12:59 PM, Gilles Dubuc <gilles(a)wikimedia.org
<mailto:gilles@wikimedia.org>> wrote:
The context is that Filippo from Ops would like to run a regular
cleanup job that deletes thumbnails from swift that have
non-mediawiki-requested sizes, when they haven't been accessed for
X amount of time. Currently we keep all thumbnails forever.
The idea is that 3rd-party tool requesting odd sizes would result
in less storage space used, as what they request would be deleted
after a while. This would be accompanied with documentation
towards developers indicating that best performance is obtained
when using a predefined set of sizes currently in use by the
various tools in production (core, extensions, mobile apps and
sites, etc.).
This is an interim solution while we still store thumbnails on
swift, which in itself is something we want to change in the future.
- we want to use less storage space
Yes
- images we are generating and caching for not-Wikipedia
should be the first to go
Yes. More accurately, images we are currently generating for
unknown 3rd parties requesting unusual sizes.
- we assume weird sizes are from not-Wikipedia. So let's cache
them for less time
Either they are coming from unknown 3rd parties, or from defunct
code. And yes, the idea is to keep them in swift for a period,
instead of keeping them in swift forever.
- except, that doesn't work, because of tall images
We can't differentiate requests coming from core's file page for
tall images from odd sizes for anything below 1280px width. Above
that, it's a lot easier to tell the difference between code we run
and 3rd parties. Which means that we're probably already going to
see some significant storage savings. In fact Filippo has given me
figures from production, I just have to compile them to know how
much storage we're talking about. I'll do that soon and it will be
a good opportunity to see how much we're "missing out" due to the
<1280 tall images case.
- so maybe we should change the image request format?
If the thumbnail url format could be done by height in addition to
width, we could keep the existing file page behavior and
differenciate "ours vs theirs" thumbnail requests for sizes below
1280px. It would be a lot of work, we have to see if it's worth it.
- If you want to prioritize Wiki[mp]edia thumbnails, why not
use the referrer header instead? Why use the width parameter
to detect this?
Referrer is unreliable in the real world. Browsers can suppress
it, so can proxies, etc. The width parameter doesn't tell us the
source. If we receive a request for "469" width, we can't tell if
it's coming from a 3rd party or a visitor of the file page for an
image which is for example 469px wide and 1024px tall.
- Are we sure we'll improve overall performance by evicting
certain files from cache quicker? Why not trust the LRU cache
algorithm?
Performance, no, but storage space yes. The idea is that the
performance impact would only be limited to clients requesting
weird image sizes. I don't think we have a LRU option to speak of,
it would be a job written by Ops.
- as maintainers of the wikimedia media file servers, we want
to reduce the number of images cached in order to save storage
space and cost?
Yes, and in particular this would allow us to use the existing
capacity for more useful purposes, such as pre-generating all
expected thumbnail sizes at upload time. Meaning that on
"official" clients, or on clients sticking to the extensive list
of sizes we'll support will never hit a thumbnail size that needs
to be generated on the fly.
is it possible to cache based on a last accessed timestamp?
When we move away from swift, this is exactly what we want to set
up. Although it would be interesting to contemplate making
exceptions for widely used sizes. What I'm describing is a
temporary solution while we still live in the thumbnails-on-swift
status quo.
- if an image size has not been accessed within x number of
days purge it from the cache
Basically this is an attempt to do this on swift, while not
touching sizes that we know are requested by a lot of clients.
On Wed, Aug 13, 2014 at 12:35 PM, dan-nl
<dan.entous.wikimedia(a)gmail.com
<mailto:dan.entous.wikimedia@gmail.com>> wrote:
what is the main use case?
- as maintainers of the wikimedia media file servers, we want
to reduce the number of images cached in order to save storage
space and cost?
- and/or something else?
is it possible to cache based on a last accessed timestamp?
- if an image size has not been accessed within x number of
days purge it from the cache
with kind regards,
dan
On Aug 13, 2014, at 11:18 , Neil Kandalgaonkar
<neilk(a)neilk.net <mailto:neilk@neilk.net>> wrote:
I think I need more context. Is this what
you're saying?
- we want to use less storage space
- images we are generating and caching for not-Wikipedia
should be the
first to go
- we assume weird sizes are from not-Wikipedia.
So let's
cache them for less time
- except, that doesn't work, because of tall
images
- so maybe we should change the image request format?
If this is accurate I have a few questions:
- If you want to prioritize Wiki[mp]edia thumbnails, why not
use the
referrer header instead? Why use the width parameter
to detect this?
- Are we sure we'll improve overall
performance by evicting
certain files from cache quicker? Why not trust the
LRU cache
algorithm?
On 8/13/14, 1:36 AM, Gilles Dubuc wrote:
> Currently the file page provides a set of different image
sizes for
the user to directly access. These sizes are usually
width-based. However, for tall images they are height-based.
The thumbnail urls, which are used to generate them pass only
a width.
>
> What this means is that tall images end up with arbitrary
thumbnail
widths that don't follow the set of sizes meant for
the file page. The end result from an ops perspective is that
we end up with very diverse widths for thumbnails. Not a
problem in itself, but the exposure of these random-ish widths
on the file page means that we can't set a different caching
policy for non-standard widths without affecting the images
linked from the file page.
>
> I see two solutions to this problem, if we want to
introduce different
caching tiers for thumbnail sizes that
come from mediawiki and thumbnail sizes that were requested by
other things.
>
> The first one would be to always keep the size progression
on the file
page width-bound, even for soft-rotated images.
The first drawback of this is that for very skinny/very wide
images the file size progression between the sizes could
become steep. The second drawback is that we'd often offer
less size options, because they'd be based on the smallest
dimension.
>
> The second option would be to change the syntax of the
thumbnail urls
in order to allow height constraint. This is a
pretty scary change.
>
> If we don't do anything, it simply means that we'll have to
apply the same caching policy to every size smaller than 1280.
We could already save quite a bit of storage space by evicting
non-standard sizes larger than that, but sizes lower than 1280
would have to stay the way they are now.
>
> Thoughts?
>
>
> _______________________________________________
> Multimedia mailing list
>
> Multimedia(a)lists.wikimedia.org
<mailto:Multimedia@lists.wikimedia.org>
--
Neil Kandalgaonkar (|
<neilk(a)neilk.net <mailto:neilk@neilk.net>>
_______________________________________________
Multimedia mailing list
Multimedia(a)lists.wikimedia.org
<mailto:Multimedia@lists.wikimedia.org>
_______________________________________________
Multimedia mailing list
Multimedia(a)lists.wikimedia.org
<mailto:Multimedia@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/multimedia
_______________________________________________
Multimedia mailing list
Multimedia(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/multimedia