Hi guys,
I am getting ready to turn off my work email until I return from vacation on Sep. 24.
Before I go, here’s my recommended to-do list for next week, with a focus on Media Viewer improvements.
It is based in part on this updated improvements plan:
https://www.mediawiki.org/wiki/Multimedia/Media_Viewer/Improvements#Tasks
I considered emailing each of you individually about specific tasks on this list, but a single message to the team may be more helpful, so we’re all on the same page.
Gilles
* Lead Wednesday's sprint planning meeting
* Update the Etherpad, Mingle site and Improvements page
* Deploy and monitor 'Pre-render thumbnails on backend’ (#301)
* Investigate better performance metrics solution for Media Viewer (#881)
* Run "versus" test on a dedicated labs instance (#884)
Mark
* Complete and deploy ‘More Details’ button with project icon (#830, 873)
* Complete and deploy Download and Share buttons (#841, 834)
* Develop Disable / Enable tools (#836, 719, 870)
* Show clearer indication of authorship (#837)
* Work closely with Pau on all design questions (via IRC in your AM, then email in the PM)
Gergo
* Disable MediaViewer for logged-in Commons users (#822)
* Help with back-end on Disable / Enable tools, as needed (#836, 719, 870)
* Update 'File: Page vs. MMV' performance dashboard with improved user data (#880)
* Performance histogram for MediaViewer and File page (#815)
* Help with attribution/licensing improvements for metadata cleanup effort
Pau
* Update specs with all assets for Disable / Enable tools (#836, 719, 870)
* Finalize design for Caption above the fold + how to expand/collapse it (#589)
* Propose a design for how to access ‘Help’ or ‘About’ pages from Media Viewer
* Join our IRC MM channel at the end of your day (to work more closely with Mark)
Keegan
* Respond to new comments on consultation or MV talk pages
* Thank contributors to consultation whose suggestions made it to the list
* Archive old comments on Media Viewer talk page
Abbey
* Write findings from Disable / Enable user testing with Pau
* Add a few more screenshots on August report (with credits)
* Post wiki report on July user studies, for transparency reasons
See Current Cycle wall for more details:
http://ur1.ca/h7w5s
I hope this outline is helpful. There may be other priority tasks that are not listed here, but I wanted to give you a sense of action items that seem important for the next 2 weeks, from my standpoint.
In my absence, Erik and Howie will tag team to provide product input on all this. Erik will join next Wednesday’s sprint planning meeting and can help test important features for deployment.
I look forward to seeing you all the following Wednesday, Sep. 24, at the 9am PT weekly meeting. I will not check work email during that time. If you need to reach me urgently, my personal email is fabriceflorin(a)gmail.com .
Good luck and speak to you soon!
Fabrice
_______________________________
Fabrice Florin
Product Manager, Multimedia
Wikimedia Foundation
https://www.mediawiki.org/wiki/User:Fabrice_Florin_(WMF)
Hi Thomas,
I'm not really talking about the specific query *engine* that will work
on the file topic data. (Well, maybe a little, in general terms about
some of the functionality we might want in such a search).
What I'm more talking about is the kind of data that will likely need to
stored on the CommonsData wikibase to make any kind of such query engine
*possible* with reasonable speed -- in particular not just the most
specific Q-numbers that apply to a file, but (IMO) *any* Q-number that
the file should be returned from if the topic corresponding to that
Q-number was searched for.
I'm saying that such a Q-number needs to be included on the item on
CommonsData for the file -- it's not enough that if used Wikidata to
look up the more specific Q-number, then the less specific Q-number
would be returned: I'm saying that lookup already needs to have been
done (and maintained), so the less specific Q-number is already sitting
on CommonsData when someone comes to search for it.
This doesn't need to be a manual process (though the presence of a
Q-number on a CommonsData item perhaps needs to subject to manual
overrule, in case the inference chain has gone wrong, and it really
isn't relevant); but what I'm saying is that you can't wait to do the
inference when the search request comes in -- instead the relevant
Q-numbers for each file need to be pre-computed, and stored on the
CommonsData item, so that when the search request comes in, they are
already there to be searched on. That denormalisation of information
really needs to be in place whatever the fine coding of the engine --
it's data design, rather than engine coding.
-- James.
On 13/09/2014 20:56, Thomas Douillard wrote:
> Hi James, I don't understand (I must admit I did not read the whole topic).
> Are we talking about a specific query engine ? The one the development team
> will implement in Wikibase, or are we talking of something else ?
>
> If we do not know that, I seems difficult to have this conversation at that
> point.
>
> 2014-09-13 21:51 GMT+02:00 James Heald <j.heald(a)ucl.ac.uk>:
>
>> "Let the ops worry about time" is not an answer.
>>
>> We're talking about the something we're hoping to turn into a world-class
>> mass-use image bank, and its front-line public-facing search capability.
>>
>> That's on an altogether different scale to WDQ running a few hundred
>> searches a day.
>>
>> Moreover, we're talking about a public-facing search capability, where
>> you're user clicks a tag and they want an updated results set *instantly*
>> -- their sitting around while the server makes a cup of tea, or declares
>> the query is too complex and goes into a sulk is not an option.
>>
>>
>> If the user wants a search on "palace" and "soldier", there simply is not
>> time for the server to first recursively build a list of every palace it
>> knows about, then every image related to each of those palaces, then every
>> soldier it knows about, every image related to each of those soldiers, then
>> intersect the two (very big) lists before it can start delivering any image
>> hits at all. That is not acceptable. A random internet user wants those
>> hits straight away.
>>
>> The only way to routinely be able to deliver that is denormalisation.
>>
>> It's not a question of just buying some more blades and filling up some
>> more racks. That doesn't get you a big enough factor of speedup.
>>
>> What we have is a design challenge, which needs a design solution.
>>
>> -- James.
>>
>>
>>
>>
>>
>>> Let the ops worry about time, I have not heard them complain about a
>>> search
>>> dystopia yet. Even the Wiki Data Query has reasonable response time
>>> compairing to the power it offers in the queries. And that is on wmflabs,
>>> not a production server.
>>> You're saying that even when we make the effort to get structured linked
>>> data we should not exploit the single most important advantage it offers.
>>> It does not make sense.
>>> It almost like just repeating the category sysem again but with another
>>> software (albeit it offers multilinguality).
>>>
>>> /Jan
>>>
>>>
>>>
>>> _______________________________________________
>>> Multimedia mailing list
>>> Multimedia(a)lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/multimedia
>>>
>>>
>>
>> _______________________________________________
>> Wikidata-l mailing list
>> Wikidata-l(a)lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata-l
>>
>
>
>
> _______________________________________________
> Wikidata-l mailing list
> Wikidata-l(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata-l
>
Hey, all!
This new post is to respond to various points from GerardM and P.
Blissenbach, previously responding to me on wikidata-l. I'm also
cross-posting it to multimedia-l, who no doubt will be able to put me
straight about lots of things.
In particular, where will "topics" to be associated with image files be
stored, and how will they be searched ?
* Where will topics be stored ? *
On the question of where the list of topics will be stored, the initial
thoughts of the Structured Data team would seem to be clear: they are to
be stored on the new CommonsData wikibase.
See eg:
https://commons.wikimedia.org/w/index.php?title=File%3AStructured_Data_-_Sl…
("topic links")
https://docs.google.com/document/d/1tzwGtXRyK3o2ZEfc85RJ978znRdrf9EkqdJ0zVj…
(API design and class diagram)
* How would topics be searched ? *
Gerard wrote:
> I am really interested how you envision searching when all those topics are
> isolated and attached to each file..
The trite answer is: in the same way you would search any other database
-- by setting an index.
It should be very very simple to pull the identities of all files on
CommonsData related to topic Qnnnnn.
* Why not store information with the Q-items on WikiData, regarding what
files are related ? *
One could do this. Essentially what we have here is a many-many join.
Each file can have many topics. Each topic can have many files. So the
classic relational approach would be a separate join table.
Moving the information out of main Wikidata makes Wikidata smaller and
leaner to query, particularly for queries that simply aren't interested
in images.
As to whether you really do have a join table, or whether you just
consider it all part of CommonsData, that's really up to the developers.
* What about the natural hierarchical structure ? *
eg
Leonardo da Vinci
--> Mona Lisa
--> --> Files depicting the Mona Lisa
Shouldn't the fact that it was Leonardo that painted the Mona Lisa only
be stored in one place, on Mona Lisa, (or perhaps on Leonardo); but
*not* multiple times, separately on every single depiction file?
*A*: Probably not, for several reasons.
Trying to find things (and also, to accurately represent things) in a
hierarchical structure is the bane of Commons at the moment; it also
makes searching Wikidata significantly non-trivial.
So the most significant reason is retrieval.
Suppose we have an image with topic "Gloucestershire Old Spot" (a breed
of pig). We also want to be able to retrieve the image rapidly if
somebody keys in "Pig".
Similarly, if we have an image of the "Mona Lisa", we also want it to be
in the set of images with somebody keying in "Leonardo"
For simple searches, one could image walking down the wikidata tree from
"Pig" or from "Leonardo", compiling a list of derived search terms, and
then building a union set of hits. Slightly more cumbersome than just
pulling everything tagged "Pig" from a relational database, but not so
different from what WDQ manages.
However, suppose one is combining "pig" and "country house", does one
then have to go down the tree to first identify every single country
house, and unify the hits for each one of those searches, before
computing the intersection with "pig" ? Or does one instead simply go
through the hitset for "pig" and see if it is also tagged "country house" ?
Now it's not a bad idea to identify "lead topics" and "implied topics"
associated with an image. Each time a new topic was added to an image,
one would want a lookup to be made on Wikidata and a list of implied
topics also to be added. Similarly if a topic identified as a "lead
topic" was changed (eg perhaps a country house had been mis-identified),
one would also want the list of implied topics to be updated (eg what
county it was in, which family it was associated with, etc).
Also the system would need to be looking out for relevant changes on
Wikidata -- eg if as a result of a new claim being added
("Gloucestershire Old Spot is a type of Pig"), what was previously an
independent lead topic "Pig" might become an implied topic.
Similarly, if something in the chain of implications was changed, the
consequences of that change would need to be reflected (eg if a parish
that the country house was in had been assigned to the wrong county; or
a work that the work was derivative of had been assigned to the wrong
painter).
Having to monitor such things is the price of denormalisation.
The question one has to ask is what is more troublesome: having to
propagate changes like this to multiple places in a denormalised
structure where multiple copies of the same information need to be
present (which can be done in quite a lazy background way); or,
alternatively, having to navigate the normalised structure every time a
user wants to build a results set, an overhead which directly affects
the speed at which the user can be returned those results ?
* How will searching by users likely be done in practice ? *
A classic approach in combinatorial searching is to give the user an
initial set of hits, and then encourage them to refine that set.
This implies, on the basis of the current query and hit-set, trying to
identify the best refinement options to offer them.
There may be classic properties like location and time-period. Or there
may be tags that can be identified as particularly rich in the return
set. Or properties which those tags are the values of that are
particularly rich in the return set.
But a really classic approach in image searching is simpler than that.
It simply shows a random selection of images from the current hit set,
lets the user reveal the tags that are associated with any one of them,
and then lets the user add one of those tags to the user's query.
This is how, in the first instance, I would expect an image search on
topics to be first implemented -- because it's such a well-known
technique, often works so well, and is so (comparatively)
straightforward to implement.
So that's why (IMO) the ability to refine searches by adding another
topic needs to be so fast and responsive. In terms of design, this is
the optimisation that will affect user experience.
* What about images stored on local language wikis? *
Gerard wrote:
> I also am really interested to know when you have all those files isolated
> on Commons, how you will include media files that are NOT on Commons.. This
> is a normal use case.
The project is called Structured Data for Commons, and the wikibase
being built for it is quite often being called CommonsData.
But it seems to me there is no particular reason why it should not be
straightforward to roll out essentially the same structure to local
language wikis as well.
I would have thought it would be fairly easy to then implement a
federated search, that finds all files matching these criteria on
*either* Commons *or* en-wiki (say).
Would one actually implement that all in one wikibase (ImagesData, say,
rather than CommonsData) ? That's a call I'd leave to the experts.
On the one hand, it probably would make it easy to search for all files
matching the criteria on *any* wiki.
What I suspect is more likely, and probably makes more sense, is to
converge the images themselves to all live in one place. So if the same
fair-use image was used on multiple fair-use wikis, it would only be
stored once (though each fair-use wiki would retain it's own File page
for it). Such a structure should also make transfers to Commons much
easier -- compared to the copy-and-paste by bot at the moment, which
loses all the file-page history and most of the upload history.
But there are blockers in the way of that at the moment -- in
particular, blockers that need to be addressed for image patrollers and
fair-use enforcement specialists still to be able to do their job in
such a set-up. To start with there are lots of tools they use, that at
the moment only run on one wiki but would need to effectively run on two
(or perhaps, the fact that it was two wikis would need to be hidden).
They would need equivalent admin and deletion rights on both xx-wiki and
the xx partition of Images wiki. Ideally they would be able to see
changes to the two on the same watchlist. etc etc.
So it may be some time before running the same image search across all
wikis can be supported by the system itself. But it will surely be
supported through middleware sooner than that.
So that's some thoughts (or maybe some mis-thoughts) about file-topic
searching and storage.
Now, tell me what I've got wrong. :-)
All best,
James.
Currently the file page provides a set of different image sizes for the
user to directly access. These sizes are usually width-based. However, for
tall images they are height-based. The thumbnail urls, which are used to
generate them pass only a width.
What this means is that tall images end up with arbitrary thumbnail widths
that don't follow the set of sizes meant for the file page. The end result
from an ops perspective is that we end up with very diverse widths for
thumbnails. Not a problem in itself, but the exposure of these random-ish
widths on the file page means that we can't set a different caching policy
for non-standard widths without affecting the images linked from the file
page.
I see two solutions to this problem, if we want to introduce different
caching tiers for thumbnail sizes that come from mediawiki and thumbnail
sizes that were requested by other things.
The first one would be to always keep the size progression on the file page
width-bound, even for soft-rotated images. The first drawback of this is
that for very skinny/very wide images the file size progression between the
sizes could become steep. The second drawback is that we'd often offer less
size options, because they'd be based on the smallest dimension.
The second option would be to change the syntax of the thumbnail urls in
order to allow height constraint. This is a pretty scary change.
If we don't do anything, it simply means that we'll have to apply the same
caching policy to every size smaller than 1280. We could already save quite
a bit of storage space by evicting non-standard sizes larger than that, but
sizes lower than 1280 would have to stay the way they are now.
Thoughts?
We created a set of these, you can see them at:
https://integration.wikimedia.org/ci/view/BrowserTests/view/MultimediaViewe…
In the near future I would like to refine exactly what is required for
these builds as regards
* browser
* version
* OS (if applicable)
* target environment (beta, test2, mw.o)
as well as whatever additional feature coverage we might provide.
Thanks,
-Chris
Thanks to everyone who participated in the recent Media Viewer Consultation!
Our multimedia team appreciates the many constructive suggestions to improve the viewing experience for readers and casual editors on Wikimedia projects. We reviewed about 130 community suggestions and prioritized a number of important development tasks for the next release of this feature. Those prioritized tasks have now been added to the improvements list on the consultation page. (1)
We have already started development of the most critical improvements: 'must-have’ features that this consultation helped identify and that have been validated through user testing — see research findings (2) and design prototype (3). We plan to complete all these 'must have' improvements by the end of October and will deploy them incrementally, starting this week. For more details, see our improvement plan (4) and development tasks for the next 6 weeks (5).
As we release these improvements, we will post regular updates on the Media Viewer talk page (6). We invite you to review these improvements and share your feedback.
The foundation is also launching a file metadata cleanup drive (7) to add machine-readable attributions and licenses on files lack them. This will lay the groundwork for the structured data partnership (8) with the Wikidata team, to enable better search and re-use of media in our projects. We encourage everyone to join these efforts.
This community consultation was very productive for us and we look forward to more collaborations in the future.
Thanks again to all our gracious contributors. We consider ourselves lucky to have so many great community partners.
Onward!
Fabrice
(1) https://meta.wikimedia.org/wiki/Community_Engagement_(Product)/Media_Viewer…
(2) https://www.mediawiki.org/wiki/Media_Viewer_Research_Round_2_(August_2014)
(3) http://multimedia-alpha.wmflabs.org/wiki/Rapa_Nui_National_Park#mediaviewer…
(4) https://www.mediawiki.org/wiki/Multimedia/Media_Viewer/Improvements
(5) http://ur1.ca/h7w5s
(6) https://www.mediawiki.org/wiki/Talk:Multimedia/About_Media_Viewer
(7) https://meta.wikimedia.org/wiki/File_metadata_cleanup_drive
(8) https://commons.wikimedia.org/wiki/Commons:Structured_data
_______________________________
Fabrice Florin
Product Manager, Multimedia
Wikimedia Foundation
https://www.mediawiki.org/wiki/User:Fabrice_Florin_(WMF)
> > The first three we can get from pretty much either API, or extract
>> directly from
>> > a dump file. The latter is eluding us though, for two reasons. One is
>> that a
>> > file, like 30C3_Commons_Machinery_2.jpg, is actually in the /b/ba/
>> directory -
>> > but where this /b/ba/ comes from (a hash?) is unclear to us now, and
>> it's not
>> > something we find in the dumps - though we can get it from one of the
>> APIs.
>>
>
> Yes, /b/ba ist based on the first two digits of the MD5 hash of the title:
>
> md5( "30C3_Commons_Machinery_2.jpg" ) -> ba253c78d894a80788940a3ca765debb
>
> But this is "arcane knowledge" which nobody should really rely on. The
> canonical
> way would be to use
>
> https://commons.wikimedia.org/wiki/Special:Redirect/file/30C3_Commons_Machi…
>
> Which generates a redirect to
>
> https://upload.wikimedia.org/wikipedia/commons/b/ba/30C3_Commons_Machinery_…
>
> To get a thumbnail, you can directly manipulate that URL, by inserting
> "thumb/"
> and the desired size in the correct location (maybe Special:Redirect can
> do that
> for you, but I do not know how):
>
>
> https://upload.wikimedia.org/wikipedia/commons/thumb/b/ba/30C3_Commons_Mach…
>
If I am not mistaken you can use thumb.php to get the needed thumb?
<https://commons.wikimedia.org/w/thumb.php?f=Example.jpg&width=100>
(That’s what I used in my CommonsDownloader [1])
[1] <
https://github.com/Commonists/CommonsDownloader/blob/master/commonsdownload…
>
Hope that helps,
--
Jean-Frédéric
Hi Jonas,
Awesome project!
I’m cc-ing the WMF Multimedia team, who might have some more answers :)
2014-09-04 12:26 GMT+02:00 Jonas Öberg <jonas(a)commonsmachinery.se>:
> Dear all,
>
> some of you may have been at our presentation during Wikimania and you'll
> find this familiar, but for the rest of you, I'm working with Commons
> Machinery on software that will hope to identify images on the web, even
> when they are used outside of their original context, to provide automatic
> attribution and a referral back to its origin. Imagine a blogger using a
> photo from Commons, visiting that blog and having a browser plugin overlay
> a small icon showing that the image is from Commons and inviting to find
> out more - even if the blogger forgot to attribute.
>
> We're currently working on an addon for Firefox to do just this, and we've
> previously worked out a backend to store the information we need to make
> these matches, some utilities for perceptual image hashing etc. We would
> love to work with images from Wikimedia Commons as a first dataset to
> explore how this will all work in practice.
>
> But in order to do so, we need information from Commons, and we want to
> make this as easy on the WMF servers as possible, so we'd appreciate some
> help and pointers. What we're looking at retrieving is information about
> (1) title, (2) author, (3) license, and (4) thumbnails of medium size.
>
> The first three we can get from pretty much either API, or extract
> directly from a dump file. The latter is eluding us though, for two
> reasons. One is that a file, like 30C3_Commons_Machinery_2.jpg, is actually
> in the /b/ba/ directory - but where this /b/ba/ comes from (a hash?) is
> unclear to us now, and it's not something we find in the dumps - though we
> can get it from one of the APIs.
>
> The other is thumbnail sizes. We need to retrieve a reasonably sized image
> (but in many cases less than the original size) of about 640px wide, so
> that we can then run a perceptual hash algorithm on this file.
>
> From what we can understand, you can request any size thumbnail on an
> image simply by prefixing it with the size you want (like
> 123x-Filename.jpg). But it seems really silly to always request 640x for
> instance, since that would mean the WMF servers would need to generate that
> for us specifically if the resolution doesn't exist.
>
> What we'd find much more appealing is to be able to determine before
> making the call what sizes already exist and which can be retrieved without
> the WMF servers needing to rescale them for us. And while the viewer on
> Commons do seem to offer thumbnails in various sizes, we can't seem to get
> that information from any API.
>
> We can scrape the Commons web page for this information, but we figured
> that people here might have good ideas for how we approach this with
> minimal impact on the WMF servers :)
>
> Sincerely,
> Jonas
>
>
> _______________________________________________
> Commons-l mailing list
> Commons-l(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/commons-l
>
>
--
Jean-Frédéric
thanks for the list email, I must have typed it wrong and it bounced, so I
emailed you guys directly.
Things I'm interested in…
- Abandon rate (and where a user is in the process)
- browse button vs drag and drop usage
- frequency that the page returns an error, and what that error is (we
still don't have all required fields marked as required)
- frequency of users expanding/adding additional information beyond the
defaults (extra languages, additional categories)
*Jared Zimmerman * \\ Director of User Experience \\ Wikimedia Foundation
M +1 415 609 4043 \\ @jaredzimmerman <http://loo.ms/g0>
On Wed, Sep 3, 2014 at 11:19 AM, Mark Holmquist <mholmquist(a)wikimedia.org>
wrote:
> On Wed, Sep 03, 2014 at 10:49:54AM -0700, Jared Zimmerman wrote:
> > I this done? anything interesting to report? if not done, when is is
> > scheduled for?
>
> Hi, Jared. It would be super if you could contact the Multimedia team via
> our mailing list, multimedia(a)lists.wikimedia.org, it's much more reliable.
>
> UW is only partially EventLogging'd right now, but we have a few
> in-progress
> patches that should add more data.
>
> Is there anything in particular that you'd like to know?
>
> --
> Mark Holmquist
> Software Engineer, Multimedia
> Wikimedia Foundation
> mtraceur(a)member.fsf.org
> https://wikimediafoundation.org/wiki/User:MHolmquist
>