Multimedia September 2014

multimedia@lists.wikimedia.org

26 participants
24 discussions

by Fabrice Florin

Hi guys, I am getting ready to turn off my work email until I return from vacation on Sep. 24. Before I go, here’s my recommended to-do list for next week, with a focus on Media Viewer improvements. It is based in part on this updated improvements plan: https://www.mediawiki.org/wiki/Multimedia/Media_Viewer/Improvements#Tasks I considered emailing each of you individually about specific tasks on this list, but a single message to the team may be more helpful, so we’re all on the same page. Gilles * Lead Wednesday's sprint planning meeting * Update the Etherpad, Mingle site and Improvements page * Deploy and monitor 'Pre-render thumbnails on backend’ (#301) * Investigate better performance metrics solution for Media Viewer (#881) * Run "versus" test on a dedicated labs instance (#884) Mark * Complete and deploy ‘More Details’ button with project icon (#830, 873) * Complete and deploy Download and Share buttons (#841, 834) * Develop Disable / Enable tools (#836, 719, 870) * Show clearer indication of authorship (#837) * Work closely with Pau on all design questions (via IRC in your AM, then email in the PM) Gergo * Disable MediaViewer for logged-in Commons users (#822) * Help with back-end on Disable / Enable tools, as needed (#836, 719, 870) * Update 'File: Page vs. MMV' performance dashboard with improved user data (#880) * Performance histogram for MediaViewer and File page (#815) * Help with attribution/licensing improvements for metadata cleanup effort Pau * Update specs with all assets for Disable / Enable tools (#836, 719, 870) * Finalize design for Caption above the fold + how to expand/collapse it (#589) * Propose a design for how to access ‘Help’ or ‘About’ pages from Media Viewer * Join our IRC MM channel at the end of your day (to work more closely with Mark) Keegan * Respond to new comments on consultation or MV talk pages * Thank contributors to consultation whose suggestions made it to the list * Archive old comments on Media Viewer talk page Abbey * Write findings from Disable / Enable user testing with Pau * Add a few more screenshots on August report (with credits) * Post wiki report on July user studies, for transparency reasons See Current Cycle wall for more details: http://ur1.ca/h7w5s I hope this outline is helpful. There may be other priority tasks that are not listed here, but I wanted to give you a sense of action items that seem important for the next 2 weeks, from my standpoint. In my absence, Erik and Howie will tag team to provide product input on all this. Erik will join next Wednesday’s sprint planning meeting and can help test important features for deployment. I look forward to seeing you all the following Wednesday, Sep. 24, at the 9am PT weekly meeting. I will not check work email during that time. If you need to reach me urgently, my personal email is fabriceflorin(a)gmail.com . Good luck and speak to you soon! Fabrice _______________________________ Fabrice Florin Product Manager, Multimedia Wikimedia Foundation https://www.mediawiki.org/wiki/User:Fabrice_Florin_(WMF)

9 years, 8 months

Re: [Multimedia] [Wikidata-l] Commons file-topic searching and storage (was Re: Commons Categories again)

by James Heald

Hi Thomas, I'm not really talking about the specific query *engine* that will work on the file topic data. (Well, maybe a little, in general terms about some of the functionality we might want in such a search). What I'm more talking about is the kind of data that will likely need to stored on the CommonsData wikibase to make any kind of such query engine *possible* with reasonable speed -- in particular not just the most specific Q-numbers that apply to a file, but (IMO) *any* Q-number that the file should be returned from if the topic corresponding to that Q-number was searched for. I'm saying that such a Q-number needs to be included on the item on CommonsData for the file -- it's not enough that if used Wikidata to look up the more specific Q-number, then the less specific Q-number would be returned: I'm saying that lookup already needs to have been done (and maintained), so the less specific Q-number is already sitting on CommonsData when someone comes to search for it. This doesn't need to be a manual process (though the presence of a Q-number on a CommonsData item perhaps needs to subject to manual overrule, in case the inference chain has gone wrong, and it really isn't relevant); but what I'm saying is that you can't wait to do the inference when the search request comes in -- instead the relevant Q-numbers for each file need to be pre-computed, and stored on the CommonsData item, so that when the search request comes in, they are already there to be searched on. That denormalisation of information really needs to be in place whatever the fine coding of the engine -- it's data design, rather than engine coding. -- James. On 13/09/2014 20:56, Thomas Douillard wrote: > Hi James, I don't understand (I must admit I did not read the whole topic). > Are we talking about a specific query engine ? The one the development team > will implement in Wikibase, or are we talking of something else ? > > If we do not know that, I seems difficult to have this conversation at that > point. > > 2014-09-13 21:51 GMT+02:00 James Heald <j.heald(a)ucl.ac.uk>: > >> "Let the ops worry about time" is not an answer. >> >> We're talking about the something we're hoping to turn into a world-class >> mass-use image bank, and its front-line public-facing search capability. >> >> That's on an altogether different scale to WDQ running a few hundred >> searches a day. >> >> Moreover, we're talking about a public-facing search capability, where >> you're user clicks a tag and they want an updated results set *instantly* >> -- their sitting around while the server makes a cup of tea, or declares >> the query is too complex and goes into a sulk is not an option. >> >> >> If the user wants a search on "palace" and "soldier", there simply is not >> time for the server to first recursively build a list of every palace it >> knows about, then every image related to each of those palaces, then every >> soldier it knows about, every image related to each of those soldiers, then >> intersect the two (very big) lists before it can start delivering any image >> hits at all. That is not acceptable. A random internet user wants those >> hits straight away. >> >> The only way to routinely be able to deliver that is denormalisation. >> >> It's not a question of just buying some more blades and filling up some >> more racks. That doesn't get you a big enough factor of speedup. >> >> What we have is a design challenge, which needs a design solution. >> >> -- James. >> >> >> >> >> >>> Let the ops worry about time, I have not heard them complain about a >>> search >>> dystopia yet. Even the Wiki Data Query has reasonable response time >>> compairing to the power it offers in the queries. And that is on wmflabs, >>> not a production server. >>> You're saying that even when we make the effort to get structured linked >>> data we should not exploit the single most important advantage it offers. >>> It does not make sense. >>> It almost like just repeating the category sysem again but with another >>> software (albeit it offers multilinguality). >>> >>> /Jan >>> >>> >>> >>> _______________________________________________ >>> Multimedia mailing list >>> Multimedia(a)lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/multimedia >>> >>> >> >> _______________________________________________ >> Wikidata-l mailing list >> Wikidata-l(a)lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wikidata-l >> > > > > _______________________________________________ > Wikidata-l mailing list > Wikidata-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikidata-l >

9 years, 8 months

Commons file-topic searching and storage (was Re: Commons Categories again)

by James Heald

Hey, all! This new post is to respond to various points from GerardM and P. Blissenbach, previously responding to me on wikidata-l. I'm also cross-posting it to multimedia-l, who no doubt will be able to put me straight about lots of things. In particular, where will "topics" to be associated with image files be stored, and how will they be searched ? * Where will topics be stored ? * On the question of where the list of topics will be stored, the initial thoughts of the Structured Data team would seem to be clear: they are to be stored on the new CommonsData wikibase. See eg: https://commons.wikimedia.org/w/index.php?title=File%3AStructured_Data_-_Sl… ("topic links") https://docs.google.com/document/d/1tzwGtXRyK3o2ZEfc85RJ978znRdrf9EkqdJ0zVj… (API design and class diagram) * How would topics be searched ? * Gerard wrote: > I am really interested how you envision searching when all those topics are > isolated and attached to each file.. The trite answer is: in the same way you would search any other database -- by setting an index. It should be very very simple to pull the identities of all files on CommonsData related to topic Qnnnnn. * Why not store information with the Q-items on WikiData, regarding what files are related ? * One could do this. Essentially what we have here is a many-many join. Each file can have many topics. Each topic can have many files. So the classic relational approach would be a separate join table. Moving the information out of main Wikidata makes Wikidata smaller and leaner to query, particularly for queries that simply aren't interested in images. As to whether you really do have a join table, or whether you just consider it all part of CommonsData, that's really up to the developers. * What about the natural hierarchical structure ? * eg Leonardo da Vinci --> Mona Lisa --> --> Files depicting the Mona Lisa Shouldn't the fact that it was Leonardo that painted the Mona Lisa only be stored in one place, on Mona Lisa, (or perhaps on Leonardo); but *not* multiple times, separately on every single depiction file? *A*: Probably not, for several reasons. Trying to find things (and also, to accurately represent things) in a hierarchical structure is the bane of Commons at the moment; it also makes searching Wikidata significantly non-trivial. So the most significant reason is retrieval. Suppose we have an image with topic "Gloucestershire Old Spot" (a breed of pig). We also want to be able to retrieve the image rapidly if somebody keys in "Pig". Similarly, if we have an image of the "Mona Lisa", we also want it to be in the set of images with somebody keying in "Leonardo" For simple searches, one could image walking down the wikidata tree from "Pig" or from "Leonardo", compiling a list of derived search terms, and then building a union set of hits. Slightly more cumbersome than just pulling everything tagged "Pig" from a relational database, but not so different from what WDQ manages. However, suppose one is combining "pig" and "country house", does one then have to go down the tree to first identify every single country house, and unify the hits for each one of those searches, before computing the intersection with "pig" ? Or does one instead simply go through the hitset for "pig" and see if it is also tagged "country house" ? Now it's not a bad idea to identify "lead topics" and "implied topics" associated with an image. Each time a new topic was added to an image, one would want a lookup to be made on Wikidata and a list of implied topics also to be added. Similarly if a topic identified as a "lead topic" was changed (eg perhaps a country house had been mis-identified), one would also want the list of implied topics to be updated (eg what county it was in, which family it was associated with, etc). Also the system would need to be looking out for relevant changes on Wikidata -- eg if as a result of a new claim being added ("Gloucestershire Old Spot is a type of Pig"), what was previously an independent lead topic "Pig" might become an implied topic. Similarly, if something in the chain of implications was changed, the consequences of that change would need to be reflected (eg if a parish that the country house was in had been assigned to the wrong county; or a work that the work was derivative of had been assigned to the wrong painter). Having to monitor such things is the price of denormalisation. The question one has to ask is what is more troublesome: having to propagate changes like this to multiple places in a denormalised structure where multiple copies of the same information need to be present (which can be done in quite a lazy background way); or, alternatively, having to navigate the normalised structure every time a user wants to build a results set, an overhead which directly affects the speed at which the user can be returned those results ? * How will searching by users likely be done in practice ? * A classic approach in combinatorial searching is to give the user an initial set of hits, and then encourage them to refine that set. This implies, on the basis of the current query and hit-set, trying to identify the best refinement options to offer them. There may be classic properties like location and time-period. Or there may be tags that can be identified as particularly rich in the return set. Or properties which those tags are the values of that are particularly rich in the return set. But a really classic approach in image searching is simpler than that. It simply shows a random selection of images from the current hit set, lets the user reveal the tags that are associated with any one of them, and then lets the user add one of those tags to the user's query. This is how, in the first instance, I would expect an image search on topics to be first implemented -- because it's such a well-known technique, often works so well, and is so (comparatively) straightforward to implement. So that's why (IMO) the ability to refine searches by adding another topic needs to be so fast and responsive. In terms of design, this is the optimisation that will affect user experience. * What about images stored on local language wikis? * Gerard wrote: > I also am really interested to know when you have all those files isolated > on Commons, how you will include media files that are NOT on Commons.. This > is a normal use case. The project is called Structured Data for Commons, and the wikibase being built for it is quite often being called CommonsData. But it seems to me there is no particular reason why it should not be straightforward to roll out essentially the same structure to local language wikis as well. I would have thought it would be fairly easy to then implement a federated search, that finds all files matching these criteria on *either* Commons *or* en-wiki (say). Would one actually implement that all in one wikibase (ImagesData, say, rather than CommonsData) ? That's a call I'd leave to the experts. On the one hand, it probably would make it easy to search for all files matching the criteria on *any* wiki. What I suspect is more likely, and probably makes more sense, is to converge the images themselves to all live in one place. So if the same fair-use image was used on multiple fair-use wikis, it would only be stored once (though each fair-use wiki would retain it's own File page for it). Such a structure should also make transfers to Commons much easier -- compared to the copy-and-paste by bot at the moment, which loses all the file-page history and most of the upload history. But there are blockers in the way of that at the moment -- in particular, blockers that need to be addressed for image patrollers and fair-use enforcement specialists still to be able to do their job in such a set-up. To start with there are lots of tools they use, that at the moment only run on one wiki but would need to effectively run on two (or perhaps, the fact that it was two wikis would need to be hidden). They would need equivalent admin and deletion rights on both xx-wiki and the xx partition of Images wiki. Ideally they would be able to see changes to the two on the same watchlist. etc etc. So it may be some time before running the same image search across all wikis can be supported by the system itself. But it will surely be supported through middleware sooner than that. So that's some thoughts (or maybe some mis-thoughts) about file-topic searching and storage. Now, tell me what I've got wrong. :-) All best, James.

9 years, 8 months

Image sizes on the file page

by Gilles Dubuc

Currently the file page provides a set of different image sizes for the user to directly access. These sizes are usually width-based. However, for tall images they are height-based. The thumbnail urls, which are used to generate them pass only a width. What this means is that tall images end up with arbitrary thumbnail widths that don't follow the set of sizes meant for the file page. The end result from an ops perspective is that we end up with very diverse widths for thumbnails. Not a problem in itself, but the exposure of these random-ish widths on the file page means that we can't set a different caching policy for non-standard widths without affecting the images linked from the file page. I see two solutions to this problem, if we want to introduce different caching tiers for thumbnail sizes that come from mediawiki and thumbnail sizes that were requested by other things. The first one would be to always keep the size progression on the file page width-bound, even for soft-rotated images. The first drawback of this is that for very skinny/very wide images the file size progression between the sizes could become steep. The second drawback is that we'd often offer less size options, because they'd be based on the smallest dimension. The second option would be to change the syntax of the thumbnail urls in order to allow height constraint. This is a pretty scary change. If we don't do anything, it simply means that we'll have to apply the same caching policy to every size smaller than 1280. We could already save quite a bit of storage space by evicting non-standard sizes larger than that, but sizes lower than 1280 would have to stay the way they are now. Thoughts?

9 years, 8 months

MMV browser tests for IE versions

by Chris McMahon

We created a set of these, you can see them at: https://integration.wikimedia.org/ci/view/BrowserTests/view/MultimediaViewe… In the near future I would like to refine exactly what is required for these builds as regards * browser * version * OS (if applicable) * target environment (beta, test2, mw.o) as well as whatever additional feature coverage we might provide. Thanks, -Chris

9 years, 8 months

Media Viewer Consultation Results

by Fabrice Florin

Thanks to everyone who participated in the recent Media Viewer Consultation! Our multimedia team appreciates the many constructive suggestions to improve the viewing experience for readers and casual editors on Wikimedia projects. We reviewed about 130 community suggestions and prioritized a number of important development tasks for the next release of this feature. Those prioritized tasks have now been added to the improvements list on the consultation page. (1) We have already started development of the most critical improvements: 'must-have’ features that this consultation helped identify and that have been validated through user testing — see research findings (2) and design prototype (3). We plan to complete all these 'must have' improvements by the end of October and will deploy them incrementally, starting this week. For more details, see our improvement plan (4) and development tasks for the next 6 weeks (5). As we release these improvements, we will post regular updates on the Media Viewer talk page (6). We invite you to review these improvements and share your feedback. The foundation is also launching a file metadata cleanup drive (7) to add machine-readable attributions and licenses on files lack them. This will lay the groundwork for the structured data partnership (8) with the Wikidata team, to enable better search and re-use of media in our projects. We encourage everyone to join these efforts. This community consultation was very productive for us and we look forward to more collaborations in the future. Thanks again to all our gracious contributors. We consider ourselves lucky to have so many great community partners. Onward! Fabrice (1) https://meta.wikimedia.org/wiki/Community_Engagement_(Product)/Media_Viewer… (2) https://www.mediawiki.org/wiki/Media_Viewer_Research_Round_2_(August_2014) (3) http://multimedia-alpha.wmflabs.org/wiki/Rapa_Nui_National_Park#mediaviewer… (4) https://www.mediawiki.org/wiki/Multimedia/Media_Viewer/Improvements (5) http://ur1.ca/h7w5s (6) https://www.mediawiki.org/wiki/Talk:Multimedia/About_Media_Viewer (7) https://meta.wikimedia.org/wiki/File_metadata_cleanup_drive (8) https://commons.wikimedia.org/wiki/Commons:Structured_data _______________________________ Fabrice Florin Product Manager, Multimedia Wikimedia Foundation https://www.mediawiki.org/wiki/User:Fabrice_Florin_(WMF)

9 years, 8 months

Odd requestlog entry

by Oliver Keyes

URL runs " http://commons.wikimedia.org/w/index.php?title=Image:Grandville_torrent.jpg…", user_agent is "MediaWiki/1.24wmf19". Is this you guys? -- Oliver Keyes Research Analyst Wikimedia Foundation

9 years, 8 months

Re: [Multimedia] [Commons-l] Hashing Wikimedia Commons

by Jean-Frédéric

> > The first three we can get from pretty much either API, or extract >> directly from >> > a dump file. The latter is eluding us though, for two reasons. One is >> that a >> > file, like 30C3_Commons_Machinery_2.jpg, is actually in the /b/ba/ >> directory - >> > but where this /b/ba/ comes from (a hash?) is unclear to us now, and >> it's not >> > something we find in the dumps - though we can get it from one of the >> APIs. >> > > Yes, /b/ba ist based on the first two digits of the MD5 hash of the title: > > md5( "30C3_Commons_Machinery_2.jpg" ) -> ba253c78d894a80788940a3ca765debb > > But this is "arcane knowledge" which nobody should really rely on. The > canonical > way would be to use > > https://commons.wikimedia.org/wiki/Special:Redirect/file/30C3_Commons_Machi… > > Which generates a redirect to > > https://upload.wikimedia.org/wikipedia/commons/b/ba/30C3_Commons_Machinery_… > > To get a thumbnail, you can directly manipulate that URL, by inserting > "thumb/" > and the desired size in the correct location (maybe Special:Redirect can > do that > for you, but I do not know how): > > > https://upload.wikimedia.org/wikipedia/commons/thumb/b/ba/30C3_Commons_Mach… > If I am not mistaken you can use thumb.php to get the needed thumb? <https://commons.wikimedia.org/w/thumb.php?f=Example.jpg&width=100> (That’s what I used in my CommonsDownloader [1]) [1] < https://github.com/Commonists/CommonsDownloader/blob/master/commonsdownload… > Hope that helps, -- Jean-Frédéric

9 years, 8 months

Re: [Multimedia] [Commons-l] Hashing Wikimedia Commons

by Jean-Frédéric

Hi Jonas, Awesome project! I’m cc-ing the WMF Multimedia team, who might have some more answers :) 2014-09-04 12:26 GMT+02:00 Jonas Öberg <jonas(a)commonsmachinery.se>: > Dear all, > > some of you may have been at our presentation during Wikimania and you'll > find this familiar, but for the rest of you, I'm working with Commons > Machinery on software that will hope to identify images on the web, even > when they are used outside of their original context, to provide automatic > attribution and a referral back to its origin. Imagine a blogger using a > photo from Commons, visiting that blog and having a browser plugin overlay > a small icon showing that the image is from Commons and inviting to find > out more - even if the blogger forgot to attribute. > > We're currently working on an addon for Firefox to do just this, and we've > previously worked out a backend to store the information we need to make > these matches, some utilities for perceptual image hashing etc. We would > love to work with images from Wikimedia Commons as a first dataset to > explore how this will all work in practice. > > But in order to do so, we need information from Commons, and we want to > make this as easy on the WMF servers as possible, so we'd appreciate some > help and pointers. What we're looking at retrieving is information about > (1) title, (2) author, (3) license, and (4) thumbnails of medium size. > > The first three we can get from pretty much either API, or extract > directly from a dump file. The latter is eluding us though, for two > reasons. One is that a file, like 30C3_Commons_Machinery_2.jpg, is actually > in the /b/ba/ directory - but where this /b/ba/ comes from (a hash?) is > unclear to us now, and it's not something we find in the dumps - though we > can get it from one of the APIs. > > The other is thumbnail sizes. We need to retrieve a reasonably sized image > (but in many cases less than the original size) of about 640px wide, so > that we can then run a perceptual hash algorithm on this file. > > From what we can understand, you can request any size thumbnail on an > image simply by prefixing it with the size you want (like > 123x-Filename.jpg). But it seems really silly to always request 640x for > instance, since that would mean the WMF servers would need to generate that > for us specifically if the resolution doesn't exist. > > What we'd find much more appealing is to be able to determine before > making the call what sizes already exist and which can be retrieved without > the WMF servers needing to rescale them for us. And while the viewer on > Commons do seem to offer thumbnails in various sizes, we can't seem to get > that information from any API. > > We can scrape the Commons web page for this information, but we figured > that people here might have good ideas for how we approach this with > minimal impact on the WMF servers :) > > Sincerely, > Jonas > > > _______________________________________________ > Commons-l mailing list > Commons-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/commons-l > > -- Jean-Frédéric

9 years, 8 months

Re: [Multimedia] instrumentation of current (old) upload wizard

by Jared Zimmerman

thanks for the list email, I must have typed it wrong and it bounced, so I emailed you guys directly. Things I'm interested in… - Abandon rate (and where a user is in the process) - browse button vs drag and drop usage - frequency that the page returns an error, and what that error is (we still don't have all required fields marked as required) - frequency of users expanding/adding additional information beyond the defaults (extra languages, additional categories) *Jared Zimmerman * \\ Director of User Experience \\ Wikimedia Foundation M +1 415 609 4043 \\ @jaredzimmerman <http://loo.ms/g0> On Wed, Sep 3, 2014 at 11:19 AM, Mark Holmquist <mholmquist(a)wikimedia.org> wrote: > On Wed, Sep 03, 2014 at 10:49:54AM -0700, Jared Zimmerman wrote: > > I this done? anything interesting to report? if not done, when is is > > scheduled for? > > Hi, Jared. It would be super if you could contact the Multimedia team via > our mailing list, multimedia(a)lists.wikimedia.org, it's much more reliable. > > UW is only partially EventLogging'd right now, but we have a few > in-progress > patches that should add more data. > > Is there anything in particular that you'd like to know? > > -- > Mark Holmquist > Software Engineer, Multimedia > Wikimedia Foundation > mtraceur(a)member.fsf.org > https://wikimediafoundation.org/wiki/User:MHolmquist >

9 years, 8 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

Multimedia September 2014