Thanks for cc'ing me Jonathan, I wouldn't have seen this otherwise.

TL;DR - Objectively measurable criteria. Clear process. No surprises.

The context of my giving the example of Vector as a good example *of process* was after the presentation about the future of 'Flow' at Wikimania.[1] I highly recommend people read the slides of this session if you've not already - great stuff![2] In particular, I was talking about how the Usability Initiative team were the first to use an opt-in Beta process at the WMF. It was the use of iterative development, progressive rollout, and closed-loop feedback that made their work a successful *process*. I wasn't talking about the Vector skin per-se. 

Significantly, they had a publicly-declared and measurable, criteria for determining what counted as "community acceptance/support". This criteria was 80% retention rate of opt-in users. They did not lock-down the features of one version of their beta and move to the next version until they could show that 80% of people who tried it, preferred it. Moreover, they stuck to this objective criteria for measuring consensus support all the way to the final rollout.[3] 

This system was a great way to identify people who had the willingness to change but had concerns, as opposed to getting bogged down by people who would never willingly accept a change or people who would accept all changes regardless. It also meant that those people became 'community advocates' for the new system because they had positive experiences of their feedback being taken into account.

And I DO remember the process, and the significance that was attached to it by the team (which included Trevor Parscal), because in 2009 I interviewed the whole team in person for the Wikipedia Weekly podcast.[4]  Far from "looking at the past through rose coloured glasses" I recall the specific pain-points on the day that the Vector Skin became the default. These were the inter-language links list being autocollapsed, and the Wikipedia logo was updated.[5] The fact that it was THESE things that caused all the controversy on the day that Vector went from Beta to opt-out is instructive. These were the two things that were NOT part of the Beta testing period - no process, surprises. Tthe people who had valid feedback had not been given an opportunity to provide it and valid feedback came instead in the form of swift criticism on mailing lists.[6]

My support for concept of a clearly defined, objectively measured, rollout *process* for new features is not new... When Fabrice announced "beta features" in November 2013 I was the first to respond - referring to the same examples, and telling the same story about the Usability Initiative's processes.[7] 

Then, as now, the "beta features" tab lists the number of users who have opted-in to a tool, but there is no comparative/objective explanation of what that actually means! For example, it tells me that 33,418 people have opted-in to "Hovercards", but is that good? How long did it take to reach that level? How many people have switched it off? What proportion of the active editorship is that? And most importantly - what relationship does this number have to whether Hovercards will 'graduate' or 'fall' the opt-in Beta process?

Which brings me to the point I made to Jonathan, and also Pau, at Wikimania about the future of Flow.
If there's two things we Wikimedians hate most, I've come to believe that they are:
1) The absence of a clear process, or a failure to follow that process
2) Being surprised

We can, generally, abide outcomes/decisions that we don't like (e.g. article-deletion debates) as long as the process by which that decision was arrived at was clearly explained, and objectively followed. I believe this is why there was so much anger and frustration about the 'autoconfirm article creation trial' on en.wp [8] and the 'superprotect' controversy - because they represented a failure to follow a process, and a surprise (respectively).

So, even more than the Vector skin or even the Visual Editor, Flow ABSOLUTELY MUST have a clear, objectively measurable, *process* for measuring community consensus because it will be replacing community-designed and community-operated workflows (e.g. [9]). This means that once it is enabled on a particular workflow:
1)  an individual user can't opt-out to the old system.
2) it will most affect, and be most used by, admins and other very-active-users. 
Therefore, I believe that this development must be an iterative process of working on 1 workflow on 1 wiki at a time, with objective measures of consensus-support that are at least partially *determined by the affected community itself*. This will be the only way that Flow can gain community consensus for replacing the existing template/sub-page/gadget/transclusion/category-based workflows.[10] 

Because Flow will be updating admin-centric workflows, if it is rolled-out in a way that is anything less than this then it will strike the community as hubris - "it is necessary to destroy the town in order to save it".[11]

-Liam / Wittylama

P.S. While you're at it please make ALL new features go through the "Beta features" process with some consistent/discoverable process. As it is, some things live there permanently in limbo, some things DO have a process associated with them, and some things bypass the beta system altogether. As bawolff said, this means people feel they don't have any influence over the rollout process and therefore chose to not be involved at all.[12]

[1] https://wikimania2015.wikimedia.org/wiki/Submissions/User(s)_Talk(ing):_The_future_of_wiki_discussions 
[2] https://wikimania2015.wikimedia.org/wiki/File:User(s)_Talk(ing)_-_Wikimania_2015.pdf
[3] https://blog.wikimedia.org/2010/05/13/a-new-look-for-wikipedia/
[4] Sorry - I can't find the file anymore though. This was the page: https://en.wikipedia.org/wiki/Wikipedia:WikipediaWeekly/Episode76
[5] https://blog.wikimedia.org/2010/05/13/wikipedia-in-3d/
[6] https://commons.wikimedia.org/wiki/Talk:Wikipedia/2.0#Logo_revisions_need_input
[7] https://lists.wikimedia.org/pipermail/wikimedia-l/2013-November/128896.html
[8] https://en.wikipedia.org/wiki/Wikipedia:Autoconfirmed_article_creation_trial
[9] https://wikimania2015.wikimedia.org/w/index.php?title=File:User(s)_Talk(ing)_-_Wikimania_2015.pdf&page=4
[10] https://wikimania2015.wikimedia.org/w/index.php?title=File:User(s)_Talk(ing)_-_Wikimania_2015.pdf&page=8
[11] https://en.wikipedia.org/wiki/B%E1%BA%BFn_Tre#Vietnam_War
[12] https://lists.wikimedia.org/pipermail/design/2015-July/002355.html


wittylama.com
Peace, love & metadata

On 27 July 2015 at 22:51, Jonathan Morgan <jmorgan@wikimedia.org> wrote:
On Mon, Jul 27, 2015 at 11:02 AM, Ryan Lane <rlane32@gmail.com> wrote:

For instance, if a change negatively affects an editor's workflow, it should
be reflected in data like "avg/p95/p99 time for x action to occur", where x
is some normal editor workflow.


That is indeed one way you can provide evidence of correlation; but in live deployments (which are, at best, quasi-experiments), you seldom get results that are as unequivocal as the example you're presenting here.  And quantifying the influence of a single causal factor (such as the impact of a particular UI change on time-on-task for this or that editing workflow) is even harder.

Knowing that something occurs isn't the same as knowing why. Take the English Wikipedia editor decline. There has been a lot of good research on this subject, and we have confidently identified a set of factors that are likely contributors. Some of these can be directly measured: the decreased retention rate of newcomers; the effect of early, negative experiences on newcomer retention; a measurable increase over time in phenomena (like reverts, warnings, new article deletions) that likely cause those negative experiences. But none of us who have studied the editor decline believe that these are the only factors. And many community members who have read our research don't even accept our premises, let alone our findings.

I'm not at all afraid of sounding pedantic here (or of writing a long-ass wall of text), because I think that many WMF and former-WMF participants in this discussion are glossing over important stuff: Yes, we need a more evidence-based product design process. But we also need a more collaborative, transparent, and iterative deployment process. Having solid research and data on the front-end of your product lifecycle is important, but it's not some kind of magic bullet and is no substitute for community involvement in product design (through the lifecycle).

We have an excellent Research & Data team. The best one we've ever had at WMF. Pound-for-pound, they're as good as or better than the Data Science teams at Google or Facebook. None of them would ever claim, as you seem to here, that all you need to build good products are well-formed hypotheses and access to buckets of log data. 

I had a great conversation with Liam Wyatt at Wikimania (cc'ing him, in case he doesn't follow this list). We talked about strategies for deploying new products on Wikimedia projects: what works, what doesn't. He held up the design/deployment process for Vector as an example of good process, one that we should (re)adopt. 

Vector was created based on extensive user research and community consultation[1]. Then WMF made a beta, and invited people across projects to opt-in and try it out on prototype wikis[2]. The product team set public criteria for when it would release  the product as default across production projects: retention of 80% of the Beta users who had opted in, after a certain amount of time. When a beta tester opted out, they were sent a survey to find out why[3]. The product team attempted to triage the issues reported in these surveys, address them in the next iteration, or (if they couldn't/wouldn't fix them), at least publicly acknowledge the feedback. Then they created a phased deployment schedule, and stuck to it[4]. 

This was, according to Liam (who's been around the movement a lot longer than most of us at WMF), a successful strategy. It built trust, and engaged volunteers as both evangelists and co-designers. I am personally very eager to hear from other community members who were around at the time what they thought of the process, and/or whether there are other examples of good WMF product deployments that we could crib from as we re-assess our current process. From what I've seen, we still follow many good practices in our product deployments, but we follow them haphazardly and inconsistently. 

Whether or not we (WMF) think it is fair that we have to listen to "vocal minorities" (Ryan's words), these voices often represent and influence the sentiments of the broader, less vocal, contributor base in important ways. And we won't be able to get people to accept our conclusions, however rigorously we demonstrate them or carefully we couch them in scientific trappings, if they think we're fundamentally incapable of building something worthwhile, or deploying it responsibly.

We can't run our product development like "every non-enterprise software company worth a damn" (Steven's words), and that shouldn't be our goal. We aren't a start-up (most of which fail) that can focus all our resources on one radical new idea. We aren't a tech giant like Google or Facebook, that can churn out a bunch of different beta products, throw them at a wall and see what sticks. 

And we're not a commercial community-driven site like Quora or Yelp, which can constantly monkey with their interface and feature set in order to maximize ad revenue or try out any old half-baked strategy to monetize its content. There's a fundamental difference between Wikimedia and Quora. In Quora's case, a for-profit company built a platform and invited people to use it. In Wikimedia's case, a bunch of volunteers created a platform, filled it with content, and then a non-profit company was created to support that platform, content, and community. 

Our biggest opportunity to innovate, as a company, is in our design process. We have a dedicated, multi-talented, active community of contributors. Those of us who are getting paid should be working on strategies for leveraging that community to make better products, rather than trying to come up with new ways to perform end runs around them.