Studying the relationship between remixing & learning

With more than 10 million users, the Scratch online community is the largest online community where kids learn to program. Since it was created, a central goal of the community has been to promote “remixing” — the reworking and recombination of existing creative artifacts. As the video above shows, remixing programming projects in the current web-based version of Scratch is as easy is as clicking on the “see inside” button in a project web-page, and then clicking on the “remix” button in the web-based code editor. Today, close to 30% of projects on Scratch are remixes.

Remixing plays such a central role in Scratch because its designers believed that remixing can play an important role in learning. After all, Scratch was designed first and foremost as a learning community with its roots in the Constructionist framework developed at MIT by Seymour Papert and his colleagues. The design of the Scratch online community was inspired by Papert’s vision of a learning community similar to Brazilian Samba schools (Henry Jenkins writes about his experience of Samba schools in the context of Papert’s vision here), and a comment Marvin Minsky made in 1984:

Adults worry a lot these days. Especially, they worry about how to make other people learn more about computers. They want to make us all “computer-literate.” Literacy means both reading and writing, but most books and courses about computers only tell you about writing programs. Worse, they only tell about commands and instructions and programming-language grammar rules. They hardly ever give examples. But real languages are more than words and grammar rules. There’s also literature – what people use the language for. No one ever learns a language from being told its grammar rules. We always start with stories about things that interest us.

In a new paper — titled “Remixing as a pathway to Computational Thinking” — that was recently published at the ACM Conference on Computer Supported Collaborative Work and Social Computing (CSCW) conference, we used a series of quantitative measures of online behavior to try to uncover evidence that might support the theory that remixing in Scratch is positively associated with learning.

scratchblocksOf course, because Scratch is an informal environment with no set path for users, no lesson plan, and no quizzes, measuring learning is an open problem. In our study, we built on two different approaches to measure learning in Scratch. The first approach considers the number of distinct types of programming blocks available in Scratch that a user has used over her lifetime in Scratch (there are 120 in total) — something that can be thought of as a block repertoire or vocabulary. This measure has been used to model informal learning in Scratch in an earlier study. Using this approach, we hypothesized that users who remix more will have a faster rate of growth for their code vocabulary.

Controlling for a number of factors (e.g. age of user, the general level of activity) we found evidence of a small, but positive relationship between the number of remixes a user has shared and her block vocabulary as measured by the unique blocks she used in her non-remix projects. Intriguingly, we also found a strong association between the number of downloads by a user and her vocabulary growth. One interpretation is that this learning might also be associated with less active forms of appropriation, like the process of reading source code described by Minksy.

The second approach we used considered specific concepts in programming, such as loops, or event-handling. To measure this, we utilized a mapping of Scratch blocks to key programming concepts found in this paper by Karen Brennan and Mitchel Resnick. For example, in the image below are all the Scratch blocks mapped to the concept of “loop”.

scratchblocksctWe looked at six concepts in total (conditionals, data, events, loops, operators, and parallelism). In each case, we hypothesized that if someone has had never used a given concept before, they would be more likely to use that concept after encountering it while remixing an existing project.

Using this second approach, we found that users who had never used a concept were more likely to do so if they had been exposed to the concept through remixing. Although some concepts were more widely used than others, we found a positive relationship between concept use and exposure through remixing for each of the six concepts. We found that this relationship was true even if we ignored obvious examples of cutting and pasting of blocks of code. In all of these models, we found what we believe is evidence of learning through remixing.

Of course, there are many limitations in this work. What we found are all positive correlations — we do not know if these relationships are causal. Moreover, our measures do not really tell us whether someone has “understood” the usage of a given block or programming concept.However, even with these limitations, we are excited by the results of our work, and we plan to build on what we have. Our next steps include developing and utilizing better measures of learning, as well as looking at other methods of appropriation like viewing the source code of a project.

This blog post and the paper it describes are collaborative work with Sayamindu Dasgupta, Andrés Monroy-Hernández, and William Hale. The paper is released as open access so anyone can read the entire paper here. This blog post was also posted on Sayamindu Dasgupta’s blog and on Medium by the MIT Media Lab.

More Community Data Science Workshops

Pictures from the CDSW sessions in Spring 2014
Pictures from the CDSW sessions in Spring 2014

After two successful rounds in 2014, I’m helping put on another round of the Community Data Science Workshops. Last year, our 40+ volunteer mentorss taught more than 150 absolute beginners the basics of programming in Python, data collection from web APIs, and tools for data analysis and visualization and we’re still in the process of improving our curriculum and scaling up.

Once again, the workshops will be totally free of charge and open to anybody. Once again, they will be possible through the generous participation of a small army of volunteer mentors.

We’ll be meeting for four sessions over three weekends:

  • Setup and Programming Tutorial (April 10 evening)
  • Introduction to Programming (April 11)
  • Importing Data from web APIs (April 25)
  • Data Analysis and Visualization (May 9)

If you’re interested in attending, or interested in volunteering as mentor, you can go to the information and registration page for the current round of workshops and sign up before April 3rd.

Consider the Redirect

In wikis, redirects are special pages that silently take readers from the page they are visiting to another page. Although their presence is noted in tiny gray text (see the image below) most people use them all the time and never know they exist. Redirects exist to make linking between pages easier, they populate Wikipedia’s search autocomplete list, and are generally helpful in organizing information. In the English Wikipedia, redirects make up more than half of all article pages.

seattle_redirect

Over the years, I’ve spent some time contributing to to Redirects for Discussion (RfD). I think of RfD as like an ultra-low stakes version of Articles for Deletion where Wikipedians decide whether to delete or keep articles. If a redirect is deleted, viewers are taken to a search results page and almost nobody notices. That said, because redirects are almost never viewed directly, almost nobody notices if a redirect is kept either!

I’ve told people that if they want to understand the soul of a Wikipedian, they should spend time participating in RfD. When you understand why arguing about and working hard to come to consensus solutions for how Wikipedia should handle individual redirects is an enjoyable way to spend your spare time — where any outcome is invisible — you understand what it means to be a Wikipedian.

That said, wiki researchers rarely take redirects into account. For years, I’ve suspected that accounting for redirects was important for Wikipedia research and that several classes of findings were noisy or misleading because most people haven’t done so. As a result, I worked with my colleague Aaron Shaw at Northwestern earlier this year to build a longitudinal dataset of redirects that can capture the dynamic nature of redirects. Our work was published as a short paper at OpenSym several months ago.

It turns out, taking redirects into account correctly (especially if you are looking at activity over time) is tricky because redirects are stored as normal pages by MediaWiki except that they happen to start with special redirect text. Like other pages, redirects can be updated and changed over time are frequently are. As a result, taking redirects into account for any study that looks at activity over time requires looking at the text of every revision of every page.

Using our dataset, Aaron and I showed that the distribution of edits across pages in English Wikipedia (a relationships that is used in many research projects) looks pretty close to log normal when we remove redirects and very different when you don’t. After all, half of articles are really just redirects and, and because they are just redirects, these “articles” are almost never edited.

edits_over_pages

Another puzzling finding that’s been reported in a few places — and that I repeated myself several times — is that edits and views are surprisingly uncorrelated. I’ll write more about this later but the short version is that we found that a big chunk of this can, in fact, be explained by considering redirects.

We’ve published our code and data and the article itself is online because we paid the ACM’s open access fee to ransom the article.

Another Round of Community Data Science Workshops in Seattle

Pictures from the CDSW sessions in Spring 2014
Pictures from the CDSW sessions in Spring 2014

I am helping coordinate three and a half day-long workshops in November for anyone interested in learning how to use programming and data science tools to ask and answer questions about online communities like Wikipedia, free and open source software, Twitter, civic media, etc. This will be a new and improved version of the workshops run successfully earlier this year.

The workshops are for people with no previous programming experience and will be free of charge and open to anyone.

Our goal is that, after the three workshops, participants will be able to use data to produce numbers, hypothesis tests, tables, and graphical visualizations to answer questions like:

  • Are new contributors to an article in Wikipedia sticking around longer or contributing more than people who joined last year?
  • Who are the most active or influential users of a particular Twitter hashtag?
  • Are people who participated in a Wikipedia outreach event staying involved? How do they compare to people that joined the project outside of the event?

If you are interested in participating, fill out our registration form here before October 30th. We were heavily oversubscribed last time so registering may help.

If you already know how to program in Python, it would be really awesome if you would volunteer as a mentor! Being a mentor will involve working with participants and talking them through the challenges they encounter in programming. No special preparation is required. If you’re interested, send me an email.

Community Data Science Workshops Post-Mortem

Earlier this year, I helped plan and run the Community Data Science Workshops: a series of three (and a half) day-long workshops designed to help people learn basic programming and tools for data science tools in order to ask and answer questions about online communities like Wikipedia and Twitter. You can read our initial announcement for more about the vision.

The workshops were organized by myself, Jonathan Morgan from the Wikimedia Foundation, long-time Software Carpentry teacher Tommy Guy, and a group of 15 volunteer “mentors” who taught project-based afternoon sessions and worked one-on-one with more than 50 participants. With overwhelming interest, we were ultimately constrained by the number of mentors who volunteered. Unfortunately, this meant that we had to turn away most of the people who applied. Although it was not emphasized in recruiting or used as a selection criteria, a majority of the participants were women.

The workshops were all free of charge and sponsored by the UW Department of Communication, who provided space, and the eScience Institute, who provided food.

cdsw_combo_images-1The curriculum for all four session session is online:

The workshops were designed for people with no previous programming experience. Although most our participants were from the University of Washington, we had non-UW participants from as far away as Vancouver, BC.

Feedback we collected suggests that the sessions were a huge success, that participants learned enormously, and that the workshops filled a real need in the Seattle community. Between workshops, participants organized meet-ups to practice their programming skills.

Most excitingly, just as we based our curriculum for the first session on the Boston Python Workshop’s, others have been building off our curriculum. Elana Hashman, who was a mentor at the CDSW, is coordinating a set of Python Workshops for Beginners with a group at the University of Waterloo and with sponsorship from the Python Software Foundation using curriculum based on ours. I also know of two university classes that are tentatively being planned around the curriculum.

Because a growing number of groups have been contacting us about running their own events based on the CDSW — and because we are currently making plans to run another round of workshops in Seattle late this fall — I coordinated with a number of other mentors to go over participant feedback and to put together a long write-up of our reflections in the form of a post-mortem. Although our emphasis is on things we might do differently, we provide a broad range of information that might be useful to people running a CDSW (e.g., our budget). Please let me know if you are planning to run an event so we can coordinate going forward.

Community Data Science Workshops in Seattle

Photo from the Boston Python Workshop – a similar workshop run in Boston that has inspired and provided a template for the CDSW.
Photo from the Boston Python Workshop – a similar workshop run in Boston that has inspired and provided a template for the CDSW.

On three Saturdays in April and May, I will be helping run three day-long project-based workshops at the University of Washington in Seattle. The workshops are for anyone interested in learning how to use programming and data science tools to ask and answer questions about online communities like Wikipedia, Twitter, free  and open source software, and civic media.

The workshops are for people with no previous programming experience and the goal is to bring together researchers as well as participants and leaders in online communities.  The workshops will all be free of charge and open to the public given availability of space.

Our goal is that, after the three workshops, participants will be able to use data to produce numbers, hypothesis tests, tables, and graphical visualizations to answer questions like:

  • Are new contributors to an article in Wikipedia sticking around longer or contributing more than people who joined last year?
  • Who are the most active or influential users of a particular Twitter hashtag?
  • Are people who participated in a Wikipedia outreach event staying involved? How do they compare to people that joined the project outside of the event?

If you are interested in participating, fill out our registration form here. The deadline to register is Wednesday March 26th.  We will let participants know if we have room for them by Saturday March 29th. Space is limited and will depend on how many mentors we can recruit for the sessions.

If you already have experience with Python, please consider helping out at the sessions as a mentor. Being a mentor will involve working with participants and talking them through the challenges they encounter in programming. No special preparation is required.  If you’re interested,  send me an email.

Doctor of Philosophy

On Wednesday, I successfully defended my PhD dissertation in front of a ridiculously packed house at the MIT Media Lab. I am humbled by the support shown by the MIT Sloan, Media Lab, and Harvard communities. Earlier today, I finished up paperwork and submitted my archival copies. I’m done.

Although I’ve often heard PhDs described as emotional roller coasters, I feel enormously blessed in that I honestly can’t relate. My eight years at MIT and Harvard have been almost universally positive and I have learned and grown indescribably. As excited as I am about my next chapter at the University of Washington, I’m going to miss my life here. Deeply.

My dissertation was three essays on volunteer mobilization in peer production. Once I have a chance to catch up and recover, I’ll be posting the previously unpublished pieces. The Remixing Dilemma was included in the dissertation and is already online. The Media Lab AV team shot professional video of the talk. When I get a copy of the video, I’ll post that too.

But because I think it’s important, I’ve formatted and published the acknowledgments section of the dissertation today. Although there are too many folks to thank, I’ve highlighted the contributions of my co-authors, and friends, Aaron Shaw and Andrés Monroy Hernández and my almost unbelievably incredible group of advisors: Eric von Hippel, Yochai Benkler, Mitch Resnick, and Tom Malone.

The Wikipedia Gender Gap Revisited

In a new paper, recently published in the open access journal PLOSONE, Aaron Shaw and I build on new research in survey methodology to describe a method for estimating bias in opt-in surveys of contributors to online communities. We use the technique to reevaluate the most widely cited estimate of the gender gap in Wikipedia.

A series of studies have shown that Wikipedia’s editor-base is overwhelmingly male. This extreme gender imbalance threatens to undermine Wikipedia’s capacity to produce high quality information from a full range of perspectives. For example, many articles on topics of particular interest to women tend to be under-produced or of poor quality.

Given the open and often anonymous nature of online communities, measuring contributor demographics is a challenge. Most demographic data on Wikipedia editors come from “opt-in” surveys where people respond to open, public invitations. Unfortunately, very few people answer these invitations. Results from opt-in surveys are unreliable because respondents are rarely representative of the community as a whole. The most widely-cited estimate from a large 2008 survey by the Wikimedia Foundation (WMF) and UN University in Maastrict (UNU-MERIT) suggested that only 13% of contributors were female. However, the very same survey suggested that less than 40% of Wikipedia’s readers were female. We know, from several reliable sources, that Wikipedia’s readership is evenly split by gender — a sign of bias in the WMF/UNU-MERIT survey.

In our paper, we combine data from a nationally representative survey of the US by the Pew Internet and American Life Project with the opt-in data from the 2008 WMF/UNU-MERIT survey to come up with revised estimates of the Wikipedia gender gap. The details of the estimation technique are in the paper, but the core steps are:

  1. We use the Pew dataset to provide baseline information about Wikipedia readers.
  2. We apply a statistical technique called “propensity scoring” to estimate the likelihood that a US adult Wikipedia reader would have volunteered to participate in the WMF/UNU-MERIT survey.
  3. We follow a process originally developed by Valliant and Dever to weight the WMF/UNU-MERIT survey to “correct” for estimated bias.
  4. We extend this weighting technique to Wikipedia editors in the WMF/UNU data to produce adjusted estimates of the demographics of their sample.

Using this method, we estimate that the proportion of female US adult editors was 27.5% higher than the original study reported (22.7%, versus 17.8%), and that the total proportion of female editors was 26.8% higher (16.1%, versus 12.7%). These findings are consistent with other work showing that opt-in surveys tend to undercount women.

Overall, these results reinforce the basic substantive finding that women are vastly under-represented among Wikipedia editors.

Beyond Wikipedia, our paper describes a method online communities can adopt to estimate contributor demographics using opt-in surveys, but that is more credible than relying entirely on opt-in data. Advertising-intelligence firms like ComScore and Quantcast provide demographic data on the readership of an enormous proportion of websites. With these sources, almost any community can use our method (and source code) to replicate a similar analysis by: (1) surveying a community’s readers (or a random subset) with the same instrument used to survey contributors; (2) combining results for readers with reliable demographic data about the readership population from a credible source; (3) reweighting survey results using the method we describe.

Although our new estimates will not help us us close the gender gap in Wikipedia or address its troubling implications, they give us a better picture of the problem. Additionally, our method offers an improved tool to build a clearer demographic picture of other online communities in general.

Job Market Materials

Last year, I applied for academic, tenure track, jobs at several communication departments, information schools, and in HCI-focused computer science programs with a tradition of hiring social scientists.

Being “on the market” — as it is called — is both scary and time consuming. Like me, many candidates have never been on the market before. Candidates are asked to produce documents in genres — e.g., cover letters, research statements, teaching statements, diversity statements — that most candidates have never written, read, or even heard of.

Candidates often rely on their supervisors for advice. I did so and my advisors were extremely helpful. The reality, however, is that although candidates’ advisors may sit on hiring committees, most have not been on candidates’ side of job market themselves for years or even decades.

The Internet is full of websites, like the academic jobs wiki, Academia StackExchange, and the Chronicle of Higher Education forums for people on the market. Confused and insecure candidates ask questions of the form, “Does blank matter?” and the answer is usually, “Doing/having blank may help/hurt, but it is only one factor of many.” The result is that candidates worry about everything. Then they worry about what they should be worrying about, but are not.

The most helpful thing, for me, was to read and synthesize the material submitted by recent successful job market candidates. For example, Michael Bernstein — a friend from MIT, now at Stanford — published his research and teaching statements on his website and I found both useful as I prepared mine. That said, I was surprised by how little material like this I could find on the web. For example, I could not find any examples of recent job market cover letters from successful candidates in fields close to mine.

So to help fill this gap, I am publishing all of my job market material. I’ve posted both the PDFs of the material I submitted as well as the LaTeX templates I used to generate the documents in my packet. My packet included:

  • Research Statement (TeX) — A description of my research to date and my current trajectory. Following a convention I have seen others follow, I “cited” my own work (but only my work) to form a a curated bibliography of my own publications and working papers.
  • Teaching Statement (TeX) — A two-page description of my approach to teaching, a list of my teaching experience, and a description of sample courses.
  • Diversity Statement (TeX) — A description of how I think about diversity and how I have, and will, engage with it in my teaching and research.
  • Cover Letter (TeX) — Each application I sent had a customized cover letter. I wrote mine on MIT letter head. Since each letter is different, I have published the letter I sent to the department that I took the job in (UW Communication). Because my new department did not request research and teaching statements, the cover letter includes material taken from both. For departments that requested separate statements, I limited myself to a shorter (1.5 pages) version of the letter with a similar structure.
  • Writing Samples — I included three or four of my papers to every job I applied to. The selection of articles changed a bit depending on the department but I included at least one single-authored paper in each packet.
  • Letters of Recommendation — Because I didn’t write these and haven’t seen them, I can’t share them. I requested letters from my four committee members: Eric von Hippel, Yochai Benkler, Mitch Resnick, and Tom Malone.
  • Curriculum Vitae (TeX) — I have tried to keep my CV up-to-date during graduate school. I keep my CV in git and have a little CGI script automatically rebuild the published version whenever an update is committed.

I hope people going “on the market” will find these materials useful. Obviously, you should not copy or reuse the text of any of my material. It is your application, after all. That said, please do help yourself to the formatting and structure.

Finally, I would encourage anyone who builds on my material to republish their own material to help other candidates. If you do, I’d appreciate a link back or comment on this blog post so that my readers can find your improvements.

London and Michigan

I’ll be spending the week after next (June 17-23) in London for the annual meeting of the International Communication Association where I’ll be presenting a paper. This will be my first ICA and I’m looking forward to connecting with many new colleagues in the discipline. If you’re one of them, reading this, and would like to meet up in London, please let me know!

Starting June 24th, I’ll be in Ann Arbor, Michigan for four weeks of the ICPSR summer program in applied statistics at the Institute for Social Research. I have been wanting to sign up for some of their advanced methods classes for years and am planning to take the opportunity this summer before I start at UW. I’ll be living with my friends and fellow Berkman Cooperation Group members Aaron Shaw and Dennis Tennen.

I would love to make connections and meet people in both places so, if you would like to meet up, please get in contact.

The Cost of Inaccessibility at the Margins of Relevance

I use RSS feeds to keep up with academic journals. Because of an undocumented and unexpected feature (bug?) in my (otherwise wonderful) free software newsreader NewsBlur, many articles published over the last year were marked as having been read before I saw them.

Over the last week, I caught up. I spent hours going through abstracts and downloading papers that looked interesting or relevant to my research. Because I did this for hundreds of articles, it gave me an unusual opportunity to reflect on my journal reading practices in a systematic way.

On a number of occasions, there were potentially interesting articles in non-open access journals that neither MIT nor Harvard subscribes to and that were otherwise not accessible to me. In several cases where the research was obviously important to my work, I made an interlibrary request, emailed the papers’ authors for copies, or tracked down a colleague at an institution with access.

Of course, articles that look potentially interesting from the title and abstract often end up being less relevant or well executed on closer inspection. I tend to cast a wide net, skim many articles, and put them aside when it’s clear that the study is not for me. This week, I downloaded many of these possibly relevant papers to, at least, give a skim. But only if I could download them easily. On three or four occasions, I found inaccessible articles at this margin of relevance. In these cases, I did not bother trying to track down the articles.

Of course, what appear to be marginally relevant articles sometimes end up being a great match for my research and I will end up citing and building on the work. I found several suprisingly interesting papers last week. The articles that were locked up have no chance at this.

When people suggest that open access hinders the spread of scholarship, a common retort is that the people who need the work have or can finagle access. For the papers we know we need, this might be true. As someone with access to two of the most well endowed libraries in academia who routinely requests otherwise inaccessible articles through several channels, I would have told you, a week ago, that locked-down journals were unlikely to keep me from citing anybody.

So it was interesting watching myself do a personal cost calculation in a way that sidelined published scholarship — and that open access publishing would have prevented. At the margin of relevance to ones research, open access may make a big difference.

The Remixing Dilemma: The Trade-off Between Generativity and Originality

This post was written with Andrés Monroy-Hernández. It is a summary of a paper just published in American Behavioral Scientist. You can also read the full paper: The remixing dilemma: The trade-off between generativity and originality. It is part of a series of papers I have written with Monroy-Hernández using data from Scratch. You can find the others on my academic website.

Remixing — the reworking and recombination of existing creative artifacts — represents a widespread, important, and controversial form of social creativity online. Proponents of remix culture often speak of remixing in terms of rich ecosystems where creative works are novel and highly generative. However, examples like this can be difficult to find. Although there is a steady stream of media being shared freely on the web, only a tiny fraction of these projects are remixed even once. On top of this, many remixes are not very different from the works they are built upon. Why is some content more attractive to remixers? Why are some projects remixed in deeper and more transformative ways?
Remix Diagram
We try to shed light on both of these questions using data from Scratch — a large online remixing community. Although we find support for several popular theories, we also present evidence in support of a persistent trade-off that has broad practical and theoretical implications. In what we call the remixing dilemma, we suggest that characteristics of projects that are associated with higher rates of remixing are also associated with simpler and less transformative types of derivatives.

Our study is focused on two interrelated research questions. First, we ask why some projects shared in remixing communities are more or less generative than others. “Generativity” — a term we borrow from Jonathan Zittrain — describes creative works that are likely to inspire follow-on work. Several scholars have offered suggestions for why some creative works might be more generative than others. We focus on three central theories:

  1. Projects that are moderately complicated are more generative. The free and open source software motto “release early and release often” suggests that simple projects will offer more obvious opportunities for contribution than more polished projects. That said, projects that are extremely simple (e.g., completely blank slates) may also uninspiring to would-be contributors.
  2. Projects by prominent creators are more generative. The reasoning for this claim comes from the suggestion that remixing can act as a form of cultural conversation and that the work of popular creators can act like a common medium or language.
  3. Projects that are remixes themselves are more generative. The reasoning for this final claim comes from the idea that remixing thrives through the accumulation of contributions from groups of people building on each other’s work.

Our second question focuses on the originality of remixes and asks when more or less transformative remixing occurs. For example, highly generative projects may be less exciting if the projects produced based on them are all near-identical copies of antecedent projects. For a series of reasons — including the fact that increased generativity might come by attracting less interested, skilled, or motivated individuals — we suggest that each of the factors associated with generativity will also be associated with less original forms of remixing. We call this trade-off the remixing dilemma.

We answer both of our research questions using a detailed dataset from Scratch, where young people build, share, and collaborate on interactive animations and video games. The community was built to support users of the Scratch programming environment, a desktop application with functionality similar to Flash created by the Lifelong Kindergarten Group at the MIT Media Lab. Scratch is designed to allow users to build projects by integrating images, music, sound, and other media with programming code. Scratch is used by more than a million users, most of them under 18 years old.

To test our three theories about generativity, we measure whether or not, as well as how many times, Scratch projects were remixed in a dataset that includes every shared project. Although Scratch is designed as a remixing community, only around one tenth of all Scratch projects are ever remixed. Because more popular projects are remixed more frequently simply because of exposure, we control for the number of times each project is viewed.

Our analysis shows at least some support for all three theories of generativity described above. (1) Projects with moderate amounts of code are remixed more often than either very simple or very complex projects. (2) Projects by more prominent creators are more generative. (3) Remixes are more likely to attract remixers than de novo projects.

To test our theory that there is a trade-off between generativity and originality, we build a dataset that includes every Scratch remix and its antecedent. For each pair, we construct a measure of originality by comparing the remix to its antecedent and computing an “edit distance” (a concept we borrow from software engineering) to determine how much the projects differ.

We find strong evidence of a trade-off: (1) Projects of moderate complexity are remixed more lightly than more complicated projects. (2) Projects by more prominent creators tend to be remixed in less transformative ways. (3) Cumulative remixing tends to be associated with shallower and less transformative derivatives. That said, our support for (1) is qualified in that we do not find evidence of the increased originality for the simplest projects as our theory predicted.

Two plots of estimated values for prototypical projects. Panel 1 (left) display predicted probabilities of being remixed. Panel 2 (right) display predicted edit distances. Both panels show predicted values for both remixes and de novo projects from 0 to 1,204 blocks (99th percentile).
Two plots of estimated values for prototypical projects. Panel 1 (left) displays predicted probabilities of being remixed. Panel 2 (right) displays predicted edit distances. Both panels show predicted values for both remixes and de novo projects from 0 to 1,204 blocks (99th percentile).

We feel that our results raise difficult but important challenges, especially for the designers of social media systems. For example, many social media sites track and display user prominence with leaderboards or lists of aggregate views. This technique may lead to increased generativity by emphasizing and highlighting creator prominence. That said, it may also lead to a decrease in originality of the remixes elicited. Our results regarding the relationship of complexity to generativity and originality of remixes suggest that supporting increased complexity, at least for most projects, may have fewer drawbacks.

As supporters and advocates of remixing, we feel that although highly generative works that lead to highly original derivatives may be rare and difficult for system designers to support, understanding remixing dynamics and encouraging these rare projects remain a worthwhile and important goal.

Benjamin Mako Hill, Massachusetts Institute of Technology
Andrés Monroy-Hernández, Microsoft Research

For more, see our full paper, “The remixing dilemma: The trade-off between generativity and originality.” Published in American Behavioral Scientist. 57-5, Pp. 643—663. (Official Link, Pay-Walled ).

MIT LaTeX Stationery

Color MIT LetterHead Example

The MIT graphic identity website provides downloadable stationery templates for letterhead and envelopes. They provide both Microsoft Word and LaTeX templates. But although they provide both black and white and color templates for Word, they only provide the monochrome templates for LaTeX. When writing cover letters for the job market this year, I was not particularly interested in compromising on color and was completely unwilling to compromise on TeX.

As a result, I ended up modifying each of the three templates to include color. In the process, I fixed a few bugs and documented one tricky issue. I’ve published a git repository with my changes. It includes branches for each version of three of the “old” black and white templates as well as my my three new color templates. I hope others at MIT find it useful. I’ve tried to keep the changes minimal.

I’ve emailed the folks at MIT Communication Production Services to see if they want to publish my modified versions. Until then, anyone interested can help themselves to the git repository. LaTeX user that you are, you probably prefer that anyway.

Conversation on Freedom and Openness in Learning

On Monday, I was a visitor and guest speaker in a session on “Open Learning” in a class on Learning Creative Learning which aims to offer “a course for designers, technologists, and educators.” The class is being offered publicly by the combination — surprising but very close to my heart — of Peer 2 Peer University and the MIT Media Lab.

The hour-long session was facilitated by Philipp Schmidt and was mostly structured around a conversation with Audrey Watters and myself. The rest of the course materials and other video lectures are on the course website.

You can watch the video on YouTube or below. I thought it was a thought-provoking conversation!

If you’re interested in alternative approaches to learning and free software philosophy, I would also urge you to check out an essay I wrote in 2002: The Geek Shall Inherit the Earth: My Story of Unlearning. Keep in mind that the essay is probably the most personal thing I have ever published and I wrote it more than a decade ago it as a twenty-one year old undergraduate at Hampshire College. Although I’ve grown and learned enormously in the last ten years, and although I would not write the same document today, I am still proud of it.

The Cost of Collaboration for Code and Art

This post was written with Andrés Monroy-Hernández for the Follow the Crowd Research Blog. The post is a summary of a paper forthcoming in Computer-Supported Cooperative Work 2013. You read also read the full paper: The Cost of Collaboration for Code and Art: Evidence from Remixing. It is part of a series of papers I have written with Monroy-Hernández using data from Scratch. You can find the others on my academic website.

Does collaboration result in higher quality creative works than individuals working alone? Is working in groups better for functional works like code than for creative works like art? Although these questions lie at the heart of conversations about collaborative production on the Internet and peer production, it can be hard to find research settings where you can compare across both individual and group work and across both code and art. We set out to tackle these questions in the context of a very large remixing community.

Example of a remix in the Scratch online community, and the project it is based off. The orange arrows indicate pieces which were present in the original and reused in the remix.

Remixing platforms provide an ideal setting to answer these questions. Most support the sharing, and collaborative rating, of both individually and collaboratively authored creative works. They also frequently combine code with artistic media like sound and graphics.

We know that that increased collaboration often leads to higher quality products. For example, studies of Wikipedia have suggested that vandalism is detected and removed within minutes, and that high quality articles in Wikipedia, by several measures, tend to be produced by more collaboration. That said, we also know that collaborative work is not always better — for example, that brainstorming results in less good ideas when done in groups. We attempt to answer this broad question, asked many times before, in the context of remixing: Which is the better description, “the wisdom of crowds” or “too many cooks spoil the broth”? That, fundamentally, forms our paper’s first research question: Are remixes, on average, higher quality than single-authored works?

A number of critics of peer production, and some fans, have suggested that mass collaboration on the Internet might work much better for certain kinds of works. The argument is that free software and Wikipedia can be built by a crowd because they are functional. But more creative works — like music, a novel, or a drawing — might benefit less, or even be hurt by, participation by a crowd. Our second research question tries to get at this possibility: Are code-intensive remixes, higher quality than media-intensive remixes?

We try to answers to these questions using a detailed dataset from Scratch – a large online remixing community where young people build, share, and collaborate on interactive animations and video games. The community was built to support users of the Scratch programming environment: a desktop application with functionality similar to Flash created by the Lifelong Kindergarten Group at the MIT Media Lab. Scratch is designed to allow users to build projects by integrating images, music, sound and other media with programming code. Scratch is used by more than a million, mostly young, users.

Measuring quality is tricky and we acknowledge that there are many ways to do it. In the paper, we rely most heavily a measure of peer ratings in Scratch called loveits — very similar to “likes” on Facebook. We find similar results with several other metrics and we control for the number of views a project receives.

In answering our first research question, we find that remixes are, on average, rated as being of lower quality than works of single authorship. This finding was surprising to us but holds up across a number of alternative tests and robustness checks.

In answering our second question, we find rough support for the common wisdom that remixing tends to be more effective for functional works than for artistic media. The more code-intensive a project is, on average, the closer the gap is between a remix and a work of single authorship. But the more media-intensive a project is, the bigger the gap. You can see the relationships that our model predicts in the graph below.

Two plots of estimated values for prototypical projects showing the predicted number of loveits using our estimates. In the left panel, the x-axis varies number of blocks while holding media intensity at the sample median. The right panel varies the number of media elements while holding the number of blocks at the sample median. Ranges for each are from 0 to the 90th percentile.

Both of us are supporters and advocates of remixing. As a result, we were initially a little troubled by our result in this paper. We think the finding suggests an important limit to the broadest claims of the benefit of collaboration in remixing and peer production.

That said, we also reject the blind repetition of the mantra that collaboration is always better — for every definition of “better,” and for every type of work. We think it’s crucial to learn and understand the limitations and challenges associated with remixing and we’re optimistic that this work can influence the design of social media and collaboration systems to help remixing and peer production thrive.

For more, see our full paper, The Cost of Collaboration for Code and Art: Evidence from Remixing.