What do people do when they edit Wikipedia through Tor?

Note: I have not published blog posts about my academic papers over the past few years. To ensure that my blog contains a more comprehensive record of my published papers and to surface these for folks who missed them, I will be periodically (re)publishing blog posts about some “older” published projects. This post is closely based on a previously published post by Kaylea Champion on the Community Data Science Blog.

Many individuals use Tor to reduce their visibility to widespread internet surveillance.

One popular approach to protecting our privacy online is to use the Tor network. Tor protects users from being identified by their IP address, which can be tied to a physical location. However, if you’d like to contribute to Wikipedia using Tor, you’ll run into a problem. Although most IP addresses can edit without an account, Tor users are blocked from editing.

Tor users attempting to contribute to Wikipedia are shown a screen that informs them that they are not allowed to edit Wikipedia.

Other research by my team has shown that Wikipedia’s attempt to block Tor is imperfect and that some people have been able to edit despite the ban. As part of this work, we built a dataset of more than 11,000 contributions to Wikipedia via Tor and used quantitative analysis to show that contributions from Tor were of about the same quality as contributions from other new editors and other contributors without accounts. Of course, given the unusual circumstances Tor-based contributors face, we wondered whether a deeper look at the content of their edits might reveal more about their motives and the kinds of contributions they seek to make. Kaylea Champion (then a student, now faculty at UW Bothell) led a qualitative investigation to explore these questions.

Given the challenges of studying anonymity seekers, we designed a novel “forensic” qualitative approach inspired by techniques common in computer security and criminal investigation. We applied this new technique to a sample of 500 editing sessions and categorized each session based on what the editor seemed to be intending to do.

Most of the contributions we found fell into one of the two following categories:

  • Many contributions were quotidian attempts to add to the encyclopedia. Tor-based editors added facts, fixed typos, and updated train schedules. There’s no way to know whether these individuals knew they were just getting lucky in their ability to edit or were patiently reloading to evade the ban.
  • Second, we found harassing comments and vandalism. Unwelcome conduct is common in online environments, and it is sometimes more common when the likelihood of being identified decreases. Some of the harassing comments we observed were direct responses to being banned as a Tor user.

Although these were most of what we observed, we also found evidence of several types of contributor intent:

  • We observed activism: when a contributor tried to draw attention to journalistic accounts of environmental and human rights abuses committed by a mining company, only to have editors linked to the mining company repeatedly remove their edits. Another example involved an editor trying to diminish the influence of proponents of alternative medicine.
  • We also observed quality maintenance activities when editors used Wikipedia’s rules on appropriate sourcing to remove personal websites being cited in conspiracy theories.
  • We saw edit wars with Tor editors participating in a back-and-forth removal and replacement of content as part of a dispute, in some cases countering the work of an experienced Wikipedia editor whom even other experienced editors had gauged to be biased.
  • Finally, we saw Tor-based editors participating in non-article discussions, such as investigations into administrator misconduct and protests against the Wikipedia platform’s mistrust of Tor editors.
An exploratory mapping of our themes in terms of the value a type of contribution represents to the Wikipedia community and the importance of anonymity in facilitating it. Anonymity-protecting tools play a critical role in facilitating contributions on the right side of the figure, while edits on the left are more likely to occur even when anonymity is impossible. Contributions toward the top reflect valuable forms of participation in Wikipedia, while edits at the bottom reflect damage.

In all, these themes led us to reflect on how the risks individuals face when contributing to online communities sometimes diverge from the risks the communities face in accepting their work. Expressing minoritized perspectives, maintaining community standards even when you may be targeted by the rulebreaker, highlighting injustice, or acting as a whistleblower can be very risky for an individual, and may not be possible without privacy protections. Of course, in platforms seeking to support the public good, such knowledge and accountability may be crucial.


This work was published as a paper at CSCW: Kaylea Champion, Nora McDonald, Stephanie Bankes, Joseph Zhang, Rachel Greenstadt, Andrea Forte, and Benjamin Mako Hill. 2019. A Forensic Qualitative Analysis of Contributions to Wikipedia from Anonymity Seeking Users. Proceedings of the ACM on Human-Computer Interaction. 3, CSCW, Article 53 (November 2019), 26 pages. https://doi.org/10.1145/3359155

This project was conducted by Kaylea Champion, Nora McDonald, Stephanie Bankes, Joseph Zhang, Rachel Greenstadt, Andrea Forte, and Benjamin Mako Hill. This work was supported by the National Science Foundation (awards CNS-1703736 and CNS-1703049) and included the work of two undergraduates supported through an NSF REU supplement.

Effects of Algorithmic Flagging on Fairness: Quasi-experimental Evidence from Wikipedia

Note: I have not published blog posts about my academic papers over the past few years. To ensure that my blog contains a more comprehensive record of my published papers and to surface these for folks who missed them, I will be periodically (re)publishing blog posts about some “older” published projects. This particular post is closely based on a previously published post by Nate TeBlunthuis from the Community Data Science Blog.

Many online platforms are adopting AI and machine learning as a tool to maintain order and high-quality information in the face of massive influxes of user-generated content. Of course, AI algorithms can be inaccurate, biased, or unfair. How do signals from AI predictions shape the fairness of online content moderation? How can we measure an algorithmic flagging system’s effects?

In our paper published at CSCW, Nate TeBlunthuis, together with myself and Aaron Halfaker, analyzed the RCFilters system: an add-on to Wikipedia that highlights and filters edits that a machine learning algorithm called ORES identifies as likely to be damaging to Wikipedia. This system has been deployed on large Wikipedia language editions and is similar to other algorithmic flagging systems that are becoming increasingly widespread. Our work measures the causal effect of being flagged in the RCFilters user interface.

Screenshot of Wikipedia edit metadata on Special:RecentChanges with RCFilters enabled. Highlighted edits with a colored circle to the left side of other metadata are flagged by ORES. Different circle and highlight colors (white, yellow, orange, and red in the figure) correspond to different levels of confidence that the edit is damaging. RCFilters does not specifically flag edits by new accounts or unregistered editors, but does support filtering changes by editor types.

Our work takes advantage of the fact that RCFilters, like many algorithmic flagging systems, create discontinuities in the relationship between the probability that a moderator should take action and whether a moderator actually does. This happens because the output of machine learning systems like ORES is typically a continuous score (in RCFilters, an estimated probability that a Wikipedia edit is damaging), while the flags (in RCFilters, the yellow, orange, or red highlights) are either on or off and are triggered when the score crosses some arbitrary threshold. As a result, edits slightly above the threshold are both more visible to moderators and appear more likely to be damaging than edits slightly below. Even though edits on either side of the threshold have virtually the same likelihood of truly being damaging, the flagged edits are substantially more likely to be reverted. This fact lets us use a method called regression discontinuity to make causal estimates of the effect of being flagged in RCFilters.

Charts showing the probability that an edit will be reverted as a function of ORES scores in the neighborhood of the discontinuous threshold that triggers the RCfilters flag. The jump in the increase in reversion chances is larger for registered editors compared to unregistered editors at both thresholds.

To understand how this system may affect the fairness of Wikipedia moderation, we estimate the effects of flagging on edits on different groups of editors. Comparing the magnitude of these estimates lets us measure how flagging is associated with several different definitions of fairness. Surprisingly, we found evidence that these flags improved fairness for categories of editors that have been widely perceived as troublesome—particularly unregistered (anonymous) editors. This occurred because flagging has a much stronger effect on edits by the registered than on edits by the unregistered.

We believe that our results are driven by the fact that algorithmic flags are especially helpful for finding damage that can’t be easily detected otherwise. Wikipedia moderators can see the editor’s registration status in the recent changes, watchlists, and edit history. Because unregistered editors are often troublesome, Wikipedia moderators’ attention is often focused on their contributions, with or without algorithmic flags. Algorithmic flags make damage by registered editors (in addition to unregistered editors) much more detectable to moderators and so help moderators focus on damage overall, not just damage by suspicious editors. As a result, the algorithmic flagging system decreases the bias that moderators have against unregistered editors.

This finding is particularly surprising because the ORES algorithm we analyzed was itself demonstrably biased against unregistered editors (i.e., the algorithm tended to greatly overestimate the probability that edits by these editors were damaging). Despite the fact that the algorithms were biased, their introduction could still lead to less biased outcomes overall.

Our work shows that although it is important to design predictive algorithms to avoid such biases, it is equally important to study fairness at the level of the broader sociotechnical system. Since we first published a preprint of our paper, a follow-up piece by Leijie Wang and Haiyi Zhu replicated much of our work and showed that differences between different Wikipedia communities may be another important factor driving the effect of the system. Overall, this work suggests that social signals and social context can interact with algorithmic signals, and together these can influence behavior in important and unexpected ways.


The full citation for the paper is: TeBlunthuis, Nathan, Benjamin Mako Hill, and Aaron Halfaker. 2021. “Effects of Algorithmic Flagging on Fairness: Quasi-Experimental Evidence from Wikipedia.” Proceedings of the ACM on Human-Computer Interaction 5 (CSCW): 56:1-56:27. https://doi.org/10.1145/3449130.

We have also released replication materials for the paper, including all the data and code used to conduct the analysis and compile the paper itself.

My Chair

I realize that because I have several chairs, the phrase “my chair” is ambiguous. To reduce confusion, I will refer to the head of my academic department as “my office chair” going forward.

The Financial Times has been printing an obvious error on its “Market Data” page for 18 months and nobody else seems to have noticed

Market Data section of the Financial Times US Edition print edition from May 5, 2021.

If you’ve flipped through printed broadsheet newspapers, you’ve probably seen pages full of tiny text listing prices and other market information for stocks and commodities. And you’ve almost certainly just turned the page. Anybody interested in this market prices today will turn to the internet where these numbers are available in real time and where you don’t need to squint to find what you need. This is presumably why many newspapers have stopped printing these types of pages or dramatically reduced the space devoted to them. Major financial newspapers however—like the Financial Times (FT)—still print multiple pages of market data daily. But does anybody read them?

The answer appears to be “no.” How do I know? I noticed an error in the FT‘s “Market Data” page that anybody looking in the relevant section of the page would have seen. And I have seen it reproduced every single day for the last 18 months.

In early May last year, I noticed that the Japanese telecom giant Nippon Telegraph and Telephone (NTT) was listed twice on the FT‘s list of the 500 largest global companies: once as “Nippon T&T” and also as “Nippon TT.” One right above the other. All the numbers are identical. Clearly a mistake.

Reproduction of the FT Market Data section showing a subset of Japanese companies from the FT 500 list of global companies. The duplicate lines are highlighted in yellow. This page is from today’s paper (November 26, 2022).

Wondering if it was a one-off error, I looked at a copy of the paper from about a week before and saw that the error did not exist then. I looked at a copy from one day before and saw that it did. Since the issue was apparently recurring, but new at the time, I figured someone at the paper would notice and fix it quickly. I was wrong. It has been 18 months now and the error has been reproduced every single day.

Looking through the archives, it seems that the first day the error showed up was May 5, 2021. I’ve included a screenshot from the electronic paper version from that day—and from the fifth of every month since then (or the sixth if the paper was not printed on the fifth)—that shows that the error is reproduced every day. A quick look in the archives suggests it not only appears in the US edition but also in the UK, European, Asian, and Middle East editions. All of them.

Why does this matter? The FT prints over 112,000 copies of its paper, six days a week. This duplicate line takes up almost no space, of course, so it’s not a big deal on its own. But devoting two full broadsheet pages to market data that is out date as soon as it is printed—much of which nobody appears to be reading—doesn’t seem like a great use of resources. There’s an argument to made that papers like the FT print these pages not because they are useful but because doing so is a signal of the publications’ identities as serious financial papers. But that hardly seems like a good enough reason on its own if nobody is looking at them. It seems well past time for newspapers to stop wasting paper and ink on these pages.

I respect that some people think that printing paper newspapers at all is wasteful when one can just read the material online. Plenty of people disagree, of course. But who will disagree with a call to stop printing material that evidence suggests is not being seen by anybody? If an error this obvious can exist for so long, it seems clear that nobody—not even anybody at the FT itself—is reading it.

The Hidden Costs of Requiring Accounts

Should online communities require people to create accounts before participating?

This question has been a source of disagreement among people who start or manage online communities for decades. Requiring accounts makes some sense since users contributing without accounts are a common source of vandalism, harassment, and low quality content. In theory, creating an account can deter these kinds of attacks while still making it pretty quick and easy for newcomers to join. Also, an account requirement seems unlikely to affect contributors who already have accounts and are typically the source of most valuable contributions. Creating accounts might even help community members build deeper relationships and commitments to the group in ways that lead them to stick around longer and contribute more.

In a new paper published in Communication Research, I worked with Aaron Shaw provide an answer. We analyze data from “natural experiments” that occurred when 136 wikis on Fandom.com started requiring user accounts. Although we find strong evidence that the account requirements deterred low quality contributions, this came at a substantial (and usually hidden) cost: a much larger decrease in high quality contributions. Surprisingly, the cost includes “lost” contributions from community members who had accounts already, but whose activity appears to have been catalyzed by the (often low quality) contributions from those without accounts.


A version of this post was first posted on the Community Data Science blog.

The full citation for the paper is: Hill, Benjamin Mako, and Aaron Shaw. 2020. “The Hidden Costs of Requiring Accounts: Quasi-Experimental Evidence from Peer Production.” Communication Research, 48 (6): 771–95. https://doi.org/10.1177/0093650220910345.

If you do not have access to the paywalled journal, please check out this pre-print or get in touch with us. We have also released replication materials for the paper, including all the data and code used to conduct the analysis and compile the paper itself.

Q&A about doing a PhD with my research group

Ever considered doing research about online communities, free culture/software, and peer production full time? It’s PhD admission season and my research group—the Community Data Science Collective—is doing an open-to-anyone Q&A about PhD admissions this Friday November 5th. We’ve got room in the session and its not too late to sign up to join us!

The session will be a good opportunity to hear from and talk to faculty recruiting students to our various programs at the University of Washington, Purdue, and Northwestern and to talk with current and previous students in the group.

I am hoping to admit at least one new PhD advisee to the Department of Communication at UW this year (maybe more) and am currently co-advising (and/or have previously co-advised) students in UW’s Allen School of Computer Science & Engineering, Department of Human-Centered Design & Engineering, and Information School.

One thing to keep in mind is that my primary/home department—Communication—has a deadline for PhD applications of November 15th this year.

The registration deadline for the Q&A session is listed as today but we’ll do what we can to sneak you in even if you register late. That said, please do register ASAP so we can get you the link to the session!

https://blog.communitydata.science/join-us-for-the-cdsc-phd-application-q-a/

Returning to DebConf

I first started using Debian sometime in the mid 90s and started contributing as a developer and package maintainer more than two decades years ago. My first very first scholarly publication, collaborative work led by Martin Michlmayr that I did when I was still an undergrad at Hampshire College, was about quality and the reliance on individuals in Debian. To this day, many of my closest friends are people I first met through Debian. I met many of them at Debian’s annual conference DebConf.

Given my strong connections to Debian, I find it somewhat surprising that although all of my academic research has focused on peer production, free culture, and free software, I haven’t actually published any Debian related research since that first paper with Martin in 2003!

So it felt like coming full circle when, several days ago, I was able to sit in the virtual DebConf audience and watch two of my graduate student advisees—Kaylea Champion and Wm Salt Hale—present their research about Debian at DebConf21.

Salt presented his masters thesis work which tried to understand the social dynamics behind organizational resilience among free software projects. Kaylea presented her work on a new technique she developed to identifying at risk software packages that are lower quality than we might hope given their popularity (you can read more about Kaylea’s project in our blog post from earlier this year).

If you missed either presentation, check out the blog post my research collective put up or watch the videos below. If you want to hear about new work we’re doing—including work on Debian—you should follow our research group blog, and/or follow or engage with us in the Fediverse (@communitydata@social.coop), or on Twitter (@comdatasci).

And if you’re interested in joining us—perhaps to do more research on FLOSS and/or Debian and/or a graduate degree of your own?—please be in touch with me directly!

Wm Salt Hale’s presentation plus Q&A. (WebM available)
Kaylea Champion’s presentation plus Q&A. (WebM available)

NSF CAREER Award

In exciting professional news, it was recently announced that I got an National Science Foundation CAREER award! The CAREER is the US NSF’s most prestigious award for early-career faculty. In addition to the recognition, the award involves a bunch of money for me to put toward my research over the next 5 years. The Department of Communication at the University of Washington has put up a very nice web page announcing the thing. It’s all very exciting and a huge honor. I’m very humbled.

The grant will support a bunch of new research to develop and test a theory about the relationship between governance and online community lifecycles. If you’ve been reading this blog for a while, you’ll know that I’ve been involved in a bunch of research to describe how peer production communities tend to follow common patterns of growth and decline as well as a studies that show that many open communities become increasingly closed in ways that deter lots of the kinds contributions that made the communities successful in the first place.

Over the last few years, I’ve worked with Aaron Shaw to develop the outlines of an explanation for why many communities because increasingly closed over time in ways that hurt their ability to integrate contributions from newcomers. Over the course of the work on the CAREER, I’ll be continuing that project with Aaron and I’ll also be working to test that explanation empirically and to develop new strategies about what online communities can do as a result.

In addition to supporting research, the grant will support a bunch of new outreach and community building within the Community Data Science Collective. In particular, I’m planning to use the grant to do a better job of building relationships with community participants, community managers, and others in the platforms we study. I’m also hoping to use the resources to help the CDSC do a better job of sharing our stuff out in ways that are useful as well doing a better job of listening and learning from the communities that our research seeks to inform.

There are many to thank. The proposed work was the direct research of the work I did as the Center for Advanced Studies in the Behavioral Sciences at Stanford where I got to spend the 2018-2019 academic year in Claude Shannon’s old office and talking through these ideas with an incredible range of other scholars over lunch every day. It’s also the product of years of conversations with Aaron Shaw and Yochai Benkler. The proposal itself reflects the excellent work of the whole CDSC who did the work that made the award possible and provided me with detailed feedback on the proposal itself.

Identifying “Underproduced” Software

I wrote this blog post with Kaylea Champion and a version of this post was originally posted on the Community Data Science Collective blog.

Critical software we all rely on can silently crumble away beneath us. Unfortunately, we often don’t find out software infrastructure is in poor condition until it is too late. Over the last year or so, I have been supporting Kaylea Champion on a project my group announced earlier to measure software underproduction—a term we use to describe software that is low in quality but high in importance.

Underproduction reflects an important type of risk in widely used free/libre open source software (FLOSS) because participants often choose their own projects and tasks. Because FLOSS contributors work as volunteers and choose what they work on, important projects aren’t always the ones to which FLOSS developers devote the most attention. Even when developers want to work on important projects, relative neglect among important projects is often difficult for FLOSS contributors to see.

Given all this, what can we do to detect problems in FLOSS infrastructure before major failures occur? Kaylea Champion and I recently published a paper laying out our new method for measuring underproduction at the IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) 2021 that we believe provides one important answer to this question.

A conceptual diagram of underproduction. The x-axis shows relative importance, the y-axis relative quality. The top left area of the graph described by these axes is 'overproduction' -- high quality, low importance. The diagonal is Alignment: quality and importance are approximately the same. The lower right depicts underproduction -- high importance, low quality -- the area of potential risk.
Conceptual diagram showing how our conception of underproduction relates to quality and importance of software.

In the paper, we describe a general approach for detecting “underproduced” software infrastructure that consists of five steps: (1) identifying a body of digital infrastructure (like a code repository); (2) identifying a measure of quality (like the time to takes to fix bugs); (3) identifying a measure of importance (like install base); (4) specifying a hypothesized relationship linking quality and importance if quality and importance are in perfect alignment; and (5) quantifying deviation from this theoretical baseline to find relative underproduction.

To show how our method works in practice, we applied the technique to an important collection of FLOSS infrastructure: 21,902 packages in the Debian GNU/Linux distribution. Although there are many ways to measure quality, we used a measure of how quickly Debian maintainers have historically dealt with 461,656 bugs that have been filed over the last three decades. To measure importance, we used data from Debian’s Popularity Contest opt-in survey. After some statistical machinations that are documented in our paper, the result was an estimate of relative underproduction for the 21,902 packages in Debian we looked at.

One of our key findings is that underproduction is very common in Debian. By our estimates, at least 4,327 packages in Debian are underproduced. As you can see in the list of the “most underproduced” packages—again, as estimated using just one more measure—many of the most at risk packages are associated with the desktop and windowing environments where there are many users but also many extremely tricky integration-related bugs.

This table shows the 30 packages with the most severe underproduction problem in Debian, shown as a series of boxplots.
These 30 packages have the highest level of underproduction in Debian according to our analysis.

We hope these results are useful to folks at Debian and the Debian QA team. We also hope that the basic method we’ve laid out is something that others will build off in other contexts and apply to other software repositories.

In addition to the paper itself and the video of the conference presentation on Youtube by Kaylea, we’ve put a repository with all our code and data in an archival repository Harvard Dataverse and we’d love to work with others interested in applying our approach in other software ecosytems.


For more details, check out the full paper which is available as a freely accessible preprint.

This project was supported by the Ford/Sloan Digital Infrastructure Initiative. Wm Salt Hale of the Community Data Science Collective and Debian Developers Paul Wise and Don Armstrong provided valuable assistance in accessing and interpreting Debian bug data. René Just generously provided insight and feedback on the manuscript.

Paper Citation: Kaylea Champion and Benjamin Mako Hill. 2021. “Underproduction: An Approach for Measuring Risk in Open Source Software.” In Proceedings of the IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER 2021). IEEE.

Contact Kaylea Champion (kaylea@uw.edu) with any questions or if you are interested in following up.

The Free Software Foundation and Richard Stallman

I served as a director and as a voting member of the Free Software Foundation for more than a decade. I left both positions over the last 18 months and currently have no formal authority in the organization.

So although it is now just my personal opinion, I will publicly add my voice to the chorus of people who are expressing their strong opposition to Richard Stallman’s return to leadership in the FSF and to his continued leadership in the free software movement. The current situation makes me unbelievably sad.

I stuck around the FSF for a long time (maybe too long) and worked hard (I regret I didn’t accomplish more) to try and make the FSF better because I believe that it is important to have voices advocating for social justice inside our movement’s most important institutions. I believe this is especially true when one is unhappy with the existing state of affairs. I am frustrated and sad that I concluded that I could no longer be part of any process of organizational growth and transformation at FSF.

I have nothing but compassion, empathy, and gratitude for those who are still at the FSF—especially the staff—who are continuing to work silently toward making the FSF better under intense public pressure. I still hope that the FSF will emerge from this as a better organization.

Reflections on Janet Fulk and Peter Monge

In May 2019, my research group was invited to give short remarks on the impact of Janet Fulk and Peter Monge at the International Communication Association‘s annual meeting as part of a session called “Igniting a TON (Technology, Organizing, and Networks) of Insights: Recognizing the Contributions of Janet Fulk and Peter Monge in Shaping the Future of Communication Research.

Youtube: Mako Hill @ Janet Fulk and Peter Monge Celebration at ICA 2019

I gave a five-minute talk on Janet and Peter’s impact to the work of the Community Data Science Collective by unpacking some of the cryptic acronyms on the CDSC-UW lab’s whiteboard as well as explaining that our group has a home in the academic field of communication, in no small part, because of the pioneering scholarship of Janet and Peter. You can view the talk in WebM or on Youtube.

[This blog post was first published on the Community Data Science Collective blog.]