Democratizing Data Science


Community Data Science Collective
University of Washington
https://communitydata.science/
 
Benjamin Mako Hill
makohill@uw.edu
https://mako.cc/academic/

Talk outline

  • Democratizing data science?
  • Two experiments:
    • Community Data Science Workshops
    • Scratch Community Blocks
  • Reflections and Takeaways

Democratizing data science?

Why democratize data science?

Compared to professional programmers, end-user programmers...

  • are more common
  • have distinct interests
  • carry out different tasks
  • are best supported by distinct tools

End-User Data Science?

How can we empower everyone to access and explore their personal data (e.g from social media)?

Conversational Programmers

Conversational Data Science?

How can he help everyone develop the concepts and language they need to understand data science (e.g., to communicate and collaborate effectively with data science professionals)?

Experiment 1

Community Data Science Workshops

Our Goal

Train end-user data scientists who can ask and answer questions using Python and web data.

Participants

  • Social scientists & humanists
  • UI/UX designers and researchers
  • Activists & non-profit staff
  • Bloggers
  • Wikipedia contributors

Questions Like

  • Are new contributors to an article in Wikipedia sticking around longer or contributing more than people who joined last year?
  • Who are the most active or influential users of a particular Twitter hashtag?
  • Are people who participated in a Wikipedia outreach event staying involved? How do they compare to people that joined the project outside of the event?

Our Philosophy

  • Absolute beginners.
  • Problem solvers not programmers.
  • As independent as possible. As soon as possible.
  • Documentation, openness, and reproducibility from the start.
  • Use the tools and skills people arrive with. Visualization in spreadsheets is OK.

Principles

  • Students write real programs on their computers
  • 5:1 learner:mentor ratio (or better!)
  • Project based work

Four Sessions

  • 0. Setup and Tutorial (Python 3, Anaconda)
  • 1. Introduction to Programming
  • 2. Importing Data from Web APIs
  • 3. Data Analysis and Visualization

Daily Schedule

Morning: Interactive lecture in iPython & Terminal

Afternoon: Independent work on projects

Words we never say:

class, object, method, recursion, list comprehension, unit testing

Many CDSWs!

  • Seattle (UW): Spring 2014, Fall 2014, Spring 2015, Spring 2016
  • Waterloo: Fall 2014
  • Chicago and DC: Planning
  • Seven courses at UW taught by three instructors

Student Outcomes

Community data science looks different demographically from professional data science.

Edit per day in English Wikipedia to all articles in the category “People shot dead by law enforcement officers.” (Work done by Mary Dickson during Session #3 of the Fall 2014 Community Data Science Workshops.)

Experiment 2

Scratch Community Blocks

“As students work, the [learning] system can capture their inputs and collect evidence of their problem-solving sequences, knowledge, and strategy use […].”


“As you’re leveling up your skills, you can keep track of progress right on the dashboard. Points, badges, and skill levels are all visible and updated in realtime.”


What about children as data scientists?

Empirical setting

Scratch Community Blocks

Project metadata User metadata Site-wide statistics
image/svg+xml franksabate victorct

Studies with children

Approx. four-month of user testing
2,500 active Scratch users invited
Approx. 700 active users with ~1600 projects

Online ethnographic observations of activity (projects, comments, and forum posts)
Interviews, surveys, and face-to-face workshops

Projects by children

Doughnut chart showing the types of blocks the viewer of the project has used in their previously shared projects.

Created by Jondroid (13 years old) and Chewie184 (12 years old).

Number of scoops of icecream on the cone is determined by the number of followers of the viewer.

AwesomeNemo (12 years old)

“I was thinking about how I could use it to show statistics of how many followers and things you have, like not just in a bar chart

AwesomeNemo (12 years old)

Project that gives the viewer a “talkativeness” score based on the cumulative string-length of titles and descriptions of projects the viewer has shared in the past.

Pichu_is_awesome (13 years old)

Project that lets the viewer “buy” virtual accessories for their virtual doll. The buying power (amount of “money”) is determined by the number of shared projects or followers.

Pichu_is_awesome (13 years old)

“I was trying to think of something that, somebody hadn’t done yet, and I didn’t see that. And also I really like to do art on Scratch and that was a good opportunity to use that and mix the two together.”

Pichu_is_awesome (13 years old)

Reflections by children

“epic! looks like we need to use more pen blocks. :D”

DragonCat (16 years old)

“Average no. of loves - four. Well, that’s not depressing at all :'(”

Awo14 (13 years old)

Five critical data literacies

  1. Data collection and retention has privacy implications
  2. Data analysis requires skepticism and interpretation
  3. Data can come with assumptions and hidden decisions
  4. Data-driven algorithms can cause exclusion
  5. Measuring and reporting on data can affect the system that created the data

data collection and retention has privacy implications

“[…] it does feel like a bit of a way to stalk Scratchers from a project. Though I suppose it doesn't make much of a difference since a stalker could always just stalk them on the site, but you get what I mean. Someone could set themselves up with some of these blocks and get themselves a constant relay of the person’s activity.”

Raindrop (15 years old)

Data-driven algorithms can cause exclusion

“I love these new Scratch Blocks! However I did notice that they could be used to exclude new scratchers or scratchers with not a lot of followers by using a code: like this:
 when flag clicked
 if then user's followers < 300
   stop all.
I do not think this a big problem as it would be easy to remove this code but I did just want to bring this to your attention in case this not what you would want the blocks to be used for”

Jondroid (13 years old)

Reflections

A more democratized data science is possible!
There is overwhelming interest!

Democratizing data science is about helping individuals' follow their interests with data, not about learning data science as an end in itself. End-user data scientists have broad and surprising interests.

They require different tools and different form of support.

Situating learning experiences in familiar cultural contexts, and enabling learners to analyze social and behavioral data can spark conversations that explore critical data literacies.

Papers

Hill, Benjamin Mako, Dharma Dailey, Richard T. Guy, Ben Lewis, Mika Matsuzaki, and Jonathan T. Morgan. 2017. “Democratizing Data Science: The Community Data Science Workshops and Classes.” In Big Data Factories: Collaborative Approaches, 115–35. Computational Social Sciences. Berlin, Germany: Springer Nature. https://doi.org/10.1007/978-3-319-59186-5_9.

Dasgupta, Sayamindu, and Benjamin Mako Hill. 2017. “Scratch Community Blocks: Supporting Children As Data Scientists.” In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems (CHI ’17), 3620–3631. New York, New York: ACM. https://doi.org/10.1145/3025453.3025847.

Hautea, Samantha, Sayamindu Dasgupta, and Benjamin Mako Hill. 2017. “Youth Perspectives on Critical Data Literacies.” In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems (CHI ’17), 919–930. New York, New York: ACM. https://doi.org/10.1145/3025453.3025823.

Acknowledgments

Community Data Science Workshop volunteers, organizers, and participants!
Sayamindu Dasgupta
Assistant Professor, University of North Carolina
Members of the Scratch community
Mitchel Resnick, Natalie Rusk, and Hal Abelson, and members of the Lifelong Kindergarten group, MIT Media Lab
National Science Foundation
Grants DRL-1417663 and DRL-1417952

Slide Appendix

Data analysis requires skepticism and interpretation

“ At one point the follower blocks, it said I have slightly more followers than I do. And, that was kind of confusing when I was trying to make the project. […] I pulled up a second [browser] tab and compared the [data from Scratch Community Blocks and the data in my profile]. ”

Alec (14 years old)

data can come with assumptions and hidden decisions

Commenter A: that's so cool! almost 00.5% of all the users on scratch have viewed my projects and that's a lot :B but crossstitch’s results are indeed slightly dubious… over 100% of people have viewed his projects which is awesome but impossible - love the project!! ^o^
 
Commenter B: @CommenterA i think it's because its based on views, not each specific player.
 
Commenter A: @CommenterB that's awesome :D people who haven't registered on scratch have viewed a significant amount of his projects yes
 
Project Creator: @CommenterA Yeah what @CommenterB said is correct


“This project does not and cannot calculate unique viewers. So some views can be the exact same person who viewed the project earlier.”

Measuring and reporting data can affect the system that created the data

“I think this was a great idea! I am just a bit worried that people will make these projects and take it the wrong way, saying that followers are the most important thing in on Scratch. […]”

Survey response