Fairness and Bias

A closer look at BookCorpus, the text dataset that helps train large language models for Google, OpenAI, Amazon, and others

Photo by Javier Quiroga on Unsplash

BookCorpus has helped train at least thirty influential language models (including Google’s BERT, OpenAI’s GPT, and Amazon’s Bort), according to HuggingFace.

But what exactly is inside BookCorpus?

This is the research question that Nicholas Vincent and I ask in a new working paper that attempts to address some of the “documentation debt” in machine learning research — a concept discussed by Dr. Emily M. Bender and Dr. Timnit Gebru et al. in their Stochastic Parrots paper.

While many researchers have used BookCorpus since it was first introduced, documentation remains sparse. The original paper that introduced the dataset described it as…

New research on Twitter’s timeline curation algorithm sheds light on how it shapes what we’re exposed to.

Original photo by Sneha Cecil on Unsplash, styled by the author via Deep Dream

How does Twitter’s algorithm change what users see in their timelines? In a new research study from the Computational Journalism Lab, we present evidence of several shifts that result from Twitter’s timeline algorithm. Specifically, compared to the old-fashioned chronological timeline, Twitter’s algorithm:

  • ↘️ Showed fewer external links,
  • ✨ Elevated lots of “suggested” tweets (from non-followed accounts),
  • ↗️ Showed a greater diversity of sources,
  • 📊 Slightly shifted exposure to different topics, and
  • 🔊 Had a slight partisan “echo chamber” effect


Twitter’s timeline curation algorithm now directs the attention of more than 150 million daily active users. According to Twitter, the algorithm…

Following The Markup’s example, I split this blog into the main findings and this “show your work” piece.

We tested Twitter’s algorithm by creating a group of “puppet” accounts, then comparing their “latest tweets” chronological timelines to their “top tweets” algorithmic timelines.

This piece summarizes the technical details from my forthcoming paper auditing Twitter’s timeline curation algorithm. The main findings are in this blog and the full details are in the research paper, but here I will summarize the following:

  • 🧦 How we set up “sock puppet” accounts to emulate typical users
  • 🦾 How we ran automated timeline collection
  • 🦠 How we clustered covid-19 tweets by topic
  • 🟦 🟥 How we generated partisan labels for accounts
  • 💬 Other frequently-asked questions

🧦 Setting up Sock Puppets

Sock-puppet auditing involves emulating…

Might it be time to create an “FDA for algorithms?”

Photo by James Lee on Unsplash

In the United States, there is currently no federal institution that protects the public from harmful algorithms.

We can buy eggs, get a vaccine, and drive on highways knowing there are systems in place to protect our safety: the USDA checks our eggs for salmonella, the FDA checks vaccines for safety and effectiveness, the NHTSA makes sure highway turns are smooth and gentle for high speeds.

But what about when we run a Google search or look up a product on Amazon? What do we know about the safety of the algorithms behind these systems? …

Literally just an easy recipe for basic cinnamon granola, with pretty pictures.

Why does it always take so long to scroll to the actual recipe? I do not have a cute story other than I have been experimenting with granola recipes for two years, from the New York Times to random blogs to classic books like “Joy of Cooking” and “How to Cook Everything.” I landed on a synthesized recipe that provides a simple and delicious “base” granola. Here it is:

Simple Cinnamon Granola

Dry Ingredients:

  • 3.5 cups oats (for chunky granola: 1 cup ground + 2.5 cups whole oats)
  • 1 cup chopped nuts (pecans, walnuts, almonds, or a mix)
  • 1/2 tsp salt
  • 1 heaping…

Posting to Facebook feels like trying to entertain a UFC stadium, while posting to Medium feels like an open mic.

Facebook feels like a UFC stage, while Medium feels like an open mic

As the pandemic swept across the world and we all started spending more time on Facebook and other apps, I decided to stop lurking all the time and start participating more. The widespread resonance of the term “doomscrolling” made me wonder: why do we spend so much time scrolling through these feeds if they make us miserable?

I thought participation may be part of the solution, and that sharing useful, interesting, accurate information would improve the Facebook experience for me and for my friends. …

There are currently thousands of propaganda websites masquerading as local news websites across the United States, as the New York Times reported in October 2020 and the Columbia Journalism Review reported in August 2020.

The network of websites spells disaster for the news ecosystem on a number of levels, especially if the sites receive a lot of attention. As Renée Diresta articulated in this WIRED piece, there is an important distinction between “free speech” and “free reach.” Free speech entails Brian Timpone’s ability to write and publish “propaganda ordered up by dozens of think tanks, political operatives, corporate executives and…

The event illustrates how TikTok’s algorithms can make mass political communication more accessible, but it is still no democratic utopia.

Over the summer, I crunched the numbers on about 80,000 TikTok videos pertaining to the prank on Trump’s re-election rally in Tulsa. My main interest was understanding how TikTok’s algorithms may have played a role in promoting the prank. This post summarizes findings from my workshop research paper, which was presented at the RecSys 2020 workshop on responsible recommendation.


Why did the Trump administration want to ban TikTok? A few weeks ago, the app seemed to be days away from its death. And yet many of us were still asking: why, exactly, is Trump trying to ban it? …

Applying an important lesson from Dr. Ruha Benjamin’s book, “Race After Technology” — there may be a difficult truth beneath the glitch.

From rocknrollmonkey on Unsplash, “a little robot

If you’ve seen The Matrix, you likely remember the déjà vu scene, in which Neo notices a black cat walk by twice:

Neo sees a black cat walk by twice. GIF from mcmacsta on tenor

Even watching the animated GIF can induce some disturbing chills. And that sense of disturbance is no coincidence: as Trinity quickly explains to Neo, this minor “glitch” involving the black cat is actually an important sign. It indicates that the agents of the Matrix have changed something in the program, rearranging the reality that Neo, Trinity, Morpheus, and others must face.

As Dr. Ruha Benjamin

Breaking down a “data visceralization” with principles from Data Feminism, a book by Catherine D’Ignazio and Lauren Klein.

A screenshot from ProPublica’s story, “What Coronavirus Job Losses Reveal About Racism in America

As articulated by authors Catherine D’Ignazio and Lauren Klein, Data Feminism is “a way of thinking about data, both their uses and their limits, that is informed by direct experience, by a commitment to action, and by intersectional feminist thought.” It has seven core principles:

  1. Examine power
  2. Challenge power
  3. Elevate emotion and embodiment
  4. Rethink binaries and hierarchies
  5. Embrace pluralism
  6. Consider context
  7. Make labor visible

In this post, I will illustrate some principles from Data Feminism by breaking down this unemployment chart recently published by ProPublica.

Challenging Power Knowledge about Unemployment

To apply the first two core principles from Data Feminism (examine power and challenge power)…

Jack Bandy

PhD student studying AI, ethics, and media. Trying to share things I learn in plain english. 🐦 @jackbandy

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store