1. Insights
  2. AI Data
  3. Article
  • Share on X
  • Share on Facebook
  • Share via email

Closing the gender data gap in AI

Posted March 7, 2022
Illustration of numerous people making the "Break the Bias" gesture

The first computer algorithm is said to have been written in the early 1840s, for the prototype of the Analytical Engine.

The name of the programmer? Ada Lovelace, a mathematician dubbed a “female genius” in her posthumous biography.

As the field of computing developed over the next century following Lovelace’s death, the typing work involved in creating computer programs was seen as “women’s work,” a role viewed as akin to switchboard operator or secretary. Women wrote the software, while men made the hardware — the latter seen, at the time, as the more prestigious of the two tasks. And, during the Space Race of the 1950s and ’60s, three Black women, known as “human computers,” broke gender and racial barriers to help NASA send the first men into orbit. Though this bit of history was featured in the film Hidden Figures, it is still not universally known today.

It would not be an overstatement to say that the field of computer programming was defined by women — and yet, women represent just 19.7% of today’s software developers workforce according to the U.S. Bureau of Labor Statistics. Alongside this massive reversal is the growing prevalence of bias in data used to program computer software and artificial intelligence (AI).

How did this gender data gap come to happen in AI? What consequences does this gap have not only for women, but across broader society? And perhaps most importantly, how can we resolve it to ensure unbiased machine learning algorithms in the future?

What is the gender data gap?

The gender data gap is the product of a world in which men are assumed to be the default. It can look like many different things, like how women are 73% more likely to be seriously injured in a frontal car crash, mainly because airbags, seatbelts and collision testing procedures have been created based on the standard male’s height and heft. Or, how voice-recognition technology is less likely to accurately pick up what non-male, non-white people say, notably because virtual assistants are largely trained using white male-dominated datasets.

This type of data bias can exist within every subfield of AI, from machine learning to natural language processing. While widely prevalent, these biases aren’t materialized out of a great global plot to minimize women. Rather, as Caroline Criado Perez writes in her book, Invisible Women: Data Bias In a World Designed for Men, “One of the most important things to say about the gender data gap is that it is not generally malicious, or even deliberate. Quite the opposite. It is simply the product of a way of thinking that has been around for millennia and is therefore a kind of not thinking. A double not thinking, even: men go without saying, and women don’t get said at all.”

How excluding gender from data collection impacts women

Artificial intelligence cannot exist without humans, and that is inherently where the challenge lies. A 2021 academic paper published in the Journal of Marketing Management notes that, “Algorithms have been shown to learn gender and incorporate gender biases, thus reflecting damaging stereotypes about women.”

This has massive implications when collecting, annotating and validating AI training data. How can you create artificial intelligence of value if it doesn’t really know the end user it has been designed to support? And, are you really maximizing your market when you don’t include historically marginalized groups in your data?

As AI becomes increasingly inextricable from the human experience, we must understand how existing social biases, such as sexism and racism, become embedded in the world of data. Considering how the U.S. Bureau of Labor Statistics shows that software programming is overrepresented by white men, we must not ask if bias exists in data, but rather which forms of bias exist and to what extent.

Fundamentally, the only way to avoid gender bias in AI data is to intentionally and methodically root it out.

Working against bias with more inclusive AI data practices

Eliminating data bias is not an easy, one-and-done type initiative. This is as true in AI as it is in greater society. Rather, it takes an evolving process to ensure your data is the most representative, inclusive portrait of the marketplace.

The first step is identifying where bias lives in the first place. Certainly, anti-woman bias can be shockingly obvious — but it can also be very subtle and, as Criado Perez notes, it can show up in ways that have been so systematically ingrained in our mode of operation that we don’t even notice it. For instance, a 2019 analysis by Cornell University of more than 11 million online reviews using a word-embedding algorithm showed that women tend to be more commonly associated with negative attributes (e.g. fickle, impulsive, lazy) than with “positive” characteristics such as loyal, sensible or industrious.

Most importantly, that paper notes that “algorithms designed to learn from human language output, also learn gender bias.”

That means that the next step to reducing bias is to interrogate where the data is coming from, and who is producing it. Three critical ways of intentionally countering bias is to use diverse datasets; to ensure diversity among the individuals collecting, annotating and validating data, as well as writing the algorithms themselves. Greater representation on these fronts will help reduce bias, and choosing an AI data partner that is focused on diversity is an excellent way to start this journey.

With AI algorithms playing a greater role in our daily lives, it is our collective duty to #BreakTheBias and ensure an inclusive experience for all users. Indeed, bringing greater gender equity and equality helps us all: As the saying goes, a rising tide lifts all boats.


Check out our solutions

Test and improve your machine learning models via our global AI Community of 1 million+ annotators and linguists.

Learn more