What is data annotation and why does it matter?

Daily life is guided by algorithms. Even the simplest decisions — an estimated time of arrival from a GPS app or the next song in the streaming queue — can filter through artificial intelligence and machine learning algorithms. We rely on these algorithms for a number of different reasons which include personalization and efficiency. But their ability to deliver on these promises is dependent on data annotation: the process of accurately labeling datasets to train artificial intelligence to make future decisions. Data annotation is the workhorse behind our algorithm-driven world.
What is data annotation?
Computers can’t process visual information the way human brains do: A computer needs to be told what it’s interpreting and provided context in order to make decisions. Data annotation makes those connections. It’s the human-led task of labeling content such as text, audio, images and video so it can be recognized by machine learning models and used to make predictions.
Data annotation is both a critical and impressive feat when you consider the current rate of data creation. By 2025, an estimated 463 exabytes of data will be created globally on a daily basis, according to The Visual Capitalist — and that research was done before the COVID-19 pandemic accelerated the role of data in daily interactions. Now, the global data annotation tools market is projected to grow nearly 30% annually over the next six years, according to GM Insights, especially in the automotive, retail and healthcare sectors.
Why does it matter?
Data is the backbone of the customer experience. How well you know your clients directly impacts the quality of their experiences. As brands gather more and more insight on their customers, AI can help make the data collected actionable. According to Gartner, by 2022, 70% of customer interactions are expected to filter through technologies like machine learning (ML) applications, chatbots and mobile messaging.
“AI interactions will enhance text, sentiment, voice, interaction and even traditional survey analysis,” says Gartner vice-president Don Scheibenreif on the analyst firm’s blog. But in order for chatbots and virtual assistants to create seamless customer experiences, brands need to make sure the datasets guiding these decisions are high-quality.
As it currently stands, data scientists spend a significant portion of their time preparing data, according to a survey by data science platform Anaconda. Part of that is spent fixing or discarding anomalous/non-standard pieces of data and making sure measurements are accurate. These are vital tasks, given that algorithms rely heavily on understanding patterns in order to make decisions, and that faulty data can translate into biases and poor predictions by AI.

AI starts with data: Facing the challenges of data collection & annotation
Discover useful insights into the challenges of data preparation to ensure that your next artificial intelligence project is a success.
Types of data annotation
Data annotation is a broad practice but every type of data has a labeling process associated with it. Here are some of the most common types:
- Semantic annotation: Semantic annotation is a process where concepts like people, places or company names are labeled within a text to help machine learning models categorize new concepts in future texts. This is a key part of AI training to improve chatbots and search relevance.
- Image annotation: This type of annotation ensures that machines recognize an annotated area as a distinct object and often involves bounding boxes (imaginary boxes drawn on an image) and semantic segmentation (the assignment of meaning to every pixel). These labeled datasets can be used to guide autonomous vehicles or as part of facial recognition software.
- Video annotation: Similar to image annotation, video annotation uses techniques like bounding boxes but on a frame-by-frame bases, or via a video annotation tool, to acknowledge movement. Data uncovered through video annotation is key for computer vision models that conduct localization and object tracking.
- Text categorization: Text categorization is the process of assigning categories to sentences or paragraphs by topic, within a given document.
- Entity annotation: The process of helping a machine to understand unstructured sentences. There are a wide variety of techniques that can be utilized to establish a greater understanding such as Named Entity Recognition (NER), where words within a body of text are annotated with predetermined categories (e.g., person, place or thing). Another example is entity linking, where parts of a text (e.g., a company and the place where it’s headquartered) are tagged as related.
- Intent extraction: Intent extraction is the process of labeling phrases or sentences with intent in order to build a library of ways people use certain verbiage. For example, “How do I make a reservation?” and “Can I confirm my reservation,” both contain the same keyword, but have different intent. It’s another key tool for teaching chatbot algorithms to make decisions about customer requests.
- Phrase chunking: phrase chunking involves tagging parts of speech with its grammatical definition (e.g., noun or verb).
An evolving science
In the same way that data is constantly evolving, the process of data annotation is becoming more sophisticated. To put it in perspective, four or five years ago, it was enough to label a few points on a face and create an AI prototype based on that information. Now, there can be as many as 20 dots on the lips alone.
The ongoing transition from scripted chatbots to conversational AI is one of the frontrunners promising to bridge the gap between artificial and natural interactions. At the same time, consumer trust in AI-derived solutions is gradually increasing. A recent study published in Harvard Business Review found that people were far more likely to accept an algorithm’s recommendations when it comes to a product’s practicality or objective performance.
Algorithms will continue to shape consumer experience for the foreseeable future — but algorithms can be flawed, and can suffer from the same biases of their creators. Ensuring AI-powered experiences are pleasant, efficient and effective requires data annotation done by diversified teams with a nuanced understanding of what they’re annotating. Only then can we ensure data-based solutions are as accurate and representative as possible.