Five reasons why data annotation matters
Artificial intelligence (AI) is ingrained in how we experience the world. Machine learning and algorithms queue up our favorite songs, detect anomalies in our online banking, and even drive our vehicles and help diagnose our illnesses.
But none of this would be possible if no one ‘told’ these AI-powered platforms what they are seeing, and how to interpret it. Alongside advanced algorithms and powerful computers, human-led data annotation is the invisible infrastructure behind our AI-driven present and future.
Data annotation is the answer to a problem that’s persisted since the summer of 1966, when AI pioneer Marvin Minsky first instructed a group of MIT grad students to get a computer to identify objects in a scene. That experiment failed to accomplish its goal, and continued to stump programmers for decades. For all the evolution of natural language processing and computer vision over the past half-century, accurately labeled datasets continue to be the critical foundation of teaching machines.
1. In order to develop models, computers need to understand what they’re seeing
Of all the brilliant things computers can do, finding subtle patterns and inferences in data is still not their strong suit. Instead, data needs to be annotated to give computers an idea of what they’re looking at. Data annotation is the human-led process of adding metadata tags to mark up certain elements of text, images, audio and video clips.
Annotation techniques vary from project to project. In image annotation, for example, data annotators can draw and label 2D or 3D bounding boxes around objects of interest to help improve detection for autonomous vehicles. Another form of image annotation called “landmark annotation” labels anatomical or structural points of interest, which can help computers recognize faces and emotions. Text annotation can include identifying parts of speech like adjectives and nouns, and entity disambiguation connects named entities (persons, locations, products, companies, etc.) in speech to knowledge databases around them. This latter sort of annotation is a vital part of designing AI training datasets that act as the backbone of tools like chatbots, virtual assistants or search engines.
2. Processing data can be time-consuming
A recent survey by data science platform Anaconda found that data scientists spend about 45% of their time preparing data, which includes processes like cleaning data — a term for fixing or discarding anomalous/non-standard pieces and making sure measurements are accurate. It’s an improvement from the 80–90% time consumption quoted in past surveys, but still serves as a reminder that data preparation is still a time-consuming task in AI projects.
Once a diverse set of data that represents different elements/verbiage/images of your business has been retrieved and combined from multiple sources, and the cleansing and formatting process is complete, that data needs to be “loaded” into a storage system. According to Anaconda’s survey, data scientists spend about a fifth (19%) of their time on data loading.
The cost of making data usable is driving businesses across industries like healthcare, automotive and eCommerce to outsource their data annotation projects. A 2020 report from Grand View Research suggests that the global data annotation tools market — worth $390 million in 2019 — will see compound annual growth of 27% from 2020 to 2027.
In addition to the time-savings, outsourcing data to third-party providers can also be beneficial in helping to mitigate harmful biases in AI, which if left unchecked could have far-reaching ethical implications for consumers and society as a whole. By working with an experienced and established partner with a large and diverse team of human annotators, you can better ensure an optimum demographic distribution for your projects.
3. Bad data has its own costs
Part of the reason AI-enabled industries are investing in data annotation is because the alternative — doing nothing and working with bad data — is expensive, too. Humans understand intent and tone, and are better at subjectivity and nuance than computers. Consider a scenario where a chatbot used for hotel reservations is queried by a customer using the phrase “cancellation fee.” A customer asking “Is there a cancellation fee if I cancel with less than 24 hours’ notice?” has a different intent than someone saying “I agree to pay the cancellation fee.”
If the AI-powered chatbot misunderstands the intent, the reservation could be cancelled, costing the hotel business and negatively impacting the customer experience. Data annotation gives the algorithm a framework surrounding how people make requests and then it can extrapolate and expand on that. But it begins with that tagged dataset.
4. The data we create is constantly evolving
By 2025, it’s estimated that 463 exabytes of data will be created daily across the globe. For reference, one exabyte is one billion gigabytes — a staggering amount of information. Although there will be plenty of unusable data created, there is a lot of valuable data hidden within that that will need to go through the processes above.
According to an article in AI journal the Synced Review, the Waymo Open Dataset (formerly Google’s autonomous vehicle project) includes 3,000 driving scenes in over 16.7 hours of video data. The 600,000 frames have “approximately 25 million 3D bounding boxes and 22 million 2D bounding boxes” annotated. The article points to the evolution of facial key-point labelling, a technique where dots are put on a human face to help with recognition. Data annotators used to only need to put a few dots on the face, now there can be as many as 20 dots on the lips alone and over 200 on the entire face.
5. How we use data is also changing
Just like the types of data being created are always evolving, what consumers expect in trade for their data is evolving too. They expect that if they are providing incremental access to their personal preferences and data that the accuracy from virtual assistants and chatbots will drastically improve to better understand them and their needs. They expect technology using biometrics to work on the first try. They expect self-driving autonomous vehicles to be safe — perhaps even safer than when they drive.
This constant evolution of the way we use data is entwined with an AI-powered future. But in order to continue this march forward, data needs to be clean and precisely annotated. And this human-led invisible infrastructure of data annotators has never played a more critical role.