Training data is key for natural language processing algorithms
Natural language is the conversational language that we use in our daily lives. It represents a great opportunity for artificial intelligence (AI) — if machines can understand natural language, then the potential use for technology like chatbots increases dramatically.
For years now, chatbots have received a lot of attention in the media and at AI conferences, owing to advancements in natural language processing (NPL) technology.
What is natural language processing?
Natural language is the spoken words that you use in daily conversations with other people. Not long ago, machines could not understand it. But now, data scientists are working on artificial intelligence technology that can understand natural language, unlocking future breakthroughs and immense potential.
There are three basic steps to natural language processing:
- The machine processes what the user said, and interprets the meaning according to a series of algorithms.
- The machine decides the appropriate action in response to what the user said.
- The machine produces an appropriate output response in a language that the user can understand.
In addition, there are three main actions performed by natural language processing technology: understand, action, and reaction.
- Understand: First, the machine must understand the meaning of what the user is saying in words. This step uses natural language understanding (NLU), a subset of NLP.
- Action: Second, the machine must react to what the user said. For example, if you said “Hey Alexa, order toilet paper on Amazon!” Alexa will understand and do that for you.
- Reaction: Finally, the machine must react to what the user said. Once Alexa has successfully ordered toilet paper for you on Amazon, she should tell you: “I ordered toilet paper and it should be delivered tomorrow.”
Now, many companies and data scientist groups are working on NLP research. But NLP applications such as chatbots still don’t have the same conversation ability as humans, and many chatbots are only able to respond with a few select phrases.
Sentiment analysis is an important part of NLP, especially when building chatbots. Sentiment analysis is the process of identifying and categorizing opinions in a piece of text, often with the goal of determining the writer’s attitude towards something. It affects the “reaction” stage of NLP. The same input text could require different reactions from the chatbot depending on the user’s sentiment, so sentiments must be annotated in order for the algorithm to learn them.
To improve the decision-making ability of AI models, data scientists must feed large volumes of training data, so those models can use it to figure out patterns. But raw data, such as in the form of an audio recording or text messages, is useless for training machine learning models. The data must first be labeled and organized into a training dataset.
Natural language processing and data annotation
Entity annotation is the process of labeling unstructured sentences with information so that a machine can read them. For example, this could involve labeling all people, organizations and locations in a document. In the sentence “My name is Andrew,” Andrew must be properly tagged as a person’s name to ensure that the NLP algorithm is accurate.
Linguistic text annotation is also crucial for NLP. Sometimes, instead of tagging people or place names, AI community members are asked to tag which words are nouns, verbs, adverbs, etc. These data annotation tasks can quickly become complicated, as not has the necessary knowledge to distinguish the parts of speech.
Is natural language processing different for different languages?
In English, there are spaces between words, but in some other languages, like Japanese, there aren’t. The technology required for audio analysis is the same for English and Japanese. But for text analysis, Japanese requires the extra step of separating each sentence into words before individual words can be annotated.
The whole process for natural language processing requires building out the proper operations and tools, collecting raw data to be annotated, and hiring both project managers and workers to annotate the data. Here at TELUS International, we’ve built a community of crowdworkers who are language experts and who turn raw data into clean training datasets for machine learning. The typical task for our crowdworkers would involve working with a foreign language document and tagging the words that are people names, place names, company names, etc.
For machine learning, training data volume is the key to success. It’s also true that there are more English speakers worldwide, and English is the main language for most of the top companies doing AI research, such as Amazon, Apple, and Facebook. But it doesn’t matter which company “has the most advanced technology” or which language is “easier to learn,” because the most important factor is the volume of training data that is available. The volume of training data is equally important for other fields of artificial intelligence, such as computer vision and content categorization. It’s also equally true for other fields of artificial intelligence, that data quality is a bottleneck to technological advancement. If you are looking for help in gathering clean training data for your next innovative technology, TELUS International’s AI Data Solutions can help.