To make sure your bot responds to user utterances with related tasks, it’s important that you test the bot with a variety of user inputs. Evaluating a bot with a large sample of expected user inputs not only provides insights into bot responses but also gives you a great opportunity to train the bot in interpreting diverse human expressions. You can perform all the training-related activities for a bot from the Testing module. We will use a sample Flight Booking bot consisting of the following tasks for use as examples across this article.
Testing the Bot
Simply put, testing a bot refers to checking if the bot can respond to a user intent with the most relevant task. Given the flexibility of language, users will use a wide range of phrases to express the same intent. For example, you can rephrase I want to Change my ticket from San Francisco to Los Angeles on Jan 1 as Please change my travel date. Can’t make it on Jan 1. The trick is to train the bot to map both of these intents with the Modify Booking task.
So, the first step to start testing a bot is to identify a representative sample of user utterances to test the bot responses. Look for sources of data that reflect real-world usage of the language, such as support chat logs, online communities, FAQ pages of relevant portals.
How to test the bot
Follow these steps to test a bot:
- Open the bot that you want to test.
- Hover over the left navigation panel select Testing -> Utterance Testing
- In the Type a user utterance field, enter the utterance that you want to test. Example: Rescheduling my plan. Cancel my ticket to LA.
Types of Test Results
When you test a user utterance against a bot, the NLP engine tries to find the bot tasks that match the intent. The NLP engine uses a hybrid approach using Machine Learning, Fundamental Meaning, and Knowledge Graph (if the bot has one) models to score the matching intents on relevance. The model classifies user utterances as either being Possible Matches or Definitive Matches.
Definitive Matches get high confidence scores and are assumed to be perfect matches for the user utterance. In published bots, if user input matches with a single Definitive Match, the bot directly executes the task. If the utterances match with multiple Definitive Matches, they are sent as options for the end user to choose one. On the other hand, Possible Matches are intents that score reasonably well against the user input but do not inspire enough confidence to be termed as exact matches. Internally the system further classifies possible matches into good and unsure matches based on their scores. If the end user utterances were generating possible matches in a published bot, the bot sends these matches as “Did you mean?” suggestions for the end user.
Below are the possible outcomes of a user utterance test:
- Single Match (Possible or Definitive): The NLP engine finds a match for the user utterance with a single intent or task. The intent is displayed below the User Utterance field. If it is a correct match, you can move on to test the next utterance or you can also further train the task to improve its score. If it is an incorrect match, you can mark it as incorrect and select the appropriate intent.
- Multiple Matches (Possible or Definitive or Both): NLP engine identifies multiple intents that match with the user utterance. From the results, select the radio button for the matching task and train it.
- Unidentified Intent: The user input did not match any task in any of the linked bots. Select an intent and train it to match the user utterance.
Analyzing the Test Results
When you test a user utterance, in addition to the matching intents you will also see an NLP Analysis box that provides a quick overview of the shortlisted intents, the NLP models using which they were shortlisted, corresponding scores, and the final winner. Under the Fundamental Meaning tab, you can see the scores of all the intents even if they aren’t shortlisted. As mentioned above, the Kore.ai NLP engine uses Machine Learning, Fundamental Meaning, and Knowledge Graph (if any) models to match intents.
If the NLP engine finds a single Definitive Match through one of the underlying models, you will see the task as the matching intent. If the test identifies more than one definitive matches, you will receive them as options to pick the right intent.
If the models shortlist more than one possible matches, all the shortlisted intents are re-scored by the Ranking and Resolver using the Fundamental Meaning model to determine the final winner. Sometimes, multiple Possible Matches secure the same score even after the rescoring in which case they are presented as multiple matches to the developer to select one. You can click the tab with the name of the learning model in the NLP Analysis box to view the intent scores.
- Machine Learning (ML) Model: The ML model tries to match the user input with the task label and the training utterances of each task. If the user input consists of multiple sentences, each sentence is run separately against the task name as well as the task utterances. The Machine Learning Model section in the NLP Analysis shows only the names of the tasks that secure a positive score. In general, the more the number of training utterances that you add to a task, the greater are its chances for discovery. For more information, read Machine Learning.
- Fundamental Meaning Model: Apart from the ML model, each task in the bot is also scored against the user input using a comprehensive custom NLP algorithm that involves different combinations of task names, synonyms, and patterns. The Fundamental Meaning Model tab shows the analysis for all the intents in the bot. Click the tab to view the scores of each task, and details of how the scores are calculated as explained below:
- Words Matched: The score given for the number of words in the user input that matched words in the task name or a trained utterance for the task.
- Word Coverage: The score given for the ratio of the words matched with that of the overall words in the task, including task name, field names, utterances, and synonyms.
- Exact Words: The score given for the number of words that matched exactly and not by synonyms.
- Sentence Structure: Bonus for the sentence structure match to the user input.
- Word Position: Score given to a word based on its position in a sentence Individual words towards the start of the sentence are given higher preference. Extra credit if the word is near to the sentence start.
- Order Bonus: Bonus for the number of words in the same order as the task label.
- Role Bonus: Bonus for the number of primary and secondary roles (subject/verb/object) matched.
- Spread Bonus: Bonus for the difference between the position of first and last matched words in a pattern. The higher the difference, the greater the score.
- Penalty: Penalty if there are several phrases before the task name or if there is a conjunction in the middle of the task label.
- Knowledge Collection: If the bot consists of a Knowledge Graph, the user utterances are processed to extract the terms and are mapped with the Knowledge Graph to fetch the relevant paths. All the paths containing more than a preset threshold of the number of terms get shortlisted for further screening. Path with 100% terms covered and having similar FAQ in the path is considered a perfect match.
- Ranking and Resolver: Ranking and Resolver determines the final winner of the entire NLP computation. If either the ML model or the Knowledge Graph find a perfect match, the ranking and resolver doesn’t re-score the intent and presents it as a matched intent. Even if there are multiple perfect matches, they will be presented as options to the developers from which they can choose. The Ranking and Resolver re-scores all the other good and unsure matches identified by the three models using the Fundamental Learning model. After re-scoring, if the final score of an intent crosses a certain threshold, it too is considered as a match.
Improving the Bot
Training is how you enhance the performance of the NLP engine to prioritize one bot task or user intent over another based on the user input. You should test and, if needed, train your bot for all possible user utterances and inputs.
Train the bot
- After you enter a User Utterance, depending on the test result do one of the following to open the training options:
- For an unmatched intent: From the Select an Intent drop-down list, select the intent that you want to match with the user utterance.
- For multiple matched intents: Select the radio button for the intent you want to match.
- For a single matched intent: Click the name of the matched intent.
- The user utterance that you entered gets displayed in the field under the ML Utterances section. To add the utterance to the intent, click Save & Train. You can add as many utterances as you want, one after another. For more information, read Machine Learning.
- Under the Intent Synonyms section, each word in the task name appears as a separate line item. Enter the synonyms for the words to optimize the NLP interpreter accuracy to recognize the correct task. For more information, read Managing Synonyms.
- Under the Intent Patterns section, enter task patterns for the intent. For more information, read Managing Patterns.
- When you are done making the relevant training entries, click Re-Run Utterance to see if you have improved the intent to get a high confidence score.
Mark an Incorrect Match
When a user input matches an incorrect task, do the following to match it with the right intent: