Once you have built and trained your bot, the most important question that arises is how good is your bot’s NLP model? So, evaluating your bot’s performance is important to delineate how well your bot understands the user utterances. Conversational bots need to be constantly trained for better NLP and tested to respond to user utterances. After every training check, we recommend you keep a check on if the training done is appropriate and has enhanced or deteriorated the NLP model.
Kore.ai Bots Platform provides the following tools to test the bot NLP:
- Machine Learning Model Analysis
- Batch testing
- NLP detailed Analysis
Machine Learning Model Analysis
The Machine Learning Model (ML Model) graphically represents the performance of your trained utterances against the intent. The ML Model graph evaluates all the training utterances against each bot task and plots them into one of the quadrants for an intent:
- True Positive: Utterance matching the right intent.
- True Negative: Utterances not trained for an intent are not identifying that intent, as expected.
- False Positive: Utterances not trained for an intent are incorrectly identifying that intent.
- False Negative: Utterances trained for an intent are not identifying that intent.
A quick look at the graph and you know which utterance-intent matches are accurate and which of them can be further trained to produce better results.
Consider the following key points when referring to the graph:
- The higher up the utterance in the True quadrants the better it exhibits expected behavior
- Utterances at a moderate level in the True quadrants can be further trained for better scores.
- The utterances falling into the False quadrants need immediate attention.
- Utterances that fall into the true quadrants of multiple bot tasks denote overlapping bot tasks that need to be fixed.
Understanding a Good/Bad ML model
Let us consider a banking bot as an example for understanding Good or Bad ML Mode. The bot has multiple tasks with more than 300 trained utterances. The below Image depicts 4 tasks and the associated utterances.
The model in this scenario is fairly well trained with most of the utterances pertaining to a task are concentrated in the True Positive quadrant and most of the utterances for other tasks are in the True Negative quadrant.
The developer can work to improve on the following aspects of this model :
- In the ML model for task ‘Get Account Balance’ we can see a few utterances (B) in the False Positive quadrant.
- An utterance trained for ‘Get Account Balance’ appears in the True Negative quadrant (C).
- Though the model is well trained and most of the utterances for this task are higher up in the True Positive quadrant, some of the utterances still have a very low score. (A)
- When you hover over the dots, you can see the utterance. For A, B, and C, though the utterance should have exactly matched the intent, it has a low or negative score as a similar utterance has been trained for another intent.
Note: In such cases, it’s best to try the utterance with Test & Train module, check for the intents that the ML engine returns and the associated sample utterances. Fine tune the utterances and try again.
- The task named Report Loss of Card contains limited utterances which are concentrated together.
Let us now compare it with the ML Model for the Travel Bot below:
The model is trained with a lot of conflicting utterances, resulting in a scattered view of utterances. This will be considered as a bad model and will need to be re-trained with a smaller set of utterances that do not relate to multiple tasks in a bot. Learn more.
Batch Testing helps you discern the ability of the bot to correctly identify the expected intents for a given set of utterances. This involves execution of a series of tests to get a detailed statistical analysis and gauge the performance of your bot’s NLP model. To conduct a batch test, you can use predefined test suites available in the builder or create your own custom test suites.
Running a test suite is an asynchronous process and may take time. Once the testing is initiated, the report will be created with all the test results and displayed against the test-suite. The important elements that the developer needs to note in the report are:
- Success % that displays the percentage of correct intent recognition that has resulted from the test.
- Precision that is the number of correctly classified utterances divided by the total number of utterances that got classified (correctly or incorrectly) for an existing task.
- Recall that is the number of correctly classified utterances divided by the total number of utterances that got classified correctly to any existing task or classified incorrectly as an absence of existing tasks.
- F1 Score that is the weighted average of Precision and Recall. It is important to periodically run the test suite against the successful user utterances to ensure that the changes to the model are not adversely affecting it.
NLP detailed Analysis
The Kore.ai Bot Builder tool follows a white-box approach to NLP analysis. Contrary to other platforms, it provides a detailed analysis to the developer that consists of scoring patterns from all the engines for them to identify, analyze and then modify the training to better associate the user utterance to an intent.
You can use the built-in Test & Train module to test any utterance against the bot’s trained NLP module and view the detailed analysis. You can also pick an utterance from the Analyze section to view its NLP analysis. The NLP analysis consists of results from all the three engines that respond with definitive and possible matches of an Intent against the user utterance.
The winner is identified based on the below rules:
- If Only one engine returned a definitive match it is marked as a clear winner and identified as the intent.
- If more than one engine returns the definitive match, then one of the following happens:
- If a unique intent is identified across multiple engines with no other definitive intent then that intent is marked as a clear winner
- If more than one intent is identified as definitive then all the identified intents are presented to the user as the matched intents and the user can select the required intent.
- If no intent is identified as definitive and there are one or more probable intents, then the intents are sent for further scoring to the ‘Ranking & Resolver’ engine.
The ranking and resolver engine rescores the qualified intents based on the Fundamental meaning model and presents the result to the user as the probable options. It presents only the top 10 percentile of the results after re-scoring.
Each engine uses different rules and thresholds to return definitive and probable matches.
- Machine Learning Engine tries to identify a single top matched task and returns all the intents scored against the user utterance with a positive or negative score. Ideally, it returns one positive scored intent and all the other intents with a negative score. It also evaluates the intent based on similarity(cosine). If this score is above 95%, the intent is marked as a Definitive match. If there’s no exact match, all the intents with a positive score are termed as probable matches.
- Fundamental Meaning Model scores user utterance against all the bot intents based on words matched, word coverage, number of exact words verses synonyms and applies a bonus (for sentence structure, word position etc.), penalty to arrive at a score for each intent. To qualify an intent, at least 60% of the words in the user utterance should match the intent name.
- Knowledge Graph Engine uses a dual approach of Bot ontology and the TFIDF based scoring. The user utterance words are matched against the terms in the knowledge graph and anything above 50% terms match are considered as qualified.
NLP Detection Examples
To understand the NLP detection, let us use the example of a Bank bot with the following details:
- The bot consists of 5 Dialog Tasks and a Default Dialog
- The intents have been trained with Synonyms, Patterns and ML utterances
- The bot consists of a knowledge graph defined with 86 FAQs distributed in 4 top-level terms
Scenario 1 – NLP Analysis with FM identifying a Definitive match
- The Fundamental meaning(FM) model identified the utterance as a Definitive match.
- The Machine Learning (ML) model also identified it as a Possible match.
- The score returned for task identified is 6 times more than other intent scores. Also, all the words in the intent name are present in the user utterance. Thus the FM model termed it a Definitive match.
- The ML model matches the Find ATM intent as a Probable match.
- The ML Model returns a Definitive match with other models returning no match
- The FM model could not identify this task as none of the words in the task name Transfer Funds matched the words in the user utterance.
- The user utterance is “How do I make a transfer money to a London account?”
- The user utterance contains all the terms required to match this Knowledge task path Transfer, Money, International.
- The term international is identified as a synonym of London that the user used in the utterance.
- As 100% path term matched the path was qualified. As part of confidence scoring, the terms in the user query are similar to that of the actual Knowledge task question. Thus, it returns a score of 100.
- As the score returned is above 100, the intent is marked as a Definitive match and selected.
- FM engine found it a Probable match as the key term Transfer is present in the user utterance
- ML engine found the utterance as a Probable match as the utterance did not fully match any trained utterance.
Scenario 4 – NLP Analysis with Multiple engines returning probable match and selecting a single match
- All the 3 engines returned possible match and no definitive match
- ML Model has 1 possible matches and FM Model has 2 possible matches, of which 1 is common. Knowledge Graph has 1 possible match. All possible matches identified are re-ranked in the Ranking and Resolver.
- The Ranking and Resolver component returned the highest score for the single match (Task name – “ When can I start making payments using BillPay plus? ”) from Knowledge graph engine. The scores for other probable match come out to be lower than 2 percentile of the top score and are thus ignored. The winner, in this case, is the ‘KG’ returned query and is presented to the user.
- Though most of the keywords in the user utterance map to the keywords in the KG query, still this is not a definitive match because
- The number of path term matched are not 100%
- The KG engine returned the score with 64.72% probability. Had we used the word ‘Billpay’ instead of ‘bill pay’ the score would have been 87.71%. (still not a 100% match)
- Now as the score is between the 60%-80% threshold the Query is presented as part of the ‘Did-you-mean’ dialog and not as a complete winner. If the score was above 80% the platform would have given out the response without re-confirming with the ‘Did-you-mean’ dialog.
Scenario 5 – NLP Analysis with Multiple engines returning probable match and resolver returning back multiple results
- All the engines detected probable matches
- KG returned with 2 possible paths
- Ranking and resolver found the 2 queries with a score less than 2% apart.
- Both the Knowledge tasks are selected and presented to the user as ‘Did-you-mean’
- Both the paths were selected as terms in both matched and the score for both the paths is more than 60%
Scenario 6 – NLP Analysis with No match
- None of the engines could identify any trained intent or Knowledge query
- In this scenario, the default intent will be triggered.