Once you have built and trained your bot, the most important question that arises is how good is your bot’s NLP model? So, evaluating your bot’s performance is important to delineate how well your bot understands the user utterances. Conversational bots need to be constantly trained for better NLP and tested to respond to user utterances. After every training check, we recommend you keep a check on if the training done is appropriate and has enhanced or deteriorated the NLP model.
Kore.ai Bots Platform provides the following tools to test the bot NLP:
- Machine Learning Model Analysis
- Batch testing
- NLP detailed Analysis
Machine Learning Model Analysis
The Machine Learning Model (ML Model) graphically represents the performance of your trained utterances against the intent. The ML Model graph evaluates all the training utterances against each bot task and plots them into one of the quadrants for an intent:
- True Positive: Utterance matching the right intent.
- True Negative: Utterances not trained for an intent are not identifying that intent, as expected.
- False Positive: Utterances not trained for an intent are incorrectly identifying that intent.
- False Negative: Utterances trained for an intent are not identifying that intent.
A quick look at the graph and you know which utterance-intent matches are accurate and which of them can be further trained to produce better results.
Consider the following key points when referring to the graph:
- The higher up the utterance in the True quadrants the better it exhibits expected behavior
- Utterances at a moderate level in the True quadrants can be further trained for better scores.
- The utterances falling into the False quadrants need immediate attention.
- Utterances that fall into the true quadrants of multiple bot tasks denote overlapping bot tasks that need to be fixed.
Understanding a Good/Bad ML model
Let us consider a banking bot as an example for understanding Good or Bad ML Mode. The bot has multiple tasks with more than 300 trained utterances. The below Image depicts 4 tasks and the associated utterances.
The model in this scenario is fairly well trained with most of the utterances pertaining to a task are concentrated in the True Positive quadrant and most of the utterances for other tasks are in the True Negative quadrant.
The developer can work to improve on the following aspects of this model :
- In the ML model for task ‘Get Account Balance’ we can see a few utterances (B) in the False Positive quadrant.
- An utterance trained for ‘Get Account Balance’ appears in the True Negative quadrant (C).
- Though the model is well trained and most of the utterances for this task are higher up in the True Positive quadrant, some of the utterances still have a very low score. (A)
- When you hover over the dots, you can see the utterance. For A, B, and C, though the utterance should have exactly matched the intent, it has a low or negative score as a similar utterance has been trained for another intent.
Note: In such cases, it’s best to try the utterance with Test & Train module, check for the intents that the ML engine returns and the associated sample utterances. Fine tune the utterances and try again.
- The task named Report Loss of Card contains limited utterances which are concentrated together.
Let us now compare it with the ML Model for the Travel Bot below:
The model is trained with a lot of conflicting utterances, resulting in a scattered view of utterances. This will be considered as a bad model and will need to be re-trained with a smaller set of utterances that do not relate to multiple tasks in a bot. Learn more.
Batch Testing helps you discern the ability of the bot to correctly identify the expected intents for a given set of utterances. This involves execution of a series of tests to get a detailed statistical analysis and gauge the performance of your bot’s NLP model. To conduct a batch test, you can use predefined test suites available in the builder or create your own custom test suites.
Running a test suite is an asynchronous process and may take time. Once the testing is initiated, the report will be created with all the test results and displayed against the test-suite. The important elements that the developer needs to note in the report are:
- Success % that displays the percentage of correct intent recognition that has resulted from the test.
- Precision that is the number of correctly classified utterances divided by the total number of utterances that got classified (correctly or incorrectly) for an existing task.
- Recall that is the number of correctly classified utterances divided by the total number of utterances that got classified correctly to any existing task or classified incorrectly as an absence of existing tasks.
- F1 Score that is the weighted average of Precision and Recall. It is important to periodically run the test suite against the successful user utterances to ensure that the changes to the model are not adversely affecting it.
NLP detailed Analysis
The Kore.ai Bot Builder tool follows a white-box approach to NLP analysis. Contrary to other platforms, it provides a detailed analysis to the developer that consists of scoring patterns from all the engines for them to identify, analyze and then modify the training to better associate the user utterance to an intent.
You can use the built-in Test & Train module to test any utterance against the bot’s trained NLP module and view the detailed analysis. You can also pick an utterance from the Analyze section to view its NLP analysis. The NLP analysis consists of results from all the three engines that respond with definitive and possible matches of an Intent against the user utterance.
The winner is identified based on the below rules:
- If Only one engine returned a definitive match it is marked as a clear winner and identified as the intent.
- If more than one engine returns the definitive match, then one of the following happens:
- If a unique intent is identified across multiple engines with no other definitive intent then that intent is marked as a clear winner
- If more than one intent is identified as definitive then all the identified intents are presented to the user as the matched intents and the user can select the required intent.
- If no intent is identified as definitive and there are one or more probable intents, then the intents are sent for further scoring to the ‘Ranking & Resolver’ engine.
The ranking and resolver engine rescores the qualified intents based on the Fundamental meaning model and presents the result to the user as the probable options. It presents only the top 10 percentile of the results after re-scoring.
Each engine uses different rules and thresholds to return definitive and probable matches.
- Machine Learning Engine tries to identify a single top matched task and returns all the intents scored against the user utterance with a positive or negative score. Ideally, it returns one positive scored intent and all the other intents with a negative score. It also evaluates the intent based on similarity(cosine). If this score is above 95%, the intent is marked as a Definitive match. If there’s no exact match, all the intents with a positive score are termed as probable matches.
- Fundamental Meaning Model scores user utterance against all the bot intents based on words matched, word coverage, number of exact words verses synonyms and applies a bonus (for sentence structure, word position etc.), penalty to arrive at a score for each intent. To qualify an intent, at least 60% of the words in the user utterance should match the intent name.
- Knowledge Graph Engine uses a dual approach of Bot ontology and the TFIDF based scoring. The user utterance words are matched against the terms in the knowledge graph and anything above 50% terms match are considered as qualified.