Key Performance Indicators for Generative AI chatbots

Choosing the right Key Performance Indicators or KPIs for your generative AI chatbot is vital to designing a system that really delivers for you. This article is part of Calibrtr’s guide to chatbots. You can go back to the series Intro here or visit our guides to prompting, evaluating and choosing the right LLM and tools for your chatbot.

Tailoring chatbot evaluation to tasks

It makes sense to tailor your evaluation based on the type of chatbot you are designing. We’ve talked about some of the common types of chatbots in our intro article. Let’s break them down into types and think about what we want to achieve with each one in terms of both aims and criteria for good performance.

Type	Examples	Aims	Performance criteria
Customer service and sales	Website queries, general customer service, booking agents, product customisation agents	To deliver successful experiences which please customers and deliver leads or sales.	Number of sales, ROI, user experience, brand consistency, reliable delivery of outcomes
Data analysis and information provision	General data analysis, corporate policy and knowledge banks	To deliver information and analysis which supports business goals	ROI, reliability of info, accuracy and quality of insight, lack of hallucinations
Training and education	Teaching bots, interview practice bots	To impart or test knowledge in a helpful and engaging way	ROI, level of improvement of knowledge, user satisfaction
Entertainment	Virtual friends and partners, gaming	To provide entertainment or companionship.	User satisfaction, length of engagement, number of return visits, ROI.
Advisors	Tax advisors. Financial advisors. Legal advisors.	To provide advice on specific expert areas, with or without a human in the loop.	ROI, reliability of information, regulatory compliance, user satisfaction.

KPIs: general

Each use case needs a specific set of KPIs. There’s some that are relevant for many use cases, and then later in the article we’ll talk about more specific ones for our identified use cases. Here’s some of the general KPIs you might want to think about:

Return on investment. Generative AI systems have reoccuring costs based on either calling a closed source LLM (e.g ChatGPT), or hosting an open source model (e.g Qwen, Llama). We go into much more detail about costs here [LINK]. For your chatbot, you need to think about whether the cost to provide the chatbot service either 1. saves you enough (by replacing a more expensive way of providing the service); 2. generates enough new revenue (e.g improved sales) or 3. is covered by user payments (e.g in the case of an entertainment bot paid for by subscription).
Number of users. A lot of users can indicate that your tool is successful. However, in some cases, it might be a negative- e.g a customer service bot getting a lot of complaints.
Number of return/engaged users. People coming back to your chatbot can indicate that they have had successful interactions.
Length of interaction. This is one where good might be different depending on your use case. For quick website query interactions, for instance, you want the shortest number of interactions necessary to deliver the information the client needs. For an entertainment AI, you want as many long interactions as possible.
Successful interactions and actions. Is your AI delivering a successful outcome for the vast majority of users?
Actions and integrations. Is it completing actions (i.e booking a meeting, collecting contact info) successfully? Is it properly handing over information/actions to integrated services?
Escalations. Needing to escalate isn’t inherently a negative. The question is whether it is escalating when it should. As part of your design you should have a clear idea of when you would like it to escalate and then track whether this is happening.
Failures How often is the chatbot failing? Failure might mean unable to answer a question, unable to respond, unable to understand a question or unable to take an action.
Latency. A slow response is annoying for users. Is there an issue with your model choice, or your hosting arrangements?
Website bounce rate and chatbot bounce rate. Visitors reaching your website and not engaging with the chatbot could indicate something needs changing, or could mean that your website serves most of their needs and your chat bot is not needed in most cases. However, visitors opening the chatbot but leaving after one or two unsuccessful interactions is more concerning. You’ll want to replay those interactions to try and understand what’s happened- get in touch to talk about Calibrtr's replay tools.
Customer satisfaction. Measuring this via a survey at the end of the interaction, or feedback buttons after every message, can help you understand how well your AI system is performing.

Image 20.jpeg

Specific KPIs for different use cases

KPIs: for customer service and sales bots

Customer service bots and sales bots are designed to deliver successful experiences which please customers, and deliver leads or sales. Some of the specific KPIs you might want to think about for this use case are:

Conversion rate. How many interactions lead to the desired outcome? This might be a sale, an enquiry passed to the relevant department, a problem solved.
Brand experience. This is more of a qualitative measure. How well did the interaction match your brand guidelines for tone? Is it creating a positive impression? This can be measured through a survey to get direct user feedback, or through an AI powered analysis of conversations.
User satisfaction. This is one where you might want to pop a survey at the end of each interaction to check for satisfaction levels.

KPIs: for data analysis and information provision

Data analysis and information provision systems are designed to deliver information and analysis which supports business goals. Some of the specific KPIs you might want to think about for this use case are:

Reliability of information. Is the chatbot reliably providing correct, unbiased and trustworthy information?
Quality of insight. Is the chatbot system providing valuable and high quality insight?
Hallucinations. Is the chatbot system hallucinating?

Image 7.jpeg

KPIs: for training and education

Training and education systems are designed to impart or test knowledge in a helpful and engaging way. Some of the specific KPIs you might want to think about for this use case are:

Completion. Did users complete the whole course or exercise? Were there many breaks taken?
Progress. In your system design, you can think about how to assess progress- a test administered at the start and end can test how much user understanding has increased thanks to the course.
User satisfaction. Again, a user satisfaction survey is good for this type of chatbot to get a more qualitative assessment of user experience.

Image 9.jpeg

KPIs: for entertainment

Entertainment AI systems are designed to entertain, or to provide companionship. Some of the specific KPIs you might want to think about for this use case are:

Average daily sessions. Ideally you’ll see an increase in this KPI.
Duration of sessions. Again, for entertainment use-cases you will probably want to see an increase in this.
Referrals. Referrals are a good indicator that an entertainment system is popular and providing a good service.

Image 3.jpeg

KPIs: for expert advisory AI systems

Advisory AI systems are designed to provide advice on specific expert areas, with or without a human in the loop (depending on the legal and regulatory requirements for each expert area). Some of the specific KPIs you might want to think about for this use case are:

Accuracy of advice. In a situation like this you are likely to have an approval process where responses are signed off by a human (or possibly an expert AI system) before going to a user. How many of these responses are inaccurate?
Engagement and additional areas. A good conversation with a high level of engagement is a positive, as is the user asking for advice on more areas, as this is likely to signify that the user feels that the advice is useful.
Conversion or sales. Does the conversation lead to one of your desired areas e.g a warm lead for your human sales team, an additional service being added to aan account or an ask for an appointment?

Conclusion

These are some of the ways that you might want to measure your chatbot’s performance. The most important factor is having a really clear idea of what you want your AI to achieve, and then build your KPIs and other assessments from that.

Get in touch!

Calibrtr offers a Generative AI cost management and performance review platform, with tools to forecast and manage costs, build and evaluate prompts, experiment with different models and do A/B testing of prompts, monitor performance and build in human-in or out of-the-loop approvals and reviews. We’re currently in Beta- get in touch with us to find out more!

Our limited beta program is open

If you'd like to apply to join our beta program, please give us a contact email and a brief description of what you're hoping to achieve with calibrtr.

Frequently Asked Questions

KPIs for generative AI chatbots are metrics used to evaluate their performance based on various criteria, such as user engagement, successful interactions, and return on investment. These KPIs help in understanding how well the chatbot is meeting its intended goals.

Choosing the right KPIs is vital because it ensures the chatbot is evaluated based on metrics that align with its intended purpose and goals. This helps in optimizing the chatbot's performance and ensuring it delivers the desired outcomes.

General KPIs include return on investment (ROI), number of users, number of return/engaged users, length of interaction, successful interactions and actions, actions and integrations, escalations, failures, latency, website bounce rate and chatbot bounce rate, and customer satisfaction.

For customer service and sales bots, specific KPIs include conversion rate, brand experience, and user satisfaction. These metrics focus on delivering successful customer experiences, generating leads or sales, and maintaining brand consistency.

Important KPIs for data analysis and information provision bots include reliability of information, quality of insight, and absence of hallucinations. These metrics ensure that the chatbot provides accurate, valuable, and trustworthy information.

For training and education bots, KPIs include completion rate, progress in knowledge, and user satisfaction. These metrics assess how well the chatbot is imparting knowledge and engaging users in the learning process.

Key KPIs for entertainment AI systems include average daily sessions, duration of sessions, and referrals. These metrics measure user engagement, satisfaction, and the popularity of the entertainment provided by the AI.

Relevant KPIs for expert advisory AI systems include accuracy of advice, user engagement and additional areas of inquiry, and conversion or sales. These metrics ensure that the chatbot provides reliable advice and generates valuable interactions and leads.

To ensure your chatbot's performance is aligned with your business goals, have a clear idea of what you want your AI to achieve. Then, build your KPIs and other assessments based on these goals to continuously monitor and optimize performance.

Calibrtr offers a Generative AI cost management and performance review platform. It includes tools to forecast and manage costs, build and evaluate prompts, experiment with different models, conduct A/B testing of prompts, monitor performance, and integrate human-in or out-of-the-loop approvals and reviews.

Key Performance Indicators for Generative AI chatbots

Tailoring chatbot evaluation to tasks

KPIs: general

Specific KPIs for different use cases

KPIs: for customer service and sales bots

KPIs: for data analysis and information provision

KPIs: for training and education

KPIs: for entertainment

KPIs: for expert advisory AI systems

Conclusion

Get in touch!

Our limited beta program is open

Frequently Asked Questions

What are Key Performance Indicators (KPIs) for generative AI chatbots?

Why is it important to choose the right KPIs for a generative AI chatbot?

What are some general KPIs relevant for many use cases?

How do KPIs differ for customer service and sales bots?

What KPIs are important for data analysis and information provision bots?

What KPIs should be considered for training and education bots?

What are the key KPIs for entertainment AI systems?

What KPIs are relevant for expert advisory AI systems?

How can I ensure my chatbot's performance is aligned with my business goals?

What services does Calibrtr offer for managing and evaluating AI chatbots?