Key Performance Indicators for Generative AI chatbots
Choosing the right Key Performance Indicators or KPIs for your generative AI chatbot is vital to designing a system that really delivers for you. This article is part of Calibrtr’s guide to chatbots. You can go back to the series Intro here or visit our guides to prompting, evaluating and choosing the right LLM and tools for your chatbot.
Tailoring chatbot evaluation to tasks
It makes sense to tailor your evaluation based on the type of chatbot you are designing. We’ve talked about some of the common types of chatbots in our intro article. Let’s break them down into types and think about what we want to achieve with each one in terms of both aims and criteria for good performance.
Type | Examples | Aims | Performance criteria |
---|---|---|---|
Customer service and sales | Website queries, general customer service, booking agents, product customisation agents | To deliver successful experiences which please customers and deliver leads or sales. | Number of sales, ROI, user experience, brand consistency, reliable delivery of outcomes |
Data analysis and information provision | General data analysis, corporate policy and knowledge banks | To deliver information and analysis which supports business goals | ROI, reliability of info, accuracy and quality of insight, lack of hallucinations |
Training and education | Teaching bots, interview practice bots | To impart or test knowledge in a helpful and engaging way | ROI, level of improvement of knowledge, user satisfaction |
Entertainment | Virtual friends and partners, gaming | To provide entertainment or companionship. | User satisfaction, length of engagement, number of return visits, ROI. |
Advisors | Tax advisors. Financial advisors. Legal advisors. | To provide advice on specific expert areas, with or without a human in the loop. | ROI, reliability of information, regulatory compliance, user satisfaction. |
KPIs: general
Each use case needs a specific set of KPIs. There’s some that are relevant for many use cases, and then later in the article we’ll talk about more specific ones for our identified use cases. Here’s some of the general KPIs you might want to think about:
- Return on investment. Generative AI systems have reoccuring costs based on either calling a closed source LLM (e.g ChatGPT), or hosting an open source model (e.g Qwen, Llama). We go into much more detail about costs here [LINK]. For your chatbot, you need to think about whether the cost to provide the chatbot service either 1. saves you enough (by replacing a more expensive way of providing the service); 2. generates enough new revenue (e.g improved sales) or 3. is covered by user payments (e.g in the case of an entertainment bot paid for by subscription).
- Number of users. A lot of users can indicate that your tool is successful. However, in some cases, it might be a negative- e.g a customer service bot getting a lot of complaints.
- Number of return/engaged users. People coming back to your chatbot can indicate that they have had successful interactions.
- Length of interaction. This is one where good might be different depending on your use case. For quick website query interactions, for instance, you want the shortest number of interactions necessary to deliver the information the client needs. For an entertainment AI, you want as many long interactions as possible.
- Successful interactions and actions. Is your AI delivering a successful outcome for the vast majority of users?
- Actions and integrations. Is it completing actions (i.e booking a meeting, collecting contact info) successfully? Is it properly handing over information/actions to integrated services?
- Escalations. Needing to escalate isn’t inherently a negative. The question is whether it is escalating when it should. As part of your design you should have a clear idea of when you would like it to escalate and then track whether this is happening.
- Failures How often is the chatbot failing? Failure might mean unable to answer a question, unable to respond, unable to understand a question or unable to take an action.
- Latency. A slow response is annoying for users. Is there an issue with your model choice, or your hosting arrangements?
- Website bounce rate and chatbot bounce rate. Visitors reaching your website and not engaging with the chatbot could indicate something needs changing, or could mean that your website serves most of their needs and your chat bot is not needed in most cases. However, visitors opening the chatbot but leaving after one or two unsuccessful interactions is more concerning. You’ll want to replay those interactions to try and understand what’s happened- get in touch to talk about Calibrtr's replay tools.
- Customer satisfaction. Measuring this via a survey at the end of the interaction, or feedback buttons after every message, can help you understand how well your AI system is performing.
Specific KPIs for different use cases
KPIs: for customer service and sales bots
Customer service bots and sales bots are designed to deliver successful experiences which please customers, and deliver leads or sales. Some of the specific KPIs you might want to think about for this use case are:
- Conversion rate. How many interactions lead to the desired outcome? This might be a sale, an enquiry passed to the relevant department, a problem solved.
- Brand experience. This is more of a qualitative measure. How well did the interaction match your brand guidelines for tone? Is it creating a positive impression? This can be measured through a survey to get direct user feedback, or through an AI powered analysis of conversations.
- User satisfaction. This is one where you might want to pop a survey at the end of each interaction to check for satisfaction levels.
KPIs: for data analysis and information provision
Data analysis and information provision systems are designed to deliver information and analysis which supports business goals. Some of the specific KPIs you might want to think about for this use case are:
- Reliability of information. Is the chatbot reliably providing correct, unbiased and trustworthy information?
- Quality of insight. Is the chatbot system providing valuable and high quality insight?
- Hallucinations. Is the chatbot system hallucinating?
KPIs: for training and education
Training and education systems are designed to impart or test knowledge in a helpful and engaging way. Some of the specific KPIs you might want to think about for this use case are:
- Completion. Did users complete the whole course or exercise? Were there many breaks taken?
- Progress. In your system design, you can think about how to assess progress- a test administered at the start and end can test how much user understanding has increased thanks to the course.
- User satisfaction. Again, a user satisfaction survey is good for this type of chatbot to get a more qualitative assessment of user experience.
KPIs: for entertainment
Entertainment AI systems are designed to entertain, or to provide companionship. Some of the specific KPIs you might want to think about for this use case are:
- Average daily sessions. Ideally you’ll see an increase in this KPI.
- Duration of sessions. Again, for entertainment use-cases you will probably want to see an increase in this.
- Referrals. Referrals are a good indicator that an entertainment system is popular and providing a good service.
KPIs: for expert advisory AI systems
Advisory AI systems are designed to provide advice on specific expert areas, with or without a human in the loop (depending on the legal and regulatory requirements for each expert area). Some of the specific KPIs you might want to think about for this use case are:
- Accuracy of advice. In a situation like this you are likely to have an approval process where responses are signed off by a human (or possibly an expert AI system) before going to a user. How many of these responses are inaccurate?
- Engagement and additional areas. A good conversation with a high level of engagement is a positive, as is the user asking for advice on more areas, as this is likely to signify that the user feels that the advice is useful.
- Conversion or sales. Does the conversation lead to one of your desired areas e.g a warm lead for your human sales team, an additional service being added to aan account or an ask for an appointment?
Conclusion
These are some of the ways that you might want to measure your chatbot’s performance. The most important factor is having a really clear idea of what you want your AI to achieve, and then build your KPIs and other assessments from that.
Get in touch!
Calibrtr offers a Generative AI cost management and performance review platform, with tools to forecast and manage costs, build and evaluate prompts, experiment with different models and do A/B testing of prompts, monitor performance and build in human-in or out of-the-loop approvals and reviews. We’re currently in Beta- get in touch with us to find out more!
Our limited beta program is open
If you'd like to apply to join our beta program, please give us a contact email and a brief description of what you're hoping to achieve with calibrtr.