Evaluating Chatbots

So, you’ve built a chatbot but want to make sure it is performing? You’ve come to the right place. This article is part of Calibrtr’s guide to chatbots. You can go back to the series Intro here or visit our guides to building prompts, designing KPIs and choosing the right LLM and tools for your chatbot.

At Calibrtr we believe that AI has huge potential for designing brilliant products and boosting productivity. But to reach that potential we need the right tools and techniques to make the most of it.

Evaluation vs guardrails

Guardrails are a much broader topic than evaluation. Guardrails are the rules, tools and other controls on how an LLM behaves, which cover everything from how the LLM is trained in the first place, to how it works in production.

Depending on the model you choose, there will be guardrails already built into the system by the model’s developer. The question then becomes whether those guardrails are sufficient- do they prevent enough hallucinations or incorrect or offensive content? If they don’t, you may need to find a different model or more specialised tools.

If they do, your evaluation can then focus on how that model performs as prompted by you and in your use-case. And that’s what we look at in this article.

Image 86.jpeg

What should I evaluate?

In our guide to writing prompts, we talked about evaluating how well your chatbot follows the instructions you’ve given it in the prompt vs how well it performs in the real world. To recap, some of the questions you might want to ask are:

Does it do what you want?

Do your instructions work as you intend?
Is it only talking about the allowed subjects what you want it to?
Can you trick the AI into revealing more information than it should, or making up information?
Is the AI’s tone right for your brand?
Do functions like sending info to your CRM system actually work?

Does it perform in the real world?

Is the AI making clients happy?
Is it making conversations with clients more efficient - getting more leads, saving money on customer service?
Is it having a positive or negative effect on your brand experience?

We give lots more examples of performance indicators that you could use for your chatbot in our Key Performance Indicators for Chatbots article.

Where in the loop should I evaluate?

Once you know what you want to evaluate, you need to think about where in the process you want to evaluate. The key evaluation points for a chatbot are:

A. In the prompt design stage. You want to make sure that your prompt does what you intend it to do (see our guide to writing prompts).

B. Before the response is sent to a client. This sort of checkpoint stops the AI from instantly responding to a query, but depending on the industry, you may need to have a human-in-the-loop sign-off before the response is sent to a client.

C. After the response has been sent. You’ll often see this with ChatGPT’s web interface- after each response, you have the option to evaluate it, or report offensive or inaccurate information.

D. At the end of the conversation. This can be purely on quantitative performance- how many messages, what latency- or get user feedback for more qualitative evaluation.

Website images.png

Who should be evaluating my chatbot’s performance?

Once you know what and where you want to evaluate, the next question is who. AI is all about saving time and money, right? Shouldn’t AI be the fastest, cheapest and best way of evaluating your prompts? Yes, and no. There are areas where AI is better suited to evaluation than humans, and areas where humans have the edge.

What sorts of evaluation are computers good at?

Anything where you want thousands of variables tested quickly. For instance, if you wanted to do thousands of different questions posed to a website query bot, to make sure that it consistently answers correctly, a computer- either a programme or an AI- would be your best bet.
The same goes for things like prompt compression tools which can very quickly analyse how a prompt can be reduced in size without losing meaning.
Assessing prompts in multiple languages is also something an AI would be better at than most humans.

What sorts of evaluation are humans good at?

Humans are needed in the loop for any use-case where there’s a regulatory or legal requirement. For instance, in many countries, there’s a legal requirement for medical or legal advice to be assessed by a doctor or lawyer before it goes to the client. Calibrtr offers a tool to build in just this sort of checkpoint into your product [LINK].
Situations where human insight or a more flexible set of actions is needed. For instance, you might design in an ‘speak to a manager’ feature where a conversation could be escalated to a human if your AI was unable to resolve an issue.
Situations where you need user feedback on the experience- you might want to do a simple post-session feedback form, which asks a user to rate the experience on a scale of 1-5, and also gives space for written feedback.

Humans and AIs working together

Designing evaluation processes which use both human and AI capabilities together are a smart way to reduce the number of evaluations humans carry out, while bringing in the best of human capabilities where needed. This can move beyond just ‘evaluation’ into hybrid AI-human solutions and products.

‘Speak to a manager’ escalation routes where an AI can ask if a client would like to speak to a human.
Triage, where an AI carries out an initial evaluation of a clients requirements before directing them to either self service resources or a human.
Risk based escalation, where particular code words or circumstances (e.g too many back and forth messages) prompt an AI conversation to be escalated to a human. This can either be proactive, where the conversation gets instantly escalated to a human, or retrospective, so that conversation is replayed later and prompts re-calibrated. Calibrtr offers a conversation replay tool for just this use-case- contact us on the form at the bottom of the page for a demo!

Image 89.jpeg

How exactly do I do all of this?

There are lots of tools available for chatbot evaluation. At Calibrtr, we’ve come up with some really useful ones to help you evaluate your chatbot throughout the loop. Here’s some ideas:

A. In the prompt design stage, you can use our prompt evaluation tools that run thousands of tests to challenge your prompt’s performance.

B. We’ve developed a tool which creates a simple, intuitive workflow for human in the loop reviews of AI responses, before the response is sent to a client

C. We also offer simple surveys which can be added to your chat interface after the response has been sent.

D. We’ve got some great tools to use at the end of the conversation. Firstly, we can help you build user surveys to build up your qualitative performance data. We also offer a unique conversation replay tool which allows you to run the same client conversation with a different prompt to see if you could have improved performance.

Image 81.jpeg

Want to try out our services?

Calibrtr offers a Generative AI cost management and performance review platform, with tools to forecast and manage costs, build and evaluate prompts, experiment with different models and do A/B testing of prompts, monitor performance and build in human-in or out of-the-loop approvals and reviews. We’re currently in Beta- get in touch with us to find out more!

Our limited beta program is open

If you'd like to apply to join our beta program, please give us a contact email and a brief description of what you're hoping to achieve with calibrtr.