a human male stands in a futuristic room evaluating a circle of white and steel robots, as an illustration of the concept of human-in-the-loop evaluation of generative AI

Human-in-the-loop verification for LLMs

Large Language Model applications are everywhere, but as an emerging technology, there are still questions around odd, or even dangerous, outputs from these systems. If you are in a field where an inaccurate output could cause real issues for you- financial, reputational or legal- then how do you harness generative AI safely? One answer is human-in-the-loop verification.

What is Human-in-the-loop verification?

Quite simply, it means putting a human sign-off stage into the conversational loop between a user and an AI. So your user asks a question, the LLM generates an answer, and then the human verifier reviews it before it is released to the user. It differs from other sorts of evaluation- which we talk more about in our evaluation of LLM chatbots article - because it stops the response going to the user before it has been reviewed, unlike other evaluation which looks back at interactions that have already happened.

an image showing a conversational cycle between an AI and a human that has been interrupted by a human in the loop review

Why would I want to use human-in-the-loop verification?

We’ve all seen some of the unusual, even dangerous, outputs from chatbots like poison sandwiches from the Savey-Meal bot recipe generator or DPD’s chatbot making poetry that insults DPD and their service levels . These examples aren’t great for company reputation, but it’s unlikely that many people would follow dangerous advice from a meal planning bot. However, what about generative AI systems that do have a degree of authority, where we do expect people to follow their advice?

We’re talking about systems like medical advice, legal advice, or financial advice. If a medical AI system from a reputable healthcare provider told you that your mole, for example, wasn’t cancerous, would you be inclined to listen? Or if a financial advice bot provided an inaccurate version of the tax code, without any signals that it was false? These situations could be really dangerous to their users, and open providers up to a whole host of legal, financial and reputational risks. To avoid these risks, using a human-in-the-loop verification system allows responses to be signed off by a human expert before they go out.

A female medic is examining a robot in a futuristic medical office

If I have to sign it off, what’s the point?

It might seem like a waste of time to have a generative AI system if it still requires human sign off. But the key factor is speed and volume. If it takes a human 30 minutes to compose some advice for a user, they can manage 14-16 cases a day. If it takes them 5 minutes to review and quality check the advice an AI has composed, they can manage 12 cases an hour, or 84-96 cases a day. This means a 6 fold increase in productivity. This can be driven up by starting to standardise some responses, which leaving most human review time for the most complex cases.

a young woman with blond hair is working on a pile of papers, with help from a robot

Do I need to sign off every single response?

That’s a judgement only you can make. If your business is high risk and there’s lots of variability in responses, you may need to sign off on every single response for the duration of your product. However, if you have a lot of standard responses and your AI system is reliably providing a correct response each time, you may be able to switch to a sampling methodology. This could be random, so 1 in 10 or 1 in 100 responses get human intervention. It could be risk based- you could design your AI system to flag the more complex responses for human intervention. It could also be based on user feedback, so you allow all responses to go out, but then have an escalation route based on users flagging issues- however this is a more risky approach.

a man stands in front of a group of robots, considering them

Isn’t this going to slow down my service?

Yes, absolutely. Generative AI services can respond in seconds, whereas even with the best staffed team in the world, your human responses will take minutes or hours. However, that trade-off is likely to be worth it if you are in a high risk industry where legally you need a human to sign off. In this type of situation, it’s not about making a brand new AI service that replaces a lawyer, say, but speeding up existing types of service and making them more efficient. If you are replacing an even slower human led-service an AI-supported service is likely to be an improvement. Even if not, the reassurance that a human is reviewing outputs is likely to make the delay acceptable to users. You can also consider allowing AI responses to go out first, but with a warning label that it hasn’t been signed off yet.

When is my tax return due?

user

**This response has not been signed off by a human** Hi Sarah, my records show it is due on the 31st March, but I will get one of our tax partners to confirm

assistant

ok, thank you

user

Hi Sarah, I checked with Joel, your assigned tax partner. He confirms that your tax return is due by the 31st March

assistant

How should I evaluate my human-in-the-loop service?

AI-supported services are all about efficiency, speed and cost savings, as well as finding innovative ways to engage with users. Here’s some of the things you might want to evaluate:

Verification accuracy rates
Turnaround times for human review
Percentage of LLM outputs requiring correction
Cost and time savings
Increased productivity of staff
Experience of staff- This is a brand new type of work and may end up being stressful for staff, particularly if it’s all they do or if there’s a lot of risk involved in making rushed decisions.

You should also be thinking about feedback loops for your prompts and system design. This means taking some or all of the following steps to continuously improve your system:

Capturing detailed feedback from human reviewers
Analysing patterns in verification results
Using insights to refine prompts and fine-tune models
Updating training data based on common errors or misunderstandings

a man with a clipboard is doing an evaluation of a shiny black and gold robot in a elegant white wood lined office

Ok, but how do I actually do this in practice?

You need a human-in-the-loop workflow tool. And here’s one we made earlier… Calibrtr’s human-in-the-loop tool provides a workflow for approving or editing your AI outputs before they go to your users. Alongside our usage dashboards, conversation replay tools and prompt building and compression tools, this can help you build robust, cost effective and safer Generative AI systems.

a screenshot of a tool to do human-in-the-loop verifications of generative AI content

Our limited beta program is open

If you'd like to apply to join our beta program, please give us a contact email and a brief description of what you're hoping to achieve with calibrtr.

Frequently Asked Questions

Human-in-the-loop verification means incorporating a human review stage into the interaction between a user and an AI. The user asks a question, the AI generates a response, and then a human reviews it before it is sent to the user. This process ensures that the AI's output is accurate and safe before being released.

Human-in-the-loop verification helps prevent inaccurate or potentially dangerous outputs from AI systems, especially in fields like healthcare, legal, or financial advice where incorrect information could have serious consequences. By having a human expert review AI outputs before they reach the user, you can mitigate risks and protect your company's reputation.

The advantage of using generative AI with human-in-the-loop verification is speed and efficiency. While a human may take 30 minutes to compose advice, they can review AI-generated content much faster, managing significantly more cases in the same amount of time. This approach can increase productivity while still ensuring quality and accuracy.

Whether you need to sign off on every response depends on the level of risk associated with your business. For high-risk industries, you may need to review every response. In other cases, you might implement a sampling method, where only a percentage of responses are reviewed, or focus on more complex cases flagged by the AI system.

Yes, human-in-the-loop verification will slow down the service compared to fully automated AI responses. However, in high-risk industries, the trade-off is worth it to ensure accuracy and safety. The delay may be acceptable to users if it means higher-quality, verified responses. Additionally, AI-generated responses can be sent out with a disclaimer until they are confirmed by a human.

You should evaluate your human-in-the-loop service based on factors such as verification accuracy rates, turnaround times for human review, the percentage of AI outputs requiring correction, cost and time savings, and the productivity and experience of staff. Continuously improving your system through feedback loops, analyzing patterns, and updating training data is also essential.

To implement human-in-the-loop verification, you need a workflow tool designed for this purpose. Calibrtr's human-in-the-loop tool provides a workflow for approving or editing AI outputs before they reach users. It also includes usage dashboards, conversation replay tools, and prompt-building tools to help you create robust and cost-effective generative AI systems.

Human-in-the-loop verification for LLMs

What is Human-in-the-loop verification?

Why would I want to use human-in-the-loop verification?

If I have to sign it off, what’s the point?

Do I need to sign off every single response?

Isn’t this going to slow down my service?

How should I evaluate my human-in-the-loop service?

Ok, but how do I actually do this in practice?

Our limited beta program is open

Frequently Asked Questions

What is Human-in-the-loop verification?

Why would I want to use human-in-the-loop verification?

If I have to sign it off, what’s the point of using generative AI?

Do I need to sign off every single response?

Isn’t human-in-the-loop verification going to slow down my service?

How should I evaluate my human-in-the-loop service?

How do I implement human-in-the-loop verification in practice?