Human-in-the-loop verification for LLMs
Large Language Model applications are everywhere, but as an emerging technology, there are still questions around odd, or even dangerous, outputs from these systems. If you are in a field where an inaccurate output could cause real issues for you- financial, reputational or legal- then how do you harness generative AI safely? One answer is human-in-the-loop verification.
What is Human-in-the-loop verification?
Quite simply, it means putting a human sign-off stage into the conversational loop between a user and an AI. So your user asks a question, the LLM generates an answer, and then the human verifier reviews it before it is released to the user. It differs from other sorts of evaluation- which we talk more about in our evaluation of LLM chatbots article - because it stops the response going to the user before it has been reviewed, unlike other evaluation which looks back at interactions that have already happened.
Why would I want to use human-in-the-loop verification?
We’ve all seen some of the unusual, even dangerous, outputs from chatbots like poison sandwiches from the Savey-Meal bot recipe generator or DPD’s chatbot making poetry that insults DPD and their service levels . These examples aren’t great for company reputation, but it’s unlikely that many people would follow dangerous advice from a meal planning bot. However, what about generative AI systems that do have a degree of authority, where we do expect people to follow their advice?
We’re talking about systems like medical advice, legal advice, or financial advice. If a medical AI system from a reputable healthcare provider told you that your mole, for example, wasn’t cancerous, would you be inclined to listen? Or if a financial advice bot provided an inaccurate version of the tax code, without any signals that it was false? These situations could be really dangerous to their users, and open providers up to a whole host of legal, financial and reputational risks. To avoid these risks, using a human-in-the-loop verification system allows responses to be signed off by a human expert before they go out.
If I have to sign it off, what’s the point?
It might seem like a waste of time to have a generative AI system if it still requires human sign off. But the key factor is speed and volume. If it takes a human 30 minutes to compose some advice for a user, they can manage 14-16 cases a day. If it takes them 5 minutes to review and quality check the advice an AI has composed, they can manage 12 cases an hour, or 84-96 cases a day. This means a 6 fold increase in productivity. This can be driven up by starting to standardise some responses, which leaving most human review time for the most complex cases.
Do I need to sign off every single response?
That’s a judgement only you can make. If your business is high risk and there’s lots of variability in responses, you may need to sign off on every single response for the duration of your product. However, if you have a lot of standard responses and your AI system is reliably providing a correct response each time, you may be able to switch to a sampling methodology. This could be random, so 1 in 10 or 1 in 100 responses get human intervention. It could be risk based- you could design your AI system to flag the more complex responses for human intervention. It could also be based on user feedback, so you allow all responses to go out, but then have an escalation route based on users flagging issues- however this is a more risky approach.
Isn’t this going to slow down my service?
Yes, absolutely. Generative AI services can respond in seconds, whereas even with the best staffed team in the world, your human responses will take minutes or hours. However, that trade-off is likely to be worth it if you are in a high risk industry where legally you need a human to sign off. In this type of situation, it’s not about making a brand new AI service that replaces a lawyer, say, but speeding up existing types of service and making them more efficient. If you are replacing an even slower human led-service an AI-supported service is likely to be an improvement. Even if not, the reassurance that a human is reviewing outputs is likely to make the delay acceptable to users. You can also consider allowing AI responses to go out first, but with a warning label that it hasn’t been signed off yet.
How should I evaluate my human-in-the-loop service?
AI-supported services are all about efficiency, speed and cost savings, as well as finding innovative ways to engage with users. Here’s some of the things you might want to evaluate:
- Verification accuracy rates
- Turnaround times for human review
- Percentage of LLM outputs requiring correction
- Cost and time savings
- Increased productivity of staff
- Experience of staff- This is a brand new type of work and may end up being stressful for staff, particularly if it’s all they do or if there’s a lot of risk involved in making rushed decisions.
You should also be thinking about feedback loops for your prompts and system design. This means taking some or all of the following steps to continuously improve your system:
- Capturing detailed feedback from human reviewers
- Analysing patterns in verification results
- Using insights to refine prompts and fine-tune models
- Updating training data based on common errors or misunderstandings
Ok, but how do I actually do this in practice?
You need a human-in-the-loop workflow tool. And here’s one we made earlier… Calibrtr’s human-in-the-loop tool provides a workflow for approving or editing your AI outputs before they go to your users. Alongside our usage dashboards, conversation replay tools and prompt building and compression tools, this can help you build robust, cost effective and safer Generative AI systems.
Our limited beta program is open
If you'd like to apply to join our beta program, please give us a contact email and a brief description of what you're hoping to achieve with calibrtr.