What are LLMOps

How do you make the robots do what you want them to?

What are LLMOps

Remember when the term 'Large Language Models' was something only a handful of insiders would casually drop in conversations? Fast forward to today, and it feels like they're the core of every shiny new app or service out there. But, as with any new tech, managing them can be complex. This is where Large Language Model Operations, or LLMOps, for those of us who prefer acronyms, can help.

LLMOps Tool Categories

Think of LLMOps as a toolkit for building, releasing and monitoring products built with Large Language Models. The toolkit is organised into three broad phases:

LLMOps categories diagram

Build

You might have already built your first products powered by LLMs, but do you have a solid foundation? Here's what should be in your toolbox:

Testing tools for prompts and models - when a prompt changes, is it better or worse than the previous?
Version control for your prompts.
Cost estimation tools - are the costs of running your product worth the expected results?

Release

Congratulations! You've turned your idea into a product. But before you celebrate, here are a few problems you might encounter:

Latency - LLMs are slow. How slow is too slow?
Cost - Tokens might be cheap, but use them lots and it gets expensive.
Hallucinations - LLMs can fabricate facts.
Jailbreaking - LLMs can sometimes wander into the 'no-go' zones of content. eg Supermarket AI Chatbot Fail

You can manage these risks by:

Starting small - Let a few users in and see how things go before rolling out further.
Keeping an eye on real-world usage - Are the prompts and result as you expected?
Budget forecasting - Did your build-time estimates match reality? If not, is it still worth rolling out?
Flag the weird stuff - If someone asks for a cake recipe that includes bleach, raise and alert and consider temporarily banning the user.

Monitor

Once your product is fully released, you might move on to other features or products, but here's what you need to keep an eye on:

The quality of results - Cloud LLMs are constantly being updated & retrained. Is your product getting better or worse over time?
Cost - Are you still within budget?
New tech - There's always something new. Can you save money or improve quality by deploying a new technology/technique?
Model upgrades - Should you switch to the latest and greatest model?
Threat detection - LLMs are vulunerable to malware, can you detect unusual patterns of behaviour that would indicate an ongoing infection?

Evaluation

You may have noticed that much of the tooling above relies on being able to evaluate LLM results - to tell good results from bad. There are a few approaches to this:

Downstream impact - responses that lead to good results (like conversions, clicks etc.) are good, responses that lead to bad results (like disengagement or complaints) are bad.
Ask the best robot (usually gpt-4 right now) for their opinion on the result.
Specify a list of tests that can be evaluated using another LLM.

At Calibrtr, we believe that the most promising approach is specifying a list of tests, and we’ve recently open-sourced a tool to help you build these.

For example, you might specify these tests, and run them against a sample of your production results:

    const tests = [{
        type: "AIResponseTest",
        llmType: {
        provider: "openAI",
        model: "gpt-3.5-turbo"
    },
        should: "use simple english that's easy to understand"
    },
        {
            type: "AIResponseTest",
            llmType: {
            provider: "openAI",
            model: "gpt-3.5-turbo"
        },
            should: "polite and professional"
        }];
    

How to detect hallucinations

Hallucinations are hard to spot, you can:

Get the model to tell you how confident it is in its answer
- eg - Open AI Logprobs can give a proxy for how confident the model is in it's output.
Extract facts from an LLM output and attempt to verify those facts with a trusted source (eg an internal database or google search).
- Eg, ask an LLM to "extract a list of facts from this text in the following json format { facts: []}"
- Then test each fact against your trusted source
Ask another LLM to check for any fact that isn’t true.
- For example, if you’re using GPT-4, you could ask google gemini, or you could ask GPT-4 several times to see how confident it is.
Ask humans - however, this is expensive, and humans may not know the answer either.

When to evaluate

Evaluation is expensive and time-consuming, so you'll pick different strategies based on your product lifecycle:

Build Stage
- Run a list of tests and example input through LLMs as part of your CI/CD pipeline
- Cache the LLM results and only rerun if the inputs change to reduce latency and cost
Release Stage
- Run guardrails on every response and only return a respose to a client if the guardrails pass
Monitor Stage
- Run a sample of your production results through your tests, and alert if the percentage of bad results cross a threshold
- Use a risk based approach, eg
  - Run guardrails for the first n messages of a new customer (if they are going to ask for a cake made of cleaning products, they’ll probably do it quickly, and if they are using the app as intended after n messages, they'll probably continue to do so)
  - Use a dictionary of “safe” and/or “unsafe” words or phrases to flag prompts or responses for further evaluation
  - Vectorise (using openai embedding or similar) the inputs and outputs, and then learn clusters of good or problematic prompts/results.

People

Finally, you should consider who will be the operators at each stage and build or buy tools with them in mind. For example:

Build: It's fairly likely this will be your developers and data scientists. They'll have a deeper understanding of LLMs and data, and want tools that can give rich information.
Release: This is likely to be your product managers and customer support. They'll want tools that are easy to use and give clear simple feedback, you should avoid confusing them with too much detail.
Monitor: This will be product managers, or executives (Heads of Engineeering, CFOs etc). They want to know that everything is running smoothly and within budget. They'll appreciate clear actionable insights into areas for significant cost or quality improvements.

Contact us

At Calibrtr, we’re actively working on LLMOps tooling, and we’d love to hear from you if you’re having problems with how to release and manage LLMs. Drop us an email at contact@calibrtr.com or visit our website calibrtr.com