How to reduce the running costs of Generative AI

running costs hero1.png

Unlike traditional software, generative AI has recurring costs. In this article we’ll dive into why this is the case, how pricing works (focusing on LLMs as a service through API), and what companies can do to drop those costs.

What drives Gen AI costs?

Generative AI is much more energy intensive than traditional computing. It starts with training, with huge data sets and months of intensive computing on hundreds or thousands of processors in a datacentre. These costs are large and growing- ChatGPT-4 is estimated to have cost upward of $100m to train, and Anthropic CEO Dario Amodei has suggested we might soon see the $1bn or even $10bn model.

The scale of the models, with millions or billions of parameters, means they continue to need large amounts of computer hardware and energy in use. Sasha Luccioni at Hugging Face calculates that Generative AI tasks use up to 33 times the energy of other types of AI/ML systems; costs are task dependent, with image generation taking 60 times more energy than text generation. Generative AI datacentres are also major users of water, with demand spiking when big models are trained.

There are moves towards making datacentres and generative AI more efficient. However, there are also other costs- datacentres require substantial capital investments to build, particularly the vast numbers of specialist graphical processing units or GPUs needed. Salaries for machine learning engineers able to develop these systems are also very high.

How does pricing work for LLMs?

Pricing models for LLMs vary depending on the way you are buying your generative AI. For consumer users, the standard model is by subscription, using closed source models like ChatGPT, Claude etc as individual user through their web or app interface.

However, if you want to build products around an LLM, you need to be able to call it via an API or other interface. Your options for this are either closed or open-source models.

Closed source providers like ChatGPT, Claude, Moshi etc offer LLMs as a service through models which are hosted, managed and maintained by the cloud AI provider. You pay for inputs and outputs, measured in tokens.

For open source models you download the model and host (either on premises or in the cloud), manage and maintain it yourself, although you can easily upgrade to keep up with the latest developments. You pay for hosting costs.

Who owns light.png

Tell me more about closed source pricing

Gen AI providers charge for each message sent to their models (the input) and the response it sends back (the output). Messages are broken down into a number of tokens, which represent words or parts of words. Each system does tokenisation differently, but broadly speaking the average number of tokens per word is 1.3. If you want to see how exactly a piece of text breaks down, the cloud AI providers offer tokenisers on their websites. Here's the OpenAI version as an example.

Companies charge a small amount per token. It’s common to see input (message sent to the AI) tokens costing less than output (response from AI) tokens, as it costs more memory and compute to write a response than read an input. LLMs can't remember, so you need to send the whole prompt and message history each time to keep things consistent.

The overall cost of an interaction with an AI is the total cost of the input prompt plus the output reply multiplied by the number of back and forth messages. We have a prompt cost calculator where you can have a play with costs for different models and prompt/conversation sizes.

Image 14.jpeg

How much can it cost?

It’s amazing how quickly costs can ramp up- a simple website chatbot session (5 back and forth messages of 50 words each; 200 word prompt) could cost as little as $21.67 per 1,000 using ChatGPT-4o , whereas a long session with a therapy bot (100 messages of 200 words each, 12,000 word prompt) using Claude Opus could be as expensive as $66.4 for a single interaction or $66,400 per 1000.

So, when you are designing your products, you need to keep in mind that a more expensive model, big prompts, or a longer conversation, could really increase your costs.

Calibrtr offers experimentation tools to test out your prompts and requirements with different models to find the price vs performance sweet spot for your product. Here's our demo to try

How to reduce costs?

There’s a few different options.

  • Keep prompts and context as short as possible. Calibrtr offers a prompt compression tool that keeps the meaning of a prompt while reducing the size.
  • The more rules you put into your prompt, the better control you have over the LLM’s behaviour. But this can drive up costs by increasing prompt size. One option is to fine-tune a model using examples of perfect conversations to improve behaviour. This can allow you to reduce prompt size, but is slower to iterate on than a prompt and costs money to do.
  • If you need a lot of context, but that’s driving up your costs, you can look into retrieval augmented generation or RAG, where a retriever service can identify the right piece of information to feed to your LLM to inform the answer.
  • Identifying the right model to use for the right outcome is vital. We offer a model experiment service where you can try out different models with your prompts to identify the best outcome
  • Tracking your costs, and the outcome of your product interactions helps you to control what you are spending and what it is doing. Calibrtr offers a spend monitoring dashboard for GenAI deployments.
  • We’ve also got a guides to evaluating chatbot performance to help you make sure your prompt is doing exactly what you want it to.

laptop dashboard with brand.png

Should I just host my own open-source model?

Looking at all the potential costs for LLM as a service, it’s tempting to think that hosting an open source model is a more attractive option. We go into more detail on the wider pros and cons of this in our article on What type of LLM do you need?.

But if we are just looking at costs, there’s lots of complexity. Firstly, how big a model? For some small LLM models, like the Apple OpenELM models (between 270 million and 3 billion parameters*), all you need is a smartphone or a commercially available laptop. For the much larger open-source models available- LLama2 from Meta comes in sizes from 7-70bn, and Qwen-72B from Alibaba has, unsurprisingly, 72bn parameters- you will need specialist hardware or to host it in the cloud. Bigger isn’t necessarily better- smaller more focused models might be a better choice for your task than a very large general purpose model.

If you’ve got that hardware already, then you just need to pay for electricity (although this can still be significant). Otherwise you need to look at cloud hosting for your model.

In both cases, you'll need to think about latency, or how fast your model is able to respond to a query. Faster, unsurprisingly, requires more hardware and compute, driving up costs. And of course, more users = more requirements and therefore more cost.

Image 11.jpeg

Which one is cheaper: closed or open source?

We would love to be able to give you a categorical answer. But it all depends on your requirements and-cases.

Some people have tried to calculate the cost differences and at what point open-source wins out against a closed-source model. Bruno Rucy from Pipedrive calculated that for a 7bn parameter model, self hosting beats ChatGPT-3.5 at the threshold of 73,846 summarisations of a document per day. However, Mohammed Talib found that running larger open source models could be significantly more expensive, particularly when trying to match the performance and latency of the LLMaaS offer: “When hosted on AWS, LLAMA3, which comes in 8B and 70B parameter variants, costs about $18 per million tokens, a significantly higher price than GPT-3.5 Turbo, which costs $2 per million tokens” .

Will costs change over time?

Unfortunately, we don’t have a crystal ball here at Calibrtr. There are different forces in play on AI costs- on one hand, models are becoming more efficient, and as smaller, more tailored models arise, costs may reduce. On the other hand, deployment volume is likely to increase, and the fundamentals of energy use and computing infrastructure are likely to keep costs up.

What we would say is that building in cost and performance discipline now will help you to build your products on a strong foundation for the future.

Image 17.jpeg

Want to try out our services?

Calibrtr offers a Generative AI cost management and performance review platform, with tools to forecast and manage costs, build and evaluate prompts, experiment with different models and do A/B testing of prompts, monitor performance and build in human-in or out of-the-loop approvals and reviews. We’re currently in Beta- get in touch with us to find out more!

Our limited beta program is open

If you'd like to apply to join our beta program, please give us a contact email and a brief description of what you're hoping to achieve with calibrtr.

Please provide a valid email address
Thank you
Please let us know how calibrtr will help you
Thank you

Frequently Asked Questions

Generative AI running costs are higher because these systems are more energy-intensive, require large amounts of computer hardware, and involve significant ongoing costs for both training and usage. Additionally, salaries for machine learning engineers are high, contributing to the overall expense.

Pricing for LLMs as a service through APIs typically involves costs per input and output tokens. Closed-source providers charge based on the number of tokens processed, with output tokens usually costing more than input tokens. The overall cost depends on the size of the prompts, the number of interactions, and the length of each interaction.

When choosing between closed-source and open-source LLM models, consider factors such as cost, ease of setup, control over data privacy, maintenance requirements, model capabilities, and the scale of deployment. Closed-source models offer ease of use and maintenance, while open-source models can offer greater customization and potentially lower costs for high-volume applications.

To reduce running costs, keep prompts and context as short as possible, use compression tools, fine-tune models with specific examples, employ retrieval augmented generation (RAG) to manage context, choose the right model for your needs, and track costs and performance to optimize spending.

Hosting your own open-source model can offer greater control over data privacy and potentially lower costs for high-volume applications. However, it requires significant hardware resources, ongoing maintenance, and can result in higher costs for larger models or when aiming for low-latency performance.

The cost-effectiveness of closed-source versus open-source LLM models depends on your specific requirements and use cases. Closed-source models might be more cost-effective for smaller-scale or lower-volume applications, while open-source models could be more economical for large-scale deployments with the necessary hardware infrastructure.

The costs of generative AI are likely to change over time due to evolving model efficiencies, the development of smaller and more tailored models, and increasing deployment volumes. Building cost and performance discipline now will help prepare for future cost fluctuations.

Calibrtr offers a Generative AI cost management and performance review platform that includes tools for cost forecasting, prompt building and evaluation, model experimentation, A/B testing, performance monitoring, and human-in or out-of-the-loop approvals and reviews. These tools help optimize the cost and performance of generative AI deployments.