Optimise Large * Models in Production
What we have covered so far:
Part 1: Putting Large Models in Production: https://medium.com/@ravishtiwari/putting-large-models-in-production-part-1-51a753ee20d44
Part 2: Running Large Models in Production: https://medium.com/@ravishtiwari/running-large-models-in-production-c56f77e9486d
After we have a model that we can move to production, we need to start thinking about the optimisations, to ensure — Profitability, Usability, Adoption which is going to make is sustainable and fuel next leg of innovation.
We need to understand that we can’t use same yard stick for Training and Inference. Training is an one Time, Inference is an ongoing process, it run everytime a query hits our LLM/LVM/L*M model in production. Optimise inference to improve RoE on your model.
What are the areas of Optimisation here?
Broadly speaking these are where we can optimise:
- Software Stack — Tools we use, configuration.
- Hardware Stack — Hardware we chose.
- Fine Tuning Approach — What level and type of fine-tuning do we want to do?
- RoI Expectation — How fast and how much do we expect our model to make?
Why exactly are we discussing these topics here?
Training and/or Fine Tuning is an one time exercise (we might be doing it frequently in say a quarter or so based on need), but, a huge cost. However, Inference we need to run continuously — online/batch/sync/async, whenever there is a query sent to model, inference needs to run, this cost is huge as well since for Most LLMs, if we need good throughput (which all of use need), we need to use GPU. This is where a lot of cash burn might happen. Even if our Fine Tuning is optimised.
We need to understand Training a ML (Large or otherwise) Model, Fine Fine-tuning an Existing model vs Running the model in production, they all have different optimisation and cost implications.
If we want our model to be financially viable and sustainable, then, the inference is where we would spend most of the $$$, and that’s what we need to optimise.
To improve RoI on ML Model running in Production, we need to look at end to end Process and Pipeline — Including Fine Tuning Approach, Hardware We are using, Software Tool Stack and RoI expectations.
But, why not include Training from scratch?
Training Large Models
Training is an one time activity, it might take multiple iterations to get a model that meets our requirement, but, we won’t do it daily {and now most GPU rich companies are in race, not much left of GPU Poor}. This article is about how to optimise LLM/LVM/L*M models in production, from an inference prospective.
Fine Tuning Optimisations
Now this is the route a lot of organisations are going to take. IMHO, it also makes sense — if there is something that can be customised as needed, why train from scratch?
There are multiple ways to fine-tuning, and a few of them are optimised for cost, others for performance, and some for coverage. 3 of the most popular approaches that I am seeing are:
Full Parameter Fine Tuning — the method that involves adjusting all the parameters of the model during the fine-tuning process. Since we are going to adjust all parameters required for optimised model, it’s going to be long, expensive, and compute heavy fine-tuning.
LoRA — we only want to change a small subset of parameters. We can use LoRA which provides the capability of transfer learning, it is memory and storage efficient. This approach is cost, memory, and time efficient. We are not changing everything, we are freezing the pre-trained model weights and injecting customised adapters on top of it. We don’t need to change all the parameters, hence, it is faster and cheaper to fine-tune compared to Full Parameter Fine Tuning.
QLoRA — In a nutshell — Quantized the weights of the LoRA adapters, hence the name QLoRA.
Now, based on what level of Fine Tuning we are doing, LoRA vs Full Parameter tuning is going to give us different Token Throughput.
If we can Quantize the model in question, it is going to be the most inexpensive.
Depending upon — Business requirement, Expenditure expectation and time constraint, select your fine tuning. The method best suited for problem at hand, best on how much we want to spend in fine tuning and inference after that.
Software Stack Optimisations
We need to understand any ML application (not just Large Language Models) is different from traditional server-side applications, and LLMs are league of their own.
On Software stack optimisation, we need to be open to experimental and new libraries that can help us improve performance. Based on my experience, I can suggest the following:
- Instead of Flask or Django, use tools tuned for LLM use case, such as FastChat
- When using FastChat, run the CPU controller and GPU worker on different VMs, and create more than 1 controller (look at the FastChat section on this post: https://medium.com/@ravishtiwari/running-large-models-in-production-c56f77e9486d
- Make use of vLLM and other libraries supporting and improving distributed inference (https://docs.vllm.ai/en/latest/serving/distributed_serving.html, https://www.kaggle.com/code/aisuko/distributed-inference-with-accelerate/notebook)
- When using LoRA fine tuned models, make use of LoRAX https://loraexchange.ai/
- Be open to quantise model when a performance boost is needed and we can accept some drop in accuracy
Hardware Stack Optimization
We need to choose the most suitable hardware to get the optimal performance and ideal price performance out of our model. We also need to remind ourselves:
- Some models can, but, not all models can run on CPU
- Not All GPUs are the same, each GPU has a different GPU Configuration and hence performance.
Let’s look at some of the popular and in-demand GPUs:
L4 → 24 GB GPU memory, 300 GB/s Memory Bandwidth
L40 → 48 GB GPU memory, 800 GB/s Memory bandwidth
A100 → 80 GB GPU memory, around 2 TB/s Memory bandwidth
H100 SMX → 80 GB GPU memory, 3.4 TB/s Memory bandwidth
Now looking at the above, tell me if it is fair to expect the same → Throughput and Price performance from these different GPUs. IMHO : not a fair ask.
- We are willing to use GPUs, but, understand that GPUs are different than CPU
- Depending upon the use case, TPUs can deliver better price performance
- Specialized chips are tailored for specific needs and, hence, can offer performance
Choosing hardware is as important as software if not more for performance. For example, consider this:
We have an SD XL LoRA fine-tuned model, with around 5–10 million parameters optimized for specific tasks via transfer learning. We want to run this model in production, and we have the following hardware to use:
- NVIDIA T4
- NVIDIA L4
- Google Cloud TPU
For this particular use case, I have seen an almost 8x performance boost when switching to TPUs from T4 and an almost 3x performance boost compared to L4. But, is TPU cheaper than L4 or T4? Unit cost wise they are not cheaper, but, we need to look at price-performance and user experience improvements.
All hardware are not same, benchmark and decide which one is best suited for the use-case.
RoI Expectation
The most important aspect of running a LLM in production. This is where a lot of things can either go as per plan or go off track.
Inference is an ongoing process. Every time there is an user request which hits our inference endpoint, our inference pipeline is going to run, and cost us $$$.
We need to clearly identify this — What is the differentiator we are offering and what is RoI expectation we have?
Differentiator → We want to be the cheapest offering out there OR Create a Value Add around the model for users? This can be achieved with either large-scale or cash burn.
Return on Investment → How soon do we expect our Model/AI Service/GenAI API to make? How much do we want it to be able to generate in order to sustain it? What is the value proposition we are going to create? What is our Burn Appetite?
We need to have a very clear goal and expectation in place, only then we can sustain such initiatives.
IMHO — A lot of AI Startups are going to crash and burn due to unrealistic or unclear RoI expectation, and not because they do not have a good model in place.
That’s a wrap. Let me know what you think about LLMs in production. Share your experience, would love to know more.
PS: Content is human generate, so, might have not-so-appropriate word selection is some cases.
PSS: Image is all GPT generated, so, all the blame to Prompt I used and Model I choose.
Got a question or a suggestion or found inaccuracies?
I would love to connect with you and hear your feedback.
Let’s get in touch