"We need to fine-tune an LLM for our business" is one of the most common requests in AI projects right now — and one of the most frequently misapplied. Fine-tuning is a specific, well-defined technique with a specific set of use cases. It is not the answer to every AI customisation problem, and understanding when it's the right tool is the first and most important step.
Do you actually need fine-tuning?
Before fine-tuning anything, it's worth ruling out two alternatives that solve a large share of "customise the AI for our business" requests more cheaply and faster.
Prompt engineering and system prompts. A well-crafted system prompt with clear instructions, examples, and constraints can get a general-purpose model to behave in a domain-appropriate way for a surprising range of use cases — answering in a specific tone, following a specific format, applying specific business rules. This costs nothing beyond engineering time and can be iterated in minutes rather than days.
Retrieval-Augmented Generation (RAG). If the core problem is that the model doesn't know your company's specific information — your product catalogue, your internal documentation, your policies — RAG, which retrieves relevant information from a knowledge base and feeds it into the model's context at query time, usually solves this more effectively than fine-tuning, and without needing to retrain anything when the information changes.
Fine-tuning earns its place when the goal is changing the model's behaviour in a way that prompting can't reliably achieve — a specific output format applied consistently across thousands of varied inputs, a particular reasoning style, domain-specific terminology and judgment that's hard to fully specify in a prompt, or significantly reducing latency and cost by using a smaller fine-tuned model instead of a larger general-purpose one for a narrow task.
Three approaches — and when to use each
Full fine-tuning updates every parameter in the model. It produces the strongest results but requires substantial compute, a large dataset, and careful handling to avoid "catastrophic forgetting" — where the model loses general capabilities while gaining domain-specific ones. For most business applications, this is more than necessary.
LoRA (Low-Rank Adaptation) freezes the original model weights and trains a small number of additional parameters that adapt the model's behaviour. It requires dramatically less compute and data than full fine-tuning, trains faster, and the resulting adapter file is small — often a few hundred megabytes rather than the full model size. For the large majority of business fine-tuning use cases, LoRA or its variants (QLoRA, which adds quantisation to reduce memory further) is the right starting point.
Instruction fine-tuning is a specific application of either method above, focused on teaching the model to follow a particular style of instruction-response pairs — useful when the goal is consistent behaviour across a defined set of task types, such as customer support responses or structured data extraction.
For nearly every business fine-tuning project that isn't building a foundational model from scratch, LoRA-based approaches offer the best balance of cost, speed, and result quality.
Building the dataset — the part that actually matters most
Model architecture and training method get most of the attention, but dataset quality determines the outcome more than almost anything else. A small, clean, well-curated dataset consistently outperforms a large, messy one.
Format. Most fine-tuning datasets are structured as instruction-response pairs — an input (a question, a task, a prompt) and the desired output. Consistency in format across examples matters more than people expect; inconsistent formatting teaches the model inconsistent behaviour.
Quantity. This varies enormously by task complexity, but as a rough starting reference: a few hundred high-quality examples can produce a noticeable shift in behaviour for narrow tasks. A few thousand is a more typical target for production-quality results on moderately complex tasks. Tens of thousands or more is usually reserved for more ambitious behavioural changes.
Quality over quantity, specifically. Examples that are inconsistent, contain errors, or represent edge cases poorly will actively teach the model the wrong behaviour. It's common, and usually correct, to spend more time curating and reviewing a smaller dataset than rushing to assemble a larger one.
Where the data comes from. Real historical examples from your business — past support tickets and their correct resolutions, past documents and their correct classifications, past queries and ideal responses — are usually far more valuable than synthetically generated examples, because they capture the actual distribution of inputs the model will see in production, including the messy edge cases that matter most.
Held-out test data. Set aside a portion of the dataset — commonly 10–20% — that the model never sees during training, used purely for evaluating performance afterward. Skipping this step makes it impossible to honestly assess whether the fine-tuning actually worked, versus the model simply memorising the training examples.
Choosing a base model
The choice of base model depends on the deployment target and the task complexity, and this decision interacts directly with the edge-versus-cloud question covered in our article on edge AI vs cloud AI.
For tasks requiring strong general reasoning and where compute isn't tightly constrained, larger open-weight models in the 7–70 billion parameter range are common starting points, fine-tuned via LoRA. For deployment on constrained hardware or where inference cost at scale is the primary concern, smaller models in the 1–7 billion parameter range, sometimes further distilled or quantised, are more appropriate — accepting a capability trade-off in exchange for speed, cost, and the ability to run on more modest hardware.
Licensing matters as much as capability for commercial use — confirm the base model's licence permits the specific commercial deployment intended, since this varies significantly between model families and versions.
The training process itself
With a LoRA-based approach, the practical training process generally follows this shape:
- Prepare the dataset in the format expected by the training framework, split into training and evaluation sets
- Select hyperparameters — learning rate, LoRA rank, number of training epochs — starting from established defaults for the chosen base model and method rather than guessing from scratch
- Run training, monitoring the loss curve to check the model is actually learning and not diverging or overfitting
- Evaluate on the held-out test set after training completes, not just by spot-checking a few outputs informally
- Iterate — adjust the dataset, hyperparameters, or both based on evaluation results, and retrain
This is typically an iterative process rather than a single successful run. Expect two to four training cycles before reaching a result that's ready for more rigorous evaluation, even on relatively well-prepared datasets.
Evaluation — how to know if it actually worked
Evaluation needs to go beyond "the outputs look reasonable when I read a few of them." Three layers are worth building into any serious evaluation process:
Automated metrics on the held-out test set — accuracy, exact match, or task-specific metrics depending on what the model is meant to do, run against examples the model never saw during training.
Comparison against the base model — running the same test set through both the fine-tuned model and the original, unmodified base model, to confirm the fine-tuning actually improved performance on the target task rather than just changing behaviour in a way that feels different without being better.
Human review on a representative sample — automated metrics don't capture everything, particularly for generative tasks where there's no single "correct" answer. Having a domain expert review a sample of outputs catches issues automated metrics miss entirely.
Deployment considerations
A fine-tuned model that performs well in evaluation still needs a production deployment plan. Key considerations: inference latency and cost at expected production volume, whether the LoRA adapter is merged into the base model for deployment or kept separate (separate adapters allow swapping between multiple fine-tuned variants without duplicating the full base model), monitoring in production to catch behaviour drift or failure modes that didn't appear in evaluation, and a clear process for retraining as more data becomes available or as business requirements evolve.
Common mistakes
- Reaching for fine-tuning before trying prompt engineering or RAG — often a more expensive solution to a problem that didn't need it
- Training on a dataset too small or too inconsistent to teach the intended behaviour reliably
- Skipping a held-out test set — making it impossible to honestly evaluate whether training succeeded
- Evaluating only by reading a handful of outputs — missing systematic failure patterns that only show up at scale
- Not comparing against the base model — sometimes the "improvement" isn't actually an improvement
- Ignoring base model licensing terms for the intended commercial use case
Realistic cost and timeline
| Stage | Typical Duration |
|---|---|
| Determining if fine-tuning is the right approach | 2–5 days |
| Dataset collection and curation | 1–3 weeks (often the longest stage) |
| Base model selection and setup | 2–4 days |
| Initial training run | Hours to a day, depending on dataset size and compute |
| Evaluation and iteration (2–4 cycles) | 1–2 weeks |
| Deployment setup and monitoring | 1 week |
| Total (typical first project) | 4–8 weeks |
Dataset preparation is almost always the longest stage, and the stage most worth investing time in. Teams that rush dataset curation to get to training faster consistently spend more total time on the project than teams that invest properly upfront, because poor data quality forces additional training cycles later.
Fine-tuning works well when the problem genuinely calls for it, and the difference between a successful project and a disappointing one usually comes down to dataset quality and honest evaluation — not the sophistication of the training method. Start by confirming fine-tuning is actually the right tool, invest disproportionately in the dataset, and evaluate rigorously before calling it done.
If you're working through an AI model project and want to discuss the right approach for your specific use case, get in touch with us. AI model training and fine-tuning is one of our three core pillars at Manthrix.
