Fine-tuning updates model weights on your labeled examples so behavior — format, tone, classification boundaries — becomes internalized. It is powerful and easy to misuse. Teams fine-tune because a blog post said so, then discover that a retrieval index would have shipped fresher policy answers in half the time.
This guide is a decision framework for engineering leads and product owners: when fine-tuning is the right capital allocation, when it is waste, and how LoRA and QLoRA changed the economics.
The order that actually works
In most enterprises the sequence is: strong system prompt and few-shot examples, then RAG if answers need private or changing facts, then fine-tuning when evaluation shows stable failure modes that retrieval and prompting cannot fix.
Skipping steps burns GPU budget and encodes outdated policies in weights until someone retrains. Document the decision — not only the model name — so the next team does not repeat the experiment.
Fine-tune when these signals are true
Output schema must be identical across millions of calls — JSON fields, legal clauses, medical codes. Prompt engineering plateaus on a held-out eval set with the same error class repeating.
Latency and token cost dominate: shorter prompts after tuning repay training in weeks. You need on-prem inference with a smaller model that cannot carry eight-thousand tokens of RAG context every time.
The task is classification, extraction, routing, or summarization with stable input-output pairs you can label.
Do not fine-tune when
Facts change weekly — product catalog, pricing, compliance macros. Users must cite source documents line by line. You have fewer than two hundred quality labeled examples and no labeling pipeline.
The problem is solved by tool calling or JSON mode you have not configured yet. If you cannot describe the task as input-to-output pairs, you are not ready.
LoRA, QLoRA, and full fine-tuning
| Method | Typical cost | Best for |
|---|---|---|
| Full FT | $500–5k+ per run | Maximum quality ceiling |
| LoRA | $100–1k per run | Production default for 7B–13B |
| QLoRA | $50–500 per run | Budget labs, rapid iteration |
| Provider API FT | Per-token training | When data can leave perimeter |
Data quality and operations
One thousand consistent examples beat one hundred thousand noisy pairs. Start with fifty to one hundred gold examples from domain experts. Scale with LLM-assisted generation plus human review.
Budget quarterly reviews: production failures become training rows; retrain; compare to baseline on the golden set. Version adapters like any dependency.
Implementation pitfalls on when-to-fine-tune-an-llm
Teams ship demos without access control on the index, then discover legal blocked the rollout. Map SSO groups to metadata before writing UI polish.
Another pitfall: optimizing generation while retrieval recall is below eighty percent on golden questions. Fix the index and chunking first — no prompt will substitute for missing documents.
Operating the system after launch
Assign a business owner for corpus freshness and a technical owner for pipelines. Weekly review of refused queries and low-score retrievals feeds backlog for new documents or metadata fixes.
Budget quarterly eval when providers ship new base models. Regression on the golden set is cheaper than incident response after a silent quality drop.
Next steps for your organization
Document the decision record: what must be true in answers, how often facts change, and cost of failure. Scope a four-to-eight-week pilot with named metrics.
If you need hands-on architecture, evaluation design, or production integration, our LLM and RAG services follow the same delivery model described across this AI cluster.
Data labeling workflow that scales
Start with fifty gold examples written by domain experts — not scraped tickets without review. Use LLM-assisted drafting only with human approval on each row before it enters training.
Version datasets like code. Tag by policy era so you do not train on pre-GDPR wording. Automate deduplication — near-duplicate rows teach the model to memorize phrasing, not rules.
Signal you are ready for LoRA
| Signal | Threshold | Action |
|---|---|---|
| Held-out eval plateau | 3+ iterations | Consider LoRA on error class |
| Labeled pairs | 200+ reviewed | Pilot adapter |
| Policy change frequency | Weekly | Prefer RAG for facts |
Frequently Asked Questions
- Often 500–5,000 quality pairs for LoRA; start with 50–100 gold examples.
- Usually RAG for facts, fine-tuning for behavior.
- When behavior or policy changes — plan quarterly minimum in production.