Fine-tuning large language models has become significantly more accessible, but 'accessible' doesn't mean 'always advisable.' After working with dozens of teams on fine-tuning projects, I've developed a framework for when fine-tuning makes sense — and when you should stick with prompting, RAG, or a combination of both.

The Decision Framework

Fine-tune when: (1) you need consistent output formatting that prompting can't reliably achieve, (2) you have domain-specific knowledge that's poorly represented in the base model, (3) you need to reduce latency by encoding behavior in weights rather than long system prompts, or (4) you need to reduce costs by using a smaller fine-tuned model instead of a larger general one.

Don't fine-tune when: (1) your data changes frequently (RAG is better), (2) you need transparency into the model's reasoning (fine-tuning is a black box), (3) you have fewer than 500 high-quality examples, or (4) prompt engineering with a capable base model achieves 90%+ of your target quality.

The Practical Steps

Data quality is everything. 500 excellent examples outperform 10,000 mediocre ones. Invest in data curation: have domain experts review every training example, remove ambiguous cases, and ensure consistent labeling. Use the base model to generate candidate outputs, then have humans edit them — this produces training data that's aligned with the model's natural distribution.

Cost Reality

Fine-tuning a 7B parameter model costs $50-200 in compute. The ongoing cost savings from using a fine-tuned 7B model versus prompting a 70B model can be 10-20x per inference. For high-volume use cases, the ROI is clear within weeks.