Organizations seeking to fine-tune large language models for specialized advising often face hardware constraints. Free-tier GPUs from platforms like Kaggle and Colab offer limited session time, making multi-epoch runs challenging. A new paper by Md Millat Hosen from arXiv addresses this with a practical adapter-handoff recipe, but also delivers a cautionary finding about synthetic training data reliability.
The Adapter-Handoff Recipe
The paper, titled "Fine-Tuning a 7B Advisor on Free-Tier GPUs: An Adapter-Handoff Recipe and a Synthetic-Data Reliability Caution," describes a three-epoch QLoRA fine-tune of Mistral-7B-Instruct-v0.3 (4-bit NF4, LoRA rank 16, using Unsloth). The training was completed across two free-tier 16 GB GPUs: a Tesla P100 first, then a T4. By checkpointing only the small LoRA adapter (41.9 million parameters), the fine-tune could resume on the second machine without transferring optimizer or scheduler state. According to the paper, adapter-only handoff is sufficient, meaning the binding constraint is per-step VRAM and per-session wall-clock time, not aggregate compute.
Evaluation Results: Quality vs. Data Fidelity
On a blind held-out comparison against the un-fine-tuned base model, the fine-tuned model achieved a BERTScore F1 increase of +0.063, indicating higher similarity to the synthetic training distribution. However, the paper notes that this is a fidelity signal, not a quality signal. A blind LLM-as-judge evaluation found that the base model was preferred on 46% of prompts versus only 18% for the fine-tuned model. Furthermore, a source-verified factuality audit uncovered four confident errors from the fine-tuned model on policy-sensitive topics, while the base model made zero.
| Metric | Base Model | Fine-Tuned Model |
|---|---|---|
| BERTScore F1 (vs. synthetic training distribution) | Baseline | +0.063 (higher) |
| Blind LLM-as-judge preference (% of prompts) | 46% | 18% |
| Confident errors in factuality audit (policy-sensitive topics) | 0 | 4 |
Synthetic Data Reliability Concern
The paper traces these errors not to fine-tuning artifacts but to the training data itself. Each audited error was already present in the Gemini-generated training answers. A random-sample audit found verifiable errors in a sizable fraction of responses: 28-40% (single-judge, n=40). The authors attribute the performance drop to the synthetic-data pipeline, not the adapter-handoff method. They release the dataset, adapter, cross-GPU notebooks, and full evaluation harness to ensure reproducibility on a single 16 GB GPU.
Implications for Enterprise AI
For technology leaders considering low-cost fine-tuning of LLMs for specialized advisory roles (e.g., in supply chain or trade compliance), the paper offers a practical hardware-constrained recipe. However, the synthetic data reliability issue is a critical reminder: data quality must be verified independently, as errors in training data can propagate even with careful model optimization. The open-source release allows enterprises to audit and replicate the findings.