How big does my corpus need to be before fine-tuning makes sense?

For LoRA on a domain-conversational task, a few thousand high-quality examples is often enough to move eval scores meaningfully. For full fine-tuning, tens of thousands. The honest answer is that "enough" depends on the task — calibration runs on a representative slice tell us before the full training compute is committed.

Will the trained model run on commodity hardware, or do I need a GPU cluster?

It depends on the parameter count and the latency target. A 7B-class model with LoRA adapters runs on a single modern GPU. A 70B-class model wants multiple GPUs or quantization. The deployment target is scoped before the base model is chosen — I do not train a model the customer cannot afford to host.

Do I own the weights, or do you?

You do. The training engagement transfers ownership of the trained weights, the training pipeline code, and the eval harness to you. The base model carries its own license (some open-weights bases are commercial-use-friendly, some are not); I scope license compatibility in the base-model selection report before the first training run.

Self-Model Training & Fine-Tuning

End-to-end training pipeline for a customer-owned model — corpus to deployable artifact, with a published eval score before promotion.

This is the training engagement. The deliverable is a customer-owned model — base architecture selected for the deployment target, fine-tuned on a corpus the customer prepared with me, packaged with a reproducible training pipeline, and shipped with an eval score against the customer's actual service tasks. The weights live with the customer. The training code lives in the customer's repository. The eval harness lives next to it.

The buyer for this page is a support or product leader who already has a corpus — helpdesk transcripts, knowledge base articles, recorded calls, product documentation — and has decided the model is strategic rather than a side feature. The decision matters. A wrapper is shipped in a week; a self-model takes a quarter. If the corpus does not yet exist or the strategic case has not been made, the wrapper engagement on the parent Customer Service AI Models page is the honest answer.

The pipeline, end to end

Corpus preparation

Extraction from helpdesk transcripts (Zendesk, Salesforce, Freshdesk, Intercom), knowledge bases (Confluence, Notion, SharePoint), recorded calls (transcribed with the customer's choice of pipeline), and product documentation. PII redaction with reviewable rules. Deduplication. Quality scoring so the training set is not weighted toward the noisiest tickets.

Base model selection

Open-weights families for self-hosting; license-compatible options for commercial use; parameter count chosen against the latency and hardware budget. The decision is technical fit, not brand preference. I scope this against the customer's deployment target — on-prem GPU pool, VPC, or hybrid — before any training compute is committed.

Fine-tuning approach

LoRA or QLoRA where the corpus size and the hardware budget justify them; full fine-tuning where the volume warrants it; distillation where the deployment target is edge inference. Reproducible training runs with versioned hyperparameters, so the run that produced the deployable weights can be replayed against a different base or a refreshed corpus.

Eval harness

Built against the customer's actual service tasks — answering a returning customer about a known ticket, summarizing a long thread for a human agent, escalating cleanly when the confidence threshold drops. Pre- and post-training scores published with the deliverable. The eval harness lives with the customer and is run on every retraining run, not just the first.

What I ship

Corpus extraction and preparation. Helpdesk, KB, calls, product docs — extracted, redacted, deduplicated, quality-scored.
Base-model selection report. Open-weights candidates, license review, parameter-count tradeoffs against the deployment target.
Fine-tuning pipeline. LoRA, QLoRA, full, or distillation — versioned hyperparameters, reproducible runs.
Eval harness against real customer tasks. Pre- and post-training scores published with the deliverable.
Retraining-trigger policy. Written, with corpus-drift threshold and eval gate.
Handover documentation and 30-day support. Runbook a second engineer can execute.

Where it fits

The corpus already exists

Three years of Zendesk. A Confluence KB. A library of recorded calls. The data is there; the pipeline that turns it into a trained model is what is missing.

The wrapper is hitting limits

The per-token bill climbs with ticket volume; the hallucination rate on domain vocabulary will not improve with more prompt engineering. The next step is a fine-tune.

Regulation requires in-perimeter inference

Healthcare, finance, defense-adjacent. The data cannot leave the compliance boundary. The model needs to live inside it.

How I work

Every engagement opens with a written training brief: corpus inventory, base-model candidates with tradeoffs, fine-tuning approach, eval-task definition, hardware and license constraints, and a phased plan. The brief ships before any training compute is committed. Calibration runs come first; the full run follows only when the calibration eval confirms the approach is sound. The principal carrying the work is described on the about page.

Engagement model

Discovery and corpus preparation run two weeks. Calibration fine-tune runs one to two weeks. Full training and eval-harness construction run three to six weeks. Handover runs one to two weeks. A 30-day support window follows. Wrapper engagements — when that is the honest scope — ship in days. The decision is named on day one.