Do I need on-prem GPUs, or can self-hosted live in the cloud?

Cloud is fine, as long as it is a tenancy the customer controls (their VPC on Azure, AWS, or GCP). "Self-hosted" in this engagement means the model runs on infrastructure the customer pays for and inspects — not necessarily on a metal rack in a closet. On-prem GPU is the right answer when the data cannot leave the perimeter; cloud VPC is the right answer when the cloud is already the platform.

How do you decide when to retrain?

A written trigger policy. The two trigger conditions: corpus drift (the embedding distribution between the current corpus and the training corpus crosses a measured threshold) and eval-score drift (the production model's score on the customer's actual tasks degrades past a configured threshold). Retraining is gated by the eval harness — a new model only promotes if it beats the production model on the customer's tasks, not on a public benchmark.

What does "predictable cost" mean in practice?

Inference cost is dominated by GPU-hours, not by per-token charges. With autoscaling bounded by a cost ceiling and traffic monitored against a forecast, the monthly bill becomes a known range rather than a surprise. The dashboard reports cost in dollars per thousand conversations, so finance can reconcile against the customer-volume forecast directly.

Self-Hosted Model Operations

Production operations for a self-hosted model — keeping it running, keeping it accurate, keeping the bill predictable. Real production, not a notebook.

This is the operations engagement. The model is trained and deployed; what is missing is the production discipline that keeps it running through traffic spikes, drift, hardware failures, and the inevitable Sunday afternoon when a customer asks why the bot just refused fifty consecutive queries. The deliverable is the operations layer — inference serving, observability, drift monitoring, retraining triggers — that turns a model on a GPU into a production service.

The buyer for this page is a CTO or platform leader who has the model in production (or near it) and needs the operations layer to look like a real production system. Per-token-tax-free is not free; self-hosted inference still needs the runbook, the dashboards, and the alerts that any other production service needs. The parent — Customer Service AI Models — describes the model itself; this page is about keeping it running.

Deployment topology

On-prem GPU clusters where the buyer owns the hardware. Customer VPC on Azure, AWS, or GCP where the cloud is the platform. Hybrid where the inference lives on prem and the orchestration lives in the cloud. The topology is scoped against the customer's latency target, throughput profile, redundancy requirements, and existing platform discipline.

Inference serving

vLLM, TGI, TensorRT-LLM, or Ollama depending on the model architecture, the hardware, and the latency budget. Batching strategy tuned to the traffic shape — interactive web chat wants low p99 latency, batched email response generation wants throughput. The serving layer is monitored, the autoscaling is bounded, and the cost ceiling is enforced.

Observability and cost

Token-level metrics: prompt tokens, completion tokens, latency percentiles (p50, p95, p99), refusal rates, fallback triggers, cache hit rate. Cost dashboards in dollars per thousand conversations, not in vendor-API tokens. Alerting that pages a human on drift — not just on outages.

Retraining cadence

Written trigger policy. Corpus drift threshold (the embedding distribution between the current corpus and the training corpus passes a measured threshold). Eval-gate criteria (the new model must beat the production model on the customer's actual tasks, not on a public benchmark, before promotion). The retraining run is reproducible from the same pipeline the original training run used.

What I ship

Inference serving stack. Tuned to the customer's latency and throughput profile.
Observability dashboards. Token-level metrics, latency percentiles, refusal rates, cost in dollars per thousand conversations.
Alerting. Drift, outage, cost-ceiling breach, refusal-rate spike — paging a human, not just lighting up a dashboard.
Retraining-trigger policy. Written, with the drift threshold and the eval gate named.
Cost ceilings and rate limits. So a runaway loop or a viral mention does not burn the quarter.
Runbook. A second engineer can run the system without a meeting.

Where it fits

From notebook to production

The model is running on a single GPU under a Streamlit demo a data scientist set up. It works for the demo. It will not survive the traffic when the marketing team turns on the announcement. I engineer the production deployment — serving stack, observability, autoscaling, cost ceilings, alerting — so the model can be put in front of real customers.

Predictable cost

The CFO wants the cost of inference to be a known number per month, not a surprise on the cloud bill. I instrument the cost dashboard in dollars per thousand conversations, enforce the cost ceiling, and provide the cost-attribution view the finance team can reconcile against the customer-volume forecast.

Drift is happening, no one is measuring it

The model was good in February. It is March. The product team shipped a new SKU line, the support team adopted new vocabulary, and the refusal rate is creeping up. I wire drift monitoring against the corpus and the customer's actual KPIs, with alerts that fire before the dashboard turns red — not after.

How I work

Discovery scopes the deployment target, the latency and throughput profile, the cost ceiling, the existing observability stack, and the retraining requirements. Implementation runs in two-week sprints with the serving stack live at the end of sprint one and observability layered in over sprints two and three. The principal carrying the work is described on the about page.

Engagement model

Discovery runs two weeks. Implementation runs four to ten weeks depending on the deployment topology and existing platform maturity. Handover includes the serving stack, the dashboards, the alerting, the retraining policy, the runbook, and a 30-day support window. Retainer arrangements exist for organizations running the model continuously and needing ongoing drift triage. To scope an operations engagement, get in touch.