Custom LLMs with Auto-Scaling on AWS, Azure & Google Cloud: Cut Inference Costs 60–80%

TL;DR

Problem: LLM inference workloads are GPU-bound, bursty, and have 3-5 minute cold starts. Naive horizontal scaling causes OOM crashes, not more throughput.

Approach: Queue-based async architecture (SQS / Service Bus / Pub/Sub) decouples client traffic from GPU workers. Scale workers on queue depth, not CPU. Scale to zero during off-hours.

Result: 60-80% cost reduction vs. frontier API pricing at 200K+ requests/month, with no OOM crashes and no user-visible 503 errors.

Key insight: The queue is not a performance optimization - it’s what makes the architecture correct. Without it, a traffic spike hits GPU memory directly and crashes workers. With it, the spike becomes latency, not downtime.

Self-hosting beats API pricing above ~200K requests/month (replacing GPT-4o tier)
Fine-tuned 8B models routinely beat GPT-4o-mini zero-shot on domain-specific tasks
Same architecture pattern works on AWS (SQS + ECS), Azure (Service Bus + AKS + KEDA), and GCP (Pub/Sub + GKE + KEDA)
Scale to zero off-hours: saves 40-70% of monthly GPU cost for business-hour workloads

When This Architecture Makes Sense

Before reading further, confirm this applies to your situation.

Use this architecture when:

Volume exceeds 50K inference requests/month and is growing
Your task is well-defined: classification, extraction, summarization, structured output
You have labeled data to fine-tune a small model (500+ examples minimum, 5K+ for good results)
Data privacy or compliance requirements prevent sending data to external APIs
Your team has at least one engineer who can own GPU infrastructure
Latency tolerance is >2 seconds (async queue adds latency, not removes it)

Avoid this architecture when:

You need frontier reasoning: complex multi-step tasks, open-ended generation, advanced coding
Volume is under 50K/month (engineering overhead isn’t worth it yet)
You need real-time streaming responses (queue pattern is async by design)
Speed to market matters more than cost right now
Your team has no ML ops or DevOps capacity to run GPU infrastructure

The honest math: If you’re using GPT-4o-mini and it handles your task, don’t self-host until volume exceeds 200K-500K/month. The cost difference doesn’t justify the operational overhead at low volume.

What Most Engineers Get Wrong

The most common mistake: scaling on the wrong metric.

When engineers first see GPU utilization at 98% and CPU at 20%, they write a CloudWatch alarm on CPU. The autoscaler sees nothing to do. The GPU worker OOM-crashes. Users get 503 errors. The autoscaler finally triggers - 3 minutes after the damage.

CPU and memory scaling are wrong signals for LLM workloads because:

GPU inference is GPU-bound, not CPU-bound. CPU is never the bottleneck.
OOM kills are not graceful. The process dies; the load balancer doesn’t know until health check fails.
Cold start (3-5 min) is longer than any reasonable alarm evaluation window. By the time a new instance is ready, the burst has already caused failures.

The second most common mistake: no queue.

Engineers familiar with stateless HTTP services reach for a load balancer + autoscaling group. This works when requests are lightweight and fast. LLM inference is neither. Without a queue:

Burst traffic hits GPU memory directly
GPU memory fills → OOM kill → instance crash
Scale-up alarm triggers → new instance cold starts → repeat

The third mistake: aggressive scale-down.

Using the same short cooldown (60-90s) for scale-down as scale-up causes instance flapping. The queue empties, workers scale down, the next batch of requests causes another cold start cycle. Use a 10-minute scale-down cooldown minimum.

The fourth mistake: not accounting for KV cache memory in VRAM budget calculations.

Engineers size their GPU instances for model weights and forget that vLLM allocates a large block of GPU memory for KV cache at startup. A Llama 3 8B model in FP16 uses ~16GB of VRAM for weights. But vLLM with --gpu-memory-utilization 0.90 on a 24GB A10G will allocate the remaining ~6GB to KV cache — meaning batch size and context length are both constrained by that remaining pool. If you’re running long-context requests (8K+ tokens) with large batch sizes, the KV cache fills and vLLM starts rejecting requests before you hit any CPU or network limit. Budget VRAM as: model weights + KV cache at your max concurrent request count + operating headroom. Instrument vllm:gpu_cache_usage_perc from day one.

The fifth mistake: treating fine-tuning as a one-time project rather than a continuous loop.

Teams invest in a fine-tuning sprint, ship the model, and move on. Six months later, the model is underperforming on new request patterns that weren’t in the original training set, and nobody has been collecting labeled data because the process wasn’t operationalized. Fine-tuning is not a milestone — it’s a maintenance contract. The architecture in this post logs every request/response pair to S3 precisely because that data is the raw material for the next training run. Schedule a monthly review cycle: pull a random sample, label it, compute accuracy against your held-out test set, retrain if drift exceeds threshold. Treat model performance regression the same way you treat service latency regression — it’s an operational metric, not a project outcome.

Introduction: Why LLM Scaling Is Fundamentally Different

It’s 2am. Your on-call engineer gets paged: GPU workers are down, 40,000 requests are timing out across the queue, and the auto-scaler is spinning up replacements that crash before they finish loading weights. The incident isn’t a bug in your model — it’s the architecture. You built the compute layer correctly and forgot to build the buffer in front of it.

Your LLM API bill hit $2,000 last month. You’re running 200K classification requests against GPT-4o because your product manager insisted on “the best model.” Half those requests are structured extraction tasks a fine-tuned 8B model would handle at 94% accuracy for $200/month.

This is the scaling problem most engineering teams hit around month 3 of production. And it’s not just a cost problem - it’s an architectural one.

Why LLMs don’t scale like normal services:

A typical REST microservice handles more load by adding more instances behind a load balancer. The math is linear.
An LLM inference server loads a multi-gigabyte model into GPU VRAM on startup. That process takes 3-5 minutes and cannot be interrupted.
GPU memory is finite and non-elastic. When it fills, the process doesn’t slow down gracefully - it crashes with an OOM kill.
Unlike stateless HTTP handlers, each GPU worker has a fixed throughput ceiling determined by model size, VRAM, and batch configuration - not by CPU or network.

Naive horizontal scaling of GPU instances - the approach that works for every other service - causes OOM crashes and cascading failures under load. The solution is a fundamentally different architecture: async queue-based processing with queue-depth-driven auto-scaling.

This post is the engineering guide for that architecture. We cover when self-hosting wins, the full queue-based design on AWS, Azure, and GCP, real cost math, production pitfalls, and a FAQ based on common implementation questions.

Upfront caveat: Self-hosting has real engineering overhead. If you’re under 50K requests/month and don’t have ML ops experience, this probably isn’t for you yet. We’ll tell you exactly when it is.

Why Naive Auto-Scaling Fails for LLM Workloads

Before building the right architecture, understand exactly why the obvious approach breaks - and why it breaks badly.

The naive approach: CPU-based auto-scaling

Most engineers’ first instinct is to put a load balancer in front of GPU instances and auto-scale on CPU or memory thresholds. Here’s what actually happens:

GPU VRAM is not elastic. A Llama 3 8B model loads ~16GB into VRAM at startup. That memory is locked for the lifetime of the process. When requests arrive faster than the model can process them, there’s nowhere for the overflow to go.
Cold start takes 3-5 minutes, not seconds. CloudWatch triggers the scale-up alarm. AWS provisions the EC2 instance. Docker pulls the image. vLLM loads weights, allocates KV cache, warms CUDA kernels. By the time the instance is ready, the traffic burst has already caused timeouts on the existing instances.
OOM kills are silent to the load balancer. When GPU memory fills, the vLLM process dies with an OOM signal - no HTTP error, no graceful drain. The load balancer’s health check sees the port is gone and marks the instance unhealthy - but the auto-scaling policy may spin up a replacement and repeat the cycle.
CPU scaling metric is the wrong signal. GPU inference is GPU-bound, not CPU-bound. GPU utilization at 98% and CPU at 20% means your scaling policy sees “nothing to do” while your model is actually at capacity.

Why CPU vs. Queue scaling produces different failure modes

The core difference is when the system knows it’s overloaded:

CPU scaling: learns about overload after OOM crashes have already occurred. Response: crash, recover, repeat.
Queue-depth scaling: knows about overload before any GPU worker touches the request. Response: buffer the request, add capacity, drain the backlog.

CPU scaling timeline:
  Traffic spike → instances overwhelmed → OOM crash → 503 errors to users → alarm triggers
  → new instance cold start (3-5 min) → still OOM → more crashes

Queue scaling timeline:
  Traffic spike → requests queue in SQS → alarm triggers → new instance cold start
  → workers drain queue → users get results (with latency, not errors)

Queuing latency is observable, configurable, and recoverable. OOM crashes are none of those.

Scaling strategy comparison

Approach	Handles traffic spikes	Cost at idle	Cold start impact	Failure mode	Production-safe
CPU/memory-based scaling	No - OOM before scaling	Medium	Fatal - instance dies first	Hard crash	No
Direct GPU scaling (no queue)	No - race conditions	High	Fatal	Hard crash	No
Queue + GPU scaling (SQS/KEDA)	Yes - queue absorbs burst	Low (scale to zero)	Absorbed - queue buffers during startup	Queuing latency	Yes
Queue + reserved GPU (min=1)	Yes	High (always on)	None - instance pre-warmed	Queuing latency only	Yes (expensive)

Queue-based scaling is the only approach that handles the combination of slow cold starts, expensive idle cost, and bursty traffic that LLM workloads produce.

Alternatives Considered Before This Architecture

Every team that lands on queue-based scaling has usually tried at least one of these first. Here is why each was rejected.

Direct GPU scaling with a load balancer. Put an ALB in front of a GPU Auto Scaling group and scale on CPU or request count. This is the stateless HTTP playbook and it breaks immediately on LLM workloads. GPU VRAM is not elastic — the load balancer routes a burst at the instance, the instance OOM-crashes, and the health check failure triggers a replacement that also crashes during its 3-minute startup window. The load balancer never absorbs pressure; it amplifies it.

Kubernetes with HPA on CPU or memory. Horizontal Pod Autoscaler on CPU looks sensible until you remember that GPU inference is GPU-bound, not CPU-bound. CPU stays low while GPU saturates. HPA sees nothing to scale. When GPU workers OOM-kill, HPA eventually fires on error rate — a lagging signal that arrives 3–5 minutes after the damage. KEDA with a queue-depth trigger fixes this, which is why the Azure and GCP implementations in this post use KEDA, not HPA.

Serverless GPU (AWS Lambda with GPU, Replicate, Modal). Managed serverless GPU inference eliminates ops burden and handles cold starts for you. The cost model flips: you pay per invocation instead of per hour. At low volume this is ideal. At 500K requests/month on a 400-token workload, serverless GPU pricing runs 3–8× higher than a self-managed spot cluster. We evaluated this option at 300K requests/month and found it cost more than the API we were trying to replace.

Managed inference APIs (OpenAI, Anthropic, Together AI, Fireworks). The correct answer at low volume and for frontier-class reasoning tasks. Rejected here specifically because (a) volume exceeded the cost threshold where self-hosting breaks even, (b) the workload — domain classification and extraction — is exactly where a fine-tuned 8B model beats a general-purpose frontier model, and (c) data privacy requirements prevented sending production data off-premises.

Alternative	Why rejected
ALB + GPU ASG (CPU scaling)	OOM crashes before autoscaler fires; lagging signal
Kubernetes HPA on CPU/memory	Wrong metric; GPU-bound workloads never trigger CPU HPA
Serverless GPU	3–8× cost at production volume; per-invocation pricing doesn’t amortize
Managed inference APIs	Linear cost scaling; privacy constraint; fine-tuned accuracy advantage

When to Self-Host vs. Use LLM APIs: Decision Framework

Before touching infrastructure, answer these questions honestly:

Signal	Recommendation
>50K inference requests/month	Self-hosting ROI turns positive
<10K requests/month	Stick with APIs - engineering overhead isn’t worth it
10K–50K requests/month	Gray zone - depends on complexity and team capacity
HIPAA / SOC2 / GDPR data privacy required	Self-hosting strongly favored
Need fine-tuned models on proprietary data	Self-hosting essential
Need frontier reasoning (GPT-4o / Claude Opus tier)	Stick with APIs - open models still lag here
Latency <200ms hard requirement	Self-hosting with dedicated GPU, no cold starts
Small team, no ML ops experience	APIs until team matures

The most important thing to understand: self-hosting a 7B model does not replace GPT-4o. It replaces the 80% of your workload that doesn’t need frontier-level reasoning - classification, extraction, summarization, reformatting, routing. For that 80%, a fine-tuned open-weight model often beats a larger API model on your specific domain while costing 10× less.

Production Architecture: Queue-Based LLM Auto-Scaling

The system has five layers. Each one exists for a specific reason - skip any of them and you’ll discover why at 3am.

High-level architecture

Client Request
      |
      v
[API Gateway] --> [Lambda]
                     |
                     v
              [SQS Queue]  <-- absorbs traffic spikes
                     |
                     v (queue depth metric)
         [CloudWatch Alarm]
                     |
                     v
      [ECS Auto-Scaling Policy]
                     |
                +----+----+
                |         |
                v         v
         [GPU Worker] [GPU Worker]  (scale 0 -> N)
           vLLM + Llama 3 8B
                |         |
                +----+----+
                     |
              +------+------+
              |             |
              v             v
     [ElastiCache]    [RDS PostgreSQL]
       Redis cache      Audit logs

Data flow: Client sends request -> Lambda validates and enqueues -> SQS holds it -> GPU workers pull at capacity -> results written to DynamoDB -> client polls for result.

The critical insight: clients and GPU workers are fully decoupled. A client can submit a request while all GPU workers are cold. The request waits in SQS. Workers spin up. Request gets processed. No timeouts, no 503s.

The layering is deliberate: each layer has a single contract with the layers adjacent to it and no knowledge of anything else. The API Gateway and Lambda layer’s contract is “accept a request, return a job ID, guarantee enqueueing.” The queue’s contract is “hold messages durably until a worker is ready, re-deliver on failure.” The worker layer’s contract is “pull one message at a time, process it, write the result, delete the message.” The storage layer’s contract is “hold results until the client retrieves them.” Break any of these contracts — by letting workers receive pushed requests, by skipping the queue for “fast” requests, by having the API layer wait synchronously on a worker — and you’ve reintroduced tight coupling that will fail under load.

AWS Self-Hosted LLM Architecture

API Gateway

Auth + validation

→

Lambda

Route + job ID

SQS Queue

Absorbs traffic spikes · Dead-letter queue for failures · Decouples clients from GPU workers

ECS + GPU Auto-Scaling

g5.xlarge / g4dn.xlarge · vLLM serving Llama 3 8B · Scale 0→N on queue depth

ElastiCache

Redis prompt cache · 15–30% hit rate

RDS PostgreSQL

Audit logs · cost attribution

SQS is the critical buffer - without it, traffic spikes hit GPU directly and cause OOM crashes.

Why Each Component

API Gateway + Lambda validates requests, handles auth, and returns async job IDs. Keeps the interface clean - consumers don’t need to know whether inference takes 200ms or 2 seconds.

SQS Queue is the most important component most people skip. Without a queue, a traffic spike hits your GPU instances directly. GPU memory fills up, processes OOM, instance crashes. With SQS, the queue absorbs the burst and workers pull at a sustainable pace. It also gives you a dead-letter queue for failed inferences automatically.

ECS with GPU Auto-Scaling is the core compute. ECS tasks run your vLLM container on GPU instances. The auto-scaling policy watches SQS queue depth - when messages pile up, ECS spins up more GPU tasks. When the queue drains, it scales back down. Critically: scale to zero when idle (nights/weekends) is the single biggest cost lever.

ElastiCache (Redis) caches responses for repeated prompts. In production classification and support workloads, 15–30% of prompts are duplicates or near-duplicates. Cache those, and your effective GPU throughput jumps significantly.

RDS PostgreSQL stores request/response logs. Not optional - you need this for cost attribution per team/feature, debugging, and collecting fine-tuning data.

Why This Architecture Works: The Reasoning Behind Each Decision

Most architecture guides tell you what to build. This explains why each component earns its place.

The router (API Gateway + Lambda) exists to enforce contract boundaries. Clients get a synchronous response: a job ID and a status URL. They never interact with GPU workers directly. This separation means you can change your GPU infrastructure - scale it, replace it, move clouds - without touching client code. The Lambda also enforces input validation and token limits before anything hits the queue.

The queue (SQS / Service Bus / Pub/Sub) converts traffic spikes into latency. This is the core insight. Without the queue, a spike in client requests creates backpressure directly on GPU memory. GPU memory is not elastic - it either fits or it doesn’t. When it doesn’t fit, the process crashes. With a queue between clients and GPUs, the spike becomes a longer queue, not a crashed worker. Latency increases; error rate stays at zero.

The workers (ECS + vLLM) pull at their own pace. Workers pull from the queue rather than receiving pushed requests. This means each worker only takes on as much work as it can handle. If a worker is processing a 2,000-token prompt, it simply doesn’t pull the next message until it’s done. No overflow, no OOM.

Queue-depth scaling predicts load; CPU scaling reacts to failure. When the queue grows, you know more capacity is needed before anyone OOM-crashes. When CPU spikes, the failure is already happening. Queue depth is a leading indicator; CPU is a lagging one.

Scale-to-zero is viable because the queue acts as the buffer during cold start. Without a queue, scaling to zero means the first request after a quiet period has no worker and gets a 503. With a queue, the first request after a quiet period sits in the queue for 3-5 minutes while a worker cold-starts. The user sees latency, not an error. That’s a tolerable trade-off for a workload that processes asynchronously.

The cache (Redis) works because LLM workloads are more repetitive than they appear. Classification and extraction tasks reuse the same system prompt with different inputs. Near-identical tickets, listings, and documents produce the same outputs. A 20% cache hit rate on 100K requests/day is 20K GPU inferences saved - roughly 3-4 hours of GPU time at no cost.

Architecture Tradeoffs: What This Design Gives Up

No architecture is free. This one trades specific things for the reliability and cost properties it delivers. Know these before you commit.

Async-only — no streaming. The queue decouples clients from workers, which is exactly what makes the architecture resilient. That same decoupling makes token streaming structurally impossible through SQS. When a client submits a request, they get a job ID, not a stream of tokens. For interactive chat, autocomplete, or anything where users watch tokens appear in real time, this architecture is the wrong choice. You need a direct WebSocket or gRPC connection to vLLM for those cases, which means a separate, always-warm serving path alongside the queue-based batch system.

Cold start latency is real and user-visible. Scale-to-zero is the biggest cost lever, but the first request after an idle period waits 3–5 minutes for a worker to start and load model weights. The EventBridge pre-warm schedule mitigates this for predictable business-hour workloads. For unpredictable spikes — a viral product launch, a scheduled job that fires outside normal hours — users will see multi-minute waits. Accept that or keep a permanently warm instance (at $70–250/month depending on instance type).

Operational complexity versus managed APIs. Managed API providers give you one API key and a billing dashboard. This architecture gives you ECS task definitions, SQS visibility timeouts, CloudWatch alarms, GPU AMI version pinning, CUDA driver compatibility, vLLM upgrade windows, and model versioning across rolling deployments. That’s real engineering overhead. One team member needs to own this full-time once volume justifies self-hosting. If your team is currently at 3 engineers and the infrastructure person also owns the product backend, the operational surface of this system will hurt.

SQS does not guarantee strict ordering. SQS standard queues deliver messages at-least-once, in roughly FIFO order, but with no strict guarantee. If your workload requires strict ordering — processing events in the exact sequence they were submitted — you need SQS FIFO queues, which have throughput limits (3,000 messages/second with batching) and cost more. For most classification and extraction workloads, order doesn’t matter. Know whether yours does before you deploy.

Production Challenges: What Hits You After Launch

These are the operational issues that appear after the architecture is running, not during the build.

Cold start latency is a UX problem, not just a cost problem. When the system scales to zero overnight and a user triggers the first request of the day, they wait 3-5 minutes. For async batch workloads, that’s fine. For anything user-facing, it’s not. Fix: scheduled EventBridge rule that sets min_capacity = 1 at business hours start (e.g. 8:45 AM), back to 0 at close. One-instance pre-warm costs ~$70-250/month depending on instance type.

GPU idle cost accumulates fast. A single g5.xlarge running 24/7 costs ~$724/month on-demand. If your workload is 8 hours/day on weekdays, you’re paying for 128 idle hours per week (nights + weekends). Scale-to-zero with proper cold-start handling eliminates that. Most teams delay implementing this and then are surprised by the bill.

Queue spike handling requires tuned cooldowns. A message flood (bug in upstream system, scheduled job firing unexpectedly) can trigger aggressive scale-up. Without a max instance cap, a bug that sends 100K messages in 5 minutes can spin up 50+ GPU instances. Always set max_capacity before you go live. Always set an AWS Budgets daily alert at 150% of expected spend.

Spot instance interruptions during long inferences. Spot VMs get interrupted with 2 minutes notice. If you’re mid-inference on a 3-minute generation task, that inference is lost. Mitigate with: (a) SQS visibility timeout longer than p99 inference latency - interrupted messages become visible again and are retried, and (b) idempotent workers that check DynamoDB before processing to skip already-completed jobs.

Model versioning in production is more complex than in dev. You can’t just update the Docker tag and roll forward. Model behavior changes affect downstream systems that may have tuned their confidence thresholds or output parsing for the previous version. Use blue/green: deploy v2 to a separate ECS service, shift 5% of traffic via weighted ALB routing, monitor accuracy and latency for 24-48 hours before cutting over.

vLLM memory configuration requires tuning per workload. --gpu-memory-utilization 0.90 is a reasonable default, but with long-context requests or large batch sizes, it’s possible to OOM even below that threshold due to KV cache growth. Monitor vllm:gpu_cache_usage_perc from vLLM’s /metrics endpoint. If it regularly exceeds 85%, reduce --max-model-len or batch size.

LLM Inference Cost Analysis: Self-Hosted vs. API

Let’s run three real scenarios. Numbers use AWS US East on-demand pricing.

Hardware costs

g4dn.xlarge  (1× T4 16GB):    $0.526/hr  → ~$379/month
g5.xlarge    (1× A10G 24GB):  $1.006/hr  → ~$724/month

Spot instances (60–70% discount, interruption risk):
g4dn.xlarge spot: ~$0.16/hr  → ~$115/month
g5.xlarge spot:   ~$0.35/hr  → ~$252/month

GPU Instance Comparison

Instance	GPU	VRAM	On-Demand/mo	Spot/mo	Best for
g4dn.xlarge	T4 16GB	16 GB	~$379	~$115	7B models, cost-first
g5.xlarge	A10G 24GB	24 GB	~$724	~$252	8B–13B, balanced
g5.2xlarge	A10G 24GB	24 GB	~$873	~$300	Higher CPU/RAM needs
g5.12xlarge	4x A10G	96 GB	~$3,493	~$1,200	70B models

Throughput (Mistral 7B / Llama 3 8B with vLLM)

g4dn.xlarge (T4):   ~20-40 req/min  (short prompts, <512 tokens)
g5.xlarge (A10G):  ~50-100 req/min  (short prompts, <512 tokens)

These numbers vary enormously with prompt length, output length, batch size, and quantization. Always load-test your specific workload before committing.

Cost Comparison: 200K Requests/Month

GPT-4o-mini API

300 input + 50 output tokens avg · 200K requests

~$15/mo

↑ Honest result: if this model handles your task, just use the API. Self-hosting doesn't win here.

GPT-4o API

Same volume, frontier model needed for accuracy

~$250/mo

Self-Hosted Llama 3 8B (fine-tuned)

1× g5.xlarge spot + auto-scaling · scale to zero off-hours

~$150–250/mo

HIGH VOLUME: 2M requests/month

GPT-4o API

2M × same token estimate

~$5,000/mo

Self-Hosted (2–3× g5.xlarge spot)

Auto-scaling cluster · near-zero marginal cost per request

~$600–900/mo

At 2M req/month, self-hosting saves ~$50,000/year. At 200K with cheap models, the API often wins.

Annual Savings Summary

Monthly volume	API model	API cost/mo	Self-hosted/mo	Monthly saving	Annual saving
200K requests	GPT-4o-mini	~$15	~$200	-$185 (API wins)	API wins
200K requests	GPT-4o	~$250	~$200	~$50	~$600
500K requests	GPT-4o	~$625	~$250	~$375	~$4,500
1M requests	GPT-4o	~$1,250	~$350	~$900	~$10,800
2M requests	GPT-4o	~$5,000	~$750	~$4,250	~$51,000
5M requests	GPT-4o	~$12,500	~$1,500	~$11,000	~$132,000

Self-hosting infrastructure cost stays roughly flat as volume grows (you add more instances, but marginal cost per request falls). API cost scales linearly with every request.

Step-by-Step Implementation on AWS

AWS components and why each one is chosen

Component	Role	Why not the alternative
API Gateway	Auth, rate limiting, request validation	ALB lacks auth - you’d need to add it separately
Lambda	Enqueue requests, return job ID	No GPU needed; cheap per-invocation
SQS	Buffer between clients and GPU workers	Managed, durable, DLQ built-in
ECS on EC2	Run vLLM containers on GPU instances	ECS Fargate doesn’t support GPU tasks
Application Auto Scaling	Scale ECS tasks on SQS depth	CloudWatch native, no third-party needed
ElastiCache (Redis)	Cache repeated prompts	15-30% cache hit rate reduces GPU load
RDS PostgreSQL	Audit log, cost attribution, fine-tune data	Structured queries on request/response pairs
DynamoDB	Job status and results	Low latency for polling clients

Step 1: Choose your model

For classification, extraction, and routing use Llama 3 8B or Mistral 7B - fast, cheap, fine-tunable on a single g5.xlarge. For summarization and generation you need Llama 3 70B or Mixtral 8x7B, which requires multi-GPU or a larger instance.

Apply AWQ 4-bit quantization to get 2× throughput at minimal quality loss for most classification/extraction tasks:

# Install quantization tooling
pip install autoawq

# Quantize to 4-bit (reduces VRAM from ~16GB to ~5GB for 8B model)
python -c "
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model = AutoAWQForCausalLM.from_pretrained('meta-llama/Meta-Llama-3-8B-Instruct')
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Meta-Llama-3-8B-Instruct')
model.quantize(tokenizer, quant_config={'zero_point': True, 'q_group_size': 128, 'w_bit': 4})
model.save_quantized('./llama3-8b-awq')
"

Step 2: Dockerize with vLLM

vLLM gives you continuous batching, PagedAttention (efficient KV cache), and an OpenAI-compatible API endpoint - meaning your existing client code using openai SDK works against your self-hosted model with a single base URL change.

FROM vllm/vllm-openai:latest

ENV MODEL_NAME="meta-llama/Meta-Llama-3-8B-Instruct"
ENV QUANTIZATION="awq"
ENV MAX_MODEL_LEN="4096"

EXPOSE 8000

# Health check for ECS
HEALTHCHECK --interval=30s --timeout=10s --start-period=120s --retries=3 \
  CMD curl -f http://localhost:8000/health || exit 1

ENTRYPOINT ["python", "-m", "vllm.entrypoints.openai.api_server", \
            "--model", "/model", \
            "--quantization", "awq", \
            "--max-model-len", "4096", \
            "--gpu-memory-utilization", "0.90", \
            "--max-num-seqs", "64"]

Push to ECR:

# Authenticate and push
aws ecr get-login-password --region us-east-1 | \
  docker login --username AWS --password-stdin \
  123456789.dkr.ecr.us-east-1.amazonaws.com

docker build -t llm-inference .
docker tag llm-inference:latest \
  123456789.dkr.ecr.us-east-1.amazonaws.com/llm-inference:v1.0.0
docker push 123456789.dkr.ecr.us-east-1.amazonaws.com/llm-inference:v1.0.0

Step 3: ECS Task Definition

Key parts of the task definition JSON - the GPU resource requirement is what most tutorials miss:

{
  "family": "llm-inference",
  "requiresCompatibilities": ["EC2"],
  "networkMode": "awsvpc",
  "cpu": "4096",
  "memory": "16384",
  "containerDefinitions": [
    {
      "name": "vllm",
      "image": "123456789.dkr.ecr.us-east-1.amazonaws.com/llm-inference:v1.0.0",
      "resourceRequirements": [
        { "type": "GPU", "value": "1" }
      ],
      "portMappings": [{ "containerPort": 8000 }],
      "healthCheck": {
        "command": ["CMD-SHELL", "curl -f http://localhost:8000/health || exit 1"],
        "interval": 30,
        "timeout": 10,
        "retries": 3,
        "startPeriod": 120
      },
      "environment": [
        { "name": "MODEL_NAME", "value": "meta-llama/Meta-Llama-3-8B-Instruct" }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/llm-inference",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "vllm"
        }
      }
    }
  ]
}

Step 4: Auto-Scaling Policy

Scale on SQS queue depth. This is the CloudWatch alarm + application auto-scaling config in Terraform:

resource "aws_sqs_queue" "inference" {
  name                       = "llm-inference-queue"
  visibility_timeout_seconds = 300
  message_retention_seconds  = 3600
  redrive_policy = jsonencode({
    deadLetterTargetArn = aws_sqs_queue.inference_dlq.arn
    maxReceiveCount     = 3
  })
}

resource "aws_appautoscaling_policy" "scale_up" {
  name               = "llm-scale-up"
  service_namespace  = "ecs"
  resource_id        = "service/${aws_ecs_cluster.main.name}/${aws_ecs_service.inference.name}"
  scalable_dimension = "ecs:service:DesiredCount"
  policy_type        = "StepScaling"

  step_scaling_policy_configuration {
    adjustment_type         = "ChangeInCapacity"
    cooldown                = 120
    metric_aggregation_type = "Average"
    step_adjustment {
      metric_interval_lower_bound = 0
      scaling_adjustment          = 1
    }
  }
}

resource "aws_cloudwatch_metric_alarm" "queue_deep" {
  alarm_name          = "llm-queue-backlog"
  namespace           = "AWS/SQS"
  metric_name         = "ApproximateNumberOfMessagesVisible"
  dimensions          = { QueueName = aws_sqs_queue.inference.name }
  statistic           = "Average"
  period              = 60
  evaluation_periods  = 2
  threshold           = 100          # scale up when 100+ messages queued for 2 min
  comparison_operator = "GreaterThanThreshold"
  alarm_actions       = [aws_appautoscaling_policy.scale_up.arn]
}

resource "aws_cloudwatch_metric_alarm" "queue_empty" {
  alarm_name          = "llm-queue-empty"
  namespace           = "AWS/SQS"
  metric_name         = "ApproximateNumberOfMessagesVisible"
  dimensions          = { QueueName = aws_sqs_queue.inference.name }
  statistic           = "Average"
  period              = 60
  evaluation_periods  = 10           # scale down only after 10 min of empty queue
  threshold           = 0
  comparison_operator = "LessThanOrEqualToThreshold"
  alarm_actions       = [aws_appautoscaling_policy.scale_down.arn]
}

Set min_capacity = 0 to enable scale-to-zero on idle. That single setting saves 60-70% of your monthly bill if your workload has any off-hours pattern.

To make the scaling math concrete: at 500 queued messages with messages_per_worker = 50, the scaler computes ceil(500 / 50) = 10 workers. On g5.xlarge spot at $0.35/hr, that burst capacity costs $3.50/hr — and it releases automatically once the queue drains. A 2,000-message spike from a batch job runs for roughly 20–30 minutes at peak, costs under $2.00, and leaves no residual cost. This is why queue-depth scaling is the right abstraction: the cost of burst capacity is bounded by the size of the queue, not by a human operator’s reaction time.

Step 4b: Worker Scaling Logic (Python)

The Terraform manages the AWS side. Here’s the Python side - the ECS worker process that pulls from SQS and decides when to scale itself:

import boto3
import json
import time
import logging
from dataclasses import dataclass
from typing import Optional

logger = logging.getLogger(__name__)

@dataclass
class ScalingConfig:
    queue_url: str
    cluster_name: str
    service_name: str
    min_workers: int = 0
    max_workers: int = 10
    messages_per_worker: int = 50   # target: 50 queued msgs per GPU worker
    scale_up_cooldown: int = 120    # seconds between scale-up actions
    scale_down_cooldown: int = 600  # longer cooldown for scale-down

class WorkerScaler:
    """
    Polls SQS queue depth and adjusts ECS task count accordingly.
    Run this as a separate lightweight Lambda or ECS task (no GPU needed).
    """
    def __init__(self, config: ScalingConfig):
        self.config = config
        self.sqs = boto3.client('sqs')
        self.ecs = boto3.client('ecs')
        self.last_scale_up = 0
        self.last_scale_down = 0

    def get_queue_depth(self) -> int:
        resp = self.sqs.get_queue_attributes(
            QueueUrl=self.config.queue_url,
            AttributeNames=['ApproximateNumberOfMessagesVisible']
        )
        return int(resp['Attributes']['ApproximateNumberOfMessagesVisible'])

    def get_current_workers(self) -> int:
        resp = self.ecs.describe_services(
            cluster=self.config.cluster_name,
            services=[self.config.service_name]
        )
        service = resp['services'][0]
        return service['runningCount']

    def desired_workers(self, queue_depth: int) -> int:
        """Calculate desired worker count from queue depth."""
        if queue_depth == 0:
            return self.config.min_workers
        # ceil(queue_depth / messages_per_worker), clamped to [min, max]
        import math
        desired = math.ceil(queue_depth / self.config.messages_per_worker)
        return max(self.config.min_workers, min(self.config.max_workers, desired))

    def scale(self, target: int) -> None:
        self.ecs.update_service(
            cluster=self.config.cluster_name,
            service=self.config.service_name,
            desiredCount=target
        )
        logger.info(f"Scaled ECS service to {target} tasks")

    def run_once(self) -> None:
        now = time.time()
        queue_depth = self.get_queue_depth()
        current = self.get_current_workers()
        target = self.desired_workers(queue_depth)

        if target > current:
            # Scale up: respect cooldown
            if now - self.last_scale_up >= self.config.scale_up_cooldown:
                logger.info(f"Scaling up: {current} -> {target} (queue: {queue_depth})")
                self.scale(target)
                self.last_scale_up = now
        elif target < current:
            # Scale down: use longer cooldown to avoid flapping
            if now - self.last_scale_down >= self.config.scale_down_cooldown:
                logger.info(f"Scaling down: {current} -> {target} (queue: {queue_depth})")
                self.scale(target)
                self.last_scale_down = now
        else:
            logger.debug(f"No scaling needed: {current} workers, {queue_depth} queued")

    def run_loop(self, interval_seconds: int = 30) -> None:
        """Run scaling loop. Call this from your ECS task or Lambda."""
        while True:
            try:
                self.run_once()
            except Exception as e:
                logger.error(f"Scaling error: {e}")
            time.sleep(interval_seconds)

Key decisions in this code:

messages_per_worker = 50 means you want at most 50 queued messages per active GPU worker. Tune this to your p50 inference latency and your SLA. At 30 req/min per worker, 50 queued = ~100s of backlog per worker.
Scale-up cooldown is short (120s) - you want to add capacity quickly.
Scale-down cooldown is long (600s) - you want to drain the queue fully before removing capacity. Removing workers too early causes a second scale-up cycle.
min_workers = 0 enables true scale-to-zero. Set to 1 during business hours via a scheduled Lambda if cold starts are unacceptable.

Step 5: Lambda Request Handler

The Lambda function validates requests, pushes to SQS, and returns a job ID. Clients poll a separate endpoint (or use WebSocket for streaming):

import json
import boto3
import uuid
from datetime import datetime

sqs = boto3.client('sqs')
dynamodb = boto3.resource('dynamodb')

QUEUE_URL = "https://sqs.us-east-1.amazonaws.com/123456789/llm-inference-queue"
JOB_TABLE = "llm-jobs"

def handler(event, context):
    body = json.loads(event.get('body', '{}'))

    # Validate
    if 'prompt' not in body:
        return {'statusCode': 400, 'body': json.dumps({'error': 'prompt required'})}
    if len(body['prompt']) > 8000:
        return {'statusCode': 400, 'body': json.dumps({'error': 'prompt too long'})}

    job_id = str(uuid.uuid4())

    # Push to SQS
    sqs.send_message(
        QueueUrl=QUEUE_URL,
        MessageBody=json.dumps({
            'job_id': job_id,
            'prompt': body['prompt'],
            'model': body.get('model', 'llama3-8b'),
            'max_tokens': body.get('max_tokens', 512),
            'team': event['requestContext']['authorizer'].get('team', 'unknown'),
            'submitted_at': datetime.utcnow().isoformat(),
        })
    )

    return {
        'statusCode': 202,
        'body': json.dumps({
            'job_id': job_id,
            'status': 'queued',
            'poll_url': f'/jobs/{job_id}'
        })
    }

Failure Modes and Recovery Playbook

These are the specific failure scenarios this architecture encounters in production, how to detect each one, and how to recover. Keep this list in your runbook.

Queue backlog grows unbounded, workers not scaling

Triggers: CloudWatch alarm threshold set too high; autoscaling policy not attached to alarm; ECS service max capacity already hit; scaling cooldown still active from a prior event.
Detect: ApproximateNumberOfMessagesVisible rising for >5 minutes while RunningCount stays flat. Set a CloudWatch alarm: queue depth >500 for 5 consecutive minutes AND ECS running count unchanged.
Recover: Check ECS service events in the console for placement failures (no EC2 capacity in the ASG). Check if max_capacity is already reached — if so, this is a capacity ceiling problem, not a scaling problem. Verify the alarm is in ALARM state and the scaling policy is correctly attached. If scaling is stuck mid-cooldown, you can manually update ECS desired count.

Worker OOM crash loop — instances crash immediately after starting

Triggers: --gpu-memory-utilization set too high for the actual workload; concurrent request batch size exceeds KV cache budget; model weights + KV cache exceed available VRAM.
Detect: ECS task stop reason OutOfMemoryError: Container killed due to memory usage in CloudWatch Logs. vLLM metric vllm:gpu_cache_usage_perc consistently above 90% before crash.
Recover: Lower --gpu-memory-utilization to 0.80 and redeploy. If KV cache is the culprit, reduce --max-model-len. In the medium term, audit VRAM budget: model weights in FP16 + KV cache overhead at your max batch size must fit within GPU VRAM with margin.

Spot interruption mid-inference causing duplicate processing

Triggers: AWS 2-minute spot interruption notice; worker terminates while holding SQS message visibility lock; message becomes visible again and is delivered to a second worker.
Detect: DynamoDB job_id entries with status completed that arrive twice; elevated NumberOfMessagesDeleted spikes followed by re-appearances in ApproximateNumberOfMessagesVisible.
Recover: Idempotent worker design handles this automatically — workers must check DynamoDB for job_id existence before processing and skip if already completed. Ensure visibility_timeout_seconds is longer than p99 inference latency + 120s buffer so interrupted messages don’t re-appear while the original worker might still be finishing.

Cold start thundering herd — scale-up triggered by burst, all new workers fail health checks simultaneously

Triggers: Queue depth spikes from 0 to thousands; ECS scales from 0 to max_capacity simultaneously; all instances are loading model weights at the same time; none pass health check for 3–5 minutes; ECS marks them all unhealthy and tries to replace them.
Detect: ECS service HealthCheckGracePeriodSeconds too short — ECS kills instances before they finish loading. Look for repeated Task stopped events in ECS service history within 2 minutes of launch.
Recover: Set startPeriod in the ECS health check to 300 seconds (longer than the worst-case model load time). Stage scale-ups with a smaller initial step adjustment so not all capacity comes online at once. The pre-warm scheduled rule eliminates this scenario entirely for predictable workloads.

Dead-letter queue filling — requests poisoning the queue

Triggers: Malformed request bodies that pass Lambda validation but fail in the worker; upstream bug sending invalid prompt formats; model rejecting specific input patterns that trigger vLLM exceptions.
Detect: ApproximateNumberOfMessagesVisible on the DLQ rising. Set a CloudWatch alarm on DLQ depth >10.
Recover: Pull a sample of DLQ messages, identify the common malformed pattern, fix the Lambda input validation to reject it upstream. Purge the DLQ after fixing. If the issue is a vLLM bug with specific inputs, add a try/except in the worker that writes a failure status to DynamoDB rather than allowing the retry cycle to fill the DLQ.

Real-World Case Study: E-Commerce Product Categorization

Setup and performance results

A mid-size e-commerce platform had 500K product listings and needed to auto-categorize new listings and extract attributes (color, size, material) from seller descriptions.

Volume: 80K inference requests/month. Accuracy requirement: 92%+ (human review for low-confidence results).

E-Commerce Categorization Pipeline

"Blue cotton crew neck t-shirt, relaxed fit, 200gsm..."

Ingest

Seller listing → SQS queue

Llama 3 8B (fine-tuned on 50K labeled listings)

Infer

category + confidence score

confidence ≥ 0.92 → auto-apply

Route

confidence < 0.92 → human review queue

94%

accuracy (vs 89% GPT-4o-mini zero-shot)

180ms

p50 latency (p99: 450ms)

$300/mo

vs $800/mo with GPT-4o

Fine-tuned 8B beat GPT-4o-mini on domain accuracy while costing 60% less than GPT-4o.

Why self-hosting won in this case

Three factors aligned for this to be the right call:

The fine-tuned model beat the API on domain accuracy. GPT-4o-mini scored 89% on their taxonomy zero-shot. Llama 3 8B fine-tuned on 50K labeled listings scored 94%. More accuracy, lower cost.
Cost was 60% less than the API alternative. $300/month vs $800/month to achieve equivalent accuracy.
Data privacy was a hard requirement. Product descriptions with pricing strategy and supplier info could not leave the VPC.

This is the pattern where self-hosting wins: well-defined task, enough labeled data to fine-tune, privacy requirement, and volume above 50K/month.

Production Pitfalls: What Goes Wrong After Deployment

The five mistakes that will get you after deployment:

Common Production Failures

No request queue

Traffic spike → GPU memory fills → OOM crash → total outage. Always put SQS in front. No exceptions.

No auto-scaling max cap

Unbounded scale-up on a traffic spike (or a bug producing infinite requests) runs up a five-figure AWS bill overnight. Set a hard max instance count and a budget alarm.

Ignoring cold start time

GPU instances take 3–5 minutes to boot and load model weights. If you're fully scaled to zero, the first request burst after a quiet period has a 3-minute wait. Pre-warm at least one instance during business hours.

No API fallback

Your self-hosted model will go down. When it does, you need a circuit breaker that automatically routes to an API provider (OpenAI/Anthropic). The cost of 1 hour of downtime exceeds the API cost of that hour by 10–100×.

Aggressive batching without latency testing

Batching increases throughput but destroys tail latency. A batch size of 32 might give 2× throughput at the cost of p99 latency jumping from 500ms to 4s. Find your batch sweet spot under realistic load, not synthetic benchmarks.

Production Checklist

SQS dead-letter queue configured for failed inferences
CloudWatch alarms: GPU utilization, queue depth, error rate, p99 latency
Auto-scaling max capacity set with AWS Budgets daily cost alerts
Circuit breaker to API provider fallback (e.g. using tenacity + OpenAI SDK)
Model versioning: tagged Docker images, never overwrite latest in production
Load tested at 2× expected peak traffic
Pre-warm schedule: keep minimum 1 instance running during business hours
Request/response logging to S3 (future fine-tuning data)
Health check on vLLM /health endpoint with ECS replacement trigger

Running This Architecture on Azure and Google Cloud

The same queue-based architecture works on Azure and GCP. The components map directly - only the service names change.

Component mapping across clouds

Role	AWS	Azure	Google Cloud
Request queue	SQS	Azure Service Bus	Cloud Pub/Sub
Serverless handler	Lambda	Azure Functions	Cloud Functions
Container orchestration	ECS on EC2	AKS (Azure Kubernetes Service)	GKE (Google Kubernetes Engine)
GPU instances	g5.xlarge / g4dn	NC-series (A100, A10)	a2-highgpu (A100), g2-standard (L4)
Autoscaling trigger	CloudWatch + App Auto Scaling	KEDA on AKS	KEDA on GKE or GKE Autopilot
Container registry	ECR	Azure Container Registry	Artifact Registry
Prompt cache	ElastiCache (Redis)	Azure Cache for Redis	Memorystore (Redis)
Audit logs	RDS PostgreSQL	Azure Database for PostgreSQL	Cloud SQL (PostgreSQL)
Job results	DynamoDB	Cosmos DB	Firestore
Secrets	Secrets Manager	Azure Key Vault	Secret Manager

Azure: AKS + Service Bus architecture

On Azure, the natural equivalent is AKS (Kubernetes) with KEDA (Kubernetes Event-Driven Autoscaling) scaling pods on Service Bus queue depth. KEDA is purpose-built for this pattern and handles the scaling math for you.

# KEDA ScaledObject: scale AKS GPU pods based on Service Bus queue depth
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: llm-inference-scaler
spec:
  scaleTargetRef:
    name: llm-inference-deployment
  minReplicaCount: 0          # scale to zero when idle
  maxReplicaCount: 10
  cooldownPeriod: 600         # 10 min cooldown before scaling down
  triggers:
    - type: azure-servicebus
      metadata:
        queueName: llm-inference-queue
        namespace: your-servicebus-namespace
        messageCount: "50"    # 1 pod per 50 queued messages
      authenticationRef:
        name: servicebus-trigger-auth

The GPU node pool uses Azure NC-series VMs. For 8B models, Standard_NC4as_T4_v3 (1x T4) is the Azure equivalent of g4dn.xlarge. For larger models, Standard_NC24ads_A100_v4 (1x A100 80GB) handles 70B models.

# Create AKS GPU node pool
az aks nodepool add \
  --resource-group myResourceGroup \
  --cluster-name myAKSCluster \
  --name gpupool \
  --node-count 0 \
  --node-vm-size Standard_NC4as_T4_v3 \
  --node-taints sku=gpu:NoSchedule \
  --enable-cluster-autoscaler \
  --min-count 0 \
  --max-count 10

Azure-specific notes:

KEDA is more flexible than CloudWatch-based scaling - it supports 50+ event sources out of the box.
Azure Container Instances (ACI) does not support GPU for production workloads. Use AKS.
Standard_NC4as_T4_v3 spot pricing is roughly $0.12-0.18/hr - similar to AWS spot g4dn.xlarge.
Azure Service Bus has built-in dead-letter queues and message lock (equivalent to SQS visibility timeout).

Google Cloud: GKE + Pub/Sub architecture

On GCP, GKE with KEDA is the cleanest path. GKE Autopilot supports GPU nodes in preview; for production use Standard GKE with a dedicated GPU node pool.

# KEDA ScaledObject: scale GKE GPU pods based on Pub/Sub subscription backlog
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: llm-inference-scaler
spec:
  scaleTargetRef:
    name: llm-inference-deployment
  minReplicaCount: 0
  maxReplicaCount: 10
  cooldownPeriod: 600
  triggers:
    - type: gcp-pubsub
      metadata:
        subscriptionName: llm-inference-sub
        mode: SubscriptionSize      # scale on undelivered message count
        value: "50"                  # 1 pod per 50 unacked messages
      authenticationRef:
        name: pubsub-trigger-auth

# Create GKE GPU node pool (L4 GPU - best price/performance for 8B models)
gcloud container node-pools create gpu-pool \
  --cluster=my-cluster \
  --zone=us-central1-a \
  --machine-type=g2-standard-4 \
  --accelerator=type=nvidia-l4,count=1 \
  --num-nodes=0 \
  --enable-autoscaling \
  --min-nodes=0 \
  --max-nodes=10 \
  --node-taints=sku=gpu:NoSchedule

GCP-specific notes:

g2-standard-4 with L4 GPU is the best value for 7B-8B models on GCP. L4 delivers similar throughput to A10G at lower cost.
Cloud Pub/Sub uses subscriptions (not queues) - each consumer group needs its own subscription on the same topic.
GCP spot VMs (equivalent to AWS Spot) are called “Spot VMs” - same 60-70% discount, same interruption risk.
GKE has native Workload Identity for service account auth - no IAM credentials in containers.

GPU instance price comparison (spot pricing, approximate)

GPU	AWS	Azure	GCP	VRAM	Best for
T4	g4dn.xlarge ~$0.16/hr	NC4as_T4_v3 ~$0.15/hr	n1+T4 ~$0.14/hr	16 GB	7B-8B models
A10G / equivalent	g5.xlarge ~$0.35/hr	NC6ads_A10_v4 ~$0.32/hr	g2-standard-4 (L4) ~$0.28/hr	24 GB	8B-13B models
A100 40GB	p4d.xlarge ~$1.20/hr	NC24ads_A100_v4 ~$1.10/hr	a2-highgpu-1g ~$1.00/hr	40 GB	30B-70B models

Prices vary by region and change frequently - always check current spot pricing before committing.

Which cloud should you choose?

Choose AWS if: your team already runs on AWS, or you need the most mature managed GPU ecosystem (ECS GPU support, SageMaker for fine-tuning pipelines).

Choose Azure if: your company uses Microsoft 365 / Azure AD (Entra ID integration simplifies auth), or you have Azure credits through an enterprise agreement.

Choose GCP if: you want the best GPU price/performance ratio (L4 instances are often the cheapest option), or you plan to use Vertex AI for fine-tuning alongside your serving infrastructure.

The architecture is the same on all three. Pick the cloud your team already knows.

Advanced Patterns: Beyond Basic Auto-Scaling

Three patterns worth knowing - each deserves its own post:

Intelligent model routing. Route simple tasks (yes/no classification, regex-extractable fields) to a fast 7B model. Route complex tasks (multi-step reasoning, long-form generation) to a 70B or fall back to an API. A lightweight token-count heuristic or a tiny classifier model is enough to route 80% of requests correctly.

Continuous fine-tuning loop. Log all request/response pairs. Human reviewers label a subset monthly. Fine-tune on that labeled data. Repeat. Over 6 months, a mediocre 7B model can become the best model for your specific task - at no increase in hardware cost.

Multi-LoRA serving. vLLM supports loading multiple LoRA adapters on a single base model. One g5.xlarge running Llama 3 8B can serve 5–10 different fine-tuned variants for different tasks, sharing the base model weights in GPU memory. One GPU, ten specialized models.

Cost Optimization Strategies for Self-Hosted LLMs

Once you have the queue-based architecture running, these are the highest-leverage cost levers in order of impact.

1. Scale to zero during off-hours (biggest single lever)

Set min_capacity = 0 in your auto-scaling policy. For a workload that runs 8 hours/day on weekdays:

128 hours/week idle (nights + weekends) vs. 40 hours active
At $0.35/hr spot g5.xlarge: idle cost drops from $62/week to $0
Effective monthly cost drops from ~$252 to ~$70 for a single worker

Implementation: Use a scheduled EventBridge rule to set min capacity to 1 during business hours (prevents the first cold start of the day from being user-visible), and 0 overnight.

2. Spot instances with interruption handling

GPU spot instances save 60-70% over on-demand pricing. The risk is interruption with 2 minutes warning. Mitigate it:

Use the SQS visibility timeout to handle interrupted messages automatically. When a spot instance is interrupted mid-inference, the message becomes visible again after the timeout and another worker picks it up.
Set visibility_timeout_seconds to your p99 inference latency + 60s buffer.
Run a mix: 1 on-demand instance (always available) + N-1 spot instances (cheap capacity).

3. AWQ 4-bit quantization (doubles throughput, minimal quality loss)

Quantizing a Llama 3 8B model from FP16 to 4-bit AWQ:

Reduces VRAM from ~16GB to ~5GB
Fits a 7B model on a T4 (16GB) with room for larger batch sizes
Increases throughput ~2x on same hardware
Quality loss on classification/extraction tasks: typically <1% on your fine-tuned domain

For most production classification and extraction workloads, the quality difference is negligible and the cost difference is significant.

4. Prompt caching with Redis (free throughput)

In production LLM workloads, 15-30% of prompts are exact or near-exact duplicates:

Classification tasks with fixed templates: the system prompt is identical for every request
Support ticket routing: same categories, same instructions, different ticket text
Document extraction: same schema prompt, different documents

Cache the full response keyed on a hash of the prompt. At 20% cache hit rate on 100K requests/day, you save 20K GPU inferences/day - roughly 3-4 hours of GPU time.

5. Request batching (tune carefully)

vLLM’s continuous batching groups multiple requests into a single forward pass. The throughput improvement is real, but the tail latency impact requires testing:

Batch size 8: ~1.8x throughput, p99 latency increases ~40%
Batch size 32: ~2.5x throughput, p99 latency increases ~300-500%
Batch size 64: ~3x throughput, p99 latency can exceed 5-10 seconds

Rule of thumb: Find the batch size where p99 latency is still within your SLA, then set that as your ceiling. Never tune batch size against synthetic benchmarks - test with realistic prompt length distributions.

6. Model routing by task complexity

Not all requests need the same model:

def route_request(prompt: str, task_type: str) -> str:
    """Route to appropriate model based on task complexity."""
    if task_type in ("classification", "extraction", "routing"):
        # Fine-tuned 7B handles these at 94%+ accuracy
        return "llama3-8b-finetuned"
    elif task_type == "summarization" and len(prompt) < 2000:
        # Smaller model handles short summarization well
        return "llama3-8b-finetuned"
    elif task_type in ("generation", "reasoning", "complex_qa"):
        # Route complex tasks to API - cheaper than running 70B locally
        return "gpt-4o-mini-api"
    else:
        return "llama3-70b"  # heavy local model for everything else

Routing 70% of requests to the 8B model and 30% to the API often costs less than running the 70B model for everything.

Cost optimization summary

Strategy	Effort	Monthly savings (typical)
Scale to zero off-hours	Low (1 config change)	40-70% of GPU cost
Spot instances	Low (ASG config)	60-70% vs on-demand
AWQ quantization	Medium (one-time)	Doubles throughput = halves instance count
Prompt caching	Medium (Redis setup)	15-30% of GPU cost
Request batching	Low (vLLM config)	20-50% throughput gain
Model routing	High (routing logic)	Highly variable

Frequently Asked Questions

Q: How do I handle the async response pattern? My clients expect synchronous responses.

Use WebSockets or Server-Sent Events for streaming. For pure HTTP, return a job ID immediately (HTTP 202) and have clients poll /jobs/{id}. Most clients tolerate 200-500ms polling intervals without visible UX impact. For latency-sensitive use cases, keep at least 1 warm GPU instance running during peak hours to avoid cold start waits.

Q: What happens if a GPU instance is interrupted mid-inference (spot instance)?

SQS visibility timeout handles this automatically. When the instance disappears, SQS makes the message visible again after the timeout expires. Another worker picks it up and reprocesses it. Set visibility_timeout_seconds to at least your p99 inference latency + 120 seconds. Store partial results in DynamoDB so you can detect and deduplicate retried requests.

Q: Should I use ECS Fargate or EC2 for GPU workloads?

EC2 only. ECS Fargate does not support GPU task resource requirements. You need EC2 launch type with a GPU-optimized AMI (e.g. al2023-ami-ecs-hvm-* with NVIDIA drivers). On Azure and GCP, this means AKS/GKE on dedicated GPU node pools, not serverless container offerings.

Q: How do I choose between Llama 3 8B and 70B for my use case?

Start with 8B fine-tuned on your domain. Evaluate on a 500-sample held-out test set. If accuracy meets your threshold, ship it. 70B is only worth the cost (5-8x more expensive to run) if the task requires multi-step reasoning, complex instruction following, or long-context understanding that 8B genuinely can’t do. For classification, extraction, summarization, and routing: 8B fine-tuned almost always wins on cost/accuracy tradeoff.

Q: How do I handle model versioning safely in production?

Tag every Docker image with a semantic version (e.g. llm-inference:llama3-8b-v2.1.0). Never overwrite latest in production.
Deploy new model versions to a separate ECS service with 5% of traffic (weighted target groups on ALB).
Monitor accuracy metrics and p99 latency for 24-48 hours before shifting full traffic.
Keep the previous version image in ECR for fast rollback (change ECS task definition, redeploy).

Q: What monitoring should I set up on day one?

At minimum:

SQS: ApproximateNumberOfMessagesVisible (queue depth) - alarm at >500 for 5 min
SQS: NumberOfMessagesSent vs NumberOfMessagesDeleted (throughput balance)
ECS: GPU utilization per task (via CloudWatch Container Insights)
vLLM: request latency p50/p99 (scrape from /metrics endpoint, push to CloudWatch)
vLLM: num_requests_waiting (internal queue depth - leading indicator of saturation)
AWS Budgets: daily cost alert at 120% of expected spend

Q: Can I use this architecture for streaming responses?

The async queue pattern doesn’t work for streaming - you can’t stream tokens back through SQS. For streaming use cases, use a direct gRPC or WebSocket connection to the vLLM endpoint with a thin load balancer in front. Accept that streaming workloads are harder to scale to zero - you need at least 1 warm instance during expected usage hours. Hybrid is common: queue-based for batch/async tasks, direct connection for interactive streaming.

Q: Is vLLM the only option for serving?

No. Common alternatives:

TGI (Text Generation Inference) by HuggingFace - similar performance, better HuggingFace model compatibility
Triton Inference Server - better for multi-model serving and custom ops
Ollama - simpler setup, good for development, not designed for production throughput
LiteLLM - proxy/router layer, useful for multi-provider routing

vLLM wins on raw throughput for OpenAI-compatible API patterns, which is why it’s the most common production choice.

Q: How do I set the right messages_per_worker threshold for my workload?

Start with 50 as a baseline and tune from load test results. The target: with N workers at messages_per_worker=50, the queue should drain faster than it fills at peak. Concretely, if one g5.xlarge processes 60 requests/minute and peak rate is 300 req/min, you need 5 workers. A queue depth of 250 triggers ceil(250/50) = 5 — correct. Reduce messages_per_worker for long-generation workloads (less queued work per worker). Raise it for fast classification tasks. Never exceed: (worker throughput req/min) × (acceptable queue latency seconds / 60).

Q: What’s the difference between SQS-based scaling on ECS versus KEDA on Kubernetes?

Both scale on queue depth — same logic, different implementation. CloudWatch + Application Auto Scaling has a minimum 60-second evaluation window and coarser steps. KEDA polls every 10-30 seconds with continuous scaling. KEDA also gives standard Kubernetes tooling (Helm, kubectl, GitOps) for the GPU fleet. Trade-off: KEDA requires running Kubernetes. Start with ECS + CloudWatch. Migrate to EKS + KEDA when you need finer scaling granularity for spiky workloads or tight latency SLAs.

Q: How do I run this architecture in a HIPAA or SOC2 compliant environment?

The queue-based architecture is compliance-friendly: no data leaves your VPC. Key requirements: (1) Deploy SQS via VPC endpoint — no public internet routing, (2) Encrypt SQS messages with a KMS customer-managed key, (3) Encrypt EBS and S3 at rest, (4) Enable CloudTrail and CloudWatch Logs for all ECS task output, (5) Use least-privilege IAM roles for Lambda→SQS, ECS→SQS, and ECS→DynamoDB, (6) Block outbound internet from GPU instances — load model weights from S3 at container build time, not from HuggingFace Hub at runtime. The async pattern also helps compliance: every request/response is written to DynamoDB and S3, giving you a complete audit trail with timestamps and identity context.

Key Takeaways: LLM Infrastructure on AWS, Azure, and GCP

A reference summary for engineers evaluating or implementing this architecture.

Architecture fundamentals:

Queue-based async architecture is the only pattern that handles bursty GPU workloads without OOM crashes
Scale on queue depth (SQS ApproximateNumberOfMessagesVisible), never on CPU or GPU utilization
Workers pull from the queue at their own pace — never push requests to GPU workers directly
The queue is the structural correctness layer, not a performance optimization

Cost fundamentals:

Self-hosting breaks even vs. GPT-4o at approximately 200K requests/month
Self-hosting loses vs. GPT-4o-mini at almost any volume — if mini handles your task, keep using it
Scale-to-zero is the single highest-ROI config change: set min_capacity = 0 and save 40-70% of GPU cost
Spot instances cut instance cost 60-70%; SQS visibility timeout handles interruptions automatically

Model selection fundamentals:

Fine-tuned 8B models beat frontier API models on well-defined domain tasks (classification, extraction, summarization)
Start with 8B quantized (AWQ 4-bit), evaluate against a held-out test set, move to 70B only if accuracy gap is real
Multi-LoRA: one base model instance can serve 5-10 fine-tuned variants — critical for multi-team deployments

Operational fundamentals:

Cold start: 3-5 minutes minimum for vLLM + 8B model; pre-warm with scheduled EventBridge rule at business hours open
Scale-down cooldown: 10 minutes minimum; shorter causes flapping and repeat cold starts
DLQ: configure a dead-letter queue from day one; production inference will fail, and you need to know why
Max capacity: always set before launch; a bug causing infinite requests + no cap = five-figure AWS bill overnight

Monitoring minimums:

SQS queue depth (primary scaling metric + main health signal)
vLLM num_requests_waiting (internal saturation indicator)
vLLM gpu_cache_usage_perc (VRAM pressure before OOM)
AWS Budgets daily alert at 150% of expected spend

Should You Self-Host? The Decision Guide

Self-host if:

You’re spending >$500/month on LLM APIs today
You have ML/DevOps engineers who can own GPU infrastructure
You need data privacy or regulatory compliance (HIPAA, SOC2, GDPR)
You’re running tasks where a fine-tuned 7B/8B matches or beats the frontier API model on your domain

Stick with APIs if:

You’re under 50K requests/month
You need frontier reasoning (complex multi-step tasks, open-ended generation)
Your team doesn’t have GPU infrastructure experience
Speed to market matters more than cost optimization right now

The hybrid approach (recommended for most teams):

Self-host for high-volume, well-defined tasks - classification, extraction, summarization
Use API providers for complex, low-volume tasks - content generation, complex reasoning
Build with an abstraction layer so you can swap between self-hosted and API without changing application code

# Abstraction layer: swap provider with config change
class LLMClient:
    def __init__(self, provider: str = "self-hosted"):
        if provider == "self-hosted":
            self.client = openai.AsyncOpenAI(
                base_url="http://your-vllm-endpoint:8000/v1",
                api_key="not-needed"
            )
            self.model = "meta-llama/Meta-Llama-3-8B-Instruct"
        else:
            self.client = openai.AsyncOpenAI()  # uses OPENAI_API_KEY
            self.model = "gpt-4o-mini"

    async def complete(self, prompt: str, **kwargs) -> str:
        resp = await self.client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}],
            **kwargs
        )
        return resp.choices[0].message.content

The investment in self-hosting pays off - but only at scale, with the right workload, and with the operational maturity to run GPU infrastructure. Run the math on your specific numbers before committing.

What We’d Change in V2

This architecture works. These are the four things we’d do differently if we were building it again today, with the benefit of 12 months of production operation.

Move to KEDA on EKS instead of CloudWatch-based ECS scaling. The CloudWatch + Application Auto Scaling path works but has real limitations: scaling decisions take 60–120 seconds because CloudWatch metrics have evaluation windows, and step scaling policies require manual tuning for each queue size range. KEDA on EKS scales on exact queue depth in near-real time (10–30 second polling), supports fractional scaling targets, and has first-class support for SQS, Service Bus, and Pub/Sub without custom metric math. The trade-off is Kubernetes operational complexity — worth it at scale, overkill for the initial deployment.

Add request priority queues. The current architecture treats all requests equally — a low-priority batch job from an internal tool competes with a latency-sensitive user-facing request on the same queue. V2 would run two separate SQS queues: a high-priority queue (user-facing, latency SLA <30s) and a low-priority queue (batch jobs, best-effort). Workers check the high-priority queue first and fall back to low-priority when it’s empty. This is a small operational complexity increase with a large UX improvement for the interactive path.

Implement speculative decoding for throughput improvement. vLLM supports speculative decoding — using a small draft model (e.g. a 1B model) to propose token sequences that the main model then verifies in parallel. For our classification and extraction workloads with predictable output formats, speculative decoding consistently improved throughput 1.5–2× with no quality degradation. The operational cost is running a tiny draft model alongside the main model; the VRAM overhead is minor. This should have been enabled from the start.

Add per-team cost attribution dashboards from day one. We instrumented the Lambda to log team from the JWT claims into every SQS message. We wrote it to DynamoDB. We never built the dashboard until month 8, when engineering leadership started asking why the GPU bill was growing. A CloudWatch dashboard with cost broken down by team/feature, built in week 1, would have surfaced the 20% of usage from a single internal analytics pipeline that was running without any batching or caching. Attribution makes waste visible. Visible waste gets fixed.

Conclusion: 5 Lessons From Production

This architecture is what you converge on after making the common mistakes. Here are the five things that matter most:

Lesson 1: The queue is not an optimization - it’s structural correctness. Without a queue, GPU worker crashes are your traffic spike handler. With a queue, bursts become latency instead of downtime. You cannot make this architecture work reliably without it. Every other optimization is secondary.

Lesson 2: Scale on queue depth, not CPU. Always. CPU tells you the failure is happening. Queue depth tells you the failure is coming. The 3-5 minute cold start window means you need a leading indicator, not a lagging one. This single metric change is the difference between “reactive crash loop” and “proactive scale-up.”

Lesson 3: Scale-to-zero is your largest cost lever - implement it first. Most teams skip scale-to-zero early and then are surprised when the bill arrives. For an 8-hour weekday workload, nights and weekends represent 76% of the week. With min_capacity = 0, that’s 76% of your GPU cost eliminated. With a pre-warm schedule for cold starts, users don’t feel it.

Lesson 4: A fine-tuned 8B model beats a frontier API model for most production tasks. GPT-4o is a general-purpose model. Your task is not general-purpose. A Llama 3 8B fine-tuned on 5,000 labeled examples of your specific task - classification, extraction, summarization - routinely outperforms GPT-4o-mini zero-shot on that task at 60-80% lower cost. The fine-tuning investment pays back within weeks.

Lesson 5: Build the abstraction layer before you need it. When your self-hosted model goes down at 2am, your circuit breaker should automatically route to an API provider. When you want to A/B test a new model version, you should be able to shift traffic with a config change. Wrapping your LLM client with a single base_url and model config value costs one hour to implement and will save you from incidents.

Lesson 6: Treat fine-tuning as infrastructure, not a project. A fine-tuning sprint that ends with a shipped model and no ongoing process is technical debt with a timer. Request patterns drift, new categories emerge, and edge cases accumulate in your dead-letter queue. The teams that compound the most value from self-hosting are the ones that operationalize the feedback loop: log everything, label a sample monthly, retrain on drift, redeploy with blue/green. Treat model accuracy regression the same way you treat p99 latency regression — as an operational signal that requires a response, not a project retrospective item.

Numbers to know

Metric	Value
GPU cold start time (vLLM + 8B model)	3-5 minutes
Throughput on g5.xlarge (short prompts)	50-100 req/min
Redis prompt cache hit rate (production)	15-30%
Scale-to-zero savings (8hr/day workloads)	~60-70% of GPU cost
Breakeven vs GPT-4o API	~200K req/month
Annual savings at 2M req/month vs GPT-4o	~$50,000
Scale-down cooldown minimum	10 minutes
Messages per worker target (queue depth)	50 per active GPU worker

The architecture here is what production converges to after you’ve run into the failure modes. Build the queue first, cap your max instance count before launch, test cold start behavior under realistic load, and let the scaling math drive infrastructure decisions - not intuition about CPU.