You Scaled the Wrong Thing We hit the wall six weeks after shipping LLM inference to production. Not a crash, not an outage - just latency quietly climbing past SLO while every metric we were watching looked fine. CPU normal. Pod count healthy. Request rate within bounds. It took an hour of digging to find the actual problem: our autoscaler was treating a 200-token summary request and an 8,000-token document analysis as identical units of work. …