At NetEase Games, we learned a hard lesson about large language model (LLM) inference in production: elastic compute is only useful if data can move just as fast.
“Elastic compute is only useful if data can move just as fast.” On paper, serverless GPU infrastructure looked like a good fit for inference workloads. Game traffic is bursty, peaks differ by title and time of day, and reserving GPU capacity for every possible spike is expensive. But …