You tap “Run” on a GPT‑powered assistant and then watch the spinner. Seconds stretch into minutes, token meters climb, and the meter on your OpenAI invoice creeps higher. Latency and cost have become the invisible tax on the large language model boom, especially when a single tough query can trigger thousands of fresh inference tokens. A new research proposal called sleep‑time compute argues that those tokens are often spent in the wrong phase of the workflow. Instead of cramming all reasoning into the moment the user hits Enter, why not let the model “think” during its idle hours, transform raw context into reusable insight, and slash the bill when the real question finally arrives?
The idea feels familiar to anyone who ever scheduled a database index or compiled code before shipping: preprocess while nobody is looking, respond instantly when they are. Yet applying that mindset to language models requires fresh benchmarks, careful accounting, and proof that offline effort transfers to online accuracy. Kevin Lin and colleagues from Letta and UC Berkeley supply exactly that evidence in “Sleep‑time Compute: Beyond Inference Scaling at Test‑time,” and their numbers suggest a rethink of how enterprise AI products budget GPU cycles.
Traditional test‑time scaling tells an LLM to work harder when the question is hard: sample multiple chains of thought, extend the reasoning trace, rerank responses, or fork dozens of candidate answers in parallel. Those tricks boost accuracy for math, coding, and knowledge tasks, but they also inflate latency and wallet drain. Users wait; vendors pay. Worse, the paradigm assumes each query is a stateless one‑off that arrives with its full context in the same request.
In the real world, contexts persist. Customer‑support bots reread the same knowledge base, coding agents navigate the same repository, and research copilots revisit a shared document corpus. The authors argue that in these stateful settings, enormous chunks of reasoning are performed redundantly. Sleep‑time compute exploits that redundancy by letting the model pre‑parse the context during idle windows, create a distilled, inference‑ready representation, and store it for later reuse. When the user finally asks, the LLM answers in a fraction of the tokens because much of the heavy lifting is already baked into the prompt.
Why sleep‑time compute rewrites the cost curveThe researchers formalize the workflow in two phases. During sleep‑time the model sees only the context c, predicts likely angles of interest, and produces a rewritten context c′ that contains intermediate deductions, structured summaries, or cached chain‑of‑thought snippets. During test‑time the user’s query q arrives. The model now receives c′ instead of the raw context and can reach the correct answer with a far smaller compute budget b. Because idle hours are cheap and parallelizable, the organization pays low‑priority rates for the preprocessing and preserves premium inference capacity for user‑facing responsiveness.
To quantify the benefit, the team split two classic math‑reasoning suites—GSM‑Symbolic and AIME—into Stateful variants where every problem is decomposed into a context paragraph and a separate question. They also built Multi‑Query GSM‑Symbolic, in which each context spawns several related questions, mimicking a user who keeps poking at the same document. The evaluation matrix compared baseline GPT‑4o, GPT‑4o‑mini, o1, o3‑mini, Claude Sonnet, and DeepSeek‑R1 under three conditions: standard test‑time scaling, sleep‑time compute with different offline budgets, and pass‑@k parallel sampling.
What the experiments showAcross every model except the smallest o1, the sleep‑time strategy pushed the accuracy‑per‑token frontier outward. On Stateful GSM‑Symbolic and Stateful AIME the authors report:
Perhaps more striking, sleep‑time compute beat the canonical pass‑@k trick at equal test‑time budgets. Pass‑@k assumes an oracle verifier can instantly pick the best of k sampled answers, an unrealistic crutch in production. Sleep‑time compute reaches higher accuracy without that luxury because the heavy reasoning already lives in c′.
The payoff is sensitive to how predictable the eventual question is. When the researchers binned GSM items by the log probability that Llama‑2 assigned to the question given the context, the accuracy delta between sleep‑time and baseline widened for the most predictable quintile. In plain English: the more obvious the follow‑up question, the bigger the win from preparing your homework in advance.
Numbers are one thing; product implications are another. The authors run a real repository test called SWE‑Features in which an agent must modify three or more files to implement a feature. With only low test‑time budgets, sleep‑time compute cut token use by about 50 percent while matching F1, meaning faster merges and lower GPU bills on continuous‑integration bots. At very high budgets, classic test‑time reasoning regained a slight edge in precision, suggesting a hybrid policy: allocate offline compute aggressively when latency matters or when contexts will be reused, fall back to rich online chains only for one‑off or highly unpredictable queries.
The framework also opens doors for synthetic data generation. If sleep‑time reasoning produces rich natural‑language representations of a codebase or document, those artifacts themselves become training data for future fine‑tuning—a virtuous loop where offline thinking seeds the next generation of model improvements without scraping more internet text.
Operationally, the technique invites engineering questions. How often should the context cache refresh? How large can c′ grow before it cancels the token savings? Which idle cycles are really free in a shared cluster? Yet none of these hurdles look as formidable as the current reality of paying real‑time prices for redundant reasoning. Enterprises that already schedule nightly builds, search‑index crawls, or materialized views have mental models for this optimization.
How LLMs are quietly becoming the ultimate city historians
Where offline thinking fits nextSleep‑time compute is not a silver bullet. Queries that blind‑side the system or contexts that mutate too rapidly will still demand fresh chains of thought. The paper itself flags open research into adaptive policies that predict when offline investment will pay off, perhaps by estimating context entropy or user intent distribution. Even so, the core takeaway stands: large language models do not need to think only when the user is watching. By borrowing an age‑old computing trick—do tomorrow’s work tonight—developers can cut latency, shrink bills, and still climb the accuracy ladder.
The upshot: Your next LLM feature might not require a bigger model or a deeper reasoning budget. It might simply require letting the model sleep on the problem first.