
LLMs are rapidly becoming default tooling across finance, from research workflows to internal engineering support. But the gap between “impressive demo” and “safe deployment” is still wide, especially in regulated trading environments where reliability, traceability, and controls matter as much as raw velocity.
We spoke with Ilya Navogitsyn, a Quantitative Developer working at the intersection of trading, quant research, and production engineering, about what LLMs are actually good for today, and what they’re not.
Q: Where are LLMs genuinely useful in quant work today: research, code, documentation, incident response, or post-trade analysis?Ilya: LLMs are best at compressing thinking, cutting time to first draft and time to diagnosis. They’re not great at inventing alpha, but they remove friction around the work that surrounds it.
Where they’re genuinely useful:
They’re weakest where precision and deep field knowledge matter most: signal discovery, new hypothesis generation, and live decision-making. LLMs don’t reliably generate novel hypotheses. They help you narrow down or broaden what you already put in front of them, which is useful, but it’s not the same as doing original research.
Q: What’s the biggest misconception executives have about “LLMs will make us faster” in a regulated trading environment?Ilya: Executives often confuse “writing code faster” with “shipping safely.” In regulated trading, speed is limited by auditability, testing, approvals, and risk controls, not by how quickly someone can produce code.
There’s also a trap: LLMs produce a lot of plausible output extremely quickly. That can increase the need for review, because now you have more surface area to verify. Without strong internal processes, LLMs don’t accelerate teams — they either slow you down with extra verification work or, worse, increase risk, which is much worse than being slow.
Q: What’s your baseline safety bar before an LLM touches anything close to production decisions?Ilya: My baseline is simple: if I can’t explain exactly why the system did something, it shouldn’t be anywhere near production decisions.
Before an LLM touches anything close to live trading, it needs hard boundaries, full observability, and the ability to fail loudly. Every output must be reviewable, reproducible, and easy to say “no” to.
And if a bad decision can lose real money quickly, which is most of the trading, there has to be a human in the loop with veto power.
Q: Copilot vs. agent: what’s the first workflow you’d let an “agentic” system own end-to-end, and what’s absolutely off-limits?Ilya: I’d first trust an agent with running experiments and writing reports: backtests, validations, summaries. That work is time-consuming and relatively easy to review.
What’s off-limits is anything that can place trades, change risk limits, or push to production. In high-volatility environments, the downside is asymmetric. If the system is wrong, it can lead to significant losses very quickly, and the failure modes are not always obvious until it’s too late.
Q: How do you evaluate an LLM system on a desk: accuracy, calibration, latency, cost, auditability, or “did it avoid one catastrophic mistake”?Ilya: All of those metrics matter, but on a trading desk they’re secondary. The real question is: did it avoid a catastrophic mistake?
I care much more about calibration and failure modes than raw accuracy. A system that’s occasionally wrong but clearly uncertain is safer than one that’s confidently wrong, because trusting the latter can lead to unacceptable losses.
Latency and cost matter once it’s proven safe. But auditability is non-negotiable. If you can’t reconstruct what it saw and why it produced an output, you don’t have a system you can trust.
Q: What does a good “LLM incident” look like? How do you detect it, contain it, and learn without killing experimentation?Ilya: A good LLM incident should be boring: it gets caught early, nothing ships, and nobody loses money.
You usually spot it because something looks off: inputs are stale, outputs don’t make sense, confidence jumps when it shouldn’t. You contain it with guardrails and fallbacks that already exist, not with ad-hoc heroics.
Then you write it up, fix the gap, and move on. The goal isn’t to avoid every mistake. The goal is to ensure mistakes don’t turn into surprises, and don’t escape into production.
Q: If everyone gets the same foundation model, where does the edge move: proprietary data, evaluation, or governance?Ilya: The edge moves to your data, your evaluation, and your process discipline.
The firms that win won’t have the “best” LLM, they’ll be the ones who know when not to trust it. The temptation to ship something that sounds smart has to go through a cold reality check. And right now, only humans can do that reliably.