· 6 min read ·

The Data Problem That Comes Before the Model in Investment AI

Source: openai

Back in early March, OpenAI published a case study on Balyasny Asset Management, detailing how the multi-strategy hedge fund built an AI research engine using GPT-5.4, structured agent workflows, and a rigorous internal evaluation framework. I’ve looked at this piece a few times since it came out, and the model choice and agent design get the most attention. But the part that keeps pulling me back is the part the case study barely touches: what does the information layer actually look like, and why is that the harder problem.

Balyasny is a multi-strategy fund, which matters because it shapes the scope of what an AI research engine has to handle. A focused long/short equity fund with thirty names in the portfolio has a tractable information problem. You can almost manage that manually. Balyasny runs dozens of investment teams across equity, fixed income, merger arbitrage, macro, and commodities strategies, covering thousands of securities globally. The information problem at that scale is not just “how do we summarize documents faster.” It’s closer to: how do you build a retrieval system that can surface the right documents, in the right temporal context, across heterogeneous sources, for any analyst on any desk, on demand.

The Point-in-Time Problem

Anyone who has built backtesting infrastructure for a quantitative fund has internalized the idea of point-in-time data. The intuition is simple: when you’re evaluating what a model would have done in January 2023, you can only use information that actually existed in January 2023. This sounds obvious, but financial data violates it constantly.

Companies restate earnings. Index compositions change. Ratings get revised retroactively in databases. Analysts publish reports that reference not-yet-public information. If your system pulls current data and projects it backwards for evaluation purposes, you introduce look-ahead bias. Every evaluation metric you compute is optimistic. Every workflow you validate is validating against a fiction.

For a traditional quantitative fund using structured data, this problem has established tooling. Vendors like FactSet and Compustat maintain point-in-time fundamental databases, preserving historical snapshots of data as it was originally reported before any revisions. You query against a timestamp, and you get the world as it looked then.

For an AI research engine operating over unstructured text, the problem is messier. Earnings call transcripts need timestamps tied to publication, not to when the earnings were for. News articles need to be tagged not just by publish date but by what information was actually in the public domain at the time of writing. An analyst note from a sell-side desk might reference preliminary data that was later corrected. The retrieval corpus has to carry enough temporal metadata to enforce point-in-time consistency at query time.

Building this correctly is not glamorous engineering. It’s the kind of thing that gets underspecified in early prototypes and then causes subtle, hard-to-diagnose evaluation errors for months. The model has nothing to do with it.

Corpus Heterogeneity at Fund Scale

The documents that matter for investment research don’t share a format. SEC EDGAR has structured filing metadata but unstructured prose inside. A 10-K annual report might run two hundred pages, dense with accounting notes and legal language. An 8-K material event filing might be three paragraphs. Earnings call transcripts are conversational and speaker-tagged. News feeds are high-volume, mixed-quality, and require duplicate detection across wire services. Alternative data sources, satellite imagery analysis, app download trends, credit card transaction summaries, are often structured but domain-specific and require their own ingestion and normalization pipelines.

A retrieval system that works well across all of these is doing a lot of behind-the-scenes work before any language model sees a token. Chunking strategies that work for a 10-K are wrong for a news feed. Embedding models that perform well on financial prose may degrade on tabular data extracted from filings. Metadata schemas that capture what you need for equities research may be wrong for merger arbitrage, where the critical facts are deal terms, regulatory status, and closing probability rather than revenue growth and margin trends.

This is one of the less-discussed reasons why agent workflows make sense in this context. A single agent with a single retrieval mechanism will be miscalibrated for some large fraction of queries. Routing different research tasks to agents with specialized retrieval configurations, different chunking logic, different embedding models optimized for their domain, gets you closer to a system that performs consistently across the breadth of a multi-strategy book.

Scope Reduction at Scale

There’s another problem that becomes visible only at fund scale: the model can’t read everything, and neither can the analyst. For any given investment question, the potentially relevant corpus might contain thousands of documents. The research engine has to reduce that scope intelligently before synthesis happens.

In practice this is a two-stage problem. First, broad retrieval: identify the candidate documents that might be relevant. Second, relevance ranking and filtering: score that candidate set and pass only the most relevant subset to the synthesis stage. This is standard RAG architecture, but the ranking step matters enormously in finance because the model will hallucinate to fill gaps in its context if you pass it incomplete information and ask it to synthesize confidently.

A useful heuristic here is to treat missing information differently from negative information. If the retrieval system finds no documents about a specific risk factor the analyst asked about, the correct output is “no relevant documents found” rather than a synthesized answer that extrapolates from tangentially related material. Getting this distinction right in the agent workflow requires explicit handling, not just prompting the model to be cautious.

GPT-5.4’s longer context window matters here, but not in the way people usually frame it. The benefit isn’t just “you can fit more in.” It’s that you can pass more of the candidate document set to the synthesis stage without aggressive filtering, which reduces the risk of the ranking step discarding something important. For financial documents that are long, densely cross-referenced, and where the critical sentence might be in footnote forty-seven of a 10-Q, this is a real capability improvement over earlier context windows.

The Evaluation Problem When Ground Truth Is Noisy

The investment research domain has a structural challenge that doesn’t appear in most LLM evaluation setups: ground truth is ambiguous and delayed. Whether a research synthesis was correct is often not determinable until months or years later, and even then the outcome is confounded by factors unrelated to research quality. A correct thesis can lose money because of macro shocks. A flawed thesis can make money in the short term because the market moved for unrelated reasons.

This means evaluation has to operate at the level of process quality rather than outcome quality. Did the system accurately represent what was in the source documents? Did it surface the relevant risk factors? Did it flag when its retrieval was incomplete? These are evaluable without knowing whether the investment worked out.

Building this kind of process evaluation is labor-intensive. It requires domain experts to construct benchmarks tied to real research tasks, with human-labeled ground truth about what a correct synthesis looks like. The investment in this infrastructure, not just the model integration, is what separates an AI research engine from an AI research toy.

The Balyasny case study describes rigorous model evaluation as a core design principle. From the outside, that is most likely what it looks like in practice: a library of task-specific benchmarks built against real research workflows, run systematically before any model or configuration change goes near production. That kind of infrastructure is slow to build and invisible in a press release, which is why it’s usually the part that gets cut from the summary.

The broader lesson, applicable well outside finance, is that the data layer and the evaluation layer are the long poles in the tent for any AI system operating in a high-stakes domain. The model selection is often the easiest decision in the stack.

Was this interesting?