Logo
Back to Blog

We tested 6 LLMs on SEC filing extraction

Financial Datasets Team · 5 min read

Introducing
EarningsBench

A single inaccurate data point can result in billions of dollars in losses. This is why hedge funds, asset managers, and investment banks spend millions on trusted, institutional-grade data from a handful of data providers.

At Financial Datasets, we're building a new key performance indicator (KPI) product. KPIs are the operational metrics companies report in earnings releases that reveal how a business is actually performing: annual recurring revenue, same-store sales, and passenger seat miles.

Unlike financial statements, KPIs aren't standardized. They live in earnings releases, press releases, and earnings call transcripts with inconsistent formatting, and every company reports them differently, which makes extraction at scale hard.

Our accuracy bar at Financial Datasets is 100%. So, we built EarningsBench, a new benchmark to find out which LLM can accurately extract structured KPIs from real earnings releases, across different sectors, tickers, and formats.

Benchmark design

Dataset: 60 earnings releases across 12 sectors, 1 ticker per sector, covering multiple quarters.

Sectors: Airlines (DAL), Banks (JPM), Biotech (AMGN), Industrials (GE), Insurance (PGR), Oil & Gas (XOM), Pharma (LLY), REITs (PLD), Restaurants (MCD), Retail (WMT), SaaS (DDOG), Semiconductors (NVDA).

Task: We define a list of KPIs for each sector, along with rules for how to extract them. We call this the "taxonomy", for example:

  • Airlines: load factor, CASM-ex-fuel (cost per available seat mile excluding fuel), passenger yield
  • Banks: NIM (net interest margin), CET1 ratio (Common Equity Tier 1), ROTCE (return on tangible common equity)
  • REITs: FFO per share (funds from operations), same-store NOI growth (net operating income), lease spreads

The model receives raw narrative text and tables, and must return structured JSON with the extracted KPI values and metadata.

Models tested: Opus 4.6, Opus 4.6 with Extended Thinking, Sonnet 4.6, Sonnet 4.6 with Extended Thinking, GPT 5.4 (high reasoning), GPT 5.4 (medium reasoning). All 6 models worked on the same 60 earnings releases with identical prompts, taxonomies, and parsing. No information sharing between models. We ran 10 evaluation rounds, refining our approach between each round based on observed errors. The results reported here are from the final round.

How we measured accuracy: We read every earnings release and checked the extracted values by hand against the source document. Every disagreement between models was verified against the original earnings release.

Results

ModelAccuracyRecallAvg CostAvg Speed
GPT 5.4 (medium)97.2%73%$0.13263s
GPT 5.4 (high)98.4%80%$0.210148s
Sonnet 4.695.3%80%$0.11713s
Sonnet 4.6 (thinking)97.2%83%$0.15649s
Opus 4.698.9%83%$0.20015s
Opus 4.6 (thinking)99.3%80%$0.24949s

Opus 4.6 with Extended Thinking hit 99.3% accuracy across 60 earnings releases. It found KPI values that no other model extracted, like prior-period YoY comparison data buried in narrative text and tables. We manually verified these against the source earnings releases and they weren't hallucinations. Opus just read comparison columns that the others skipped.

Both Anthropic models improved with Extended Thinking enabled. Opus went from 98.9% to 99.3%. Sonnet went from 95.3% to 97.2%. Thinking helps when an earnings release reports the same metric multiple ways (GAAP vs adjusted, different business segments, monthly vs quarterly) and the model needs to pick the right one.

GPT 5.4 (high) was 3-10x slower than the Anthropic models (148s vs 13-49s per earnings release) without a meaningful accuracy advantage. GPT 5.4 (medium) had decent accuracy at 97.2% but the lowest recall at 73%, meaning it skipped too many KPIs entirely. The task is input-heavy table extraction and the Opus models handled it best.

Accuracy vs. cost per filing

Cost was surprisingly low across the board. Opus 4.6 with Extended Thinking, the most accurate model, costs $0.249 per earnings release. The task is input-heavy but output-light, which keeps costs down. Prompt caching helps too since the system prompt and taxonomy are reused across earnings releases. For production use, cost is not the blocker, accuracy is.

Accuracy by sector

Four sectors are "solved": Oil & Gas, Pharma, SaaS, and Semiconductors. Every model gets 100%. These earnings releases have clean, unambiguous tables. Model selection doesn't matter here.

The hard sectors separate the models. Insurance (74% for Sonnet 4.6), Industrials (85-86% for both Sonnets), REITs (93-94% for Sonnet and GPT-high), and Retail (89-90% for four models). These are the earnings releases with segment breakdowns, multi-basis reporting, and KPI variant ambiguity.

ModelAirlinesBanksBiotechIndustrialsInsuranceOil & GasPharmaREITsRestaurantsRetailSaaSSemis
Opus 4.6 (thinking)100%100%100%97%100%100%100%98%100%97%100%100%
Opus 4.6100%100%100%95%100%100%100%100%100%90%100%100%
Sonnet 4.6 (thinking)100%97%100%86%100%100%100%100%88%89%100%100%
Sonnet 4.699%98%100%85%74%100%100%94%96%89%100%100%
GPT 5.4 (high)97%99%100%100%100%100%97%93%95%100%100%100%
GPT 5.4 (medium)94%95%100%100%100%100%100%97%91%89%100%100%

Extended Thinking closes the gap in hard sectors. Sonnet goes from 74% to 100% on Insurance, 94% to 100% on REITs. Opus goes from 95% to 97% on Industrials, 90% to 97% on Retail. The easy sectors don't need Extended Thinking. The hard ones do.

GPT 5.4 has a different error profile than Anthropic. GPT-high scores 100% on Industrials and Retail (where Sonnet struggles) but drops on Airlines (97%), Restaurants (95%), and REITs (93%). GPT-medium struggles on Retail (89%) and Restaurants (91%) where GPT-high gets 100%. The models don't fail on the same filings, which is why cross-model consensus is a useful validation signal.

Where models get it wrong

Every outlier error we found came down to ambiguity. The models can read the tables fine. They fail when an earnings release presents two valid values for the same metric and the model picks the wrong one.

Multi-segment companies

Some companies report the same metric for multiple business segments in a single earnings release. When you ask for "segment margin" without specifying which segment, models randomly pick one or the other, and not consistently. Without Extended Thinking, models grabbed different segments across different quarters for the same company.

Mixed reporting periods

Some companies report monthly and quarterly figures side by side in the same earnings release. When a model sees both, it tends to grab the first number it encounters. Without Extended Thinking, one model systematically grabbed monthly figures instead of quarterly totals, leading to consistent understatements across every earnings release for that company.

Multi-basis reporting

REITs and other companies often report the same metric on multiple accounting bases - consolidated vs pro-rata, total vs subtotal. The numbers are close enough to look right but different enough to matter. Models without Extended Thinking consistently grabbed subtotals instead of totals, or picked the wrong basis.

Regulatory variants

Banks and financial institutions report key ratios under multiple regulatory frameworks. Both numbers are real, both come from the same earnings release, but analysts care about one specific variant. Without explicit disambiguation, models split evenly on which to pick.

What we learned

After 10 rounds of evaluation across 60 earnings releases, a few things became clear.

Finding the number is easy, picking the right one is hard

Models have no trouble reading a table. The errors happen when an earnings release reports the same metric two different ways and the model has to choose. GAAP vs adjusted, monthly vs quarterly, data center revenue vs total revenue, two different regulatory approaches to the same capital ratio. The number is right there in the earnings release either way. The question is which one to grab.

Thinking mode helps models make better choices

Without it, the model tends to grab the first value it sees. With Extended Thinking enabled, it can reason through which variant the taxonomy actually wants. This is where the accuracy gains come from.

Prompt engineering matters as much as model selection

With a basic taxonomy, Opus 4.6 with Extended Thinking had significantly more outlier errors. After adding disambiguation rules (like the ones described above), most of those errors went away. If you benchmark a model with bad prompts, you're measuring your prompts, not the model.

Write general rules, not ticker-specific ones

Disambiguation rules need to work across every company in a sector, not just the one where you first noticed the error. When you're building for 10,000+ tickers across 100+ sectors, general rules are the only way to scale.

Model consensus is useful but not sufficient

When 5 out of 6 models agree on a value, the 1 is almost always wrong. We still manually verify every extraction against the source earnings release. Consensus helps us prioritize what to check first.

Our production choice

Opus 4.6 with Extended Thinking for both real-time extraction and historical backfill. 99.3% accuracy across 60 earnings releases, $0.249 per earnings release, 49 seconds per earnings release. It's the best model we tested, but 99.3% isn't good enough for production. Our bar is 100%.

We're moving forward with Opus 4.6 with Extended Thinking, but with a big caveat. It's the best model we tested, but 99.3% isn't good enough for production. Our bar is 100%.

We cannot ship raw model output and neither should you. We run multiple validation steps on top of the model's output before anything reaches production. A single inaccurate data point can result in billions in losses for our customers. We'd rather take a few extra minutes and be 100% correct.


Want access to the KPI data we're extracting? Contact us at [email protected]