June 8, 2026
We sent a 500-page SEC filing to every document API. Here's what came back.
Berkshire Hathaway's 10-K tested against extraction and reasoning APIs. Real outputs, real accuracy, no spin.
Benchmarks on 2-page invoices are easy. Everything works on clean, short documents. The real test is what happens when you throw a 500-page SEC filing at a document API and ask it hard questions.
We used Berkshire Hathaway's Q3 2024 10-Q (58 pages of dense financial tables) and asked five questions that require real understanding — not just field extraction.
The five questions
- Net earnings for Q3 2024 — literal value, buried in financial statements
- YoY revenue growth rate — requires finding two numbers and computing a percentage
- Insurance float amount — domain-specific term, not labeled as "float" in the document
- Debt-to-equity ratio — requires reading the balance sheet, extracting two values, and dividing
- Largest operating segment by revenue — requires comparing multiple line items across a table
Results: extraction vs. reasoning accuracy
We tested each question against standard extraction (pull literal values) and reasoning (compute answers). Here's how they performed:
| Question | Type | Extraction | Reasoning |
|---|---|---|---|
| Net earnings Q3 2024 | Literal lookup | $26,251M | $26,251M |
| YoY revenue growth | Computation | Raw numbers only | -0.23% |
| Insurance float | Domain reasoning | Not found | $171B |
| Debt-to-equity ratio | Multi-field computation | Separate fields | 0.26 |
| Largest segment | Comparison | Table dump | Insurance |
Extraction nailed the literal lookup (1/5). Reasoning handled all five — including the ones that require finding multiple values across different pages and computing the answer.
Latency and cost breakdown
On the 58-page Berkshire 10-Q:
| Metric | Extract | Analyze |
|---|---|---|
| Latency (5 fields) | ~4s | ~18s |
| Credits used | 58 (1/page) | 116 (2/page) |
| Cost at $0.01/credit | $0.58 | $1.16 |
| Agent steps per question | — | 2-4 tool calls |
Analyze is slower because it's doing real work — searching the document, reading specific pages, running computations. But for questions that extraction can't answer at all, the extra latency and cost are irrelevant. A wrong answer in 4 seconds isn't better than a right answer in 18.
What document reasoning returned
The /analyze endpoint handled all five. Here's the debt-to-equity example with full reasoning trace:
{
"data": {"debt_to_equity": 0.26},
"reasoning": {"debt_to_equity": "Total debt: $127,299M (Notes payable
$8,331M + long-term debt $118,968M, p.4). Total equity:
$488,669M (p.4). Ratio: 127299/488669 = 0.26"},
"sources": {"debt_to_equity": [
"Notes payable and other borrowings: $8,331 (p.4)",
"Long-term debt: $118,968 (p.4)",
"Total Berkshire Hathaway shareholders' equity: $488,669 (p.4)"
]},
"confidence": {"debt_to_equity": 0.96},
"steps": {"debt_to_equity": [
{"tool": "search", "args": {"query": "total debt borrowings"}},
{"tool": "read_pages", "args": {"start_page": 4, "end_page": 4}},
{"tool": "compute", "args": {"code": "result = (8331+118968)/488669"}}
]}
}
Three steps: search for the right section, read the balance sheet, compute the ratio. Every step is logged. Your agent can explain exactly how it arrived at 0.26.
DocVQA benchmark: how we measure accuracy at scale
Beyond our custom financial benchmarks, we run the industry-standard DocVQA (Document Visual Question Answering) benchmark — 5,349 questions across real-world documents. The metric is ANLS (Average Normalized Levenshtein Similarity), scored 0-1.
Published reference scores for context:
| Approach | ANLS Score |
|---|---|
| GPT-4o (direct vision) | ~0.92 |
| Claude 3.5 Sonnet | ~0.90 |
| Gemini 1.5 Pro | ~0.89 |
| Donut (specialized model) | ~0.84 |
| Tesseract + LayoutLM | ~0.78 |
We run DocVQA on every release to catch accuracy regressions. The benchmark suite is part of our CI — we don't ship if accuracy drops.
The pattern: extraction for fields, reasoning for questions
This benchmark made the distinction clear:
Extraction works for
- "What is the net earnings figure?"
- "Who are the parties in this contract?"
- "What is the invoice total?"
Questions with literal answers in the document.
Reasoning works for
- "What is the YoY revenue growth rate?"
- "Which segment grew fastest?"
- "Does the total match the line items?"
Questions that require computation or comparison.
If your agent only extracts, it can tell you what's in the document. If it reasons, it can tell you what the document means.
We publish our benchmark suite and run it on every release. Try analyze on your own documents — the free tier gives you 100 credits to test with.