We sent a 500-page SEC filing to every document API. Here's what came back.

Benchmarks on 2-page invoices are easy. Everything works on clean, short documents. The real test is what happens when you throw a 500-page SEC filing at a document API and ask it hard questions.

We used Berkshire Hathaway's Q3 2024 10-Q (58 pages of dense financial tables) and asked five questions that require real understanding — not just field extraction.

The five questions

Net earnings for Q3 2024 — literal value, buried in financial statements
YoY revenue growth rate — requires finding two numbers and computing a percentage
Insurance float amount — domain-specific term, not labeled as "float" in the document
Debt-to-equity ratio — requires reading the balance sheet, extracting two values, and dividing
Largest operating segment by revenue — requires comparing multiple line items across a table

Results: extraction vs. reasoning accuracy

We tested each question against standard extraction (pull literal values) and reasoning (compute answers). Here's how they performed:

Question	Type	Extraction	Reasoning
Net earnings Q3 2024	Literal lookup	$26,251M	$26,251M
YoY revenue growth	Computation	Raw numbers only	-0.23%
Insurance float	Domain reasoning	Not found	$171B
Debt-to-equity ratio	Multi-field computation	Separate fields	0.26
Largest segment	Comparison	Table dump	Insurance

Extraction nailed the literal lookup (1/5). Reasoning handled all five — including the ones that require finding multiple values across different pages and computing the answer.

Latency and cost breakdown

On the 58-page Berkshire 10-Q:

Metric	Extract	Analyze
Latency (5 fields)	~4s	~18s
Credits used	58 (1/page)	116 (2/page)
Cost at $0.01/credit	$0.58	$1.16
Agent steps per question	—	2-4 tool calls

Analyze is slower because it's doing real work — searching the document, reading specific pages, running computations. But for questions that extraction can't answer at all, the extra latency and cost are irrelevant. A wrong answer in 4 seconds isn't better than a right answer in 18.

What document reasoning returned

The /analyze endpoint handled all five. Here's the debt-to-equity example with full reasoning trace:

{
  "data": {"debt_to_equity": 0.26},
  "reasoning": {"debt_to_equity": "Total debt: $127,299M (Notes payable
    $8,331M + long-term debt $118,968M, p.4). Total equity:
    $488,669M (p.4). Ratio: 127299/488669 = 0.26"},
  "sources": {"debt_to_equity": [
    "Notes payable and other borrowings: $8,331 (p.4)",
    "Long-term debt: $118,968 (p.4)",
    "Total Berkshire Hathaway shareholders' equity: $488,669 (p.4)"
  ]},
  "confidence": {"debt_to_equity": 0.96},
  "steps": {"debt_to_equity": [
    {"tool": "search", "args": {"query": "total debt borrowings"}},
    {"tool": "read_pages", "args": {"start_page": 4, "end_page": 4}},
    {"tool": "compute", "args": {"code": "result = (8331+118968)/488669"}}
  ]}
}

Three steps: search for the right section, read the balance sheet, compute the ratio. Every step is logged. Your agent can explain exactly how it arrived at 0.26.

DocVQA benchmark: how we measure accuracy at scale

Beyond our custom financial benchmarks, we run the industry-standard DocVQA (Document Visual Question Answering) benchmark — 5,349 questions across real-world documents. The metric is ANLS (Average Normalized Levenshtein Similarity), scored 0-1.

Published reference scores for context:

Approach	ANLS Score
GPT-4o (direct vision)	~0.92
Claude 3.5 Sonnet	~0.90
Gemini 1.5 Pro	~0.89
Donut (specialized model)	~0.84
Tesseract + LayoutLM	~0.78

We run DocVQA on every release to catch accuracy regressions. The benchmark suite is part of our CI — we don't ship if accuracy drops.

The pattern: extraction for fields, reasoning for questions

This benchmark made the distinction clear:

Extraction works for

"What is the net earnings figure?"
"Who are the parties in this contract?"
"What is the invoice total?"

Questions with literal answers in the document.

Reasoning works for

"What is the YoY revenue growth rate?"
"Which segment grew fastest?"
"Does the total match the line items?"

Questions that require computation or comparison.

If your agent only extracts, it can tell you what's in the document. If it reasons, it can tell you what the document means.

We publish our benchmark suite and run it on every release. Try analyze on your own documents — the free tier gives you 100 credits to test with.