Accuracy benchmark

Extraction Accuracy Benchmark — v1

Run date: 2026-05-25 · Pipeline version: main branch

Status: First publishable benchmark. 14 of 15 runnable fixtures PASS on real broker OMs from a 22-fixture corpus across 5 CRE asset classes. The single failure is documented LLM non-determinism (±1 unit on a borderline rent-roll row), not a deal-defect class error. Methodology, corpus, and defect log included so buyer diligence can independently re-verify.

TL;DR

Fixtures runnable on real PDFs	15
Fixtures passing end-to-end	14
Pass rate	93.3%
Total corpus fixtures (including JSON-only)	22
Asset classes represented	5 (multifamily, mixed-use, NNN, office, industrial)
Geographic states	10+ (CA, NJ, NY, CO, AR, PA, WA, IL, OR, NV, AZ, MN, TN)
Validation checks in registry (at v1 run; now 43)	41
Validation checks pinned across the corpus	~36
LLM + Reducto extraction time per fixture	~3:15 average
LLM + Reducto cost per fixture	~$2.00 average
Single full-corpus run	~50 min, ~$15–30

The one failing fixture (philly_3800k_12_unit) is failing for the right reason: the regression suite caught a real signal of LLM non-determinism on a borderline rent-roll classification (11 vs. 12 units across two consecutive runs). Pinned within a tolerance band that covers both observations; tightening the prompt to deterministic classification is tracked as P1 #11 in the defect log.

Methodology

The regression suite (apps/api/tests/regression/test_pipeline_regression.py) runs the full production extraction pipeline against each fixture's source PDF — same Reducto OCR call, same Anthropic LLM calls, same downstream commercial- classification, banked-rent enrichment, opex-basis classification, and validation registry — and compares the output against per-fixture golden values in expected.json.

What gets compared, per fixture:

deal_summary — property name, type, asking price, unit count, broker cap rate, rentable SF
rent_roll — source flag (per_unit / synthesized / absent), total/occupied/vacant unit counts, residential/commercial splits, monthly rent in-place, banked rent, non-arms-length lease counts
operating_statement — total income, total operating expenses, NOI, pro forma NOI variants
lease_abstracts — count and per-tenant fields (tenant name, base rent, escalation rate, credit flags)
rent_comps — count, average market rent, subject identification
sales_comps — count, average sale price, subject identification
validation — per-check status assertions across the 41-check registry — must_pass, must_not_fail, must_emit_info, must_emit_warn

Numeric fields use per-block tolerance bands (typically 0.5–2%, widened to 5% on operating statement aggregates, 10% only where LLM non-determinism is documented and tracked as a fix-needed defect).

A buyer can reproduce by:

cd apps/api
export ANTHROPIC_API_KEY=...
export REDUCTO_API_KEY=...
python -m pytest tests/regression/ -v

Source PDFs are not committed for confidentiality. Buyer diligence can either request the corpus under NDA or supply their own OMs to the same harness — the assertions in expected.json are committed and reviewable line-by-line.

Corpus

22 fixtures total. The 15 runnable here have source PDFs available in this run; the other 7 are hand-curated regression-pin fixtures from prior development whose source PDFs remain confidential and re-enter the corpus when their owners clear them.

By asset class

Asset class	Fixtures	Examples
Multifamily (pure)	11	The Beverly (46u, SF), Sonoma Heights (60u, CO), Summit Portfolio (387u, AR), Vista Del Pacifico (61u, San Diego)
Mixed-use	4	Starboard SF SRO, Urban Capital, 9307 3rd Ave Brooklyn, 101 Dyckman ($16.25M institutional)
NNN net lease	4	Oregon DMV (single-tenant govt), Carson NV (two-tenant), University MN healthcare, Muirwood AZ medical office
Office / owner-user	1	Muirwood AZ (overlaps with NNN)
Industrial	1	Westbelt Dr Nashville
Teaser (no financials)	2	Lantana Culver City, Kirkland WA

Pass rate — detailed

14 PASS of 15 runnable fixtures:

Fixture	Asset class	Status
brooklyn_9307_3rd_ave	Mixed-use (NY)	PASS
carson_nv_two_tenant_nnn	NNN multi-tenant (NV)	PASS
chicago_31_unit_condo_portfolio	Multifamily condo (IL)	PASS
clovis_1228_jefferson_ave	Multifamily turnkey (CA)	PASS
dyckman_101_manhattan	Mixed-use institutional (NY)	PASS
equity_union_venice_los_angeles	Multifamily value-add (CA)	PASS
kirkland_teaser	Teaser (WA)	PASS
lantana_culver_city_teaser	Teaser (CA)	PASS
muirwood_az_medical_office_owner_user	Medical office condo (AZ)	PASS
philly_3800k_12_unit	Multifamily (PA)	FAILLLM non-determinism on borderline unit row — see P1 #11
sonoma_heights_colorado_springs	Multifamily value-add (CO)	PASS
summit_portfolio_little_rock	Multifamily portfolio (AR)	PASS
university_mn_bhg_nnn_healthcare	NNN healthcare (MN)	PASS
valley_oregon_eugene_dmv_nnn	NNN govt single-tenant (OR)	PASS
westbelt_tn_nashville_industrial	Industrial owner-user (TN)	PASS

Defect log

The full list of defects the regression suite caught is published alongside this benchmark. Each entry includes fixture, symptom, root cause, suggested fix, and tracking status.

P1 — accuracy bugs: 4 total. 1 FIXED (OS extractor percentage parsing). 1 mitigated with tolerance + tracking (philly LLM variance). 2 open (Brooklyn commercial classifier, Clovis cap-rate variance).
P2 — harness limitation: 1 (short-document deal_summary threshold).
P3 — NNN/lease-abstract coverage gaps: 4 (OS synthesis from lease abstract, deal_summary on NNN, taxonomy, section finder labeling).
P4 — goldens ergonomics: 2 (date comparison, model field documentation).

The Brooklyn commercial-classifier defect is the most material remaining accuracy bug; the others are coverage gaps that depress validation breadth but do not produce wrong numbers — they produce missing-field signals downstream consumers can detect.

Request the full defect log →

Unit economics

Per fixture, observed across the 50-minute total run:

~3:15 wall-clock per fixture (range: 30s for teasers, 6+ min for institutional)
~$2.00 spend per fixture (Reducto OCR + Anthropic LLM calls combined)
Cost-per-correct-document: $2.14 (= $30 / 14 pass)

For comparison: a human underwriter checking 22 OMs at junior-analyst hourly rates plus the time to re-check every cell against the source would run 3–6 hours per OM. The accuracy bar that matters is "what fraction of the rechecks find errors" — which is what this benchmark answers.

What's not measured (be honest)

Per-cell source citation accuracy. The pipeline produces page-level citations for some fields; per- cell bounding-box accuracy is not yet measured. The architecture supports it; the regression coverage doesn't exist yet.
Drift over time. This is run 2; the first run was on the same date. No production- window canary is wired up yet to catch model-side drift between runs.
Comparable accuracy by deal archetype subgroup. With 15 fixtures across 5 asset classes (1–11 per class), per- archetype confidence intervals are not statistically meaningful. Tier-2 sourcing to 30+ fixtures would close this.
Adversarial cases. No deliberately-corrupted PDFs (OCR-broken, half-scanned, malformed-table) in the corpus. Robustness floor is unmeasured.

How a buyer verifies

The regression suite is the artifact. A buyer's diligence team can:

Receive a repo snapshot under NDA
Drop their own 5–10 OMs into apps/api/tests/regression/fixtures/<their_slug>/source.pdf with a minimal expected.json derived from the OM's headline figures
Run pytest tests/regression/ against their fixtures with their own API keys
Inspect the assertion failures (if any) and the validation-check status output

The harness emits the full per-fixture diff in one block per fixture, so a buyer can read "expected NOI 234,835, got 236,118 (within tolerance)" line-by-line. No black-box claims; the assertion logic is in committed Python.

Try the pipeline yourself

The same validation registry exercised by this benchmark is exposed as a callable API. Send pre-extracted JSON or a source PDF; receive a validation report with per-check status, affected fields, and the structured data payload your UI needs to render cell-level flags.

curl -sS https://api.rentrolliq.com/v1/checks \
  -H "X-API-Key: <issued on request>" | jq '.checks_catalog_size'
# => 42

OpenAPI spec: docs/monetization/openapi.yaml
Integration cookbook (curl + Python + sample request/response): docs/monetization/examples/
Request a key: support@rentrolliq.com

Want to evaluate it yourself?

Try a single OM free, or talk to us about a benchmark run on your own deal package.

Try one free Talk to us