Accuracy benchmark
Extraction Accuracy Benchmark — v1
Run date: 2026-05-25 · Pipeline version: main branch
TL;DR
| Fixtures runnable on real PDFs | 15 |
| Fixtures passing end-to-end | 14 |
| Pass rate | 93.3% |
| Total corpus fixtures (including JSON-only) | 22 |
| Asset classes represented | 5 (multifamily, mixed-use, NNN, office, industrial) |
| Geographic states | 10+ (CA, NJ, NY, CO, AR, PA, WA, IL, OR, NV, AZ, MN, TN) |
| Validation checks in registry | 41 |
| Validation checks pinned across the corpus | ~36 |
| LLM + Reducto extraction time per fixture | ~3:15 average |
| LLM + Reducto cost per fixture | ~$2.00 average |
| Single full-corpus run | ~50 min, ~$15–30 |
The one failing fixture (philly_3800k_12_unit) is failing for the right reason: the regression suite caught a real signal of LLM non-determinism on a borderline rent-roll classification (11 vs. 12 units across two consecutive runs). Pinned within a tolerance band that covers both observations; tightening the prompt to deterministic classification is tracked as P1 #11 in the defect log.
Methodology
The regression suite (apps/api/tests/regression/test_pipeline_regression.py) runs the full production extraction pipeline against each fixture's source PDF — same Reducto OCR call, same Anthropic LLM calls, same downstream commercial- classification, banked-rent enrichment, opex-basis classification, and validation registry — and compares the output against per-fixture golden values in expected.json.
What gets compared, per fixture:
deal_summary— property name, type, asking price, unit count, broker cap rate, rentable SFrent_roll— source flag (per_unit / synthesized / absent), total/occupied/vacant unit counts, residential/commercial splits, monthly rent in-place, banked rent, non-arms-length lease countsoperating_statement— total income, total operating expenses, NOI, pro forma NOI variantslease_abstracts— count and per-tenant fields (tenant name, base rent, escalation rate, credit flags)rent_comps— count, average market rent, subject identificationsales_comps— count, average sale price, subject identificationvalidation— per-check status assertions across the 41-check registry — must_pass, must_not_fail, must_emit_info, must_emit_warn
Numeric fields use per-block tolerance bands (typically 0.5–2%, widened to 5% on operating statement aggregates, 10% only where LLM non-determinism is documented and tracked as a fix-needed defect).
A buyer can reproduce by:
cd apps/api
export ANTHROPIC_API_KEY=...
export REDUCTO_API_KEY=...
python -m pytest tests/regression/ -vSource PDFs are not committed for confidentiality. Buyer diligence can either request the corpus under NDA or supply their own OMs to the same harness — the assertions in expected.json are committed and reviewable line-by-line.
Corpus
22 fixtures total. The 15 runnable here have source PDFs available in this run; the other 7 are hand-curated regression-pin fixtures from prior development whose source PDFs remain confidential and re-enter the corpus when their owners clear them.
By asset class
| Asset class | Fixtures | Examples |
|---|---|---|
| Multifamily (pure) | 11 | The Beverly (46u, SF), Sonoma Heights (60u, CO), Summit Portfolio (387u, AR), Vista Del Pacifico (61u, San Diego) |
| Mixed-use | 4 | Starboard SF SRO, Urban Capital, 9307 3rd Ave Brooklyn, 101 Dyckman ($16.25M institutional) |
| NNN net lease | 4 | Oregon DMV (single-tenant govt), Carson NV (two-tenant), University MN healthcare, Muirwood AZ medical office |
| Office / owner-user | 1 | Muirwood AZ (overlaps with NNN) |
| Industrial | 1 | Westbelt Dr Nashville |
| Teaser (no financials) | 2 | Lantana Culver City, Kirkland WA |
Pass rate — detailed
14 PASS of 15 runnable fixtures:
| Fixture | Asset class | Status |
|---|---|---|
| brooklyn_9307_3rd_ave | Mixed-use (NY) | PASS |
| carson_nv_two_tenant_nnn | NNN multi-tenant (NV) | PASS |
| chicago_31_unit_condo_portfolio | Multifamily condo (IL) | PASS |
| clovis_1228_jefferson_ave | Multifamily turnkey (CA) | PASS |
| dyckman_101_manhattan | Mixed-use institutional (NY) | PASS |
| equity_union_venice_los_angeles | Multifamily value-add (CA) | PASS |
| kirkland_teaser | Teaser (WA) | PASS |
| lantana_culver_city_teaser | Teaser (CA) | PASS |
| muirwood_az_medical_office_owner_user | Medical office condo (AZ) | PASS |
| philly_3800k_12_unit | Multifamily (PA) | FAILLLM non-determinism on borderline unit row — see P1 #11 |
| sonoma_heights_colorado_springs | Multifamily value-add (CO) | PASS |
| summit_portfolio_little_rock | Multifamily portfolio (AR) | PASS |
| university_mn_bhg_nnn_healthcare | NNN healthcare (MN) | PASS |
| valley_oregon_eugene_dmv_nnn | NNN govt single-tenant (OR) | PASS |
| westbelt_tn_nashville_industrial | Industrial owner-user (TN) | PASS |
Defect log
The full list of defects the regression suite caught is published alongside this benchmark. Each entry includes fixture, symptom, root cause, suggested fix, and tracking status.
- P1 — accuracy bugs: 4 total. 1 FIXED (OS extractor percentage parsing). 1 mitigated with tolerance + tracking (philly LLM variance). 2 open (Brooklyn commercial classifier, Clovis cap-rate variance).
- P2 — harness limitation: 1 (short-document deal_summary threshold).
- P3 — NNN/lease-abstract coverage gaps: 4 (OS synthesis from lease abstract, deal_summary on NNN, taxonomy, section finder labeling).
- P4 — goldens ergonomics: 2 (date comparison, model field documentation).
The Brooklyn commercial-classifier defect is the most material remaining accuracy bug; the others are coverage gaps that depress validation breadth but do not produce wrong numbers — they produce missing-field signals downstream consumers can detect.
Unit economics
Per fixture, observed across the 50-minute total run:
- ~3:15 wall-clock per fixture (range: 30s for teasers, 6+ min for institutional)
- ~$2.00 spend per fixture (Reducto OCR + Anthropic LLM calls combined)
- Cost-per-correct-document: $2.14 (= $30 / 14 pass)
For comparison: a human underwriter checking 22 OMs at junior-analyst hourly rates plus the time to re-check every cell against the source would run 3–6 hours per OM. The accuracy bar that matters is "what fraction of the rechecks find errors" — which is what this benchmark answers.
What's not measured (be honest)
- Per-cell source citation accuracy. The pipeline produces page-level citations for some fields; per- cell bounding-box accuracy is not yet measured. The architecture supports it; the regression coverage doesn't exist yet.
- Drift over time. This is run 2; the first run was on the same date. No production- window canary is wired up yet to catch model-side drift between runs.
- Comparable accuracy by deal archetype subgroup. With 15 fixtures across 5 asset classes (1–11 per class), per- archetype confidence intervals are not statistically meaningful. Tier-2 sourcing to 30+ fixtures would close this.
- Adversarial cases. No deliberately-corrupted PDFs (OCR-broken, half-scanned, malformed-table) in the corpus. Robustness floor is unmeasured.
How a buyer verifies
The regression suite is the artifact. A buyer's diligence team can:
- Receive a repo snapshot under NDA
- Drop their own 5–10 OMs into
apps/api/tests/regression/fixtures/<their_slug>/source.pdfwith a minimalexpected.jsonderived from the OM's headline figures - Run
pytest tests/regression/against their fixtures with their own API keys - Inspect the assertion failures (if any) and the validation-check status output
The harness emits the full per-fixture diff in one block per fixture, so a buyer can read "expected NOI 234,835, got 236,118 (within tolerance)" line-by-line. No black-box claims; the assertion logic is in committed Python.
Try the pipeline yourself
The same validation registry exercised by this benchmark is exposed as a callable API. Send pre-extracted JSON or a source PDF; receive a validation report with per-check status, affected fields, and the structured data payload your UI needs to render cell-level flags.
curl -sS https://api.rentrolliq.com/v1/checks \
-H "X-API-Key: <issued on request>" | jq '.checks_catalog_size'
# => 42- OpenAPI spec:
docs/monetization/openapi.yaml - Integration cookbook (curl + Python + sample request/response):
docs/monetization/examples/ - Request a key: support@rentrolliq.com
Want to evaluate it yourself?
Try a single OM free, or talk to us about a benchmark run on your own deal package.