Benchmarking & Accuracy
FHIR4DS is rigorously tested against the official CMS eCQM test bundles from the ecqm-content-qicore-2025 package. These are the same industry-standard test datasets used to certify all conformant clinical reasoning engines.
1. Accuracy Results
FHIR4DS achieves 100% spec compliance across all test suites — 2,822 total tests passing.
| Metric | Result |
|---|---|
| Spec Compliance (CQL) | 100% (1,706 / 1,706 tests) |
| Spec Compliance (FHIRPath) | 100% (935 / 935 tests) |
| Spec Compliance (SQL-on-FHIR) | 100% (134 / 134 tests) |
| eCQM Measures | 100% pass rate (47 / 47) |
Known Upstream Accuracy Gaps
4 measures have documented accuracy gaps caused by bugs in the official CMS test bundles, not by FHIR4DS implementation errors. These measures fail equally in all conformant engines.
| Measure | Issue in Upstream Test Data |
|---|---|
| CMS135 | Heart Failure — MADIE-2124: MeasureReport has denominator-exception=0 for DENEXCEPPass test cases |
| CMS145 | IVF — MADIE-2124: Same pattern as CMS135 |
| CMS157 | Oncology — Test data has 2025 encounter dates but measurement period is 2026 |
| CMS1017 | Palliative Care — Non-UUID IDs, contradictory MeasureReports, missing valueset codes |
2. Performance & Throughput
By leveraging a SQL-native, vectorized architecture, FHIR4DS provides a transformative performance advantage over traditional engines.
Head-to-Head: FHIR4DS vs. Java Reference Engine
We compared FHIR4DS against the industry-standard Java Clinical Reasoning engine using 12 shared measures that achieved 100% execution success in both environments.
| Metric | Traditional Engine (Java) | FHIR4DS (SQL Native) | Speedup |
|---|---|---|---|
| Mean Execution/Patient | ~936ms | ~6.9ms | ~137× |
| Median Execution/Patient | ~819ms | ~1.9ms | ~425× |
Scalability
The speedup reflects the architectural difference: traditional engines evaluate each patient sequentially, whereas FHIR4DS runs a single columnar SQL query that processes the entire population simultaneously. This results in near-zero marginal cost for adding additional patients to a cohort.
3. Measures Tested
The 47 CMS eCQMs from the QI-Core 2025 content package included in our standard benchmark suite include:
- CMS74 — Primary Caries Prevention
- CMS75 — Children with Dental Decay
- CMS124 — Cervical Cancer Screening
- CMS130 — Colorectal Cancer Screening
- CMS159 — Depression Remission
- CMS349 — HIV Screening
- ... and 41 additional measures.
4. Running Benchmarks Locally
To verify these results in your own environment, you can run the benchmark suite directly from the repository:
# Navigate to the benchmarking directory
cd benchmarks
# Run the full 2025 QI-Core suite
python -m runner --suite 2025 --skip-errors
5. CI Performance Reports
The repository also includes a DQM timing report workflow for tracking
performance changes across the 2025 measure suite. The workflow runs the DQM
conformance suite, compares conformance/reports/dqm_report.json to the
checked-in baseline at benchmarks/baselines/dqm_2025.json, and uploads JSON
and Markdown reports as GitHub Actions artifacts.
To generate the same report locally:
python3 conformance/scripts/run_dqm.py
python3 benchmarks/runner/dqm_perf_report.py \
--current conformance/reports/dqm_report.json \
--baseline benchmarks/baselines/dqm_2025.json \
--output-json benchmarks/output/dqm-performance-report.json \
--output-md benchmarks/output/dqm-performance-report.md
See the repository's CONTRIBUTING.md and .github/CI.md files for workflow behavior and baseline update policy.