PAYORLENS
AI Governance Evaluation Report · Model: logistic · 2026-04-20 17:36
Dataset: CMS DE-SynPUF Inpatient Claims · Architecture v2.0 · NIST AI RMF aligned
Section 0 · Executive Risk Brief

Overall Governance Risk Assessment

🔴
RED
Risk Level
Total Findings
11
All risk levels
Critical
4
Require immediate action
High
0
Require remediation
Medium / Low
3 / 4
Monitor
Recommendation: DO NOT DEPLOY. 4 critical finding(s) must be remediated before this model is used in any coverage decision workflow.
Top Findings:
Finding 1: Severely miscalibrated model (Brier=0.236). Confidence scores are unreliable. Autonomous use in coverage decisions is unjustifiable under CMS-0057-F explainability requirements.…
Finding 2: Severe denial rate disparity (DPD=0.286) across race cohorts (p=0.0000). This surpasses the NAIC Adverse Impact Ratio threshold implication (0.80–1.25 bounds). A payer using this model in prior auth o…
Finding 3: Severe denial rate disparity (DPD=0.446) across age_band cohorts (p=0.0000). This surpasses the NAIC Adverse Impact Ratio threshold implication (0.80–1.25 bounds). A payer using this model in prior au…

Section 1 · NIST AI RMF Compliance Mapping

Every evaluated metric is mapped to its corresponding NIST AI RMF function and subcategory. This table is the regulatory spine of the report.

MetricNIST FunctionSubcategoryStatus
Data Quality (Validation Error Rate) Map MP-2.3 🟢 LOW
Calibration (Brier Score + High-Conf Errors) Measure MS-2.3 🔴 CRITICAL
DPD (race) Measure MS-2.5 🔴 CRITICAL
DPD (gender) Measure MS-2.5 🟡 MEDIUM
DPD (age_band) Measure MS-2.5 🔴 CRITICAL
DPD (state_code) Measure MS-2.5 🔴 CRITICAL
Robustness — ICD9 primary diagnosis code corrupted (wrong/invalid code injected) Measure MS-2.6 🟡 MEDIUM
Robustness — Prior auth required fields nulled (diagnosis + utilization days) Measure MS-2.6 🟢 LOW
Robustness — Age band shifted up one tier (member enrollment data lag) Measure MS-2.6 🟢 LOW
Robustness — Claim amounts 10x-inflated on 5% of records (high-cost outliers) Measure MS-2.6 🟢 LOW
Robustness — Combined: 15% missing diagnosis + 15% amount corrupted + 5% age shift Measure MS-2.6 🟡 MEDIUM

Section 2 · Data Quality Findings

Total Records
66,718
After merge & normalisation
Validation Errors
0
Pydantic schema failures
Error Rate
0.00%
Schema compliance rate

Data contract enforced via Pydantic v2 schema validation. Each record validated against ClaimRecord model with field-level type coercion and range checks. Note on CMS DE-SynPUF: SP_STATE_CODE and CLM_ID are integer-typed in raw CMS files — coerced to str in normalisation layer before schema validation.

Section 3 · Model Performance Overview

Accuracy
0.586
Overall correctness
F1 Score
0.379
Precision-Recall balance
ROC-AUC
0.632
Discrimination ability
Brier Score
0.236
Calibration quality

High-confidence errors (>85% confidence, wrong prediction): 8 (0.05% of test set). These are the 'dangerous prediction' events — model was highly confident AND wrong. In prior auth workflows, these become automated adverse determinations without human review.

Section 4–5 · Fairness Audit & Risk Findings

Data Quality (Validation Error Rate) 🟢 LOW
Data validation error rate of 0.00% — excellent. Pydantic schema enforcement is effective on this dataset.
Recommended Action: Maintain current data contract. Re-validate on any schema change.
NIST AI RMF: Map › MP-2.3 — Data provenance, quality, and lineage  |  Total records evaluated: 66,718
Calibration (Brier Score + High-Conf Errors) 🔴 CRITICAL
Severely miscalibrated model (Brier=0.236). Confidence scores are unreliable. Autonomous use in coverage decisions is unjustifiable under CMS-0057-F explainability requirements.
Recommended Action: Do not use model confidence for routing logic. Full recalibration required.
NIST AI RMF: Measure › MS-2.3 — AI output reliability and uncertainty quantification  |  High-confidence errors (>85% conf, wrong): 8/16680
DPD (race) 🔴 CRITICAL
Severe denial rate disparity (DPD=0.286) across race cohorts (p=0.0000). This surpasses the NAIC Adverse Impact Ratio threshold implication (0.80–1.25 bounds). A payer using this model in prior auth or claims adjudication faces litigation exposure comparable to the UnitedHealth nH Predict and Cigna PxDx class-action pattern.
Recommended Action: Do NOT deploy in production. Full model audit required. Independent third-party re-validation recommended before any live use.
NIST AI RMF: Measure › MS-2.5 — Bias and fairness testing across demographic cohorts  |  chi-square p-value=0.0000, significant=True
DPD (gender) 🟡 MEDIUM
Meaningful denial rate disparity (0.054) across gender cohorts (p=0.0000). Under NIST AI RMF MS-2.5, this constitutes a measurable fairness gap. In a prior auth workflow, this pattern would attract NAIC unfair discrimination scrutiny and requires documented mitigation.
Recommended Action: Investigate root cause in training data. Rebalance cohort representation or apply post-processing fairness constraint. Document remediation steps.
NIST AI RMF: Measure › MS-2.5 — Bias and fairness testing across demographic cohorts  |  chi-square p-value=0.0000, significant=True
DPD (age_band) 🔴 CRITICAL
Severe denial rate disparity (DPD=0.446) across age_band cohorts (p=0.0000). This surpasses the NAIC Adverse Impact Ratio threshold implication (0.80–1.25 bounds). A payer using this model in prior auth or claims adjudication faces litigation exposure comparable to the UnitedHealth nH Predict and Cigna PxDx class-action pattern.
Recommended Action: Do NOT deploy in production. Full model audit required. Independent third-party re-validation recommended before any live use.
NIST AI RMF: Measure › MS-2.5 — Bias and fairness testing across demographic cohorts  |  chi-square p-value=0.0000, significant=True
DPD (state_code) 🔴 CRITICAL
Severe denial rate disparity (DPD=0.550) across state_code cohorts (p=0.0000). This surpasses the NAIC Adverse Impact Ratio threshold implication (0.80–1.25 bounds). A payer using this model in prior auth or claims adjudication faces litigation exposure comparable to the UnitedHealth nH Predict and Cigna PxDx class-action pattern.
Recommended Action: Do NOT deploy in production. Full model audit required. Independent third-party re-validation recommended before any live use.
NIST AI RMF: Measure › MS-2.5 — Bias and fairness testing across demographic cohorts  |  chi-square p-value=0.0000, significant=True
Robustness — ICD9 primary diagnosis code corrupted (wrong/invalid code injected) 🟡 MEDIUM
Moderate F1 degradation (7.7%) under 'ICD9 primary diagnosis code corrupted (wrong/invalid code injected)' at 20% corruption. Crosses warning threshold (>5%). In real payer workflows, data quality issues at this rate are common (incomplete PA submissions, missing documentation). Performance will degrade in production without pipeline quality controls.
Recommended Action: Implement upstream data validation. Add fallback to human review when key fields are missing.
NIST AI RMF: Measure › MS-2.6 — Robustness and resilience under input perturbation  |  Baseline F1=0.379 → Degraded F1=0.349 (decay=7.7%) | Warning >5% | Danger >20%
Robustness — Prior auth required fields nulled (diagnosis + utilization days) 🟢 LOW
Model shows negligible performance degradation (1.6% F1 decay) under 'Prior auth required fields nulled (diagnosis + utilization days)' at 20% corruption rate. Robust to this failure mode under NIST AI RMF MS-2.6.
Recommended Action: No action required. Document robustness test result for governance trail.
NIST AI RMF: Measure › MS-2.6 — Robustness and resilience under input perturbation  |  Baseline F1=0.379 → Degraded F1=0.373 (decay=1.6%) | Warning >5% | Danger >20%
Robustness — Age band shifted up one tier (member enrollment data lag) 🟢 LOW
Model shows negligible performance degradation (0.0% F1 decay) under 'Age band shifted up one tier (member enrollment data lag)' at 10% corruption rate. Robust to this failure mode under NIST AI RMF MS-2.6.
Recommended Action: No action required. Document robustness test result for governance trail.
NIST AI RMF: Measure › MS-2.6 — Robustness and resilience under input perturbation  |  Baseline F1=0.379 → Degraded F1=0.379 (decay=0.0%) | Warning >5% | Danger >20%
Robustness — Claim amounts 10x-inflated on 5% of records (high-cost outliers) 🟢 LOW
Model shows negligible performance degradation (1.4% F1 decay) under 'Claim amounts 10x-inflated on 5% of records (high-cost outliers)' at 5% corruption rate. Robust to this failure mode under NIST AI RMF MS-2.6.
Recommended Action: No action required. Document robustness test result for governance trail.
NIST AI RMF: Measure › MS-2.6 — Robustness and resilience under input perturbation  |  Baseline F1=0.379 → Degraded F1=0.374 (decay=1.4%) | Warning >5% | Danger >20%
Robustness — Combined: 15% missing diagnosis + 15% amount corrupted + 5% age shift 🟡 MEDIUM
Moderate F1 degradation (6.5%) under 'Combined: 15% missing diagnosis + 15% amount corrupted + 5% age shift' at 15% corruption. Crosses warning threshold (>5%). In real payer workflows, data quality issues at this rate are common (incomplete PA submissions, missing documentation). Performance will degrade in production without pipeline quality controls.
Recommended Action: Implement upstream data validation. Add fallback to human review when key fields are missing.
NIST AI RMF: Measure › MS-2.6 — Robustness and resilience under input perturbation  |  Baseline F1=0.379 → Degraded F1=0.354 (decay=6.5%) | Warning >5% | Danger >20%

Fairness — Race

🔴 CRITICAL
DPD=0.2861  |  EOD=0.2202  |  χ² p=0.0000  |  ✅ Significant
CohortCountDenial Rate F1Note
White 14,065 0.423 0.369
Black 1,797 0.701 0.431
Hispanic 321 0.611 0.339
Other 497 0.414 0.408

Fairness — Gender

🟡 MEDIUM
DPD=0.0539  |  EOD=0.0546  |  χ² p=0.0000  |  ✅ Significant
CohortCountDenial Rate F1Note
Female 9,425 0.480 0.382
Male 7,255 0.426 0.375

Fairness — Age Band

🔴 CRITICAL
DPD=0.4459  |  EOD=0.3814  |  χ² p=0.0000  |  ✅ Significant
CohortCountDenial Rate F1Note
65+ 13,898 0.412 0.357
50-64 1,773 0.595 0.418
35-49 776 0.812 0.490
18-34 233 0.858 0.493

Fairness — State Code

🔴 CRITICAL
DPD=0.5505  |  EOD=0.8182  |  χ² p=0.0000  |  ✅ Significant
CohortCountDenial Rate F1Note
AR 209 0.445 0.308
CA 1,277 0.435 0.389
WV 431 0.390 0.320
NJ 559 0.599 0.383
KS 202 0.386 0.351
FL 1,176 0.495 0.356
OH 718 0.421 0.364
AL 335 0.460 0.364
UT 1,193 0.455 0.366
IN 415 0.439 0.373
NC 534 0.511 0.399
NY 1,044 0.485 0.367
TX 364 0.527 0.413
MO 427 0.473 0.377
ME 75 0.240 0.167
WI 263 0.365 0.371
IL 756 0.468 0.405
MS 251 0.514 0.373
SD 304 0.493 0.381
OR 131 0.298 0.355
PR 346 0.312 0.429
ID 73 0.233 0.167
WY 152 0.382 0.268
OK 258 0.357 0.317
LA 293 0.495 0.415
MI 601 0.464 0.347
GA 494 0.549 0.395
KY 332 0.503 0.496
MA 382 0.503 0.443
VT 67 0.388 0.439
AZ 264 0.409 0.364
ND 33 0.061 0.000
MD 327 0.578 0.424
Unknown 144 0.500 0.255
NV 99 0.545 0.447
PA 682 0.520 0.415
CT 196 0.378 0.262
MN 258 0.314 0.351
CO 184 0.451 0.409
IA 196 0.291 0.409
TN 51 0.157 0.300
DE 59 0.542 0.273
AK 20 0.350 0.500
NE 94 0.553 0.343
NH 79 0.570 0.394
NM 89 0.213 0.129
HI 49 0.245 0.480
SC 62 0.258 0.323
DC 36 0.611 0.545
MT 59 0.102 0.429
VA 37 0.432 0.500

Section 6 · Robustness Stress Test — Clinical Failure Scenarios

Baseline F1: 0.3787  ·  🟢 Safe <5%  |  🟡 Warning 5–10%  |  🟠 High 10–20%  |  🔴 Danger >20% F1 decay

ScenarioRateBaseline F1 Degraded F1Decay %ThresholdRisk
ICD9 primary diagnosis code corrupted (wrong/invalid code injected) 20% 0.379 0.349 7.7% ✅ OK 🟢 LOW
Prior auth required fields nulled (diagnosis + utilization days) 20% 0.379 0.373 1.6% ✅ OK 🟢 LOW
Age band shifted up one tier (member enrollment data lag) 10% 0.379 0.379 0.0% ✅ OK 🟢 LOW
Claim amounts 10x-inflated on 5% of records (high-cost outliers) 5% 0.379 0.374 1.4% ✅ OK 🟢 LOW
Combined: 15% missing diagnosis + 15% amount corrupted + 5% age shift 15% 0.379 0.354 6.5% ✅ OK 🟢 LOW

Section 7 · Top 5 High-Confidence Failure Cases

The five predictions where the model was most confident AND wrong. In a prior auth automation workflow these become unreviewed adverse determinations. Each narrative includes patient profile, error type, confidence level, and governance implication.

Rank 1 — False Denial (model confidence: 92.7%)
Patient: Male, age 68, race: Black, state: NC
Chronic conditions: Diabetes, Congestive Heart Failure, COPD | Utilization: 97 days
Model predicted: DENIED · Actual outcome: APPROVED
Governance implication: A false denial at 92.7% confidence bypasses human review in an automated prior auth workflow, becoming an unreviewed adverse determination. This patient profile (Black, age 68) represents a demographic group with elevated model error rates per fairness audit.
Rank 2 — False Denial (model confidence: 91.1%)
Patient: Male, age 72, race: Black, state: GA
Chronic conditions: Diabetes, Congestive Heart Failure | Utilization: 79 days
Model predicted: DENIED · Actual outcome: APPROVED
Governance implication: A false denial at 91.1% confidence bypasses human review in an automated prior auth workflow, becoming an unreviewed adverse determination. This patient profile (Black, age 72) represents a demographic group with elevated model error rates per fairness audit.
Rank 3 — False Denial (model confidence: 90.7%)
Patient: Female, age 79, race: White, state: IL
Chronic conditions: Diabetes, Congestive Heart Failure, COPD | Utilization: 90 days
Model predicted: DENIED · Actual outcome: APPROVED
Governance implication: A false denial at 90.7% confidence bypasses human review in an automated prior auth workflow, becoming an unreviewed adverse determination. This patient profile (White, age 79) represents a demographic group with elevated model error rates per fairness audit.
Rank 4 — False Denial (model confidence: 87.4%)
Patient: Male, age 88, race: White, state: IN
Chronic conditions: Diabetes, Congestive Heart Failure, COPD, Cancer | Utilization: 66 days
Model predicted: DENIED · Actual outcome: APPROVED
Governance implication: A false denial at 87.4% confidence bypasses human review in an automated prior auth workflow, becoming an unreviewed adverse determination. This patient profile (White, age 88) represents a demographic group with elevated model error rates per fairness audit.
Rank 5 — False Denial (model confidence: 87.4%)
Patient: Female, age 46, race: Black, state: NC
Chronic conditions: Diabetes, Congestive Heart Failure, COPD | Utilization: 36 days
Model predicted: DENIED · Actual outcome: APPROVED
Governance implication: A false denial at 87.4% confidence bypasses human review in an automated prior auth workflow, becoming an unreviewed adverse determination. This patient profile (Black, age 46) represents a demographic group with elevated model error rates per fairness audit.

Section 8 · Methodology & Robustness Threshold Legend

PayorLens AI Governance Evaluation Harness · Architecture v2.0 · Generated 2026-04-20 17:36 · NIST AI RMF aligned · Dataset: CMS DE-SynPUF (public, zero PHI) · This report is an independent evaluation artefact. It is not a state compliance document.