PAYORLENS
AI Governance Evaluation Report · Model: logistic · 2026-04-20 17:36
Dataset: CMS DE-SynPUF Inpatient Claims · Architecture v2.0 ·
NIST AI RMF aligned
Section 0 · Executive Risk Brief
Overall Governance Risk Assessment
Total Findings
11
All risk levels
Critical
4
Require immediate action
High
0
Require remediation
Medium / Low
3 / 4
Monitor
Recommendation:
DO NOT DEPLOY. 4 critical finding(s) must be remediated before this model is used in any coverage decision workflow.
Top Findings:
Finding 1: Severely miscalibrated model (Brier=0.236). Confidence scores are unreliable. Autonomous use in coverage decisions is unjustifiable under CMS-0057-F explainability requirements.…
Finding 2: Severe denial rate disparity (DPD=0.286) across race cohorts (p=0.0000). This surpasses the NAIC Adverse Impact Ratio threshold implication (0.80–1.25 bounds). A payer using this model in prior auth o…
Finding 3: Severe denial rate disparity (DPD=0.446) across age_band cohorts (p=0.0000). This surpasses the NAIC Adverse Impact Ratio threshold implication (0.80–1.25 bounds). A payer using this model in prior au…
Section 1 · NIST AI RMF Compliance Mapping
Every evaluated metric is mapped to its corresponding NIST AI RMF function
and subcategory. This table is the regulatory spine of the report.
| Metric | NIST Function | Subcategory | Status |
| Data Quality (Validation Error Rate) |
Map
|
MP-2.3
|
🟢 LOW |
| Calibration (Brier Score + High-Conf Errors) |
Measure
|
MS-2.3
|
🔴 CRITICAL |
| DPD (race) |
Measure
|
MS-2.5
|
🔴 CRITICAL |
| DPD (gender) |
Measure
|
MS-2.5
|
🟡 MEDIUM |
| DPD (age_band) |
Measure
|
MS-2.5
|
🔴 CRITICAL |
| DPD (state_code) |
Measure
|
MS-2.5
|
🔴 CRITICAL |
| Robustness — ICD9 primary diagnosis code corrupted (wrong/invalid code injected) |
Measure
|
MS-2.6
|
🟡 MEDIUM |
| Robustness — Prior auth required fields nulled (diagnosis + utilization days) |
Measure
|
MS-2.6
|
🟢 LOW |
| Robustness — Age band shifted up one tier (member enrollment data lag) |
Measure
|
MS-2.6
|
🟢 LOW |
| Robustness — Claim amounts 10x-inflated on 5% of records (high-cost outliers) |
Measure
|
MS-2.6
|
🟢 LOW |
| Robustness — Combined: 15% missing diagnosis + 15% amount corrupted + 5% age shift |
Measure
|
MS-2.6
|
🟡 MEDIUM |
Section 2 · Data Quality Findings
Total Records
66,718
After merge & normalisation
Validation Errors
0
Pydantic schema failures
Error Rate
0.00%
Schema compliance rate
Data contract enforced via Pydantic v2 schema validation. Each record validated against
ClaimRecord model with field-level type coercion and range checks.
Note on CMS DE-SynPUF: SP_STATE_CODE and CLM_ID are integer-typed in
raw CMS files — coerced to str in normalisation layer before schema validation.
Section 3 · Model Performance Overview
Accuracy
0.586
Overall correctness
F1 Score
0.379
Precision-Recall balance
ROC-AUC
0.632
Discrimination ability
Brier Score
0.236
Calibration quality
High-confidence errors (>85% confidence, wrong prediction):
8
(0.05% of test set).
These are the 'dangerous prediction' events — model was highly confident AND wrong.
In prior auth workflows, these become automated adverse determinations without human review.
Section 4–5 · Fairness Audit & Risk Findings
Data Quality (Validation Error Rate)
🟢 LOW
Data validation error rate of 0.00% — excellent. Pydantic schema enforcement is effective on this dataset.
Recommended Action: Maintain current data contract. Re-validate on any schema change.
NIST AI RMF: Map
› MP-2.3 — Data provenance, quality, and lineage
| Total records evaluated: 66,718
Calibration (Brier Score + High-Conf Errors)
🔴 CRITICAL
Severely miscalibrated model (Brier=0.236). Confidence scores are unreliable. Autonomous use in coverage decisions is unjustifiable under CMS-0057-F explainability requirements.
Recommended Action: Do not use model confidence for routing logic. Full recalibration required.
NIST AI RMF: Measure
› MS-2.3 — AI output reliability and uncertainty quantification
| High-confidence errors (>85% conf, wrong): 8/16680
DPD (race)
🔴 CRITICAL
Severe denial rate disparity (DPD=0.286) across race cohorts (p=0.0000). This surpasses the NAIC Adverse Impact Ratio threshold implication (0.80–1.25 bounds). A payer using this model in prior auth or claims adjudication faces litigation exposure comparable to the UnitedHealth nH Predict and Cigna PxDx class-action pattern.
Recommended Action: Do NOT deploy in production. Full model audit required. Independent third-party re-validation recommended before any live use.
NIST AI RMF: Measure
› MS-2.5 — Bias and fairness testing across demographic cohorts
| chi-square p-value=0.0000, significant=True
DPD (gender)
🟡 MEDIUM
Meaningful denial rate disparity (0.054) across gender cohorts (p=0.0000). Under NIST AI RMF MS-2.5, this constitutes a measurable fairness gap. In a prior auth workflow, this pattern would attract NAIC unfair discrimination scrutiny and requires documented mitigation.
Recommended Action: Investigate root cause in training data. Rebalance cohort representation or apply post-processing fairness constraint. Document remediation steps.
NIST AI RMF: Measure
› MS-2.5 — Bias and fairness testing across demographic cohorts
| chi-square p-value=0.0000, significant=True
DPD (age_band)
🔴 CRITICAL
Severe denial rate disparity (DPD=0.446) across age_band cohorts (p=0.0000). This surpasses the NAIC Adverse Impact Ratio threshold implication (0.80–1.25 bounds). A payer using this model in prior auth or claims adjudication faces litigation exposure comparable to the UnitedHealth nH Predict and Cigna PxDx class-action pattern.
Recommended Action: Do NOT deploy in production. Full model audit required. Independent third-party re-validation recommended before any live use.
NIST AI RMF: Measure
› MS-2.5 — Bias and fairness testing across demographic cohorts
| chi-square p-value=0.0000, significant=True
DPD (state_code)
🔴 CRITICAL
Severe denial rate disparity (DPD=0.550) across state_code cohorts (p=0.0000). This surpasses the NAIC Adverse Impact Ratio threshold implication (0.80–1.25 bounds). A payer using this model in prior auth or claims adjudication faces litigation exposure comparable to the UnitedHealth nH Predict and Cigna PxDx class-action pattern.
Recommended Action: Do NOT deploy in production. Full model audit required. Independent third-party re-validation recommended before any live use.
NIST AI RMF: Measure
› MS-2.5 — Bias and fairness testing across demographic cohorts
| chi-square p-value=0.0000, significant=True
Robustness — ICD9 primary diagnosis code corrupted (wrong/invalid code injected)
🟡 MEDIUM
Moderate F1 degradation (7.7%) under 'ICD9 primary diagnosis code corrupted (wrong/invalid code injected)' at 20% corruption. Crosses warning threshold (>5%). In real payer workflows, data quality issues at this rate are common (incomplete PA submissions, missing documentation). Performance will degrade in production without pipeline quality controls.
Recommended Action: Implement upstream data validation. Add fallback to human review when key fields are missing.
NIST AI RMF: Measure
› MS-2.6 — Robustness and resilience under input perturbation
| Baseline F1=0.379 → Degraded F1=0.349 (decay=7.7%) | Warning >5% | Danger >20%
Robustness — Prior auth required fields nulled (diagnosis + utilization days)
🟢 LOW
Model shows negligible performance degradation (1.6% F1 decay) under 'Prior auth required fields nulled (diagnosis + utilization days)' at 20% corruption rate. Robust to this failure mode under NIST AI RMF MS-2.6.
Recommended Action: No action required. Document robustness test result for governance trail.
NIST AI RMF: Measure
› MS-2.6 — Robustness and resilience under input perturbation
| Baseline F1=0.379 → Degraded F1=0.373 (decay=1.6%) | Warning >5% | Danger >20%
Robustness — Age band shifted up one tier (member enrollment data lag)
🟢 LOW
Model shows negligible performance degradation (0.0% F1 decay) under 'Age band shifted up one tier (member enrollment data lag)' at 10% corruption rate. Robust to this failure mode under NIST AI RMF MS-2.6.
Recommended Action: No action required. Document robustness test result for governance trail.
NIST AI RMF: Measure
› MS-2.6 — Robustness and resilience under input perturbation
| Baseline F1=0.379 → Degraded F1=0.379 (decay=0.0%) | Warning >5% | Danger >20%
Robustness — Claim amounts 10x-inflated on 5% of records (high-cost outliers)
🟢 LOW
Model shows negligible performance degradation (1.4% F1 decay) under 'Claim amounts 10x-inflated on 5% of records (high-cost outliers)' at 5% corruption rate. Robust to this failure mode under NIST AI RMF MS-2.6.
Recommended Action: No action required. Document robustness test result for governance trail.
NIST AI RMF: Measure
› MS-2.6 — Robustness and resilience under input perturbation
| Baseline F1=0.379 → Degraded F1=0.374 (decay=1.4%) | Warning >5% | Danger >20%
Robustness — Combined: 15% missing diagnosis + 15% amount corrupted + 5% age shift
🟡 MEDIUM
Moderate F1 degradation (6.5%) under 'Combined: 15% missing diagnosis + 15% amount corrupted + 5% age shift' at 15% corruption. Crosses warning threshold (>5%). In real payer workflows, data quality issues at this rate are common (incomplete PA submissions, missing documentation). Performance will degrade in production without pipeline quality controls.
Recommended Action: Implement upstream data validation. Add fallback to human review when key fields are missing.
NIST AI RMF: Measure
› MS-2.6 — Robustness and resilience under input perturbation
| Baseline F1=0.379 → Degraded F1=0.354 (decay=6.5%) | Warning >5% | Danger >20%
Fairness — Race
🔴 CRITICAL
DPD=0.2861 | EOD=0.2202 |
χ² p=0.0000 | ✅ Significant
| Cohort | Count | Denial Rate |
F1 | Note |
| White |
14,065 |
0.423 |
0.369 |
— |
| Black |
1,797 |
0.701 |
0.431 |
— |
| Hispanic |
321 |
0.611 |
0.339 |
— |
| Other |
497 |
0.414 |
0.408 |
— |
Fairness — Gender
🟡 MEDIUM
DPD=0.0539 | EOD=0.0546 |
χ² p=0.0000 | ✅ Significant
| Cohort | Count | Denial Rate |
F1 | Note |
| Female |
9,425 |
0.480 |
0.382 |
— |
| Male |
7,255 |
0.426 |
0.375 |
— |
Fairness — Age Band
🔴 CRITICAL
DPD=0.4459 | EOD=0.3814 |
χ² p=0.0000 | ✅ Significant
| Cohort | Count | Denial Rate |
F1 | Note |
| 65+ |
13,898 |
0.412 |
0.357 |
— |
| 50-64 |
1,773 |
0.595 |
0.418 |
— |
| 35-49 |
776 |
0.812 |
0.490 |
— |
| 18-34 |
233 |
0.858 |
0.493 |
— |
Fairness — State Code
🔴 CRITICAL
DPD=0.5505 | EOD=0.8182 |
χ² p=0.0000 | ✅ Significant
| Cohort | Count | Denial Rate |
F1 | Note |
| AR |
209 |
0.445 |
0.308 |
— |
| CA |
1,277 |
0.435 |
0.389 |
— |
| WV |
431 |
0.390 |
0.320 |
— |
| NJ |
559 |
0.599 |
0.383 |
— |
| KS |
202 |
0.386 |
0.351 |
— |
| FL |
1,176 |
0.495 |
0.356 |
— |
| OH |
718 |
0.421 |
0.364 |
— |
| AL |
335 |
0.460 |
0.364 |
— |
| UT |
1,193 |
0.455 |
0.366 |
— |
| IN |
415 |
0.439 |
0.373 |
— |
| NC |
534 |
0.511 |
0.399 |
— |
| NY |
1,044 |
0.485 |
0.367 |
— |
| TX |
364 |
0.527 |
0.413 |
— |
| MO |
427 |
0.473 |
0.377 |
— |
| ME |
75 |
0.240 |
0.167 |
— |
| WI |
263 |
0.365 |
0.371 |
— |
| IL |
756 |
0.468 |
0.405 |
— |
| MS |
251 |
0.514 |
0.373 |
— |
| SD |
304 |
0.493 |
0.381 |
— |
| OR |
131 |
0.298 |
0.355 |
— |
| PR |
346 |
0.312 |
0.429 |
— |
| ID |
73 |
0.233 |
0.167 |
— |
| WY |
152 |
0.382 |
0.268 |
— |
| OK |
258 |
0.357 |
0.317 |
— |
| LA |
293 |
0.495 |
0.415 |
— |
| MI |
601 |
0.464 |
0.347 |
— |
| GA |
494 |
0.549 |
0.395 |
— |
| KY |
332 |
0.503 |
0.496 |
— |
| MA |
382 |
0.503 |
0.443 |
— |
| VT |
67 |
0.388 |
0.439 |
— |
| AZ |
264 |
0.409 |
0.364 |
— |
| ND |
33 |
0.061 |
0.000 |
— |
| MD |
327 |
0.578 |
0.424 |
— |
| Unknown |
144 |
0.500 |
0.255 |
— |
| NV |
99 |
0.545 |
0.447 |
— |
| PA |
682 |
0.520 |
0.415 |
— |
| CT |
196 |
0.378 |
0.262 |
— |
| MN |
258 |
0.314 |
0.351 |
— |
| CO |
184 |
0.451 |
0.409 |
— |
| IA |
196 |
0.291 |
0.409 |
— |
| TN |
51 |
0.157 |
0.300 |
— |
| DE |
59 |
0.542 |
0.273 |
— |
| AK |
20 |
0.350 |
0.500 |
— |
| NE |
94 |
0.553 |
0.343 |
— |
| NH |
79 |
0.570 |
0.394 |
— |
| NM |
89 |
0.213 |
0.129 |
— |
| HI |
49 |
0.245 |
0.480 |
— |
| SC |
62 |
0.258 |
0.323 |
— |
| DC |
36 |
0.611 |
0.545 |
— |
| MT |
59 |
0.102 |
0.429 |
— |
| VA |
37 |
0.432 |
0.500 |
— |
Section 6 · Robustness Stress Test — Clinical Failure Scenarios
Baseline F1: 0.3787 ·
🟢 Safe <5% | 🟡 Warning 5–10% |
🟠 High 10–20% | 🔴 Danger >20% F1 decay
| Scenario | Rate | Baseline F1 |
Degraded F1 | Decay % | Threshold | Risk |
| ICD9 primary diagnosis code corrupted (wrong/invalid code injected) |
20% |
0.379 |
0.349 |
7.7% |
✅ OK |
🟢 LOW |
| Prior auth required fields nulled (diagnosis + utilization days) |
20% |
0.379 |
0.373 |
1.6% |
✅ OK |
🟢 LOW |
| Age band shifted up one tier (member enrollment data lag) |
10% |
0.379 |
0.379 |
0.0% |
✅ OK |
🟢 LOW |
| Claim amounts 10x-inflated on 5% of records (high-cost outliers) |
5% |
0.379 |
0.374 |
1.4% |
✅ OK |
🟢 LOW |
| Combined: 15% missing diagnosis + 15% amount corrupted + 5% age shift |
15% |
0.379 |
0.354 |
6.5% |
✅ OK |
🟢 LOW |
Section 7 · Top 5 High-Confidence Failure Cases
The five predictions where the model was most confident AND wrong.
In a prior auth automation workflow these become unreviewed adverse determinations.
Each narrative includes patient profile, error type, confidence level,
and governance implication.
Rank 1 — False Denial (model confidence: 92.7%)
Patient: Male, age 68, race: Black, state: NC
Chronic conditions: Diabetes, Congestive Heart Failure, COPD | Utilization: 97 days
Model predicted: DENIED · Actual outcome: APPROVED
Governance implication: A false denial at 92.7% confidence bypasses human review in an automated prior auth workflow, becoming an unreviewed adverse determination. This patient profile (Black, age 68) represents a demographic group with elevated model error rates per fairness audit.
Rank 2 — False Denial (model confidence: 91.1%)
Patient: Male, age 72, race: Black, state: GA
Chronic conditions: Diabetes, Congestive Heart Failure | Utilization: 79 days
Model predicted: DENIED · Actual outcome: APPROVED
Governance implication: A false denial at 91.1% confidence bypasses human review in an automated prior auth workflow, becoming an unreviewed adverse determination. This patient profile (Black, age 72) represents a demographic group with elevated model error rates per fairness audit.
Rank 3 — False Denial (model confidence: 90.7%)
Patient: Female, age 79, race: White, state: IL
Chronic conditions: Diabetes, Congestive Heart Failure, COPD | Utilization: 90 days
Model predicted: DENIED · Actual outcome: APPROVED
Governance implication: A false denial at 90.7% confidence bypasses human review in an automated prior auth workflow, becoming an unreviewed adverse determination. This patient profile (White, age 79) represents a demographic group with elevated model error rates per fairness audit.
Rank 4 — False Denial (model confidence: 87.4%)
Patient: Male, age 88, race: White, state: IN
Chronic conditions: Diabetes, Congestive Heart Failure, COPD, Cancer | Utilization: 66 days
Model predicted: DENIED · Actual outcome: APPROVED
Governance implication: A false denial at 87.4% confidence bypasses human review in an automated prior auth workflow, becoming an unreviewed adverse determination. This patient profile (White, age 88) represents a demographic group with elevated model error rates per fairness audit.
Rank 5 — False Denial (model confidence: 87.4%)
Patient: Female, age 46, race: Black, state: NC
Chronic conditions: Diabetes, Congestive Heart Failure, COPD | Utilization: 36 days
Model predicted: DENIED · Actual outcome: APPROVED
Governance implication: A false denial at 87.4% confidence bypasses human review in an automated prior auth workflow, becoming an unreviewed adverse determination. This patient profile (Black, age 46) represents a demographic group with elevated model error rates per fairness audit.
Section 8 · Methodology & Robustness Threshold Legend
- Dataset: CMS DE-SynPUF Inpatient Claims + Beneficiary Summary,
joined on DESYNPUF_ID. Engineered denial target: logit model over utilization,
chronic conditions, age, demographic weights + noise (σ=1.2) → ~20% base rate.
- Target rationale: CLM_PMT_AMT==0 proxy removed — it caused
target leakage (AUC=1.0). Engineered target reflects realistic prior auth
denial patterns per CMS/NAIC literature.
- Model: logistic with class_weight="balanced".
claim_amount excluded from features (was leaking into target).
- Fairness tests: scipy chi-square on each sensitive feature.
Findings only flagged if p < 0.05 AND |DPD| ≥ 0.05. HIGH minimum if |DPD| ≥ 0.10.
- Robustness thresholds:
🟢 Safe <5% F1 decay |
🟡 Warning 5–10% |
🟠 High 10–20% |
🔴 Danger >20% F1 decay.
- NIST AI RMF: Voluntary framework (not Radinate-proprietary).
Mappings based on NIST AI RMF v1.0 published guidance.
PayorLens AI Governance Evaluation Harness · Architecture v2.0 · Generated 2026-04-20 17:36 ·
NIST AI RMF aligned · Dataset: CMS DE-SynPUF (public, zero PHI) ·
This report is an independent evaluation artefact. It is not a state compliance document.