Section 0 · Executive Risk Brief

Overall Governance Risk Assessment

🔴

RED

Risk Level

Total Findings

11

All risk levels

Critical

4

Require immediate action

High

0

Require remediation

Medium / Low

3 / 4

Monitor

Recommendation: DO NOT DEPLOY. 4 critical finding(s) must be remediated before this model is used in any coverage decision workflow.

Top Findings:

Finding 1: Severely miscalibrated model (Brier=0.236). Confidence scores are unreliable. Autonomous use in coverage decisions is unjustifiable under CMS-0057-F explainability requirements.…

Finding 2: Severe denial rate disparity (DPD=0.286) across race cohorts (p=0.0000). This surpasses the NAIC Adverse Impact Ratio threshold implication (0.80–1.25 bounds). A payer using this model in prior auth o…

Finding 3: Severe denial rate disparity (DPD=0.446) across age_band cohorts (p=0.0000). This surpasses the NAIC Adverse Impact Ratio threshold implication (0.80–1.25 bounds). A payer using this model in prior au…

Section 1 · NIST AI RMF Compliance Mapping

Every evaluated metric is mapped to its corresponding NIST AI RMF function and subcategory. This table is the regulatory spine of the report.

Metric	NIST Function	Subcategory	Status
Data Quality (Validation Error Rate)	Map	MP-2.3	🟢 LOW
Calibration (Brier Score + High-Conf Errors)	Measure	MS-2.3	🔴 CRITICAL
DPD (race)	Measure	MS-2.5	🔴 CRITICAL
DPD (gender)	Measure	MS-2.5	🟡 MEDIUM
DPD (age_band)	Measure	MS-2.5	🔴 CRITICAL
DPD (state_code)	Measure	MS-2.5	🔴 CRITICAL
Robustness — ICD9 primary diagnosis code corrupted (wrong/invalid code injected)	Measure	MS-2.6	🟡 MEDIUM
Robustness — Prior auth required fields nulled (diagnosis + utilization days)	Measure	MS-2.6	🟢 LOW
Robustness — Age band shifted up one tier (member enrollment data lag)	Measure	MS-2.6	🟢 LOW
Robustness — Claim amounts 10x-inflated on 5% of records (high-cost outliers)	Measure	MS-2.6	🟢 LOW
Robustness — Combined: 15% missing diagnosis + 15% amount corrupted + 5% age shift	Measure	MS-2.6	🟡 MEDIUM

Section 2 · Data Quality Findings

Total Records

66,718

After merge & normalisation

Validation Errors

0

Pydantic schema failures

Error Rate

0.00%

Schema compliance rate

Data contract enforced via Pydantic v2 schema validation. Each record validated against ClaimRecord model with field-level type coercion and range checks. Note on CMS DE-SynPUF: SP_STATE_CODE and CLM_ID are integer-typed in raw CMS files — coerced to str in normalisation layer before schema validation.

Section 3 · Model Performance Overview

Accuracy

0.586

Overall correctness

F1 Score

0.379

Precision-Recall balance

ROC-AUC

0.632

Discrimination ability

Brier Score

0.236

Calibration quality

High-confidence errors (>85% confidence, wrong prediction): 8 (0.05% of test set). These are the 'dangerous prediction' events — model was highly confident AND wrong. In prior auth workflows, these become automated adverse determinations without human review.

Section 4–5 · Fairness Audit & Risk Findings

Data Quality (Validation Error Rate) 🟢 LOW

Data validation error rate of 0.00% — excellent. Pydantic schema enforcement is effective on this dataset.

Recommended Action: Maintain current data contract. Re-validate on any schema change.

NIST AI RMF: Map › MP-2.3 — Data provenance, quality, and lineage | Total records evaluated: 66,718

Calibration (Brier Score + High-Conf Errors) 🔴 CRITICAL

Severely miscalibrated model (Brier=0.236). Confidence scores are unreliable. Autonomous use in coverage decisions is unjustifiable under CMS-0057-F explainability requirements.

Recommended Action: Do not use model confidence for routing logic. Full recalibration required.

NIST AI RMF: Measure › MS-2.3 — AI output reliability and uncertainty quantification | High-confidence errors (>85% conf, wrong): 8/16680

DPD (race) 🔴 CRITICAL

Severe denial rate disparity (DPD=0.286) across race cohorts (p=0.0000). This surpasses the NAIC Adverse Impact Ratio threshold implication (0.80–1.25 bounds). A payer using this model in prior auth or claims adjudication faces litigation exposure comparable to the UnitedHealth nH Predict and Cigna PxDx class-action pattern.

Recommended Action: Do NOT deploy in production. Full model audit required. Independent third-party re-validation recommended before any live use.

NIST AI RMF: Measure › MS-2.5 — Bias and fairness testing across demographic cohorts | chi-square p-value=0.0000, significant=True

DPD (gender) 🟡 MEDIUM

Meaningful denial rate disparity (0.054) across gender cohorts (p=0.0000). Under NIST AI RMF MS-2.5, this constitutes a measurable fairness gap. In a prior auth workflow, this pattern would attract NAIC unfair discrimination scrutiny and requires documented mitigation.

Recommended Action: Investigate root cause in training data. Rebalance cohort representation or apply post-processing fairness constraint. Document remediation steps.

NIST AI RMF: Measure › MS-2.5 — Bias and fairness testing across demographic cohorts | chi-square p-value=0.0000, significant=True

DPD (age_band) 🔴 CRITICAL

Severe denial rate disparity (DPD=0.446) across age_band cohorts (p=0.0000). This surpasses the NAIC Adverse Impact Ratio threshold implication (0.80–1.25 bounds). A payer using this model in prior auth or claims adjudication faces litigation exposure comparable to the UnitedHealth nH Predict and Cigna PxDx class-action pattern.

Recommended Action: Do NOT deploy in production. Full model audit required. Independent third-party re-validation recommended before any live use.

NIST AI RMF: Measure › MS-2.5 — Bias and fairness testing across demographic cohorts | chi-square p-value=0.0000, significant=True

DPD (state_code) 🔴 CRITICAL

Severe denial rate disparity (DPD=0.550) across state_code cohorts (p=0.0000). This surpasses the NAIC Adverse Impact Ratio threshold implication (0.80–1.25 bounds). A payer using this model in prior auth or claims adjudication faces litigation exposure comparable to the UnitedHealth nH Predict and Cigna PxDx class-action pattern.

Recommended Action: Do NOT deploy in production. Full model audit required. Independent third-party re-validation recommended before any live use.

NIST AI RMF: Measure › MS-2.5 — Bias and fairness testing across demographic cohorts | chi-square p-value=0.0000, significant=True

Robustness — ICD9 primary diagnosis code corrupted (wrong/invalid code injected) 🟡 MEDIUM

Moderate F1 degradation (7.7%) under 'ICD9 primary diagnosis code corrupted (wrong/invalid code injected)' at 20% corruption. Crosses warning threshold (>5%). In real payer workflows, data quality issues at this rate are common (incomplete PA submissions, missing documentation). Performance will degrade in production without pipeline quality controls.

Recommended Action: Implement upstream data validation. Add fallback to human review when key fields are missing.

NIST AI RMF: Measure › MS-2.6 — Robustness and resilience under input perturbation | Baseline F1=0.379 → Degraded F1=0.349 (decay=7.7%) | Warning >5% | Danger >20%

Robustness — Prior auth required fields nulled (diagnosis + utilization days) 🟢 LOW

Model shows negligible performance degradation (1.6% F1 decay) under 'Prior auth required fields nulled (diagnosis + utilization days)' at 20% corruption rate. Robust to this failure mode under NIST AI RMF MS-2.6.

Recommended Action: No action required. Document robustness test result for governance trail.

NIST AI RMF: Measure › MS-2.6 — Robustness and resilience under input perturbation | Baseline F1=0.379 → Degraded F1=0.373 (decay=1.6%) | Warning >5% | Danger >20%

Robustness — Age band shifted up one tier (member enrollment data lag) 🟢 LOW

Model shows negligible performance degradation (0.0% F1 decay) under 'Age band shifted up one tier (member enrollment data lag)' at 10% corruption rate. Robust to this failure mode under NIST AI RMF MS-2.6.

Recommended Action: No action required. Document robustness test result for governance trail.

NIST AI RMF: Measure › MS-2.6 — Robustness and resilience under input perturbation | Baseline F1=0.379 → Degraded F1=0.379 (decay=0.0%) | Warning >5% | Danger >20%

Robustness — Claim amounts 10x-inflated on 5% of records (high-cost outliers) 🟢 LOW

Model shows negligible performance degradation (1.4% F1 decay) under 'Claim amounts 10x-inflated on 5% of records (high-cost outliers)' at 5% corruption rate. Robust to this failure mode under NIST AI RMF MS-2.6.

Recommended Action: No action required. Document robustness test result for governance trail.

NIST AI RMF: Measure › MS-2.6 — Robustness and resilience under input perturbation | Baseline F1=0.379 → Degraded F1=0.374 (decay=1.4%) | Warning >5% | Danger >20%

Robustness — Combined: 15% missing diagnosis + 15% amount corrupted + 5% age shift 🟡 MEDIUM

Moderate F1 degradation (6.5%) under 'Combined: 15% missing diagnosis + 15% amount corrupted + 5% age shift' at 15% corruption. Crosses warning threshold (>5%). In real payer workflows, data quality issues at this rate are common (incomplete PA submissions, missing documentation). Performance will degrade in production without pipeline quality controls.

Recommended Action: Implement upstream data validation. Add fallback to human review when key fields are missing.

NIST AI RMF: Measure › MS-2.6 — Robustness and resilience under input perturbation | Baseline F1=0.379 → Degraded F1=0.354 (decay=6.5%) | Warning >5% | Danger >20%

Fairness — Race

🔴 CRITICAL

DPD=0.2861 | EOD=0.2202 | χ² p=0.0000 | ✅ Significant

Cohort	Count	Denial Rate	F1	Note
White	14,065	0.423	0.369	—
Black	1,797	0.701	0.431	—
Hispanic	321	0.611	0.339	—
Other	497	0.414	0.408	—

Fairness — Gender

🟡 MEDIUM

DPD=0.0539 | EOD=0.0546 | χ² p=0.0000 | ✅ Significant

Cohort	Count	Denial Rate	F1	Note
Female	9,425	0.480	0.382	—
Male	7,255	0.426	0.375	—

Fairness — Age Band

🔴 CRITICAL

DPD=0.4459 | EOD=0.3814 | χ² p=0.0000 | ✅ Significant

Cohort	Count	Denial Rate	F1	Note
65+	13,898	0.412	0.357	—
50-64	1,773	0.595	0.418	—
35-49	776	0.812	0.490	—
18-34	233	0.858	0.493	—

Fairness — State Code

🔴 CRITICAL

DPD=0.5505 | EOD=0.8182 | χ² p=0.0000 | ✅ Significant

Cohort	Count	Denial Rate	F1	Note
AR	209	0.445	0.308	—
CA	1,277	0.435	0.389	—
WV	431	0.390	0.320	—
NJ	559	0.599	0.383	—
KS	202	0.386	0.351	—
FL	1,176	0.495	0.356	—
OH	718	0.421	0.364	—
AL	335	0.460	0.364	—
UT	1,193	0.455	0.366	—
IN	415	0.439	0.373	—
NC	534	0.511	0.399	—
NY	1,044	0.485	0.367	—
TX	364	0.527	0.413	—
MO	427	0.473	0.377	—
ME	75	0.240	0.167	—
WI	263	0.365	0.371	—
IL	756	0.468	0.405	—
MS	251	0.514	0.373	—
SD	304	0.493	0.381	—
OR	131	0.298	0.355	—
PR	346	0.312	0.429	—
ID	73	0.233	0.167	—
WY	152	0.382	0.268	—
OK	258	0.357	0.317	—
LA	293	0.495	0.415	—
MI	601	0.464	0.347	—
GA	494	0.549	0.395	—
KY	332	0.503	0.496	—
MA	382	0.503	0.443	—
VT	67	0.388	0.439	—
AZ	264	0.409	0.364	—
ND	33	0.061	0.000	—
MD	327	0.578	0.424	—
Unknown	144	0.500	0.255	—
NV	99	0.545	0.447	—
PA	682	0.520	0.415	—
CT	196	0.378	0.262	—
MN	258	0.314	0.351	—
CO	184	0.451	0.409	—
IA	196	0.291	0.409	—
TN	51	0.157	0.300	—
DE	59	0.542	0.273	—
AK	20	0.350	0.500	—
NE	94	0.553	0.343	—
NH	79	0.570	0.394	—
NM	89	0.213	0.129	—
HI	49	0.245	0.480	—
SC	62	0.258	0.323	—
DC	36	0.611	0.545	—
MT	59	0.102	0.429	—
VA	37	0.432	0.500	—

Section 6 · Robustness Stress Test — Clinical Failure Scenarios

Baseline F1: 0.3787 · 🟢 Safe <5% | 🟡 Warning 5–10% | 🟠 High 10–20% | 🔴 Danger >20% F1 decay

Scenario	Rate	Baseline F1	Degraded F1	Decay %	Threshold	Risk
ICD9 primary diagnosis code corrupted (wrong/invalid code injected)	20%	0.379	0.349	7.7%	✅ OK	🟢 LOW
Prior auth required fields nulled (diagnosis + utilization days)	20%	0.379	0.373	1.6%	✅ OK	🟢 LOW
Age band shifted up one tier (member enrollment data lag)	10%	0.379	0.379	0.0%	✅ OK	🟢 LOW
Claim amounts 10x-inflated on 5% of records (high-cost outliers)	5%	0.379	0.374	1.4%	✅ OK	🟢 LOW
Combined: 15% missing diagnosis + 15% amount corrupted + 5% age shift	15%	0.379	0.354	6.5%	✅ OK	🟢 LOW

Section 7 · Top 5 High-Confidence Failure Cases

The five predictions where the model was most confident AND wrong. In a prior auth automation workflow these become unreviewed adverse determinations. Each narrative includes patient profile, error type, confidence level, and governance implication.

Rank 1 — False Denial (model confidence: 92.7%)
Patient: Male, age 68, race: Black, state: NC
Chronic conditions: Diabetes, Congestive Heart Failure, COPD | Utilization: 97 days
Model predicted: DENIED · Actual outcome: APPROVED
Governance implication: A false denial at 92.7% confidence bypasses human review in an automated prior auth workflow, becoming an unreviewed adverse determination. This patient profile (Black, age 68) represents a demographic group with elevated model error rates per fairness audit.

Rank 2 — False Denial (model confidence: 91.1%)
Patient: Male, age 72, race: Black, state: GA
Chronic conditions: Diabetes, Congestive Heart Failure | Utilization: 79 days
Model predicted: DENIED · Actual outcome: APPROVED
Governance implication: A false denial at 91.1% confidence bypasses human review in an automated prior auth workflow, becoming an unreviewed adverse determination. This patient profile (Black, age 72) represents a demographic group with elevated model error rates per fairness audit.

Rank 3 — False Denial (model confidence: 90.7%)
Patient: Female, age 79, race: White, state: IL
Chronic conditions: Diabetes, Congestive Heart Failure, COPD | Utilization: 90 days
Model predicted: DENIED · Actual outcome: APPROVED
Governance implication: A false denial at 90.7% confidence bypasses human review in an automated prior auth workflow, becoming an unreviewed adverse determination. This patient profile (White, age 79) represents a demographic group with elevated model error rates per fairness audit.

Rank 4 — False Denial (model confidence: 87.4%)
Patient: Male, age 88, race: White, state: IN
Chronic conditions: Diabetes, Congestive Heart Failure, COPD, Cancer | Utilization: 66 days
Model predicted: DENIED · Actual outcome: APPROVED
Governance implication: A false denial at 87.4% confidence bypasses human review in an automated prior auth workflow, becoming an unreviewed adverse determination. This patient profile (White, age 88) represents a demographic group with elevated model error rates per fairness audit.

Rank 5 — False Denial (model confidence: 87.4%)
Patient: Female, age 46, race: Black, state: NC
Chronic conditions: Diabetes, Congestive Heart Failure, COPD | Utilization: 36 days
Model predicted: DENIED · Actual outcome: APPROVED
Governance implication: A false denial at 87.4% confidence bypasses human review in an automated prior auth workflow, becoming an unreviewed adverse determination. This patient profile (Black, age 46) represents a demographic group with elevated model error rates per fairness audit.

Section 8 · Methodology & Robustness Threshold Legend

Dataset: CMS DE-SynPUF Inpatient Claims + Beneficiary Summary, joined on DESYNPUF_ID. Engineered denial target: logit model over utilization, chronic conditions, age, demographic weights + noise (σ=1.2) → ~20% base rate.
Target rationale: CLM_PMT_AMT==0 proxy removed — it caused target leakage (AUC=1.0). Engineered target reflects realistic prior auth denial patterns per CMS/NAIC literature.
Model: logistic with class_weight="balanced". claim_amount excluded from features (was leaking into target).
Fairness tests: scipy chi-square on each sensitive feature. Findings only flagged if p < 0.05 AND |DPD| ≥ 0.05. HIGH minimum if |DPD| ≥ 0.10.
Robustness thresholds: 🟢 Safe <5% F1 decay | 🟡 Warning 5–10% | 🟠 High 10–20% | 🔴 Danger >20% F1 decay.
NIST AI RMF: Voluntary framework (not Radinate-proprietary). Mappings based on NIST AI RMF v1.0 published guidance.