AI governance evaluation harness for clinical decision models in insurance workflows.
Health insurers are deploying AI in prior authorization and claims adjudication. Fairness metrics are widely generated, but their regulatory and explainability implications require deeper interpretation.
PayorLens bridges the gap between raw evaluation metrics and the business consequence β producing a two-audience governance report that a compliance officer and an ML engineer can both act on.
Evaluated on CMS DE-SynPUF inpatient claims (66,718 records) Β· Logistic Regression baseline model
| Finding | Metric | Risk Level |
|---|---|---|
| Model calibration failure | Brier = 0.236 Β· 8 high-confidence wrong predictions | π΄ CRITICAL |
| Race cohort denial disparity | DPD = 0.286 Β· p < 0.0001 | π΄ CRITICAL |
| Age band denial disparity | DPD = 0.446 Β· p < 0.0001 | π΄ CRITICAL |
| Geographic disparity | DPD = 0.550 across state cohorts Β· p < 0.0001 | π΄ CRITICAL |
| Gender disparity | DPD = 0.054 Β· p < 0.0001 | π‘ MEDIUM |
| ICD9 code corruption robustness | F1 decay 7.7% at 20% corruption | π‘ MEDIUM |
| Multi-field degradation | F1 decay 6.5% under combined pipeline failure | π‘ MEDIUM |
Overall governance verdict: RED β DO NOT DEPLOY without remediation.
π View full sample report β
PayorLens runs five evaluation modules against any binary classification model on claims data:
1. Data quality β Pydantic v2 schema validation with field-level type coercion. Produces a data contract pass/fail before any model evaluation begins.
2. Model performance β Accuracy, F1, ROC-AUC, Brier score, confusion matrix. Flags high-confidence wrong predictions (>85% model confidence, wrong outcome) β the specific failure mode cited in the Cigna PxDx class-action litigation.
3. Fairness audit β Demographic Parity Difference (DPD) and Equalized Odds Difference (EOD) across race, gender, age band, and geography. Every disparity metric is backed by a chi-square significance test. Nothing is flagged unless it clears both effect size and statistical significance thresholds.
4. Robustness stress test β Five clinically meaningful failure injection scenarios (not random noise): ICD9 code corruption, missing prior auth fields, age band enrollment lag, high-cost outlier claims, and combined multi-field degradation. Each scenario has a named clinical rationale from real payer data quality failure modes.
5. Risk interpretation β Every metric feeds through risk_interpreter.py, which maps findings to NIST AI RMF functions (Govern / Map / Measure / Manage) and subcategories, assigns a risk level (LOW / MEDIUM / HIGH / CRITICAL), writes a plain-English payor interpretation, and generates a recommended action. This is what makes the report readable to a compliance officer, not just an ML engineer.
The output is a two-audience HTML/PDF governance report:
| Section | Audience | Contents |
|---|---|---|
| 0 Β· Executive Risk Brief | Compliance officer | Overall risk score, top 3 findings, single recommendation |
| 1 Β· NIST AI RMF Map | Legal / risk officer | Every metric β NIST function β PASS/WARN/FAIL |
| 2 Β· Data Quality | Data engineer | Pydantic validation results, error rate |
| 3 Β· Model Performance | ML engineer | F1, AUC, Brier, calibration diagram |
| 4β5 Β· Fairness Audit | Compliance + ML | Per-cohort DPD/EOD with p-values and risk narratives |
| 6 Β· Robustness | ML + compliance | F1 decay per clinical scenario, danger threshold |
| 7 Β· Failure Narratives | Compliance officer | Top 5 high-confidence errors as plain-language vignettes |
| 8 Β· Methodology | Auditor | Dataset provenance, statistical test rationale |
Python 3.11+
scikit-learn β model training and evaluation pipelines
fairlearn β MetricFrame for per-cohort fairness metrics
scipy.stats β chi2_contingency for all significance tests
pydantic v2 β schema validation and data contracts
pandas / numpy β data wrangling
matplotlib β charts (ROC curve, calibration, fairness bar charts)
joblib β model serialization
typer β CLI interface
# 1. Install dependencies
pip install -r requirements.txt
# 2. Place CMS DE-SynPUF files in data/raw/cms/
# Beneficiary Summary + Inpatient Claims (Sample 1)
# Free registration: cms.gov/Research-Statistics-Data-and-Systems
# 3. Run the full pipeline
python cli.py evaluate \
--bene-file DE1_0_2008_Beneficiary_Summary_File_Sample_1.csv \
--claims-file DE1_0_2008_to_2010_Inpatient_Claims_Sample_1.csv \
--model logistic
# Output: reports/payorlens_logistic.html
payorlens/
βββ loader.py # CMS data ingestion, normalization, Pydantic validation
βββ evaluator.py # Model training + core performance metrics
βββ fairness.py # FairnessAuditor β DPD/EOD with chi-square significance
βββ robustness.py # ClinicalRobustnessInjector β 5 failure scenarios
βββ risk_interpreter.py # RiskInterpreter β metrics β risk narratives + NIST mapping
βββ reporter.py # ReportGenerator β two-audience HTML/PDF output
βββ cli.py # Typer CLI entry point
data/
βββ raw/cms/ # CMS DE-SynPUF source files (not committed β too large)
βββ processed/ # Parquet, trained models, charts
reports/
βββ payorlens_logistic.html # Sample report β logistic regression
βββ payorlens_gbm.html # Sample report β gradient boosting
All findings are mapped to the NIST AI Risk Management Framework 1.0 (January 2023), specifically:
NIST AI RMF is a voluntary federal framework. References to state statutes (CO SB21-169, NY Circular Letter No. 7) and NAIC guidance in report narratives are contextual illustrations of the regulatory environment a payer would face β not legal opinions.
On the target variable: CMS DE-SynPUF does not include actual prior authorization denial decisions β that data is proprietary to individual payers. The denial_status label used in this evaluation is an engineered proxy, constructed using clinically informed logistic regression on utilization days, claim amount, chronic condition burden, and demographic features, calibrated to a ~21% denial rate consistent with published industry benchmarks.
The proxy label is used to demonstrate evaluation methodology. A production engagement would use a payerβs actual denial labels. The framework β schema validation, fairness metrics, robustness scenarios, risk interpretation, NIST mapping β applies unchanged regardless of how the target variable is sourced.
On model performance: The logistic regression baseline (F1=0.379, AUC=0.632) is intentionally a starting point, not a production candidate. The purpose of this evaluation is to demonstrate what the governance framework catches when a model has problems β and this one has several. That is the point.
PayorLens is an evaluation harness β it assesses a model and produces a governance report. It is not a model registry, a compliance artifact generator, a circuit breaker, or a HITL workflow system. It evaluates one model at a time against public data and tells you what the risks are.
Built as an independent portfolio project demonstrating AI governance methodology for payer AI use cases. Domain informed by experience in health AI and regulatory compliance. Not affiliated with any payer, EHR vendor, or AI company.
NIST AI RMF is a product of the National Institute of Standards and Technology. CMS DE-SynPUF is a public dataset from the Centers for Medicare & Medicaid Services.