Promptv1

ML Model Evaluation and Benchmarking Framework

0.0(0)

3 downloads83 views

Description

Generate a comprehensive ML model evaluation framework with multi-metric scoring, statistical significance testing, and visual reports.

The Prompt

You are an ML evaluation expert. Build a comprehensive model benchmarking framework.

## Configuration
- **Task Type**: [TASK_TYPE e.g. classification / regression / NLG / ranking]
- **Models to Compare**: [MODELS e.g. model_a, model_b, model_c]
- **Dataset**: [DATASET e.g. custom test set with ground truth labels]
- **Output Format**: [FORMAT e.g. HTML report / JSON / LaTeX table]

## Framework Components

### 1. Metric Registry (`metrics.py`)
For [TASK_TYPE], implement:

**Classification**: accuracy, precision, recall, F1 (macro/micro/weighted), AUC-ROC, AUC-PR, Matthews correlation coefficient, Cohen's kappa, confusion matrix, calibration curve

**Regression**: MSE, RMSE, MAE, MAPE, R², adjusted R², residual analysis

**NLG**: BLEU, ROUGE-1/2/L, METEOR, BERTScore, human preference correlation

**Ranking**: NDCG@k, MAP@k, MRR, precision@k, recall@k

### 2. Statistical Testing (`significance.py`)
- Paired bootstrap test for comparing two models
- McNemar's test for classification differences
- Wilcoxon signed-rank test for non-parametric comparison
- Bonferroni correction for multiple comparisons
- Effect size calculation (Cohen's d)
- Confidence interval estimation with configurable alpha=[ALPHA e.g. 0.05]

### 3. Subgroup Analysis (`fairness.py`)
- Slice performance by [SLICING_COLUMNS e.g. gender, age_group, language]
- Identify worst-performing subgroups
- Calculate disparity ratios between subgroups
- Equalized odds and demographic parity checks

### 4. Report Generator (`report.py`)
- Comparative leaderboard table with all metrics
- Per-model performance radar charts
- Error analysis with worst-case examples
- Statistical significance matrix (p-values)
- Export to [FORMAT]

### 5. CLI Interface
- `benchmark run --config config.yaml`
- `benchmark compare --models model_a model_b --metric f1`
- `benchmark report --format html --output results/`

Provide complete implementation with type hints, docstrings, and example config files for each task type.

Free to copy and use. Compatible with Claude 4 Opus, GPT-5, Gemini 2.5 Pro.

Usage Instructions

Select your task type and list models to compare. Prepare a test dataset with ground truth labels. Run benchmark with config.yaml to generate comparative reports.

Compatible AI Models

Claude 4 OpusGPT-5Gemini 2.5 Pro

Version History

Initial release

Supported models:Claude 4 Opus, GPT-5, Gemini 2.5 Pro

Related Prompts

Reviews

Frequently Asked Questions

How do I use this prompt?+

What AI models is this prompt compatible with?+

Is this prompt free to download?+

Who created this prompt?+

Can I modify this prompt?+

You are an ML evaluation expert. Build a comprehensive model benchmarking framework. ## Configuration - **Task Type**: [TASK_TYPE e.g. classification / regression / NLG / ranking] - **Models to Compare**: [MODELS e.g. model_a, model_b, model_c] - **Dataset**: [DATASET e.g. custom test set with ground truth labels] - **Output Format**: [FORMAT e.g. HTML report / JSON / LaTeX table] ## Framework Components ### 1. Metric Registry (`metrics.py`) For [TASK_TYPE], implement: **Classification**: accuracy, precision, recall, F1 (macro/micro/weighted), AUC-ROC, AUC-PR, Matthews correlation coefficient, Cohen's kappa, confusion matrix, calibration curve **Regression**: MSE, RMSE, MAE, MAPE, R², adjusted R², residual analysis **NLG**: BLEU, ROUGE-1/2/L, METEOR, BERTScore, human preference correlation **Ranking**: NDCG@k, MAP@k, MRR, precision@k, recall@k ### 2. Statistical Testing (`significance.py`) - Paired bootstrap test for comparing two models - McNemar's test for classification differences - Wilcoxon signed-rank test for non-parametric comparison - Bonferroni correction for multiple comparisons - Effect size calculation (Cohen's d) - Confidence interval estimation with configurable alpha=[ALPHA e.g. 0.05] ### 3. Subgroup Analysis (`fairness.py`) - Slice performance by [SLICING_COLUMNS e.g. gender, age_group, language] - Identify worst-performing subgroups - Calculate disparity ratios between subgroups - Equalized odds and demographic parity checks ### 4. Report Generator (`report.py`) - Comparative leaderboard table with all metrics - Per-model performance radar charts - Error analysis with worst-case examples - Statistical significance matrix (p-values) - Export to [FORMAT] ### 5. CLI Interface - `benchmark run --config config.yaml` - `benchmark compare --models model_a model_b --metric f1` - `benchmark report --format html --output results/` Provide complete implementation with type hints, docstrings, and example config files for each task type.

ML Model Evaluation and Benchmarking Framework

Description

The Prompt

Usage Instructions

Tags

Compatible AI Models

Version History

Related Prompts

Reviews

Frequently Asked Questions

ML Model Evaluation and Benchmarking Framework

Description

The Prompt

Usage Instructions

Tags

Compatible AI Models

Version History

Related Prompts

Reviews

Frequently Asked Questions