Evaluating model performance is crucial to determine how well a machine learning model generalizes to new data. In medical applications (e.g., lesion classification), this ensures clinical usefulness and patient safety.
Metric | Description | Interpretation |
---|---|---|
Accuracy | Ratio of correct predictions to total cases | Good for balanced datasets |
Precision | TP / (TP + FP) | How many predicted positives are true |
Recall (Sensitivity) | TP / (TP + FN) | Ability to detect true positives |
Specificity | TN / (TN + FP) | Ability to detect true negatives |
F1-Score | Harmonic mean of precision and recall | Balances precision & recall |
AUC (Area Under ROC Curve) | Measures ability to distinguish between classes | Closer to 1 = better |
Balanced Accuracy | Mean of sensitivity and specificity | Useful for imbalanced datasets |
Confusion Matrix | Table showing TP, FP, TN, FN | Full picture of model errors |
* Metrics are computed per class, then averaged:
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score # Assume y_test = true labels, y_pred = predicted labels print(confusion_matrix(y_test, y_pred)) print(classification_report(y_test, y_pred)) # For multi-class AUC (with probabilities) roc_auc_score(y_test, y_proba, multi_class="ovr")
* High sensitivity: Essential in detecting critical conditions (e.g., tumors) * High specificity: Important to avoid false positives * Balanced accuracy: Prevents overestimation in imbalanced data (e.g., rare tumors)
* Always report multiple metrics, not just accuracy * Use cross-validation to avoid overfitting * Consider confidence intervals for key metrics