Data Science Cheat Sheet: Difference between revisions

From IT Wiki
(Created page with "== Confusion Matrix and F1 Score == '''Confusion Matrix''' {| class="wikitable" |- ! !!Predicted Positive!!Predicted Negative |- |'''Actual Positive'''||True Positive (TP)||False Negative (FN) |- |'''Actual Negative'''||False Positive (FP)||True Negative (TN) |} '''F1 Score''' = 2 * (Precision * Recall) / (Precision + Recall) * 2 * (Positive Predictive Value * True Positive Rate) / (Positive Predictive Value + True Positive Rate) * 2 * (TP) / (TP + FP + FN) ==...")
 
No edit summary
Line 17: Line 17:
'''[[Recall (Data Science)|True Positive Rate (TPR), Sensitivity, Recall]]'''
'''[[Recall (Data Science)|True Positive Rate (TPR), Sensitivity, Recall]]'''
*TPR = Sensitivity = Recall = TP / (TP + FN)
*TPR = Sensitivity = Recall = TP / (TP + FN)
*Application: Measures the model's ability to correctly identify positive cases, useful in medical diagnostics to ensure true positives are detected.
'''[[Precision (Data Science)|Precision (Positive Predictive Value)]]'''
'''[[Precision (Data Science)|Precision (Positive Predictive Value)]]'''
*Precision = TP / (TP + FP)
*Precision = TP / (TP + FP)
*Application: Indicates the proportion of positive predictions that are correct, valuable in applications like spam filtering to minimize false alarms.
'''[[Specificity (Data Science)|Specificity (True Negative Rate, TNR)]]'''
'''[[Specificity (Data Science)|Specificity (True Negative Rate, TNR)]]'''
*Specificity = TNR = TN / (TN + FP)
*Specificity = TNR = TN / (TN + FP)
*Application: Assesses the model's accuracy in identifying negative cases, crucial in fraud detection to avoid unnecessary scrutiny of legitimate transactions.
'''[[False Positive Rate|False Positive Rate (FPR)]]'''
'''[[False Positive Rate|False Positive Rate (FPR)]]'''
*FPR = FP / (FP + TN)
*FPR = FP / (FP + TN)
*Application: Reflects the rate of false alarms for negative cases, significant in security systems where false positives can lead to excessive interventions.
'''Negative Predictive Value (NPV)'''
'''Negative Predictive Value (NPV)'''
*NPV = TN / (TN + FN)
*NPV = TN / (TN + FN)
*Application: Shows the likelihood that a negative prediction is accurate, important in screening tests to reassure negative cases reliably.
'''[[Accuracy (Data Science)|Accuracy]]'''
'''[[Accuracy (Data Science)|Accuracy]]'''
*Accuracy = (TP + TN) / (TP + TN + FP + FN)
*Accuracy = (TP + TN) / (TP + TN + FP + FN)
*Application: Provides an overall measure of model correctness, often used as a baseline metric but less informative for imbalanced datasets.


== Relationships between Key Concepts ==
== Curves & Chart ==
'''TPR (Recall) and Precision''':
'''Lift Curve'''
*TPR represents the proportion of actual positives correctly predicted by the model, while Precision shows the proportion of predicted positives that are actually positive.
 
**Increasing TPR (Recall) can sometimes reduce Precision, and vice versa.
* '''X-axis''': Percent of data (typically population percentile or cumulative population)
'''FPR and Specificity''':
* '''Y-axis''': Lift (ratio of model's performance vs. baseline)
*Specificity = (1 - FPR). In an ROC curve, FPR is plotted on the x-axis and TPR on the y-axis to visualize model performance.
* '''Application''': Helps in evaluating the effectiveness of a model in prioritizing high-response cases, often used in marketing to identify segments likely to respond to promotions.
'''F1 Score''':
 
*Defined as the harmonic mean of Precision and Recall, emphasizing their balance. F1 Score = 2 * (Precision * Recall) / (Precision + Recall)
'''Gain Chart'''
'''Accuracy''':
 
*Accuracy reflects the overall model performance but may not be suitable in cases of class imbalance.
* '''X-axis''': Percent of data (typically cumulative population)
* '''Y-axis''': Cumulative gain (proportion of positives captured)
* '''Application''': Illustrates the cumulative capture of positive responses at different cutoffs, useful in customer targeting to assess the efficiency of resource allocation.
 
'''Cumulative Response Curve'''
 
* '''X-axis''': Percent of data (cumulative population)
* '''Y-axis''': Cumulative response (actual positives captured as cumulative total)
* '''Application''': Evaluates model performance by showing how many true positives are captured as more of the population is included, applicable in direct marketing to optimize campaign reach.
 
'''ROC Curve'''
 
* '''X-axis''': False Positive Rate (FPR)
* '''Y-axis''': True Positive Rate (TPR or Sensitivity)
* '''Application''': Used to evaluate the trade-off between true positive and false positive rates at various thresholds, crucial in medical testing to balance sensitivity and specificity.
 
'''Precision-Recall Curve'''
 
* '''X-axis''': Recall (True Positive Rate)
* '''Y-axis''': Precision (Positive Predictive Value)
* '''Application''': Focuses on the balance between recall and precision, especially useful in cases of class imbalance, like fraud detection or medical diagnosis, where positive class accuracy is vital.

Revision as of 14:26, 4 November 2024

Confusion Matrix and F1 Score

Confusion Matrix

Predicted Positive Predicted Negative
Actual Positive True Positive (TP) False Negative (FN)
Actual Negative False Positive (FP) True Negative (TN)

F1 Score = 2 * (Precision * Recall) / (Precision + Recall)

  • 2 * (Positive Predictive Value * True Positive Rate) / (Positive Predictive Value + True Positive Rate)
  • 2 * (TP) / (TP + FP + FN)

Key Evaluation Metrics

True Positive Rate (TPR), Sensitivity, Recall

  • TPR = Sensitivity = Recall = TP / (TP + FN)
  • Application: Measures the model's ability to correctly identify positive cases, useful in medical diagnostics to ensure true positives are detected.

Precision (Positive Predictive Value)

  • Precision = TP / (TP + FP)
  • Application: Indicates the proportion of positive predictions that are correct, valuable in applications like spam filtering to minimize false alarms.

Specificity (True Negative Rate, TNR)

  • Specificity = TNR = TN / (TN + FP)
  • Application: Assesses the model's accuracy in identifying negative cases, crucial in fraud detection to avoid unnecessary scrutiny of legitimate transactions.

False Positive Rate (FPR)

  • FPR = FP / (FP + TN)
  • Application: Reflects the rate of false alarms for negative cases, significant in security systems where false positives can lead to excessive interventions.

Negative Predictive Value (NPV)

  • NPV = TN / (TN + FN)
  • Application: Shows the likelihood that a negative prediction is accurate, important in screening tests to reassure negative cases reliably.

Accuracy

  • Accuracy = (TP + TN) / (TP + TN + FP + FN)
  • Application: Provides an overall measure of model correctness, often used as a baseline metric but less informative for imbalanced datasets.

Curves & Chart

Lift Curve

  • X-axis: Percent of data (typically population percentile or cumulative population)
  • Y-axis: Lift (ratio of model's performance vs. baseline)
  • Application: Helps in evaluating the effectiveness of a model in prioritizing high-response cases, often used in marketing to identify segments likely to respond to promotions.

Gain Chart

  • X-axis: Percent of data (typically cumulative population)
  • Y-axis: Cumulative gain (proportion of positives captured)
  • Application: Illustrates the cumulative capture of positive responses at different cutoffs, useful in customer targeting to assess the efficiency of resource allocation.

Cumulative Response Curve

  • X-axis: Percent of data (cumulative population)
  • Y-axis: Cumulative response (actual positives captured as cumulative total)
  • Application: Evaluates model performance by showing how many true positives are captured as more of the population is included, applicable in direct marketing to optimize campaign reach.

ROC Curve

  • X-axis: False Positive Rate (FPR)
  • Y-axis: True Positive Rate (TPR or Sensitivity)
  • Application: Used to evaluate the trade-off between true positive and false positive rates at various thresholds, crucial in medical testing to balance sensitivity and specificity.

Precision-Recall Curve

  • X-axis: Recall (True Positive Rate)
  • Y-axis: Precision (Positive Predictive Value)
  • Application: Focuses on the balance between recall and precision, especially useful in cases of class imbalance, like fraud detection or medical diagnosis, where positive class accuracy is vital.