Data Science Cheat Sheet: 두 판 사이의 차이
IT 위키
|  (Created page with "== Confusion Matrix and F1 Score == '''Confusion Matrix''' {| class="wikitable" |- ! !!Predicted Positive!!Predicted Negative |- |'''Actual Positive'''||True Positive (TP)||False Negative (FN) |- |'''Actual Negative'''||False Positive (FP)||True Negative (TN) |} '''F1 Score''' = 2 * (Precision * Recall) / (Precision + Recall)  * 2 * (Positive Predictive Value * True Positive Rate) / (Positive Predictive Value + True Positive Rate) * 2 * (TP) / (TP + FP + FN)  ==...") | 편집 요약 없음 | ||
| (같은 사용자의 중간 판 2개는 보이지 않습니다) | |||
| 1번째 줄: | 1번째 줄: | ||
| == Models == | |||
| * '''Support Vector Machine (SVM)''': A supervised model that finds the optimal hyperplane for class separation, widely used in high-dimensional tasks like text classification (e.g., spam detection). | |||
| ** '''''Advantage''''': Effective in high-dimensional spaces and robust to overfitting with the proper kernel. | |||
| ** '''''Disadvantage''''': Computationally intensive on large datasets and sensitive to parameter tuning. | |||
| * '''k-Nearest Neighbors (kNN)''': A non-parametric method that classifies based on nearest neighbors, often applied in recommendation systems and image recognition. | |||
| ** '''''Advantage''''': Simple and intuitive, with no training phase, making it easy to implement. | |||
| ** '''''Disadvantage''''': Computationally expensive at prediction time, especially with large datasets, and sensitive to irrelevant features. | |||
| * '''Decision Tree''': A model that splits data into branches based on feature values, useful for interpretable applications like customer segmentation and medical diagnosis. | |||
| ** '''''Advantage''''': Highly interpretable and handles both numerical and categorical data well. | |||
| ** '''''Disadvantage''''': Prone to overfitting, especially with deep trees, and can be sensitive to small data changes. | |||
| * '''Linear Regression''': A statistical technique that predicts a continuous outcome based on linear relationships, commonly used in financial forecasting and trend analysis. | |||
| ** '''''Advantage''''': Simple and interpretable, with fast training for large datasets. | |||
| ** '''''Disadvantage''''': Assumes a linear relationship, so it's unsuitable for complex, non-linear data. | |||
| * '''Logistic Regression''': A classification model estimating the probability of a binary outcome, widely used in credit scoring and binary medical diagnostics. | |||
| ** '''''Advantage''''': Interpretable with a clear probabilistic output, efficient for binary classification. | |||
| ** '''''Disadvantage''''': Limited to linear boundaries, making it ineffective for complex relationships without transformations. | |||
| * '''Naive Bayes''': A probabilistic classifier assuming feature independence, effective in text classification tasks like spam filtering due to its speed and simplicity. | |||
| ** '''''Advantage''''': Fast and efficient, especially on large datasets with independence assumptions holding. | |||
| ** '''''Disadvantage''''': Assumes feature independence, which may reduce accuracy if dependencies exist between features. | |||
| == Confusion Matrix and F1 Score == | == Confusion Matrix and F1 Score == | ||
| '''[[Confusion Matrix]]''' | '''[[Confusion Matrix]]''' | ||
| 17번째 줄: | 38번째 줄: | ||
| '''[[Recall (Data Science)|True Positive Rate (TPR), Sensitivity, Recall]]''' | '''[[Recall (Data Science)|True Positive Rate (TPR), Sensitivity, Recall]]''' | ||
| *TPR = Sensitivity = Recall = TP / (TP + FN) | *TPR = Sensitivity = Recall = TP / (TP + FN) | ||
| *Application: Measures the model's ability to correctly identify positive cases, useful in medical diagnostics to ensure true positives are detected. | |||
| '''[[Precision (Data Science)|Precision (Positive Predictive Value)]]''' | '''[[Precision (Data Science)|Precision (Positive Predictive Value)]]''' | ||
| *Precision = TP / (TP + FP) | *Precision = TP / (TP + FP) | ||
| *Application: Indicates the proportion of positive predictions that are correct, valuable in applications like spam filtering to minimize false alarms. | |||
| '''[[Specificity (Data Science)|Specificity (True Negative Rate, TNR)]]''' | '''[[Specificity (Data Science)|Specificity (True Negative Rate, TNR)]]''' | ||
| *Specificity = TNR = TN / (TN + FP) | *Specificity = TNR = TN / (TN + FP) | ||
| *Application: Assesses the model's accuracy in identifying negative cases, crucial in fraud detection to avoid unnecessary scrutiny of legitimate transactions. | |||
| '''[[False Positive Rate|False Positive Rate (FPR)]]''' | '''[[False Positive Rate|False Positive Rate (FPR)]]''' | ||
| *FPR = FP / (FP + TN) | *FPR = FP / (FP + TN) | ||
| *Application: Reflects the rate of false alarms for negative cases, significant in security systems where false positives can lead to excessive interventions. | |||
| '''Negative Predictive Value (NPV)''' | '''Negative Predictive Value (NPV)''' | ||
| *NPV = TN / (TN + FN) | *NPV = TN / (TN + FN) | ||
| *Application: Shows the likelihood that a negative prediction is accurate, important in screening tests to reassure negative cases reliably. | |||
| '''[[Accuracy (Data Science)|Accuracy]]''' | '''[[Accuracy (Data Science)|Accuracy]]''' | ||
| *Accuracy = (TP + TN) / (TP + TN + FP + FN) | *Accuracy = (TP + TN) / (TP + TN + FP + FN) | ||
| *Application: Provides an overall measure of model correctness, often used as a baseline metric but less informative for imbalanced datasets. | |||
| == Curves & Chart == | |||
| '''[[Lift Curve]]''' | |||
| * '''X-axis''': Percent of data (typically population percentile or cumulative population) | |||
| * '''Y-axis''': Lift (ratio of model's performance vs. baseline) | |||
| * '''Application''': Helps in evaluating the effectiveness of a model in prioritizing high-response cases, often used in marketing to identify segments likely to respond to promotions. | |||
| '''[[Gain Chart]]''' | |||
| * '''X-axis''': Percent of data (typically cumulative population) | |||
| * '''Y-axis''': Cumulative gain (proportion of positives captured) | |||
| * '''Application''': Illustrates the cumulative capture of positive responses at different cutoffs, useful in customer targeting to assess the efficiency of resource allocation. | |||
| '''[[Cumulative Response Curve]]''' | |||
| * '''X-axis''': Percent of data (cumulative population) | |||
| * '''Y-axis''': Cumulative response (actual positives captured as cumulative total) | |||
| * '''Application''': Evaluates model performance by showing how many true positives are captured as more of the population is included, applicable in direct marketing to optimize campaign reach. | |||
| '''[[ROC Curve]]''' | |||
| * '''X-axis''': False Positive Rate (FPR) | |||
| * '''Y-axis''': True Positive Rate (TPR or Sensitivity) | |||
| * '''Application''': Used to evaluate the trade-off between true positive and false positive rates at various thresholds, crucial in medical testing to balance sensitivity and specificity. | |||
| '''[[Precision-Recall Curve]]''' | |||
| * '''X-axis''': Recall (True Positive Rate) | |||
| ''' | * '''Y-axis''': Precision (Positive Predictive Value) | ||
| * '''Application''': Focuses on the balance between recall and precision, especially useful in cases of class imbalance, like fraud detection or medical diagnosis, where positive class accuracy is vital. | |||
| ''' | |||
| ''' | |||
2024년 11월 4일 (월) 14:44 기준 최신판
Models[편집 | 원본 편집]
- Support Vector Machine (SVM): A supervised model that finds the optimal hyperplane for class separation, widely used in high-dimensional tasks like text classification (e.g., spam detection).
- Advantage: Effective in high-dimensional spaces and robust to overfitting with the proper kernel.
- Disadvantage: Computationally intensive on large datasets and sensitive to parameter tuning.
 
- k-Nearest Neighbors (kNN): A non-parametric method that classifies based on nearest neighbors, often applied in recommendation systems and image recognition.
- Advantage: Simple and intuitive, with no training phase, making it easy to implement.
- Disadvantage: Computationally expensive at prediction time, especially with large datasets, and sensitive to irrelevant features.
 
- Decision Tree: A model that splits data into branches based on feature values, useful for interpretable applications like customer segmentation and medical diagnosis.
- Advantage: Highly interpretable and handles both numerical and categorical data well.
- Disadvantage: Prone to overfitting, especially with deep trees, and can be sensitive to small data changes.
 
- Linear Regression: A statistical technique that predicts a continuous outcome based on linear relationships, commonly used in financial forecasting and trend analysis.
- Advantage: Simple and interpretable, with fast training for large datasets.
- Disadvantage: Assumes a linear relationship, so it's unsuitable for complex, non-linear data.
 
- Logistic Regression: A classification model estimating the probability of a binary outcome, widely used in credit scoring and binary medical diagnostics.
- Advantage: Interpretable with a clear probabilistic output, efficient for binary classification.
- Disadvantage: Limited to linear boundaries, making it ineffective for complex relationships without transformations.
 
- Naive Bayes: A probabilistic classifier assuming feature independence, effective in text classification tasks like spam filtering due to its speed and simplicity.
- Advantage: Fast and efficient, especially on large datasets with independence assumptions holding.
- Disadvantage: Assumes feature independence, which may reduce accuracy if dependencies exist between features.
 
Confusion Matrix and F1 Score[편집 | 원본 편집]
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actual Positive | True Positive (TP) | False Negative (FN) | 
| Actual Negative | False Positive (FP) | True Negative (TN) | 
F1 Score = 2 * (Precision * Recall) / (Precision + Recall)
- 2 * (Positive Predictive Value * True Positive Rate) / (Positive Predictive Value + True Positive Rate)
- 2 * (TP) / (TP + FP + FN)
Key Evaluation Metrics[편집 | 원본 편집]
True Positive Rate (TPR), Sensitivity, Recall
- TPR = Sensitivity = Recall = TP / (TP + FN)
- Application: Measures the model's ability to correctly identify positive cases, useful in medical diagnostics to ensure true positives are detected.
Precision (Positive Predictive Value)
- Precision = TP / (TP + FP)
- Application: Indicates the proportion of positive predictions that are correct, valuable in applications like spam filtering to minimize false alarms.
Specificity (True Negative Rate, TNR)
- Specificity = TNR = TN / (TN + FP)
- Application: Assesses the model's accuracy in identifying negative cases, crucial in fraud detection to avoid unnecessary scrutiny of legitimate transactions.
- FPR = FP / (FP + TN)
- Application: Reflects the rate of false alarms for negative cases, significant in security systems where false positives can lead to excessive interventions.
Negative Predictive Value (NPV)
- NPV = TN / (TN + FN)
- Application: Shows the likelihood that a negative prediction is accurate, important in screening tests to reassure negative cases reliably.
- Accuracy = (TP + TN) / (TP + TN + FP + FN)
- Application: Provides an overall measure of model correctness, often used as a baseline metric but less informative for imbalanced datasets.
Curves & Chart[편집 | 원본 편집]
- X-axis: Percent of data (typically population percentile or cumulative population)
- Y-axis: Lift (ratio of model's performance vs. baseline)
- Application: Helps in evaluating the effectiveness of a model in prioritizing high-response cases, often used in marketing to identify segments likely to respond to promotions.
- X-axis: Percent of data (typically cumulative population)
- Y-axis: Cumulative gain (proportion of positives captured)
- Application: Illustrates the cumulative capture of positive responses at different cutoffs, useful in customer targeting to assess the efficiency of resource allocation.
- X-axis: Percent of data (cumulative population)
- Y-axis: Cumulative response (actual positives captured as cumulative total)
- Application: Evaluates model performance by showing how many true positives are captured as more of the population is included, applicable in direct marketing to optimize campaign reach.
- X-axis: False Positive Rate (FPR)
- Y-axis: True Positive Rate (TPR or Sensitivity)
- Application: Used to evaluate the trade-off between true positive and false positive rates at various thresholds, crucial in medical testing to balance sensitivity and specificity.
- X-axis: Recall (True Positive Rate)
- Y-axis: Precision (Positive Predictive Value)
- Application: Focuses on the balance between recall and precision, especially useful in cases of class imbalance, like fraud detection or medical diagnosis, where positive class accuracy is vital.

