Data Science Cheat Sheet: 두 판 사이의 차이

2024년 11월 4일 (월) 14:44 기준 최신판

Models[편집 | 원본 편집]

Support Vector Machine (SVM): A supervised model that finds the optimal hyperplane for class separation, widely used in high-dimensional tasks like text classification (e.g., spam detection).
- Advantage: Effective in high-dimensional spaces and robust to overfitting with the proper kernel.
- Disadvantage: Computationally intensive on large datasets and sensitive to parameter tuning.
k-Nearest Neighbors (kNN): A non-parametric method that classifies based on nearest neighbors, often applied in recommendation systems and image recognition.
- Advantage: Simple and intuitive, with no training phase, making it easy to implement.
- Disadvantage: Computationally expensive at prediction time, especially with large datasets, and sensitive to irrelevant features.
Decision Tree: A model that splits data into branches based on feature values, useful for interpretable applications like customer segmentation and medical diagnosis.
- Advantage: Highly interpretable and handles both numerical and categorical data well.
- Disadvantage: Prone to overfitting, especially with deep trees, and can be sensitive to small data changes.
Linear Regression: A statistical technique that predicts a continuous outcome based on linear relationships, commonly used in financial forecasting and trend analysis.
- Advantage: Simple and interpretable, with fast training for large datasets.
- Disadvantage: Assumes a linear relationship, so it's unsuitable for complex, non-linear data.
Logistic Regression: A classification model estimating the probability of a binary outcome, widely used in credit scoring and binary medical diagnostics.
- Advantage: Interpretable with a clear probabilistic output, efficient for binary classification.
- Disadvantage: Limited to linear boundaries, making it ineffective for complex relationships without transformations.
Naive Bayes: A probabilistic classifier assuming feature independence, effective in text classification tasks like spam filtering due to its speed and simplicity.
- Advantage: Fast and efficient, especially on large datasets with independence assumptions holding.
- Disadvantage: Assumes feature independence, which may reduce accuracy if dependencies exist between features.

Confusion Matrix and F1 Score[편집 | 원본 편집]

	Predicted Positive	Predicted Negative
Actual Positive	True Positive (TP)	False Negative (FN)
Actual Negative	False Positive (FP)	True Negative (TN)

F1 Score = 2 * (Precision * Recall) / (Precision + Recall)

2 * (Positive Predictive Value * True Positive Rate) / (Positive Predictive Value + True Positive Rate)
2 * (TP) / (TP + FP + FN)

Key Evaluation Metrics[편집 | 원본 편집]

True Positive Rate (TPR), Sensitivity, Recall

TPR = Sensitivity = Recall = TP / (TP + FN)
Application: Measures the model's ability to correctly identify positive cases, useful in medical diagnostics to ensure true positives are detected.

Precision (Positive Predictive Value)

Precision = TP / (TP + FP)
Application: Indicates the proportion of positive predictions that are correct, valuable in applications like spam filtering to minimize false alarms.

Specificity (True Negative Rate, TNR)

Specificity = TNR = TN / (TN + FP)
Application: Assesses the model's accuracy in identifying negative cases, crucial in fraud detection to avoid unnecessary scrutiny of legitimate transactions.

False Positive Rate (FPR)

FPR = FP / (FP + TN)
Application: Reflects the rate of false alarms for negative cases, significant in security systems where false positives can lead to excessive interventions.

Negative Predictive Value (NPV)

NPV = TN / (TN + FN)
Application: Shows the likelihood that a negative prediction is accurate, important in screening tests to reassure negative cases reliably.

Accuracy

Accuracy = (TP + TN) / (TP + TN + FP + FN)
Application: Provides an overall measure of model correctness, often used as a baseline metric but less informative for imbalanced datasets.

Curves & Chart[편집 | 원본 편집]

Lift Curve

X-axis: Percent of data (typically population percentile or cumulative population)
Y-axis: Lift (ratio of model's performance vs. baseline)
Application: Helps in evaluating the effectiveness of a model in prioritizing high-response cases, often used in marketing to identify segments likely to respond to promotions.

Gain Chart

X-axis: Percent of data (typically cumulative population)
Y-axis: Cumulative gain (proportion of positives captured)
Application: Illustrates the cumulative capture of positive responses at different cutoffs, useful in customer targeting to assess the efficiency of resource allocation.

Cumulative Response Curve

X-axis: Percent of data (cumulative population)
Y-axis: Cumulative response (actual positives captured as cumulative total)
Application: Evaluates model performance by showing how many true positives are captured as more of the population is included, applicable in direct marketing to optimize campaign reach.

ROC Curve

X-axis: False Positive Rate (FPR)
Y-axis: True Positive Rate (TPR or Sensitivity)
Application: Used to evaluate the trade-off between true positive and false positive rates at various thresholds, crucial in medical testing to balance sensitivity and specificity.

Precision-Recall Curve

X-axis: Recall (True Positive Rate)
Y-axis: Precision (Positive Predictive Value)
Application: Focuses on the balance between recall and precision, especially useful in cases of class imbalance, like fraud detection or medical diagnosis, where positive class accuracy is vital.

@@ 1번째 줄: / 1번째 줄: @@
+== Models ==
+* '''Support Vector Machine (SVM)''': A supervised model that finds the optimal hyperplane for class separation, widely used in high-dimensional tasks like text classification (e.g., spam detection).
+** '''''Advantage''''': Effective in high-dimensional spaces and robust to overfitting with the proper kernel.
+** '''''Disadvantage''''': Computationally intensive on large datasets and sensitive to parameter tuning.
+* '''k-Nearest Neighbors (kNN)''': A non-parametric method that classifies based on nearest neighbors, often applied in recommendation systems and image recognition.
+** '''''Advantage''''': Simple and intuitive, with no training phase, making it easy to implement.
+** '''''Disadvantage''''': Computationally expensive at prediction time, especially with large datasets, and sensitive to irrelevant features.
+* '''Decision Tree''': A model that splits data into branches based on feature values, useful for interpretable applications like customer segmentation and medical diagnosis.
+** '''''Advantage''''': Highly interpretable and handles both numerical and categorical data well.
+** '''''Disadvantage''''': Prone to overfitting, especially with deep trees, and can be sensitive to small data changes.
+* '''Linear Regression''': A statistical technique that predicts a continuous outcome based on linear relationships, commonly used in financial forecasting and trend analysis.
+** '''''Advantage''''': Simple and interpretable, with fast training for large datasets.
+** '''''Disadvantage''''': Assumes a linear relationship, so it's unsuitable for complex, non-linear data.
+* '''Logistic Regression''': A classification model estimating the probability of a binary outcome, widely used in credit scoring and binary medical diagnostics.
+** '''''Advantage''''': Interpretable with a clear probabilistic output, efficient for binary classification.
+** '''''Disadvantage''''': Limited to linear boundaries, making it ineffective for complex relationships without transformations.
+* '''Naive Bayes''': A probabilistic classifier assuming feature independence, effective in text classification tasks like spam filtering due to its speed and simplicity.
+** '''''Advantage''''': Fast and efficient, especially on large datasets with independence assumptions holding.
+** '''''Disadvantage''''': Assumes feature independence, which may reduce accuracy if dependencies exist between features.
 == Confusion Matrix and F1 Score ==
 '''[[Confusion Matrix]]'''
@@ 17번째 줄: / 38번째 줄: @@
 '''[[Recall (Data Science)|True Positive Rate (TPR), Sensitivity, Recall]]'''
 *TPR = Sensitivity = Recall = TP / (TP + FN)
+*Application: Measures the model's ability to correctly identify positive cases, useful in medical diagnostics to ensure true positives are detected.
 '''[[Precision (Data Science)|Precision (Positive Predictive Value)]]'''
 *Precision = TP / (TP + FP)
+*Application: Indicates the proportion of positive predictions that are correct, valuable in applications like spam filtering to minimize false alarms.
 '''[[Specificity (Data Science)|Specificity (True Negative Rate, TNR)]]'''
 *Specificity = TNR = TN / (TN + FP)
+*Application: Assesses the model's accuracy in identifying negative cases, crucial in fraud detection to avoid unnecessary scrutiny of legitimate transactions.
 '''[[False Positive Rate|False Positive Rate (FPR)]]'''
 *FPR = FP / (FP + TN)
+*Application: Reflects the rate of false alarms for negative cases, significant in security systems where false positives can lead to excessive interventions.
 '''Negative Predictive Value (NPV)'''
 *NPV = TN / (TN + FN)
+*Application: Shows the likelihood that a negative prediction is accurate, important in screening tests to reassure negative cases reliably.
 '''[[Accuracy (Data Science)|Accuracy]]'''
 *Accuracy = (TP + TN) / (TP + TN + FP + FN)
+*Application: Provides an overall measure of model correctness, often used as a baseline metric but less informative for imbalanced datasets.
+== Curves & Chart ==
+'''[[Lift Curve]]'''
+* '''X-axis''': Percent of data (typically population percentile or cumulative population)
+* '''Y-axis''': Lift (ratio of model's performance vs. baseline)
+* '''Application''': Helps in evaluating the effectiveness of a model in prioritizing high-response cases, often used in marketing to identify segments likely to respond to promotions.
+'''[[Gain Chart]]'''
+* '''X-axis''': Percent of data (typically cumulative population)
+* '''Y-axis''': Cumulative gain (proportion of positives captured)
+* '''Application''': Illustrates the cumulative capture of positive responses at different cutoffs, useful in customer targeting to assess the efficiency of resource allocation.
+'''[[Cumulative Response Curve]]'''
+* '''X-axis''': Percent of data (cumulative population)
+* '''Y-axis''': Cumulative response (actual positives captured as cumulative total)
+* '''Application''': Evaluates model performance by showing how many true positives are captured as more of the population is included, applicable in direct marketing to optimize campaign reach.
+'''[[ROC Curve]]'''
+* '''X-axis''': False Positive Rate (FPR)
+* '''Y-axis''': True Positive Rate (TPR or Sensitivity)
+* '''Application''': Used to evaluate the trade-off between true positive and false positive rates at various thresholds, crucial in medical testing to balance sensitivity and specificity.
+'''[[Precision-Recall Curve]]'''
-== Relationships between Key Concepts ==
+* '''X-axis''': Recall (True Positive Rate)
-'''TPR (Recall) and Precision''':
+* '''Y-axis''': Precision (Positive Predictive Value)
-*TPR represents the proportion of actual positives correctly predicted by the model, while Precision shows the proportion of predicted positives that are actually positive.
+* '''Application''': Focuses on the balance between recall and precision, especially useful in cases of class imbalance, like fraud detection or medical diagnosis, where positive class accuracy is vital.
-**Increasing TPR (Recall) can sometimes reduce Precision, and vice versa.
-'''FPR and Specificity''':
-*Specificity = (1 - FPR). In an ROC curve, FPR is plotted on the x-axis and TPR on the y-axis to visualize model performance.
-'''F1 Score''':
-*Defined as the harmonic mean of Precision and Recall, emphasizing their balance. F1 Score = 2 * (Precision * Recall) / (Precision + Recall)
-'''Accuracy''':
-*Accuracy reflects the overall model performance but may not be suitable in cases of class imbalance.

익명 사용자

검색

Data Science Cheat Sheet: 두 판 사이의 차이

이름공간

더 보기

문서 행위

2024년 11월 4일 (월) 14:44 기준 최신판

목차

Models[편집 | 원본 편집]

Confusion Matrix and F1 Score[편집 | 원본 편집]

Key Evaluation Metrics[편집 | 원본 편집]

Curves & Chart[편집 | 원본 편집]

둘러보기

둘러보기

광고

위키 도구

위키 도구

익명 사용자

검색

Data Science Cheat Sheet: 두 판 사이의 차이

2024년 11월 4일 (월) 14:44 기준 최신판

Models[편집 | 원본 편집]

Confusion Matrix and F1 Score[편집 | 원본 편집]

Key Evaluation Metrics[편집 | 원본 편집]

Curves & Chart[편집 | 원본 편집]

둘러보기

위키 도구

문서 도구