Data Science Cheat Sheet 편집하기

== Models ==

* '''Support Vector Machine (SVM)''': A supervised model that finds the optimal hyperplane for class separation, widely used in high-dimensional tasks like text classification (e.g., spam detection).
** '''''Advantage''''': Effective in high-dimensional spaces and robust to overfitting with the proper kernel.
** '''''Disadvantage''''': Computationally intensive on large datasets and sensitive to parameter tuning.
* '''k-Nearest Neighbors (kNN)''': A non-parametric method that classifies based on nearest neighbors, often applied in recommendation systems and image recognition.
** '''''Advantage''''': Simple and intuitive, with no training phase, making it easy to implement.
** '''''Disadvantage''''': Computationally expensive at prediction time, especially with large datasets, and sensitive to irrelevant features.
* '''Decision Tree''': A model that splits data into branches based on feature values, useful for interpretable applications like customer segmentation and medical diagnosis.
** '''''Advantage''''': Highly interpretable and handles both numerical and categorical data well.
** '''''Disadvantage''''': Prone to overfitting, especially with deep trees, and can be sensitive to small data changes.
* '''Linear Regression''': A statistical technique that predicts a continuous outcome based on linear relationships, commonly used in financial forecasting and trend analysis.
** '''''Advantage''''': Simple and interpretable, with fast training for large datasets.
** '''''Disadvantage''''': Assumes a linear relationship, so it's unsuitable for complex, non-linear data.
* '''Logistic Regression''': A classification model estimating the probability of a binary outcome, widely used in credit scoring and binary medical diagnostics.
** '''''Advantage''''': Interpretable with a clear probabilistic output, efficient for binary classification.
** '''''Disadvantage''''': Limited to linear boundaries, making it ineffective for complex relationships without transformations.
* '''Naive Bayes''': A probabilistic classifier assuming feature independence, effective in text classification tasks like spam filtering due to its speed and simplicity.
** '''''Advantage''''': Fast and efficient, especially on large datasets with independence assumptions holding.
** '''''Disadvantage''''': Assumes feature independence, which may reduce accuracy if dependencies exist between features.

== Confusion Matrix and F1 Score ==
'''[[Confusion Matrix]]'''
{| class="wikitable"
|-
! !!Predicted Positive!!Predicted Negative
|-
|'''Actual Positive'''||True Positive (TP)||False Negative (FN)
|-
|'''Actual Negative'''||False Positive (FP)||True Negative (TN)
|}
'''[[F1 Score]]''' = 2 * (Precision * Recall) / (Precision + Recall)

* 2 * (Positive Predictive Value * True Positive Rate) / (Positive Predictive Value + True Positive Rate)
* 2 * (TP) / (TP + FP + FN)

== Key Evaluation Metrics ==
'''[[Recall (Data Science)|True Positive Rate (TPR), Sensitivity, Recall]]'''
*TPR = Sensitivity = Recall = TP / (TP + FN)
*Application: Measures the model's ability to correctly identify positive cases, useful in medical diagnostics to ensure true positives are detected.
'''[[Precision (Data Science)|Precision (Positive Predictive Value)]]'''
*Precision = TP / (TP + FP)
*Application: Indicates the proportion of positive predictions that are correct, valuable in applications like spam filtering to minimize false alarms.
'''[[Specificity (Data Science)|Specificity (True Negative Rate, TNR)]]'''
*Specificity = TNR = TN / (TN + FP)
*Application: Assesses the model's accuracy in identifying negative cases, crucial in fraud detection to avoid unnecessary scrutiny of legitimate transactions.
'''[[False Positive Rate|False Positive Rate (FPR)]]'''
*FPR = FP / (FP + TN)
*Application: Reflects the rate of false alarms for negative cases, significant in security systems where false positives can lead to excessive interventions.
'''Negative Predictive Value (NPV)'''
*NPV = TN / (TN + FN)
*Application: Shows the likelihood that a negative prediction is accurate, important in screening tests to reassure negative cases reliably.
'''[[Accuracy (Data Science)|Accuracy]]'''
*Accuracy = (TP + TN) / (TP + TN + FP + FN)
*Application: Provides an overall measure of model correctness, often used as a baseline metric but less informative for imbalanced datasets.

== Curves & Chart ==
'''[[Lift Curve]]'''

* '''X-axis''': Percent of data (typically population percentile or cumulative population)
* '''Y-axis''': Lift (ratio of model's performance vs. baseline)
* '''Application''': Helps in evaluating the effectiveness of a model in prioritizing high-response cases, often used in marketing to identify segments likely to respond to promotions.

'''[[Gain Chart]]'''

* '''X-axis''': Percent of data (typically cumulative population)
* '''Y-axis''': Cumulative gain (proportion of positives captured)
* '''Application''': Illustrates the cumulative capture of positive responses at different cutoffs, useful in customer targeting to assess the efficiency of resource allocation.

'''[[Cumulative Response Curve]]'''

* '''X-axis''': Percent of data (cumulative population)
* '''Y-axis''': Cumulative response (actual positives captured as cumulative total)
* '''Application''': Evaluates model performance by showing how many true positives are captured as more of the population is included, applicable in direct marketing to optimize campaign reach.

'''[[ROC Curve]]'''

* '''X-axis''': False Positive Rate (FPR)
* '''Y-axis''': True Positive Rate (TPR or Sensitivity)
* '''Application''': Used to evaluate the trade-off between true positive and false positive rates at various thresholds, crucial in medical testing to balance sensitivity and specificity.

'''[[Precision-Recall Curve]]'''

* '''X-axis''': Recall (True Positive Rate)
* '''Y-axis''': Precision (Positive Predictive Value)
* '''Application''': Focuses on the balance between recall and precision, especially useful in cases of class imbalance, like fraud detection or medical diagnosis, where positive class accuracy is vital.