Logistic regression
Logistic Regression is a statistical and machine learning algorithm used for binary classification tasks, where the output variable is categorical and typically represents two classes (e.g., yes/no, spam/not spam, fraud/not fraud). Despite its name, Logistic Regression is a classification algorithm, not a regression algorithm, as it predicts probabilities of classes rather than continuous values.
How It Works
Logistic Regression models the probability of a binary outcome using a logistic function, also known as the sigmoid function. The sigmoid function compresses values to range between 0 and 1, representing the probability of belonging to a particular class. The model predicts the probability that the input belongs to the positive class (1) and classifies it by applying a threshold, often 0.5.
The logistic function is represented by:
P(y=1 | X) = 1 / (1 + e-(b0 + b1X1 + b2X2 + ... + bnXn))
where:
- P(y=1 | X) is the probability of the output being 1 given the input features.
- X1, X2, ..., Xn are the input features.
- b0 is the intercept, and b1, b2, ..., bn are the coefficients of the features.
Types of Logistic Regression
- Binary Logistic Regression: Used for binary classification with two possible outcomes (e.g., yes/no).
- Multinomial Logistic Regression: Used when the outcome variable has more than two categories without any ordering (e.g., classifying types of animals).
- Ordinal Logistic Regression: Used when the outcome variable has ordered categories (e.g., ranking levels from low to high).
Applications of Logistic Regression
Logistic Regression is widely used across industries due to its simplicity, interpretability, and effectiveness in binary classification tasks:
- Healthcare: Predicting disease outcomes, risk assessments, and patient survival chances.
- Finance: Credit scoring, fraud detection, and risk analysis.
- Marketing: Customer churn prediction, targeting potential buyers, and lead qualification.
- Social Sciences: Survey analysis, where responses fall into categories like agree/disagree or support/oppose.
Key Metrics for Evaluating Logistic Regression
To assess the performance of a Logistic Regression model, common metrics include:
- Accuracy: The proportion of correct predictions.
- Precision: The ratio of true positive predictions to all positive predictions.
- Recall: The ratio of true positive predictions to all actual positives.
- F1 Score: The harmonic mean of precision and recall, useful when dealing with imbalanced data.
- AUC-ROC Curve: Measures the model’s ability to distinguish between classes, where a higher Area Under the Curve (AUC) indicates better performance.
Assumptions of Logistic Regression
Logistic Regression relies on several assumptions for accurate results:
1. Linearity of Independent Variables and Log-Odds: Assumes a linear relationship between the log-odds of the outcome and the independent variables.
2. Independence of Observations: Observations should be independent of each other to avoid biased results.
3. No Multicollinearity: Independent variables should not be highly correlated with each other, which can be checked using Variance Inflation Factor (VIF).
4. Sufficient Sample Size: Logistic Regression requires a large enough sample size, especially for categorical variables, to make accurate predictions.
Handling Limitations
Logistic Regression may not perform well if the relationship between variables is highly non-linear. In such cases, transformations, polynomial features, or using a more complex model like Decision Trees or Neural Networks can be considered.