Leakage (Data Science) 편집하기

'''Leakage''' in data science refers to a situation where information from outside the training dataset is inappropriately used to build or evaluate a model. This results in overoptimistic performance metrics during model evaluation, as the model effectively "cheats" by having access to information it would not have in a real-world application. Leakage is a critical issue in machine learning workflows and can lead to misleading conclusions and poor model generalization.
==Types of Leakage==
Leakage can occur in various forms, typically classified as follows:
*'''Target Leakage:'''
**Occurs when information that would not normally be available at prediction time is included in the training dataset.
**Example: Including a feature in a fraud detection model that directly indicates whether a transaction was flagged as fraudulent (e.g., "is_fraud").

*'''Train-Test Leakage:'''
**Happens when information from the test set "leaks" into the training data, leading to overfitted models that perform unrealistically well on evaluation metrics.
**Example: Normalizing or scaling the entire dataset (train and test combined) before splitting.

*'''Feature Leakage:'''
**Occurs when a feature provides indirect or unintended access to the target variable, often due to improper preprocessing or feature selection.
**Example: Including a feature like "total_sales_after_return" in a model predicting whether a customer will return a product.
==Common Causes of Leakage==
*Improper data preprocessing (e.g., applying transformations to the entire dataset before splitting into training and test sets).
*Including features that are highly correlated with the target variable but are unavailable at prediction time.
*Sharing data between train and test sets during feature engineering or cross-validation.
*Using future information in time series data (e.g., incorporating future sales data to predict current sales).
==How to Detect Leakage==
Detecting leakage requires careful analysis of the data and modeling workflow. Some tips include:
*'''Analyze Features:''' Examine each feature and determine whether it contains information that would not be available in real-world predictions.
*'''Inspect Data Pipelines:''' Ensure that preprocessing steps like scaling, encoding, or imputation are applied only within the training set during model training.
*'''Cross-Validation Analysis:''' Look for unusually high cross-validation scores compared to performance on unseen data, which may indicate leakage.
==How to Prevent Leakage==
Preventing leakage requires careful handling of data and features throughout the modeling process:
*'''Separate Train and Test Sets Early:''' Perform the train-test split before any preprocessing or feature engineering to ensure that no information from the test set leaks into the training process.
*'''Feature Analysis:''' Remove or modify features that are not available at prediction time or could indirectly reveal the target variable.
*'''Time-Based Splits:''' For time series data, ensure that the test set contains only future data points relative to the training set.
*'''Pipeline Management:''' Use tools like scikit-learn's `Pipeline` to automate preprocessing and ensure that it is applied independently to training and test sets.
==Examples of Leakage==
*'''Healthcare:'''
**Including a feature such as "treatment started" when predicting whether a patient will develop a condition. This feature reveals the target variable indirectly.
*'''Finance:'''
**Using a feature like "payment overdue flag" to predict whether a customer will default on a loan.
*'''E-commerce:'''
**Using "return status" in a model predicting whether a customer will return an item.
==Consequences of Leakage==
*Overfitted models with artificially inflated performance metrics.
*Poor generalization to new or unseen data.
*Misleading business insights, leading to incorrect decisions.
*Increased risk of deploying unreliable models in production.
==Python Code Example==
Below is an example to illustrate how leakage can occur and be prevented during preprocessing:<syntaxhighlight lang="python">
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
import pandas as pd

# Simulated dataset
data = {'feature_1': [1, 2, 3, 4, 5],
        'feature_2': [5, 4, 3, 2, 1],
        'target': [0, 1, 0, 1, 0]}
df = pd.DataFrame(data)

# Split data into train and test sets
X = df[['feature_1', 'feature_2']]
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

# Example of improper scaling (causes leakage)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # Scales the entire dataset, causing leakage!

# Proper scaling to prevent leakage
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression())
])

# Train pipeline without leaking test data
pipeline.fit(X_train, y_train)
score = pipeline.score(X_test, y_test)
print(f"Model Accuracy: {score:.2f}")
</syntaxhighlight>
==See Also==
*[[Data Preprocessing]]
*[[Cross-Validation]]
*[[Overfitting]]
*[[Bias and Variance]]
*[[Feature Engineering]]
[[분류:Data Science]]
설명	입력하는 내용	문서에 나오는 결과
기울임꼴	''기울인 글씨''	기울인 글씨
굵게	'''굵은 글씨'''	굵은 글씨
굵고 기울인 글씨	'''''굵고 기울인 글씨'''''	*굵고 기울인 글씨*