What Data Leakage Meaning, Applications & Example
When training data contains information about the target that wouldn't be available in practice.
What is Data Leakage?
Data Leakage occurs when information from outside the training dataset is unintentionally used to create a model . This can cause the model to perform unrealistically well during training but fail in real-world applications because it has “seen” data that it wouldn’t normally have access to.
Causes of Data Leakage
- Incorporating Future Data: Using information from the future (e.g., post-purchase data) in training models, which wouldn’t be available at prediction time.
- Feature Contamination: Including variables that are directly influenced by the target variable (e.g., using the target variable itself as a feature).
- Improper Data Splitting: Not properly splitting the data into training and test sets, leading to overlap or unintentional inclusion of test data during model training .
Applications of Data Leakage
- Model Validation: Ensuring proper separation between training and testing datasets to avoid leakage and ensure that the model is evaluated on data it hasn’t seen.
- Fraud Detection: In fraud detection, ensuring that features do not include information from future transactions or outcomes to prevent leakage.
- Predictive Maintenance: Ensuring that only historical data is used when predicting equipment failure to avoid leakage of future maintenance activities.
Example of Data Leakage
In predicting loan defaults, if a feature like “loan repayment status” is used in training but is collected after the loan is given, the model could incorrectly learn to predict loan default based on future data, leading to unrealistic performance and poor real-world results.