Data leakage is the pattern by which information that should not have been available at decision time nevertheless enters a process during calibration, training, or evaluation — making it look more skilful than it is. The leak may come from the target, the future, the test set, or the evaluator. The result is inflated performance now, disappointment later, and a misallocation of trust in between.
Imagine practicing for a quiz, but the answers were secretly written on the back of your practice sheet. You get every practice question right and feel like a genius. Then the real quiz comes with no answers on the back, and you do badly. Data leakage is when the answers sneak in during practice so you look way better than you really are.
Peeking At The Answers
Data leakage is when information that shouldn't be available when a program makes a prediction sneaks in while it's being trained or tested — making it look more accurate than it really is. The leak can take several forms: the answer being hidden inside the inputs, information from the future leaking back into the past, the test questions leaking into the study material, or the grader getting mixed up with the thing being graded. The result is great-looking scores now and a nasty surprise later, when the program faces real cases it has never truly seen. In between, people trust it more than they should. The fix is to keep a strict wall between what the program is allowed to know and the answers it's supposed to figure out.
The Broken Firewall
Data leakage is the structural pattern by which information that should not have been available at the moment a process makes a prediction nevertheless enters during its training, calibration, or evaluation — making it appear more skilful and trustworthy than it actually is. The leak can come from the target itself (the answer is encoded in the inputs), from the future relative to the decision time (information that won't exist yet leaks back into training), from the test set into the training set (the evaluation is no longer naive), or from the evaluator into the evaluated (the auditor's position is contaminated). The output is inflated performance now, disappointment later, and misallocated trust in between. The core idea is a temporal or informational firewall — between the inputs the process is entitled to and the targets it must forecast — that was supposed to be in place and was not. Only when the process finally meets genuinely held-out, genuinely future, genuinely naive situations does its actual quality collapse toward its true skill, and the gap between reported and actual quality is the cost of the leak.
Data leakage is the structural pattern by which information that should not have been available at the moment a process makes a prediction, decision, or estimate nevertheless enters the process during its calibration, training, or evaluation — making it appear more accurate, skilful, or trustworthy than it actually is when faced with the situations it is supposed to handle. The leak can come from the target itself (the answer is encoded in the inputs), from the future relative to the decision time (information that will not exist when the decision is made leaks back into training), from the test set into the training set (the evaluation is no longer naive), or from the evaluator into the evaluated (the auditor's position is contaminated). The output is inflated performance now, disappointment later, and a misallocation of trust in the interval between. Five commitments define it: a process meant to produce predictions, decisions, or estimates from the information available at a specified moment (decision time, forecast time, audit time); a temporal or informational firewall separating the inputs the process is entitled to from the targets it must forecast; the firewall being crossed — accidentally, structurally, or adversarially — so target, future, or test-set information enters; the process's self-reported quality (training accuracy, in-sample fit, audit pass rate) being inflated by the leak; and, on real deployment against genuinely held-out, future, or naive situations, actual quality collapsing toward true skill, the gap between reported and actual quality being the cost of the leak. At root it is a firewall that was supposed to be in place and was not, plus an inflated self-report that conceals the breach until deployment makes it visible.
It forces three questions: what information was supposed to be unavailable, through what channel might it have entered (direct, upstream, temporal, social), and what would performance be without it. It names the asymmetry: the clean counterfactual cannot be recovered once seen.
It compresses target leakage, look-ahead bias, item exposure, and audit contamination into one frame — a firewall that should have held and did not — with one fix family: enforce the firewall by construction, not by auditing afterward.
It teaches that a performance number is only as trustworthy as the firewall behind it, so any reported skill should be audited for what the process could have illicitly seen, and a genuinely naive evaluation should be engineered up front.
A fraud classifier uses a "claim was investigated" feature, but investigation only happens after fraud is suspected, so the feature encodes the answer. Cross-validated accuracy looks high; at genuine forecast time the feature does not yet exist and skill collapses. The fix is a strictly time-based split that admits each feature only with its real-world availability.
Data Leakage is not Escape and Leakage because data leakage names forbidden information entering a naive process, whereas escape names a constrained quantity exiting; the shared word hides opposite directions and inverts the remedy.
Data Leakage is not Data Integrity because integrity concerns whether data is correct, whereas leakage concerns whether correct-but-forbidden information crossed a boundary — the leaked feature is typically perfectly accurate.
Data Leakage is not the Future Wheel because the future wheel maps a decision's downstream effects forward, whereas leakage is the inverse error — future information leaking backward into a process meant to be naive of it.