There’s a particular kind of optimism that only a backtest can produce. You have a hypothesis, you code it up, you run it against five years of data, and the equity curve goes up and to the right. The Sharpe ratio looks strong. The drawdowns are manageable. Everything suggests you’ve found something real.
And then you trade it live, and it doesn’t work.
This experience is so common in quantitative finance that it’s almost a rite of passage. But while it’s tempting to chalk it up to “markets changed” or “bad luck,” the cause is almost always more mundane. The backtest itself was wrong – not in its arithmetic, but in its assumptions.
What I describe in the following paragraphs are the specific mistakes that cause this, why they’re easy to make, and how to guard against them. None of them are new, but all of them still catch people out regularly.
1. Look-Ahead Bias
Look-ahead bias occurs when your backtest uses information that would not have been available at the point the trading decision was made. It’s the most fundamental backtesting error, and also one of the easiest to introduce accidentally.
The textbook example is using adjusted earnings data. Earnings figures are frequently restated – a company reports Q2 earnings in July, then quietly revises them in September. If your backtest uses the revised figure when evaluating a trade signal generated in August, it’s incorporating information from the future. The backtest looks better than reality ever could.
But the subtler forms are more dangerous. Point-in-time data issues are everywhere in finance. Index membership changes retrospectively – if you’re testing a strategy on “the S&P 500,” are you using today’s constituents or the constituents as they existed on each historical date? Fundamental data vendors often backfill corrections into their time series without flagging them. Even something as simple as using a daily closing price to trigger a trade that in practice would need to be executed the following morning introduces a small but systematic bias.
The fix: Be rigorous about what your strategy “knows” at each point in time. Use point-in-time databases where possible. When in doubt, lag your signals by an extra period – if the strategy still works with a one-day delay on every input, it’s more likely to be real.
2. Survivorship Bias
Survivorship bias means testing your strategy only on assets that exist today, ignoring those that were delisted, went bankrupt, or were acquired during your test period.
The effect is pernicious and almost always flattering. Stocks that survived a ten-year window are, by definition, the ones that didn’t go to zero. If your strategy occasionally holds positions in companies that would have subsequently been delisted, your backtest won’t capture those losses – because those companies aren’t in your dataset.
This matters more than people tend to assume. Research by Elton, Gruber, and Blake found that survivorship bias in mutual fund databases overstated average returns by roughly 0.9% per year. In individual equities, the effect can be larger. During any given decade, a meaningful percentage of listed companies disappear from exchanges entirely.
The fix: Use survivorship-bias-free datasets. These are available from most institutional data vendors, though they cost more. If you’re working with free or retail-grade data, at minimum acknowledge the limitation and be sceptical of strategies that hold small-cap or distressed names – that’s where survivorship bias hits hardest.
3. Overfitting
Overfitting is the most written-about backtesting problem, and still the most common. It occurs when your model captures noise in the historical data rather than genuine, repeatable patterns.
The mechanism is straightforward. Financial data is noisy. Any sufficiently flexible model (one with enough parameters) can find patterns in historical noise that produce impressive-looking backtests. The problem is that noise, by definition, doesn’t repeat. A model tuned to the specific sequence of random fluctuations in your test period will fail when confronted with a different sequence.
The classic warning sign is a strategy with many tunable parameters that all need to be set to specific values for it to work. If changing your moving average window from 14 days to 16 days causes your strategy’s performance to collapse, you haven’t found a robust signal, you’ve found a coincidence.
A useful heuristic you can use is to count the degrees of freedom in your strategy (number of tuneable parameters, filter conditions, entry/exit rules) relative to the number of independent observations in your dataset. If the ratio feels high, it probably is.
The fix: keep models simple. Prefer strategies with fewer parameters, and test robustness by varying parameters and checking whether performance degrades gracefully or falls off a cliff. Perhaps use walk-forward optimisation rather than optimising across the entire dataset. And be honest about how many variations you tested before arriving at the one that “worked” – which leads to the next point.
4. Multiple Testing and Selection Bias
This is overfitting’s quieter cousin. You test fifty strategy variations. Three of them produce good backtests. You pick the best one and discard the other forty-nine. You then evaluate the winning strategy as if it were the only one you’d ever tested.
The statistics here are unforgiving. If you run fifty independent tests at a 5% significance level, you’d expect roughly 2.5 false positives by chance alone. The more strategies you test, the more likely you are to find one that looks good purely by accident. This is sometimes called “data snooping” or the “multiple comparisons problem,” and it’s pervasive in quantitative research.
Harvey, Liu, and Zhu addressed this directly in their 2016 paper “…and the Cross-Section of Expected Returns,” arguing that the standard t-statistic threshold of 2.0 is far too low for evaluating trading strategies given the sheer volume of strategies that have been tested across the industry. They suggested a threshold closer to 3.0.
The fix: Track how many strategy variations you test, not just the one you selected. Apply corrections for multiple testing – the Bonferroni correction is the simplest, though conservative. Better yet, reserve a portion of your data that you genuinely never look at until final validation. If you’ve already peeked at the full dataset during development, that out-of-sample test is no longer truly out-of-sample.
5. Unrealistic Transaction Cost Assumptions
A strategy that trades frequently and captures small returns per trade is exceptionally sensitive to transaction cost modelling. Get the costs wrong by even a small amount and the gap between your backtest and live performance can be dramatic.
The costs that people underestimate most consistently:
- Spread costs. Using mid-price fills in your backtest when in reality you’ll cross the spread. For liquid large-caps this might be a few basis points. For small-caps, emerging markets, or less liquid instruments, it can be substantially more.
- Market impact. Your order itself moves the price. This is negligible for small positions in liquid names, but scales non-linearly with order size. A strategy that backtests well on a £1m portfolio may be completely uneconomic at £50m because the trades required would move the market against you.
- The difference between your intended execution price and the price you actually get. In fast-moving markets or during periods of low liquidity, slippage can be significant.
- Borrowing costs for shorts. If your strategy involves short positions, the cost of borrowing shares varies by name and over time. “Hard to borrow” names can cost 10%+ annualised, which can eliminate the return from a short position entirely.
The fix: Model transaction costs conservatively. Use actual bid-ask spread data if available. Apply market impact models (Almgren-Chriss is a reasonable starting point) rather than assuming zero impact. Then run your backtest with costs at 1.5x to 2x your best estimate. If the strategy survives that, it has a buffer against reality being worse than your model assumes.
6. Ignoring Regime Changes
Financial markets are non-stationary. The statistical relationships that held during one period may not hold during another. A strategy calibrated on a decade of low interest rates and suppressed volatility may behave entirely differently in a rising rate environment.
This isn’t a modelling error in the traditional sense – it’s a deeper problem about the limits of historical inference. But it manifests in backtests as a strategy that performs brilliantly during the test window and then deteriorates when deployed.
The most common version: a strategy that implicitly depends on a specific market regime without the developer realising it. A mean-reversion strategy that works in range-bound markets but fails during trending periods. A momentum strategy that thrives in low-correlation environments but breaks down when correlations spike during a crisis.
The fix: Test your strategy across multiple market regimes, not just the period where it works best. Deliberately include stress periods (2008, 2020, 2022) in your test window and examine performance during those sub-periods separately. If the strategy only works during calm markets, that’s important information – not a flaw to be hidden by extending the backtest window until the average looks acceptable.
The Common Thread
All six mistakes share a root cause: they make the backtest more optimistic than reality. None of them make a strategy look worse than it is. This asymmetry is what makes backtesting dangerous – the errors are systematically biased in the direction of telling you what you want to hear.
The best defence is a combination of humility and process. Assume your backtest is overstating performance until proven otherwise. Build in conservative assumptions at every stage. And treat a good-looking backtest not as evidence that a strategy works, but as permission to investigate further.

