Why Backtests Lie: The Lookahead Bias Trap
Most quant blog backtests look amazing on paper and fail in production. The reason is almost always lookahead bias. Here's exactly what it is and how to prevent it.
You've seen the chart: a clean upward equity curve that triples in five years, drawdowns under 8%, Sharpe over 2. Then someone runs the same strategy live and it goes flat in three weeks. The usual culprit isn't bad luck. It's a class of bugs called lookahead bias — using information at time T that wouldn't actually have been available until time T+k.
The four most common variants
1. Restated fundamentals
You pull historical revenue from Yahoo Finance and run a quality screen on companies with rising revenue. The problem: Yahoo serves the current 10-K filings, including subsequent restatements. A company that restated 2018 revenue upward in 2021 is showing you 2021 numbers as if they were known in 2018. Your screen is buying companies that will be restated upward — which is not information your historical self could have used.
Fix: use point-in-time fundamentals. We hit SEC EDGAR XBRL by filing_date, never period_end, so the backtest only sees what was actually published on or before the decision date.
2. Future news in sentiment
You scrape Reddit for AAPL mentions and build a sentiment score. Then you backtest using that score. But your scraper is reading posts as of today, including the comment chain where someone retroactively says "told you so" about an earnings beat that happened two days later. Your historical sentiment score is leaking future information.
Fix: hard cap every post by created_utc≤ decision date. Drop the post entirely if its body references events after the decision date — yes, this requires a separate scan, and yes, it's expensive, but the alternative is a backtest that lies.
3. Survivor bias in the universe
Your backtest universe is "S&P 500 today". You backtest five years. You forget that the S&P 500 today excludes the companies that got kicked out of the index in 2022 (often because they underperformed). Your universe is pre-screened for survivors. The backtest looks great because you never bought any of the failures.
Fix: use the index composition as of each decision date, not as of today.
4. Asof-anchored joins
You join a sentiment score table to a price table on ticker = ticker AND date = date. Both tables have a row for AAPL on 2024-03-15. Sentiment table: scraped at 11:30am EST. Price table: closing price at 16:00 EST. Your strategy "decides at the close based on 11:30am sentiment" — sounds fine. But did the sentiment scrape window include posts created between 11:30am and 16:00? If yes, you have a 4.5-hour lookahead leak.
Fix: stamp every datapoint with both observed_at (when it became knowable) and refers_to (when it describes). Decisions filter on observed_at ≤ T only.
How we test for it
We run a deliberately broken backtest at startup: every data adapter gets called with asof = now() - 7 days, but the underlying data files contain rows from yesterday. If the adapter returns those rows, we fail the test — because data "1 day ago" should not be visible to a decision "7 days ago". This is the test_no_lookahead_bias case in our test suite; you can read the code.
asofcontract at the adapter boundary are unreliable by default. The bias creeps in gradually as you add data sources — and you don't notice because the equity curve gets prettier, not uglier.The Backtrader cross-validation step
We run every strategy through two engines: our in-house event-loop backtester, and Backtrader. If the two engines produce materially different P&L on the same dataset, we know one of them has a leak. This is unglamorous but catches the 5–10% of bugs that survive unit tests. You can run it yourself at /backtest with the "cross-validate" toggle on.
How to inspect any backtest you read about
Ask three questions before you trust an equity curve:
(1) What was the universe selection rule?"S&P 500 constituents at each rebalance" is correct; "current S&P 500" is survivor bias.
(2) How is fundamental data sourced?If it's Yahoo / scraped without filing dates, assume restated-numbers contamination.
(3) What's the sentiment scrape window?If posts aren't hard-capped by created_utc, assume retroactive leaks.
If the writer can't answer all three, you're looking at a curve that won't reproduce live.