Teaching AI to Forecast Rationally.
Large language models systematically over-extrapolate recent trends when asked to forecast. Prompting cannot fix it. We show that a single round of low-cost fine-tuning can align AI forecasts with rational benchmarks out-of-sample.
LLMs Over-Extrapolate
When asked to forecast, LLMs place excessive weight on recent observations — and the bias cannot be prompted away.
LLMs Inherit Human Biases
Large language models are trained on trillions of tokens of human-written text — news, analyst reports, earnings calls, and investment forums. That corpus is saturated with extrapolative language about returns and fundamentals, so the resulting model inherits the same systematic biases human forecasters show.
They Overreact to Recent Trends
Off-the-shelf Qwen3‑32B reproduces the central finding of Afrouzi et al. (2023): when forecasting AR(1) processes, its forecast errors load negatively on forecast revisions across all six persistence levels, with coefficients ranging from −0.456 to −0.260 and all significant at the 1% level. For real S&P 500 stocks, the coefficient on the most recent monthly return is 0.394 (t = 53.9).
Prompting Cannot Fix It
Chen et al. (2024) show that prompt-level tricks — role instructions (“You are a sophisticated hedge fund investor”), chain-of-thought, few-shot demos — leave the extrapolation bias essentially unchanged. The distortion lives in the model’s parameters, not in the framing of the input.
Why It Matters
As LLM-based agents move into robo-advising, credit scoring, nowcasting, and algorithmic trading, a biased forecasting layer can amplify exactly the behavioral mistakes those systems are meant to prevent. Debiasing the model before deployment is a prerequisite for responsible delegation.
Where Does the Bias Come From?
Modern LLMs are built in two stages. The extrapolation bias is baked in during the first stage and only lightly reshaped in the second.
Tens of terabytes of financial journalism, analyst reports and earnings calls — all saturated with extrapolative language. Next-token prediction forces the model to internalize both the factual knowledge and the systematic biases of its source text.
Alignment shapes how the model responds, not what it believes about data-generating processes. Worse, the human feedback used here often comes from annotators with their own extrapolative beliefs — so the bias can even be reinforced.
“You are a sophisticated hedge-fund investor” — bias unchanged
“Reason step by step” — bias unchanged
Because the distortion is encoded in the model's parameters, reframing the input alone cannot reach it. Any real fix has to touch the weights.
target: rational benchmark forecast
We insert an additional fine-tuning step that trains the model on corrective prompt-response pairs. Each pair pairs the same information the model already sees with the answer a disciplined forecaster would give — directly reshaping the mapping from observed data to predictions.
Our Framework
A simple, four-stage recipe for correcting a forecasting bias an open-weight LLM has absorbed from its training data.
1. Identify Bias
via prompting
biased LLM outputs
2. Curate Dataset
N prompt-response pairs
3. SFT Trainer
4. Confirm Debiasing
LLM
aligned & deployed
What We Found
We evaluate fine-tuning in two settings — a controlled AR(1) forecasting experiment and cross-sectional S&P 500 return prediction. In both, the bias is corrected out-of-sample.
Controlled AR(1) Forecasting
Replicates Afrouzi et al. (2023) · 6 persistence levels · ∼30 K training obs. · Qwen3–32B
| Persistence | Baseline | Fine-tuned |
|---|---|---|
| 0.0 | −0.456 (t=−19.05) | −0.073 (t=−1.54) |
| 0.2 | −0.422 | −0.051 |
| 0.4 | −0.383 | −0.042 |
| 0.6 | −0.341 | −0.034 |
| 0.8 | −0.301 | −0.028 |
| 1.0 | −0.260 (t=−10.37) | −0.027 (t=−0.97) |
The baseline Qwen3–32B replicates the human-subject pattern of Afrouzi et al. (2023): overreaction is strongest for transitory processes and weakest for random walks, with highly significant t-statistics across all six persistence levels.
After fine-tuning on rational conditional-expectation targets, the overreaction bias is statistically indistinguishable from zero across every persistence condition, on held-out test data the model never saw during training.
S&P 500 Monthly Return Prediction
CRSP S&P 500 constituents · trailing 12-month histories · train 2001–2011 · valid 2012–2015 · test 2016–2024 · Qwen3–32B
The off-the-shelf model loads positively on every lag of past returns; coefficients decline gradually with horizon but all are significant at the 1% level. This matches the extrapolation pattern human subjects show in Da, Huang & Jin (2021).
After fine-tuning on realized next-month returns, all lag coefficients flip sign to negative: the model has internalized the empirical short-term reversal in stock returns, exactly what the training data implies.
The flip is out-of-sample. The test window (2016–2024) is strictly after the 2001–2011 training window and the 2012–2015 validation window. The model has learned a new mapping from past returns to forecasts, not memorized the training labels.
Why the fix is real, not cosmetic
The diagnostic test set is fixed before training begins and never enters the fine-tune loop. The split follows Gu, Kelly & Xiu (2020). Any reduction in bias must come from generalization, not in-sample overfitting.
In the AR(1) exercise we train on rather than realizations, removing a mechanical negative correlation between forecast errors and revisions that would otherwise bias the diagnostic regression.
Return histories are supplied with no firm names and no dates, stripping out the information channel through which lookahead bias from pretraining would otherwise contaminate the results.
The Team
Frequently Asked
Questions
Common questions about our research, explained plainly.
References
The academic papers cited throughout this page.
- 1.Afrouzi, H., Kwon, S. Y., Landier, A., Ma, Y., & Thesmar, D. (2023). Overreaction in expectations: Evidence and theory. The Quarterly Journal of Economics, 138(3), 1713–1764.
- 2.Chen, S., Green, T. C., Gulen, H., & Zhou, D. (2024). What does ChatGPT make of historical stock returns? Extrapolation and miscalibration in LLM stock return forecasts. arXiv preprint, arXiv:2409.11540.
- 3.Da, Z., Huang, X., & Jin, L. J. (2021). Extrapolative beliefs in the cross-section: What can we learn from the crowds?. Journal of Financial Economics, 140(1), 175–196.
- 4.Gu, S., Kelly, B., & Xiu, D. (2020). Empirical asset pricing via machine learning. The Review of Financial Studies, 33(5), 2223–2273.
- 5.Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2022). LoRA: Low-rank adaptation of large language models. International Conference on Learning Representations (ICLR).
- 6.


