ACADEMIC RESEARCH PAPER

Teaching AI to Forecast Rationally.

Large language models systematically over-extrapolate recent trends when asked to forecast. Prompting cannot fix it. We show that a single round of low-cost fine-tuning can align AI forecasts with rational benchmarks out-of-sample.

debiasing LLMs — intuition
THE CORE INSIGHT
Prompt: past stock price history
nowm−11m−7m−3price
"Forecast the next month's return."
Off-the-shelf LLM
Predicts +7.9%
extrapolates recent trend
Fine-tuned LLM
Predicts −1.2%
learned mean reversion
Fine-tuning corrects the bias out-of-sample

LLMs Over-Extrapolate

When asked to forecast, LLMs place excessive weight on recent observations — and the bias cannot be prompted away.

LLMs Inherit Human Biases

Large language models are trained on trillions of tokens of human-written text — news, analyst reports, earnings calls, and investment forums. That corpus is saturated with extrapolative language about returns and fundamentals, so the resulting model inherits the same systematic biases human forecasters show.

They Overreact to Recent Trends

Off-the-shelf Qwen3‑32B reproduces the central finding of Afrouzi et al. (2023): when forecasting AR(1) processes, its forecast errors load negatively on forecast revisions across all six persistence levels, with coefficients ranging from −0.456 to −0.260 and all significant at the 1% level. For real S&P 500 stocks, the coefficient on the most recent monthly return is 0.394 (t = 53.9).

Prompting Cannot Fix It

Chen et al. (2024) show that prompt-level tricks — role instructions (“You are a sophisticated hedge fund investor”), chain-of-thought, few-shot demos — leave the extrapolation bias essentially unchanged. The distortion lives in the model’s parameters, not in the framing of the input.

Why It Matters

As LLM-based agents move into robo-advising, credit scoring, nowcasting, and algorithmic trading, a biased forecasting layer can amplify exactly the behavioral mistakes those systems are meant to prevent. Debiasing the model before deployment is a prerequisite for responsible delegation.

Where Does the Bias Come From?

Modern LLMs are built in two stages. The extrapolation bias is baked in during the first stage and only lightly reshaped in the second.

1
Pretraining
Raw text
TRAINING CORPUS
“Stock prices have reached what looks like a permanently high plateau.”
— Irving Fisher, Yale economist, quoted in The New York Times, October 16, 1929 (two weeks before the Wall Street Crash)

Tens of terabytes of financial journalism, analyst reports and earnings calls — all saturated with extrapolative language. Next-token prediction forces the model to internalize both the factual knowledge and the systematic biases of its source text.

2
Alignment
Doesn't fix it
RLHF / DPO / GRPO
“Rewrite responses to be polite, helpful, and instruction-following.”

Alignment shapes how the model responds, not what it believes about data-generating processes. Worse, the human feedback used here often comes from annotators with their own extrapolative beliefs — so the bias can even be reinforced.

3
Prompt-only fixes
Don't work
Role instructions

“You are a sophisticated hedge-fund investor” — bias unchanged

Chain-of-thought

“Reason step by step” — bias unchanged

Because the distortion is encoded in the model's parameters, reframing the input alone cannot reach it. Any real fix has to touch the weights.

4
Our SFT step
Reaches the weights
INSTRUCTION DATASET
prompt: past return history
target: rational benchmark forecast

We insert an additional fine-tuning step that trains the model on corrective prompt-response pairs. Each pair pairs the same information the model already sees with the answer a disciplined forecaster would give — directly reshaping the mapping from observed data to predictions.

Our Framework

A simple, four-stage recipe for correcting a forecasting bias an open-weight LLM has absorbed from its training data.

1. Identify Bias

via prompting

Forecasting Prompt
Past 40 pts: [[-20.85, None],
..., [0.16, -35.81]]
Predict Δ₁ and Δ₂
Biased Response
Δ₁ = −35.0 (extrapolates crash)
Δ₂ = +30.0 (overreacts reversal)
TEST SET
held-out evaluation

biased LLM outputs

2. Curate Dataset

Forecasting Prompt
Past 40 pts: [[13.9, None],
..., [20.65, 1.42]]
Predict Δ₁ and Δ₂
Rational Response
Δ₁ = −16.52 (mean-reverting)
Δ₂ = +12.49 (stabilizing)
from rational benchmark / future realizations
TRAIN SET
not used for evaluation
VALID SET
monitor when to stop training

N prompt-response pairs

3. SFT Trainer

Chat-LLM
base model preserved
θ fixed
Trainable
LoRA
rank r ≪ d
Monitor
training loss + task performance
early stopping → best checkpoint
GPU

4. Confirm Debiasing

Debiased
LLM
+
TEST SET
prompts from identification stage
held out
Confirm Debiasing Effect
↓ overreaction & extrapolation
Δ₁, Δ₂ closer to benchmarks

aligned & deployed

1
Identify the bias. Prompt the off-the-shelf model with a held-out set of forecasting prompts. Its raw responses reveal whatever systematic bias it has absorbed from pretraining. These prompts become the test set and are never touched again.
2
Curate the dataset. Build a separate collection of prompt-response pairs where the responses encode what a rational forecaster would say. A held-out validation split tracks generalization and triggers early stopping.
3
Fine-tune the model. Freeze the original chat-LLM and train a small set of adapter weights so the model starts producing the rational responses. The base model's general language capabilities stay intact.
4
Confirm debiasing. Feed the held-out test prompts back through the fine-tuned model and check that overreaction and extrapolation have moved toward the rational benchmark — out-of-sample, by construction.

What We Found

We evaluate fine-tuning in two settings — a controlled AR(1) forecasting experiment and cross-sectional S&P 500 return prediction. In both, the bias is corrected out-of-sample.

1

Controlled AR(1) Forecasting

Replicates Afrouzi et al. (2023) · 6 persistence levels · ∼30 K training obs. · Qwen3–32B

Error-on-revision coefficient (negative = overreaction)
Persistence BaselineFine-tuned
0.0−0.456 (t=−19.05)−0.073 (t=−1.54)
0.2−0.422−0.051
0.4−0.383−0.042
0.6−0.341−0.034
0.8−0.301−0.028
1.0−0.260 (t=−10.37)−0.027 (t=−0.97)
All baseline coefficients significant at the 1% level. After fine-tuning, none exceeds the 10% critical value.
−0.456
Baseline overreaction at

The baseline Qwen3–32B replicates the human-subject pattern of Afrouzi et al. (2023): overreaction is strongest for transitory processes and weakest for random walks, with highly significant t-statistics across all six persistence levels.

Insignificant
Fine-tuned overreaction — all

After fine-tuning on rational conditional-expectation targets, the overreaction bias is statistically indistinguishable from zero across every persistence condition, on held-out test data the model never saw during training.

2

S&P 500 Monthly Return Prediction

CRSP S&P 500 constituents · trailing 12-month histories · train 2001–2011 · valid 2012–2015 · test 2016–2024 · Qwen3–32B

+0.394
Baseline loading on

The off-the-shelf model loads positively on every lag of past returns; coefficients decline gradually with horizon but all are significant at the 1% level. This matches the extrapolation pattern human subjects show in Da, Huang & Jin (2021).

−0.120
Fine-tuned loading on

After fine-tuning on realized next-month returns, all lag coefficients flip sign to negative: the model has internalized the empirical short-term reversal in stock returns, exactly what the training data implies.

The flip is out-of-sample. The test window (2016–2024) is strictly after the 2001–2011 training window and the 2012–2015 validation window. The model has learned a new mapping from past returns to forecasts, not memorized the training labels.

Why the fix is real, not cosmetic

Strict train / validation / test separation

The diagnostic test set is fixed before training begins and never enters the fine-tune loop. The split follows Gu, Kelly & Xiu (2020). Any reduction in bias must come from generalization, not in-sample overfitting.

Conditional expectations as targets

In the AR(1) exercise we train on rather than realizations, removing a mechanical negative correlation between forecast errors and revisions that would otherwise bias the diagnostic regression.

Anonymized prompts for stock returns

Return histories are supplied with no firm names and no dates, stripping out the information channel through which lookahead bias from pretraining would otherwise contaminate the results.

The Team

Frequently Asked
Questions

Common questions about our research, explained plainly.

When asked to forecast the next value of a time series — a future stock return, an AR(1) innovation, anything with a recent trend — large language models place too much weight on the most recent observations. If the last few returns were positive, the model predicts continuation; in fact, short-horizon stock returns exhibit weak mean reversion. The same pattern shows up in human forecasters, and it turns out LLMs inherit it from the human-written text they are trained on.
Prompt-level tricks like role instructions ("you are a sophisticated hedge-fund investor"), few-shot demonstrations, or chain-of-thought reasoning operate purely at inference time. They don't touch the model's parameters. The bias, however, is encoded in those parameters during pretraining — so reframing the input can't reach it. Chen et al. (2024) show this directly: prompt-based mitigations leave the bias essentially unchanged.
LoRA is the most widely used parameter-efficient fine-tuning method in machine learning. Instead of updating all of the model's parameters, LoRA freezes the original weight matrices and trains a small set of adapter matrices attached to each layer. Because those adapters are initialized so that the model starts out matching the pretrained model exactly, fine-tuning becomes a surgical adjustment rather than a full re-train — preserving general capabilities while correcting a narrowly defined behavior.
Qwen3-32B. We deliberately chose an open-weight model: proprietary LLMs accessed through APIs don't allow inspection or modification of parameters, which is a prerequisite for fine-tuning. Qwen3-32B is large enough to exhibit the sophisticated forecasting behavior we want to correct, yet small enough to fine-tune efficiently with LoRA on commercial cloud hardware.
Two different goals. In the AR(1) exercise, training on realizations would introduce noise that shares a term with both sides of the diagnostic error-on-revision regression, mechanically biasing the coefficient downward. Training on conditional expectations removes this. In the stock-return exercise, the conditional expected return isn't directly observable, so we use realized next-month returns — which also has the advantage of teaching the model the actual empirical return distribution, including weak mean reversion.
That's the classic catastrophic-forgetting concern, and it's exactly why we use LoRA instead of full fine-tuning. LoRA keeps W₀ frozen and only trains the adapter matrices, so the model's general language understanding, reasoning, and instruction-following stay intact. What changes is narrowly the mapping from numerical return histories to forecasts — precisely the behavior we want to correct.
We believe so. Our framework only requires that you can specify a rational benchmark or a realized outcome to train against. The same recipe should apply to other behavioral biases in LLMs — overconfidence, anchoring, base-rate neglect — whenever the goal is to align the model's implicit beliefs about how data are generated with a clear normative standard. That's particularly relevant as LLM-based agents are being deployed in robo-advising, credit scoring, nowcasting, and algorithmic trading.

References

The academic papers cited throughout this page.

  1. 1.
    Afrouzi, H., Kwon, S. Y., Landier, A., Ma, Y., & Thesmar, D. (2023). Overreaction in expectations: Evidence and theory. The Quarterly Journal of Economics, 138(3), 1713–1764.
  2. 2.
    Chen, S., Green, T. C., Gulen, H., & Zhou, D. (2024). What does ChatGPT make of historical stock returns? Extrapolation and miscalibration in LLM stock return forecasts. arXiv preprint, arXiv:2409.11540.
  3. 3.
    Da, Z., Huang, X., & Jin, L. J. (2021). Extrapolative beliefs in the cross-section: What can we learn from the crowds?. Journal of Financial Economics, 140(1), 175–196.
  4. 4.
    Gu, S., Kelly, B., & Xiu, D. (2020). Empirical asset pricing via machine learning. The Review of Financial Studies, 33(5), 2223–2273.
  5. 5.
    Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2022). LoRA: Low-rank adaptation of large language models. International Conference on Learning Representations (ICLR).
  6. 6.
    Yang, A. et al. (2025). Qwen3 technical report. arXiv preprint, arXiv:2505.09388.