GPT Analyst Newsletter
Posts
LLM Financial Analysis Stability with Varying Data History

LLM Financial Analysis Stability with Varying Data History

A Deep Dive into Prediction Consistency

Lucas Fiévet
February 05, 2025

Introduction

Financial analysts and investors are increasingly turning to large language models (LLMs) for insights that can guide trading decisions. With so many variables at play—ranging from historical closing prices to quarterly reports and breaking news—a nagging question surfaces: How stable and reliable are these GenAI-driven forecasts when we adjust the amount or type of input data?

Why Stability Matters in LLM Financial Forecasts

In the financial arena, model stability is paramount. It refers to a system’s ability to consistently generate robust forecasts and sound reasoning even when minor variations are introduced to its inputs. For professional investors and trading strategists, stability is critical for several reasons:

Enhanced Confidence: An LLM that demonstrates consistent behavior builds trust among users, ensuring that minor data fluctuations do not lead to erratic trading signals.
Robust Strategy Formation: Unstable outputs can result in fragile strategies. When predictions swing dramatically due to trivial input changes, it becomes challenging to design risk-adjusted and resilient investment strategies.
Reduced Noise Overfitting: Significant shifts in output might indicate that the model is overfitting to transient noise rather than capturing the true market dynamics. Consistent behavior suggests that the model is isolating the genuine market drivers from spurious signals.

In essence, when dealing with billions in capital or client assets, ensuring that the AI’s recommendations are both reliable and repeatable is not just desirable—it’s essential.

LLM Architecture Factors That Influence Stability

Large language models like GPT-3.5, GPT-4, and other transformer-based architectures are powerful yet sensitive to input variations. Key architectural factors include:

Attention Mechanism: Transformers employ an attention mechanism to focus on relevant parts of the input. As additional content (e.g., more news articles) is introduced, the distribution of attention may shift, thereby altering the weighting of different pieces of information.
Context Window Size: LLMs are limited by a maximum context window (the total token limit for both input and output). When this limit is reached, prompts may be truncated or require summarization, potentially changing the model’s decision path.
Positional Encoding: These models use positional embeddings to maintain the order of tokens. Feeding large chunks of data in a non-sequential or arbitrary order can disrupt this mapping, leading to variations in output.
Prompt Engineering Nuances: The manner in which data is structured—whether in bullet points, tables, or narrative text—can significantly affect which tokens the model prioritizes. Thoughtful prompt design can lead to more stable reasoning patterns.

Experiment Setup: Testing Forecast Stability by Varying Data History

To systematically investigate how the quantity of historical data influences LLM predictions, we designed an experiment that manipulates four main input variables:

TopNews: Summaries or headlines from recent news.
DailyClose: Historical closing prices.
QuarterlyReportEbit: Quarterly earnings before interest and taxes.
QuarterlyReportTotalRevenue: Quarterly revenue figures.

We incorporated these variables into a financial prompt template using our “GPT Analyst” platform. Below is the prompt (improved for clarity and explanation):

RECENT_NEWS_PROMPT_BODY = inspect.cleandoc(
    """
    Description: {Description|<SYMBOL>}
    52 Week Low: {52WeekLow|<SYMBOL>}
    52 Week High: {52WeekHigh|<SYMBOL>}
    200 Days Moving Average (close): {SMA200dailyclose|<SYMBOL>}
    50 Days Moving Average (close): {SMA50dailyclose|<SYMBOL>}

    Daily Close:
    {DailyClose|-X|-0|<SYMBOL>}

    Quarterly EBIT:
    {QuarterlyReportEbit|-X|-0|<SYMBOL>}

    Quarterly Total Revenue:
    {QuarterlyReportTotalRevenue|-X|-0|<SYMBOL>}

    Recent news that might have a short-term 
    impact on the stock performance:
    {TopNews|-X|-0|<SYMBOL>}

    Using the above market statistics, maximize return.
    """
)

This template dynamically adjusts the “look-back window” for each data type. For instance, the TopNews parameter may cover the last 2, 5, 10, or 15 days; similarly, DailyClose might incorporate data from 1 to 15 days in the past. We applied similar variations for quarterly data (e.g., 1, 2, 3, 4, 8, or 12 quarters). We then aggregated the LLM predictions across multiple ticker symbols (MSFT, TSLA, BA and NOV) and dates, mapping the recommended positions—long, short, or cash—onto heatmaps for visual analysis. The prompts where evaluated on the first of every month in 2024. Leading to 12 dates per symbol.

Results: Heatmaps Highlight the Real Driver—News

For each symbol and date the decision what determined by majority vote. The heatmap shows the fraction of decision in agreement with the majority when varying the length of historical data in the prompt for the given variables.

The heatmap analysis provided several key insights into the LLM’s behavior:

Dominant Influence of News: The model’s decisions are most sensitive to the volume and recency of news. Variations in TopNews inputs consistently resulted in shifts from “long” to “cash” or “short” positions or vice versa.
Selective Impact of Daily Close Data: While the history length of DailyClose data showed some impact—particularly under conditions of uncertainty (e.g., when the model favored a cash position)—this influence diminishes when supplemented with an extended history of news coverage.
Minimal Impact of Long-Term Financial Metrics: Variations in the look-back periods for quarterly EBIT and total revenue had little to no effect on the final decision. The corresponding heatmap stays constant, indicating high consistency regardless of the historical window length.

These results suggest that the LLM places a disproportionate emphasis on recent news, potentially overshadowing the stable signals provided by long-term numerical data. This behavior underscores the need for carefully balancing news inputs to mitigate undue volatility in forecasts.

Why Does News Drive More Instability?

Textual data, such as recent news, is inherently rich in context and often ambiguous. A single, sensational headline can disproportionately influence the model’s output, even when long-term numerical indicators suggest a different trend. Two key challenges are:

Context Limits: LLMs operate within finite context windows. When too many news items are introduced, the model must truncate or summarize, which can distort the intended input.
Model Distraction: An overabundance of news can dilute the impact of more stable financial metrics. The attention mechanism might latch onto dramatic phrases or recent events, leading to “headline-chasing” predictions.

Ablation and Interpretability Methods:
To pinpoint which news items or headlines most significantly affect the model’s decision, we recommend employing ablation studies and advanced interpretability techniques. By systematically removing specific news snippets from the prompt and observing the subsequent changes in output, you can identify the most influential pieces. Additionally, analyzing the attention layers (using tools such as attention rollout or integrated gradients) can reveal how the model weights different parts of the news input, offering deeper insights into its internal reasoning processes.

Selecting the “Right” News: A Non-trivial Challenge

Choosing which news articles to include is a nuanced task. Here are some best practices:

Relevance Filtering: Employ keyword matching, thematic classification, or sentiment analysis to ensure that only articles closely related to the stock or market segment are included.
Impact-Based Summarization: Summarize critical events—such as earnings announcements, mergers, acquisitions, or regulatory changes—while deprioritizing less impactful news.
Temporal Weighting: Recent news is typically more relevant than older reports. Consider applying heavier weights to the most current articles while summarizing older items.
Topic Clustering: Group similar news items and select representative summaries to avoid redundancy, ensuring that the model’s prompt remains focused and within context limits.

These strategies help reduce the “noise” from less relevant headlines, allowing the LLM to concentrate on the signals that truly matter.

Conclusion and Next Steps

Prediction and reasoning stability are foundational to building trust in GenAI-driven finance models. Our heatmap experiments reveal that while historical price data and financial statements yield stable outputs over varying look-back windows, the volume of recent news can dramatically sway the LLM’s forecasts.

To optimize the use of LLMs in financial analysis:

Acknowledge and Manage Context Limitations: Use summarization and filtering strategies to avoid overwhelming the model.
Analyze Attention Shifts: Leverage interpretability tools to understand which news items drive major output changes.
Prioritize Relevance: Focus on high-impact news that aligns with the underlying financial signals.
Consider Architectural Nuances: Utilize advanced features, such as retrieval augmentation, when approaching context window limits.

By carefully balancing quantitative data with curated, high-impact news, you can mitigate volatility in LLM-based forecasts—ultimately building more reliable and stable models for trading and investment decisions.