How Stable Are GPT Models in Stock Market Analysis?

Insights on AI Decision-Making

The stock market is a complex system where interpreting financial data (e.g. from MarketWatch), market trends, and investor sentiment requires precision and adaptability. Leveraging GPT-based models to predict stock directions—whether to go "long," "short," or hold "cash"—offers a novel approach to decision-making. But how stable are these decisions across different models, temperatures, and prompts? Does instability reflect the unpredictability of the market, or does it reveal nuances in how GPT interprets ambiguous inputs?

In this blog, we explore the stability of GPT-generated stock forecasts using a sample of predictions. We compare models (GPT-3.5-turbo, GPT-4o-mini, GPT-4o) and temperature settings. Through visualizations and actionable insights, we uncover patterns in decision stability and offer guidelines for crafting robust prompts.

Methodology

Data Collection

We sampled the “Indicators 101” prompt from 9 different stocks or dates. See this prompt example for Airbnb.

  • Input: Company description, recent closing prices, quarterly revenues and EBITs and recent news.

  • Output: The model's forecasted decision ("long," "short," or "cash") and reasoning.

Experimental Setup

We tested how predictions varied under:

  • Models: GPT-3.5-turbo, GPT-4o-mini, GPT-4o.

  • Temperature settings: 0.0 (deterministic), 0.5 (balanced), and 1.0 (creative).

Metrics:

  • Decision stability: Consistency in "long/short/cash" predictions across runs.

  • Reasoning similarity: Measured using cosine similarity of embedding vectors for reasoning text.

Each model and temperature combination was evaluated 9 times to assess the stability of the decision and reasoning.

Findings

1. Decision Stability Across Models and Temperatures

We observe that:

  • Lower temperatures (0.0) produce more stable predictions as expected. The anomaly for GPT-4o could be due to the small sample size of only 9 prompts.

  • Higher temperatures (0.5, 1.0) increased variability, especially for ambiguous or complex prompts.

  • Variability is most common between long/cash and short/cash. There is only a single instance in the sample where GPT generated long, cash and short decisions for the same prompt on different evaluations.

  • Model comparison:

    • GPT-3.5-turbo exhibited the most variability at higher temperatures.

    • GPT-4o showed improved stability due to its advanced reasoning capabilities.

    • GPT-4o-mini is surprisingly the most stable model.

We observe that even at zero temperature different models reach different decisions for the same prompt.

We observe:

  • For simple prompts, the reasoning remained consistent across models and temperatures, even if decisions varied.

  • For complex prompts, higher temperatures introduced divergent reasoning paths, indicating that GPT explored multiple plausible interpretations.

Insights and Interpretations

1. What Does Instability Indicate?

  • Market Ambiguity: When inputs lead to unstable outputs, it may reflect the inherent ambiguity or conflicting signals in the market. For instance, mixed economic indicators might be interpreted as both bullish and bearish.

  • LLM Creativity: At higher temperatures, models generate more diverse reasoning paths, which can reveal alternative market scenarios.

2. LLM Activation Patterns and Creativity

Higher temperatures likely activate broader latent patterns in the LLM's training data, leading to:

  • Increased exploration of less probable but plausible reasoning paths.

  • Divergent decisions that reflect multiple ways to interpret the same context.

3. Role of Interpretability in LLMs

Understanding why a model chooses a particular reasoning path will require interpretability tools, such as attention heatmaps or embedding visualizations. These tools can help:

  • Identify factors driving decision variability.

  • Evaluate whether reasoning aligns with expected market logic.

We will explore these topics in a future post.

Recommendations for Crafting Robust Prompts

  • Be Specific: Explicitly define the focus area (e.g., "Analyze the impact of rising interest rates on tech stocks").

  • Use Constraints: Limit the scope of the analysis to reduce ambiguity.

  • Test Across Temperatures: Evaluate prompts at multiple temperature settings to determine the stability of the prompt.

Conclusion

Our analysis reveals that GPT models demonstrate impressive consistency in their reasoning, with decision consistency exceeding 90% even at higher temperatures. However, as expected, increasing the temperature introduces variability, allowing the LLM to explore diverse interpretations of the same market context.

Among the models, the GPT-4o-mini stands out as the most stable. This can likely be attributed to its smaller size, which limits the number of activation paths available for generating alternative interpretations, thereby reducing variability in its predictions.

A key insight from this study is the value of evaluating a prompt multiple times across a range of temperatures. This approach helps assess reasoning stability and raises a critical question: does instability indicate an inherently uncertain market state, or is it a reflection of the model's interpretive flexibility?

As GPT models continue to evolve, the exploration of more advanced interpretability tools such as SHAP or LIME will be essential. Such tools could shed light on the internal dynamics driving decision variability and further enhance the reliability of AI-driven financial decision-making.

Discover live financial reports by GPT Analyst here.

Reply

or to participate.