- GPT Analyst Newsletter
- Posts
- Which LLM Is Best for Financial Analysis Questions?
Which LLM Is Best for Financial Analysis Questions?
An Evaluation Across Leading Chatbots

Introduction
The world of financial analysis requires sifting through mountains of data—annual reports, quarterly filings, investor presentations, market analyses, and more. Today, many professionals and individual investors are looking to Large Language Model (LLM) chatbots for help. Tools like ChatGPT, Claude, Gemini, and Perplexity promise to streamline analysis by generating clear and structured insights. But how do these LLMs actually perform when tasked with real financial questions?
In this article, we’ll share our detailed evaluation of several top LLM chatbots for financial analysis. We’ll highlight the scoring criteria we used, present an overview table of results, and discuss which bot offers the best blend of correctness, recency, data source quality, calculation ability, and more. Ultimately, you’ll walk away knowing which LLM might best suit your financial research needs—and how to get the most out of any AI tool you choose.
The LLMs We Tested
We zeroed in on four leading LLM-based chatbots:
ChatGPT 4o – Known for its user-friendly interface, depth of knowledge, and well-structured responses.
Gemini 1.5 Flash – Google’s upcoming LLM project focusing on integrating advanced reasoning and real-time data.
Claude Sonnet 3.5 – Emphasizes an ethical, conversational approach but has varied performance for specific financial metrics.
Perplexity Default – Specialized in quick question-answering but may show gaps in financial metric calculations.
We put each of these chatbots to the test by asking a range of fundamental and advanced financial questions, from revenue growth trends to specific metrics like Net Debt to EBITDA. We also paid close attention to how they handled data sources, the accuracy and recency of their outputs, and whether or not they provided references.
Questions We Asked
Internal Fundamental Analysis Prompts | Answer Required |
What has been the revenue growth trend over the past 3-5 years? | Yes in %, Annual Growth |
Which product segment of the company is having the largest impact on sales? | Name the Segment |
Which product segment of the company is having the largest impact on profits? | Name the Segment |
Which product segment of the company is having the largest impact on profits? | Impact as % of total Profits |
Which geographical segment of the company is having the largest impact on sales? | Name the Geographic Segment |
Which geographical segment of the company is having the largest impact on profits? | Name the Geographic Segment |
What is the impact of the just-mentioned segment (in the previous row) on the profits of the company? | Impact as % of total Profits |
How have operating margins evolved? | Improved or Deteriorated |
Margin improvement of the last twelve months vs previous 12 month period? | in % of Revenue |
Could you tell me Net Debt to EBITDA and Net Debt to Equity | Show the two figures separated by a coma |
What are the most important company-specific KPIs, and how have they performed? | |
Most Important KPI | Describe |
Second Most Important KPI | Describe |
Third Most Important KPI | Describe |
Is the stock being affected by a specific market or company specific scenario? | Describe |
How long do you expect this scenario to last? | Number of Months |
Our Scoring Criteria
To quantify performance, we measured the chatbots across eight aspects:
Correctness (Weight = 0.3)
How accurate were the responses, especially regarding financial data and analyses?Recency (Weight = 0.2)
Did the chatbot leverage the latest available information?Source Quality (Weight = 0.2)
Did it reference credible data sources (like official 10-K filings, investor relations pages)?Calculation Ability (Weight = 0.1)
Could it correctly compute or approximate financial metrics rather than just describe them?Question Fidelity (Weight = 0.05)
Did it fully address the user’s specific question, or were there gaps?Formatting (Weight = 0.05)
How well did it structure its answers (tables, bullet points, clarity)?Cost (Weight = 0.05)
Based on external data (e.g., artificialanalysis.ai/models), how do subscription or usage fees compare?Speed (Weight = 0.05)
How quickly can you get comprehensive responses?
We then multiplied each chatbot’s performance in these categories by the respective weights to arrive at an overall score.
Overview Table of LLM Scoring
Below is our summary table, capturing the results we observed when putting each chatbot through identical queries on financial analysis:
Aspect | Weight | ChatGPT 4o | Gemini 1.5 Flash | Claude Sonnet 3.5 | Perplexity Default |
---|---|---|---|---|---|
Correctness | 0.3 | 5 | 4 | 3 | 2 |
Recency | 0.2 | 5 | 2 | 2 | 2 |
Source Quality | 0.2 | 3 | 5 | 1 | 2 |
Calculation Ability | 0.1 | 4 | 5 | 4 | 0 |
Question Fidelity | 0.05 | 5 | 5 | 2 | 5 |
Formatting | 0.05 | 5 | 5 | 2 | 3 |
Cost | 0.05 | 4 | 5 | 3 | 3 |
Speed | 0.05 | 5 | 4 | 4 | 4 |
Final Score | 4.45 | 4.05 | 2.45 | 2.15 |
Note: Cost and speed data were derived from public sources.
Discussing the Results
ChatGPT 4o – Overall Winner
Score: 4.45
Strengths: Excellent correctness and recency. Answers are usually comprehensive and logically structured, making it suitable for nuanced questions about financial statements. Its formatting is top-notch, especially for presenting data in tables or bullet lists.
Weaknesses: Tends to under-prioritize official investor relations pages as sources, and sometimes lacks full transparency about where data is retrieved.
2. Gemini 1.5 Flash – Runner-Up
Score: 4.05
Strengths: Outstanding source selection (often prioritizes 10-K filings or official announcements) and strong calculation ability. If your focus is verified, high-quality data, Gemini shines. Cost structure also appears competitive.
Weaknesses: Recency can be hit or miss, as the model occasionally references older data or avoids making fully integrated multi-quarter calculations.
3. Claude Sonnet 3.5 – Niche Conversationalist
Score: 2.45
Strengths: Great at generating friendly, human-like conversation and “soft” analyses. If you’re looking for a high-level overview or brainstorming session, Claude excels.
Weaknesses: Often provides incomplete details for specific metrics and rarely discloses sources. Calculation ability is decent but overshadowed by a weaker data pipeline.
4. Perplexity Default – Quick but Limited
Score: 2.15
Strengths: Speedy responses and can rapidly spit out surface-level information. Its direct question-answer style is handy for quick lookups.
Weaknesses: Lacks robust calculation capabilities. Often returns outdated data or incomplete financial figures. Source referencing is not always transparent.
Key Observations
Well-Structured Responses
Overall, LLMs today generate neat, orderly answers. They’ll often break down a financial analysis into fundamentals, technicals, and risk assessments. However, they still miss nuances like how macroeconomic indicators (interest rates, inflation, political risks) might affect a company’s earnings and stock price.Data Source Inconsistencies
We noticed that only Gemini consistently sought out primary sources, such as official filings or investor relations websites. Others tended to pull data from unvetted websites or press articles, which can lead to inaccuracies—especially for recent or quickly changing metrics.Calculation of Financial Metrics
While each LLM can describe metrics like revenue growth or net debt to EBITDA, they often struggle to provide accurate, current calculations. If you need the latest 12-month rolling metrics, you may need to do the math yourself, especially if you don’t trust the model to combine quarterly data.Usefulness of Identifying KPIs
Most LLMs excel at highlighting which KPIs matter for a given industry or company. This “soft” value is especially helpful for newcomers to finance or anyone wanting a checklist of important performance drivers—like margins, customer retention, or market share.Limited Integration of Multiple Factors
The bots rarely synthesize fundamental, technical, and macro data into a single recommendation. Instead, they list each factor and leave the final conclusion to the user.
Conclusion
Our evaluation reveals that ChatGPT 4o currently holds the top spot for financial analysis inquiries, boasting the highest overall score at 4.45. However, Gemini 1.5 Flash follows closely (4.05), leading in source quality and calculation ability—making it ideal for those who value official filings and precise metrics. Claude Sonnet 3.5 and Perplexity Default serve more niche roles, with Claude’s natural conversation style appealing for brainstorming, and Perplexity’s speed suiting quick lookups rather than in-depth analysis.
Ultimately, your choice may depend on your priorities. If cost control is crucial, Gemini and Perplexity might be more appealing. If correctness and clarity are paramount, ChatGPT 4o takes the crown. And if you need thorough conversation or quick answers without much depth, Claude and Perplexity respectively fit the bill.
Regardless of the LLM you pick:
Use Highly Specific Prompts. Specify the exact metric, timeframe, and data source you want.
Calculate Quantitative Metrics Yourself. LLMs excel at qualitative analysis but may falter with real-time number crunching.
By adopting this approach, you can leverage AI-driven chatbots for faster, broader research while retaining your own expertise for final investment decisions.
Want to Learn More?
Check out ChatGPT for in-depth, structured financial Q&A.
Explore Gemini if official data sources and calculation accuracy are paramount to your research.
Consider Claude for brainstorming and big-picture discussions.
Use Perplexity when you need fast, straightforward answers.
We hope you found our deep dive helpful. Have you tested any LLM chatbots for financial analysis? Let us know your experience and which one emerged as your go-to for crunching the numbers and generating insights!
This article is for informational purposes only and does not constitute financial advice. Always do your own due diligence.
Reply