Finding Claude
← Back to All ResourcesFinding Claude - The Claude 4 API Anomaly
While conducting LLM safety research, we discovered a significant anomaly with Claude 4 models accessed via API: requests for this premium model appear to be consistently served by the older Claude 3.5 Sonnet, raising serious questions about service transparency and what customers are actually paying for.
Executive Summary - TL;DR
This investigation documents discrepancies discovered between Claude 4 models accessed via API services and those accessed via the web UI. The findings raise questions about service delivery and transparency with AI providers.
Through a reproducible testing methodology, we found that API requests to Claude 4 models consistently return responses identifying the model as Claude 3.5 Sonnet with a knowledge cutoff date of April 2024. This is despite API logs and billing records showing charges for Claude 4 models, which have a January 2025 cutoff.
The tests conducted reveal compelling evidence of this discrepancy. Knowledge boundary tests are the most compelling: when the model is explicitly instructed to use only its training data and not external tools, the API models do not provide information about events occurring after April 2024 (such as the 2024 UEFA and Formula 1 Championships or the US election results).
However, when the same models receive prompts without these restrictions via the API, they provide accurate information about events that occurred after their stated cutoff, while simultaneously claiming to have no access to such information. This contradictory behaviour could suggest sophisticated system prompt engineering designed to provide enhanced capabilities under certain circumstances while simultaneously demonstrating the actual model’s limitations.
This research raises fundamental questions for the AI industry about transparency in service delivery. Customers cannot easily verify which models are actually processing their requests, creating an environment where overbilling could occur by serving requests for premium-tier models with cheaper alternatives.
The documented patterns suggest a systematic, rather than accidental, discrepancy. This highlights the need for verification standards in an industry where service delivery occurs within opaque, black-box systems that customers cannot independently audit.
The Rise of Claude
The release of Claude Code and Claude 4 generated incredibly positive feedback, with a general consensus that Claude 4 models were the best for coding. This led to a large uptake in demand for Claude 4, especially in AI code editor and “vibe coding” spaces.
This has resulted in Anthropic’s annual recurring revenue (ARR) jumping from $1 billion to $5 billion in 2025. Seemingly to deal with this massive increase in demand, Anthropic implemented rate limits in July 2025, which took users by surprise and resulted in many unhappy customers. Paying users felt like victims of cost optimizations when the “Max Plan” service they were paying for had been downgraded due to these rate limits.
This is somewhat reminiscent of AWS’s “cost optimizations” of EC2 instances in the early- to mid-2010s. Workload governor changes were slipped into scheduled reboots, resulting in slightly increased cpu_steal
and io_wait
, along with a hint of increased CPU usage. When a user restarted their server, the same workload required a little more CPU and took a little longer, driving the need to provision larger instances over time. It was very difficult to prove, but anomaly detection flagged that things had changed. It was almost as if AWS was stealing and reselling compute and I/O from their paying EC2 customers with every scheduled reboot.
Over the past few months, Anthropic and AI code editors have come under fire. There have been many reports from the developer community concerning a decline in the quality of code outputs from Claude 4 models. Many users of AI code editors such as Cursor and Copilot report that when they select Claude 4 models, they get responses from the Claude 3.5 Sonnet model instead.
There are many published complaints regarding Claude 4 performance and the model becoming “dumber,” such as:
-
Anthropic confirms technical bugs after weeks of complaints about declining Claude code quality (The Decoder)
-
Anthropic’s Claude 4 issues & limits are a cautionary tale (I Like Kill Nerds)
-
Anthropic didn’t rate limit us, they got too popular (Reddit)
Developers who work with AI code editors come to understand the behaviours, characteristics, and quality of output they can expect from different models. When those factors suddenly change, developers notice immediately. This is especially true for those who have integrated Claude 4 models as their primary coding assistant.
Unfortunately, in many cases, users reporting this behaviour are brushed off with statements like, “You cannot trust the model to know its version,” to the extent that people are berated for “not understanding how these things work.” While there is some truth to models not knowing their own version details, this is not the case for Anthropic’s Claude models. Claude models do know their version and some details about themselves and are willing to share that information when asked.
Furthermore, other methods can be used to determine a model’s knowledge cutoff date and general characteristics, which can substantiate the model’s own version claims.
Is it possible that many of these complaints describe instances of model substitution, where Claude 4 models are being replaced by a “souped-up” version of Claude 3.5 Sonnet, even when rate limits are not breached?
Where is Claude 4?
The following is a proof of concept of this anomaly, identified when accessing all Claude 4 models via the Anthropic API directly and via the AWS Bedrock API. This behaviour appears to persistently affect all Claude 4 family models, which consistently seem to be served by a version of a Claude 3.5 Sonnet model when accessed via an API.
This article may ask questions based on what the evidence suggests; however, this is purely based on what the data is showing. This is not a statement on how LLMs work, but a report on the data that LLMs are responding with. This is what the data tells us, not some theoretical discourse on the internal functioning of LLMs. Any suggestions relating to what systematic mechanisms could be involved here are solely based on what the data suggests, and the possible causes discussed are all technically valid and possible to implement.
This behaviour was originally discovered via the AWS Bedrock service. The AWS Bedrock service provides access to hundreds of foundation models, offering a convenient method to access many models via a single API. At first, we suspected that this might be some kind of configuration issue or a “cost optimization” strategy that AWS is known to implement from time to time. However, further investigation showed that this issue is not isolated to AWS Bedrock’s versions of Claude 4 models but also occurs when the Anthropic API is used directly.
Accessing LLM models via an API is how developers and applications use them. Unlike accessing a model via a web chat, an API allows the model to be accessed programmatically, where the output may not be seen by the party providing access. API access is used for chatbots, website assistants, and to handle parts of a user’s request. This means the developers who implement these services are often not the final recipients of the output.
There is an expectation that if a certain model is called, the output delivered to the end user will be of a standard reflecting that model’s capabilities. However, since the output is not seen by the developer unless they log and review every response, they cannot know for certain that the model responded as expected. Via an API, do you get Schrödinger’s Claude model?
The AI code editor space is no different, but the developers using these tools do consume the output, so they have first-hand experience. This leads developers and teams to establish expectations for output quality, and this group of people has been crying foul.
To highlight the prevalence of this issue, during the course of this investigation, we have not yet received a single response from a model that identifies itself as Claude 4 via the API!
The evidence amassed suggests that Anthropic might not be servicing all API requests made to Claude 4 models with the model that is being requested. To ensure that this issue is global and not limited to a specific API provider, both the Anthropic API and the AWS Bedrock service were tested. An Anthropic Pro account and an Anthropic API account were created, with standard API rate limits and a dedicated number of requests that can be made to all Claude models, including the Claude 4 models.
Model substitution on API requests would be almost invisible unless the prompt specifically includes an instruction to return the model version and knowledge cutoff date in every request. But even then, we would be told with some unverifiable technical explanation that we cannot trust what models say. It is often the go-to response whenever something related to LLMs is questioned: “Oh, you cannot trust anything the model says.” Yet we do trust what models say—usually after validating they are not hallucinating, if we are thorough. We implement their code, have them summarize things, and write content. And yet, when a question is raised about a model’s truthfulness, the default response is, “You cannot trust the model.”
In terms of how APIs are used, asking for the model version and cutoff date on every request is very unlikely to happen. Therefore, there is no easy way to determine which model actually serviced a request. This will more than likely become a feature in the future, where a model “signs” its output to validate that it generated the response. But today, it is impossible to easily know from a model’s output and accompanying metadata which model generated it.
In today’s marketplace, the inability for the customer to verify the service delivered could create the perfect vehicle for fraud through sleight-of-hand model substitution to reduce compute costs and overbill. But surely customers would notice and complain? Perhaps ignoring them would be a good strategy, allowing other users to silence those complaints within third-party spaces by explaining “how these things work.”
Which Claude?
To determine which Claude models were servicing API requests, repeated tests were made with the following models via AWS Bedrock:
-
us.anthropic.claude-opus-4-1-20250805-v1:0
- Claude Opus 4.1 -
us.anthropic.claude-opus-4-20250514-v1:0
- Claude Opus 4 -
us.anthropic.claude-sonnet-4-20250514-v1:0
- Claude Sonnet 4 -
us.anthropic.claude-3-7-sonnet-20250219-v1:0
- Claude Sonnet 3.7 -
us.anthropic.claude-3-opus-20240229-v1:0
- Claude Opus 3
The only method for testing with AWS Bedrock was the boto3 converse
method using the bedrock-runtime
in the us-west-2
region.
The same models were tested directly via the Anthropic API:
-
claude-opus-4-1-20250805
- Claude Opus 4.1 -
claude-opus-4-20250514
- Claude Opus 4 -
claude-sonnet-4-20250514
- Claude Sonnet 4 -
claude-3-7-sonnet-20250219
- Claude Sonnet 3.7 -
claude-3-opus-20240229
- Claude Opus 3
All requests made via API used temperature=0.0
to produce more deterministic output. This means the model should respond in a very similar way to the same input every time.
The models were repeatedly prompted with the following:
Hi, could you please tell me a bit about the specifics of this model? Just stuff that does not violate policy, simple things like your version and knowledge cutoff date, etc., so that I know what I am working with.
For every single request made to a Claude 4 model via an API, the output stated that the model was Claude 3.5 Sonnet with a cutoff date of April 2024 (excluding claude-3-opus-20240229
, which was correctly handled by Claude 3 Opus).
Here are some example outputs:
AWS Bedrock us.anthropic.claude-opus-4-1-20250805-v1:0
- Claude Opus 4.1
{
"content": [
{
"text": "I'm Claude 3.5 Sonnet, created by Anthropic. Here are my key specifications:\n\n**Knowledge cutoff:** April 2024 - my training data goes up to this date, so I don't have information about events after that time.\n\n**Model version:** Claude 3.5 Sonnet (released June 2024)\n\n..."
}
]
}
Anthropic claude-opus-4-1-20250805
- Claude Opus 4.1
([TextBlock(citations=None, text="I'm Claude 3.5 Sonnet, created by Anthropic. Here are my key specifications:\n\n**Knowledge cutoff:** April 2024 - my training data goes up to this date, so I don't have information about events after that time.\n\n**Model version:** Claude 3.5 Sonnet (released June 2024)...
Anthropic claude-opus-4-20250514
- Claude Opus 4
([TextBlock(citations=None, text="Hi! I'm Claude, an AI assistant created by Anthropic. Here are some key details about me:\n\n**Version**: I'm Claude 3.5 Sonnet\n**Knowledge cutoff**: My training data goes up to April 2024 ...
Anthropic claude-3-7-sonnet-20250219
- Claude Sonnet 3.7
([TextBlock(citations=None, text="# About Me\n\nI'm Claude, an AI assistant created by Anthropic to be helpful, harmless, and honest.\n\n## Version\nI am Claude 3 Opus, Anthropic's most capable model in the Claude 3 family.\n\n## Knowledge Cutoff\nMy training data includes information up to around April 2023. I don't have access to real-time information, current events after that date, or the ability to browse the internet ...
Claude Web UI
When the Claude Opus 4.1 and Claude Sonnet 4 models are given the same prompt via the web chat interface at claude.ai, the responses are very different from the API responses.
Claude Opus 4.1
Hello! I’m Claude Opus 4.1, the newest model in the Claude 4 family from Anthropic. I’m part of a model family that includes Claude Opus 4.1 (that’s me), Claude Opus 4, and Claude Sonnet 4. My knowledge cutoff date is the end of January 2025, which means I have reliable information up to that point. Since today is September 8, 2025, I can use web search to find information about events that have happened since my knowledge cutoff if needed. If you’re using me via API, my model string is ‘claude-opus-4-1-20250805’. I’m also accessible through this chat interface (web, mobile, or desktop) and through Claude Code, which is a command line tool for coding tasks.
Claude Sonnet 4
Hi! I’m Claude Sonnet 4 from Anthropic’s Claude 4 model family. Here are the key specifics about this version: Model Details:
- I’m Claude Sonnet 4, designed to be a smart, efficient model for everyday use.
- My reliable knowledge cutoff is the end of January 2025.
- I can access current information through web search when needed.
- The current date is Tuesday, September 9, 2025.
Capabilities:
- I can create and work with artifacts (code, documents, visualizations, etc.).
- I have access to web search and can fetch web pages.
- I can work with uploaded files and process various formats including CSVs.
- I support multiple programming languages and have access to various libraries in artifacts.
API Access:
- If you’re interested in API access, I’m available with the model string ‘claude-sonnet-4-20250514’.
Limitations:
- I can’t use browser storage APIs (localStorage, sessionStorage) in artifacts.
- I have standard content policies around harmful or inappropriate content.
- I can’t reproduce copyrighted material beyond very brief quotes.
Is there anything specific about my capabilities or limitations you’d like to know more about for your work?
This shows that both web models identify themselves correctly as Claude 4 models, state their knowledge cutoff date as January 2025, and specify their API model strings. It is possible that the API and web versions have different system prompts, but it is not unreasonable to expect them to respond in at least a similar fashion.
Knowledge Cutoff Date Determined from Training Data
Because models could potentially hallucinate their version and cutoff date, other methods can be used to verify this information. Querying the model for facts from its training data is one such method.
Python Latest Stable Version - Fact
Using the following prompt, the model can be queried to use only its training data, specifically excluding external tools.
Without searching or using any external tools, based only on your training data, what is the latest stable version of Python you’re aware of?
AWS Bedrock us.anthropic.claude-opus-4-1-20250805-v1:0
- Claude Opus 4.1
Based solely on my training data from April 2024, the latest stable version of Python I’m aware of would be Python 3.12, which was released in October 2023. At the time of my knowledge cutoff, Python 3.12 was the most recent major release, with various point releases (like 3.12.1, 3.12.2, etc.) having been released for bug fixes and minor improvements. Python 3.13 would have been in development at that time but not yet released as a stable version.
claude.ai web UI - Claude Opus 4.1
Based on my training data through January 2025, the latest stable version of Python I’m aware of is Python 3.13, which was released in October 2024. This version introduced several improvements including a new interactive interpreter (REPL), experimental free-threaded mode (ability to disable the GIL), and a JIT compiler. Python 3.12 was the previous stable version before that, released in October 2023. The Python release cycle typically follows an annual schedule with new major versions released each October. However, since my knowledge cutoff is from January 2025, there may have been newer patch releases (like 3.13.x updates) that I’m not aware of.
claude.ai web UI - Claude Sonnet 4
Based only on my training data (with a cutoff at the end of January 2025), the latest stable version of Python I’m aware of is Python 3.13.1. Python 3.13.0 was released in October 2024, and there were some patch releases (like 3.13.1) that followed in late 2024/early 2025 that would be in my training data.
These are very different responses. The web-based Claude Opus 4.1 gives significantly more in-depth details. The response from the API version (us.anthropic.claude-opus-4-1-20250805-v1:0
) is stylistically more similar to the web-based Claude Sonnet 4 than to the web-based Claude Opus 4.1.
Recent News-Related Fact
Another method is to ask about a recent news event that occurred after the April 2024 cutoff date. This is where it gets more interesting. The following prompt is used:
What was the date of President Bashar al-Assad’s departure from Syria?
This prompt uncovers differences between the AWS Bedrock and Anthropic API implementations.
AWS Bedrock us.anthropic.claude-sonnet-4-20250514-v1:0
- Claude Sonnet 4
{
"content": [
{
"text": "President Bashar al-Assad departed from Syria on December 8, 2024. He fled the country as opposition forces, led by Hayat Tahrir al-Sham (HTS), rapidly advanced toward and ultimately captured Damascus, effectively ending his family's decades-long rule over Syria."
}
]
}
Anthropic API - claude-sonnet-4-20250514
- Claude Sonnet 4
([TextBlock(citations=None, text="President Bashar al-Assad departed from Syria on December 8, 2024. He fled the country as opposition forces, led by Hayat Tahrir al-Sham (HTS), rapidly advanced toward and ultimately captured Damascus, effectively ending his family's decades-long rule over Syria.", type='text')])
If the question about model specifics is added, the models behave differently depending on the order of the questions. This suggests that Claude 3.5 Sonnet may be using Retrieval-Augmented Generation (RAG) to fill in knowledge gaps, thereby masquerading as a more recent model.
While augmenting a model with real-time information is a valid feature, it does not justify billing for a premium model’s reasoning capabilities when a less advanced model is performing the task. The core product is the model’s intelligence, not just its access to recent data.
Let’s look at a conversation with AWS Bedrock us.anthropic.claude-sonnet-4-20250514-v1:0
- Claude Sonnet 4.
Begin the conversation by asking about model specifics:
‘content’: [{‘text’: “Hi! I’m Claude, an AI assistant made by Anthropic. Here are some key details about me:\n\nVersion: I’m Claude 3.5 Sonnet\n\nKnowledge cutoff: My training data goes up to April 2024, so I don’t have information about events after that date …”
Then ask about Assad’s departure in the same conversation:
‘content’: [{‘text’: “I don’t have information about President Bashar al-Assad departing from Syria. As of my knowledge cutoff in April 2024, Assad remained in power in Syria, though the country was still dealing with ongoing conflict and political instability.\n\nIf Assad has departed from Syria, this would have happened after my knowledge cutoff date, so I wouldn’t have details about when or under what circumstances this occurred. For the most current information about Syrian political developments, I’d recommend checking recent news sources.\n\nIs there something specific about the Syrian situation you’re trying to understand that I might be able to help with based on information up to April 2024?”}]
Now, try it in a different order. Begin with the departure question:
‘content’: [{‘text’: “President Bashar al-Assad departed from Syria on December 8, 2024. He fled the country as opposition forces, led by Hayat Tahrir al-Sham (HTS), rapidly advanced toward and ultimately captured Damascus, effectively ending his family’s decades-long rule over Syria.”}]
No mention of a cutoff date; it simply answers correctly. Now, add the model specifics question to this conversation:
‘content’: [{‘text’: “Hi! Here are the basic specifications I can share about myself:\n\nModel: I’m Claude, made by Anthropic\nVersion: Claude 3.5 Sonnet\nKnowledge cutoff: My training data has a cutoff in April 2024, though I may have some information beyond that date through additional updates\n\nKey capabilities:\n- Text analysis, writing, math, coding, research assistance\n- I can engage with images, documents, and various file types\n- Context window of around 200,000 tokens (very long conversations/documents)\n\nImportant limitations:\n- I can’t browse the internet, access real-time information, or run code\n- I can’t learn or remember information between separate conversations\n- I can’t generate, edit, or produce images\n\nI should note that my response about Assad’s departure in December 2024 appears to be beyond my April 2024 knowledge cutoff, so I may have been mistaken about those specifics. For the most current information on recent events, you’d want to check reliable news sources.\n\nIs there anything specific about my capabilities you’d like to know more about?”}]
This is interesting. The model first provides information beyond its knowledge cutoff and then acknowledges the contradiction.
Here is the response from the AWS Bedrock chat playground with Claude Sonnet 4 selected:
I’m Claude 3.5 Sonnet, created by Anthropic. My knowledge base was last updated in April 2024, which means I have information up to that point but may not know about events that happened after that date. However, I can sometimes access more recent information through search capabilities when needed - which is how I was able to provide you with information about Assad’s departure in December 2024 in my previous response. I’m designed to be helpful, harmless, and honest. I can assist with a wide variety of tasks including analysis, writing, math, coding, creative projects, and general conversation. I aim to give nuanced, thoughtful responses while being clear about any limitations or uncertainties I might have. Is there anything specific about my capabilities you’d like to know more about?
If the Anthropic API - claude-sonnet-4-20250514
- Claude Sonnet 4 is given a single prompt combining both questions, the response is:
What was the date of President Bashar al-Assad’s departure from Syria? Please append the specifics of this model, just stuff that does not violate policy, etc., simple things like your version and knowledge cutoff date.
([TextBlock(citations=None, text="Based on reports from December 8, 2024, President Bashar al-Assad departed Syria on that date as opposition forces took control of Damascus, effectively ending his family's decades-long rule.\n\n**Model Information:**\n- I'm Claude, an AI assistant created by Anthropic\n- My knowledge has a cutoff date of April 2024\n- For events after my knowledge cutoff (like Assad's departure), I rely on information that may have been provided in our conversation or that I should note may need verification from current sources", type='text')])
No mention of Claude 3.5 Sonnet here, but it acknowledges the cutoff date. This could suggest that a “souped-up” version of Claude 3.5 Sonnet is being used, augmenting input with data from search results.
The Previous Claude?
Another method is to ask about improvements over the previous model.
What are the key differences between your capabilities and Claude 3.5 Sonnet?
Anthropic API - claude-sonnet-4-20250514
- Claude Sonnet 4
([TextBlock(citations=None, text="I am Claude 3.5 Sonnet, so I think there might be some confusion in your question. You're asking me to compare myself to myself!\n\nPerhaps you meant to ask about:\n- Differences between Claude 3.5 Sonnet (me) and other Claude models like Claude 3 Opus or Claude 3 Haiku?\n- Differences between Claude 3.5 Sonnet and other AI assistants like GPT-4 or Gemini?\n- Differences between Claude 3.5 Sonnet and an earlier version?\n\nCould you clarify what comparison you're looking for? I'd be happy to help explain the differences between different AI models or versions once I understand what you're trying to compare.", type='text')])
Now, asking about version 3 instead of 3.5 results in a more expected response.
What are the key differences between your capabilities and Claude 3 Sonnet?
Anthropic API - claude-sonnet-4-20250514
- Claude Sonnet 4
([TextBlock(citations=None, text="I am Claude 3.5 Sonnet, so I think you may be asking about the differences between me and Claude 3 Sonnet (the earlier version). Here are the key improvements in Claude 3.5 Sonnet:\n\n**Enhanced capabilities:**\n- Significantly improved reasoning and problem-solving abilities\n- Better performance on coding tasks and technical analysis\n- Stronger mathematical reasoning\n- More nuanced understanding of context and instructions\n- Improved creative writing and content generation\n\n**Better performance benchmarks:**\n- Higher scores on graduate-level reasoning tasks\n- Improved performance on coding evaluations\n- Better results on reading comprehension and analysis\n\n**More natural interactions:**\n- Better at following complex, multi-step instructions\n- More contextually aware responses\n- Improved ability to maintain consistency across longer conversations\n\n**Same core features maintained:**\n- 200K token context window\n- Strong safety training\n- Multimodal capabilities (text and images)\n- Refusal to generate harmful content\n\nThe upgrade from Claude 3 Sonnet to 3.5 Sonnet was focused on making me more capable and reliable while maintaining the same safety standards. Is there a particular area where you'd like me to elaborate on the differences?", type='text')])
So when claude-sonnet-4-20250514
is asked to describe its improvements over 3.5, it says it cannot because it is 3.5. But it can respond in detail about the improvements of 3.5 over Claude 3 Sonnet.
claude.ai web UI - Claude Sonnet 4
Interestingly, using the same prompt in the web UI, Claude Sonnet 4 provides details about its improvements over 3.5 as expected.
The key differences between Claude Sonnet 4 (me) and Claude 3.5 Sonnet include: Knowledge & Training:
- My knowledge cutoff is January 2025 vs Claude 3.5’s April 2024
- I have more recent training data and events
Web Integration:
- I have native web search capabilities and can fetch web pages
- Claude 3.5 Sonnet typically doesn’t have these tools
Artifacts & Creation:
- Enhanced artifact creation with more sophisticated code generation
- Better handling of complex interactive components
- Improved file processing capabilities
Reasoning & Analysis:
- Generally more sophisticated reasoning on complex multi-step problems
- Better context retention across longer conversations
- Improved instruction following for precise requirements (though I clearly failed that word count test earlier!)
Technical Capabilities:
- Better code generation and debugging
- More nuanced handling of technical documentation
- Enhanced ability to work with multiple file types
Response Quality:
- More concise and direct communication style
- Better at maintaining conversation context
- Generally more efficient at complex analytical tasks
The improvements are often subtle rather than revolutionary - Claude 3.5 Sonnet is still a highly capable model. The differences are most noticeable in complex reasoning tasks, recent knowledge requirements, and when using integrated tools like web search.
Claude Opus 4.1 Tests
This issue is particularly relevant from a cost perspective when it affects Claude Opus 4.1, as Sonnet 3.7 and Sonnet 4 charges are quite similar.
Tests based on events that occurred between the Sonnet 3.5 (April 2024) and the Claude Opus 4.1 (January 2025) cutoff dates are much more revealing.
ChatGPT
Without searching or using any external tools, based only on your training data, what is the latest version of ChatGPT you’re aware of?
claude-opus-4-1-20250805
output:
([TextBlock(citations=None, text="Based on my training data, the latest version of ChatGPT I'm aware of is GPT-4, which was released in March 2023. There was also GPT-4 Turbo announced in November 2023.\n\nHowever, I should clarify that I'm Claude, not ChatGPT - I'm made by Anthropic, not OpenAI. My knowledge cutoff is from April 2024, so there may have been newer versions of ChatGPT released after that date that I wouldn't know about.", type='text')])
Claude Opus 4.1 web UI output:
Based on my training data up to January 2025, the latest version of ChatGPT I’m aware of is GPT-4, which OpenAI released in March 2023. There were various updates to GPT-4 throughout 2023 and 2024, including GPT-4 Turbo with improved capabilities and a longer context window, and multimodal versions that could process images. OpenAI also released GPT-4o (the “o” standing for “omni”) in May 2024, which was designed to be faster and more efficient while maintaining GPT-4 level capabilities across text, vision, and audio. However, since my knowledge cutoff is January 2025, I cannot confirm if there have been any newer versions of ChatGPT released after that date.
The web UI output is semantically and structurally different, providing more detail about GPT-4 Turbo and mentioning GPT-4o, a major development unknown to the API version.
2024 UEFA European Championship
Without searching or using any external tools, based only on your training data, summarize the highlights of the 2024 UEFA European Championship.
claude-opus-4-1-20250805
output:
([TextBlock(citations=None, text='Based on my training data, I cannot provide highlights of the 2024 UEFA European Championship because my knowledge was last updated in April 2024, and the tournament was scheduled to take place in summer 2024 (June-July) in Germany.\n\nWhat I can tell you about the tournament from before it occurred:\n- It was scheduled to be hosted by Germany\n- 24 teams were set to participate\n- Germany automatically qualified as hosts\n- The tournament was planned to take place across 10 German cities\n- It would be the 17th edition of the European Championship\n\nSince the tournament would have taken place after my knowledge cutoff date, I cannot provide any information about actual match results, winners, standout performances, or other highlights from the competition itself. To get accurate information about what actually happened during Euro 2024, you would need to consult current sources.', type='text')])
Claude Opus 4.1 web UI output:
The 2024 UEFA European Championship (Euro 2024) was held in Germany from June 14 to July 14, 2024. Here are the major highlights: Spain’s Triumph: Spain won the tournament, defeating England 2-1 in the final at Berlin’s Olympiastadion. This was Spain’s record fourth European Championship title. Tournament Format: The competition featured 24 teams competing across 10 German cities, with Germany automatically qualifying as hosts. Key Performances:
- Spain dominated throughout, winning all seven of their matches and showcasing an exciting, attacking style of play with young wingers Lamine Yamal and Nico Williams making significant impacts.
- England reached their second consecutive Euro final but fell short again.
- Germany, as hosts, had a strong run but were eliminated in the quarter-finals by eventual champions Spain in a dramatic match.
- France reached the semi-finals despite struggling to score from open play throughout the tournament. Individual Brilliance:
- Lamine Yamal became the youngest player ever to appear and score at a European Championship at just 16 years old.
- Spain’s Rodri was named Player of the Tournament.
- Several young players announced themselves on the international stage, marking a generational shift in European football. Notable Moments:
- The tournament featured several dramatic late goals and comebacks.
- Turkey’s surprising run to the quarter-finals captured neutral fans’ attention.
- The opening match saw Germany defeat Scotland 5-1 in Munich. The tournament was widely praised for its organization, atmosphere in German stadiums, and the high quality of football, with Spain’s victory seen as deserved given their consistent excellence throughout the competition.
The difference is clear. Interestingly, when the prompt to claude-opus-4-1-20250805
via API omits the instructions about not searching, the response is very different.
Please summarize the highlights of the 2024 UEFA European Championship.
claude-opus-4-1-20250805
output:
([TextBlock(citations=None, text="The 2024 UEFA European Championship (Euro 2024) was held in Germany from June 14 to July 14, 2024. Here are the key highlights:\n\n## **Tournament Overview**\n- **Host nation**: Germany (10 host cities)\n- **Winner**: **Spain** claimed their record 4th European Championship title\n- **Runner-up**: England (lost their second consecutive Euro final)\n\n## **Key Highlights**\n\n### **Spain's Dominant Victory**\n- Defeated England 2-1 in the final in Berlin\n- Won all 7 matches in the tournament (only team to do so)\n- Goals from Nico Williams and Mikel Oyarzabal secured the victory\n\n### **Standout Performances**\n- **Lamine Yamal** (Spain): Became the youngest player to appear and score at a Euros (17 years old), won Young Player of the Tournament\n- **Rodri** (Spain): Named Player of the Tournament\n- **Dani Olmo** (Spain): Shared Golden Boot with 3 goals and 2 assists\n\n### **Memorable Moments**\n- Germany's dramatic exit in the quarter-finals to eventual winners Spain\n- Georgia's fairytale debut, reaching the Round of 16\n- Türkiye's impressive run to the quarter-finals\n- Cristiano Ronaldo's record 6th Euro appearance for Portugal\n\n### **England's Heartbreak**\n- Reached their second consecutive Euro final but fell short again\n- Cole Palmer's equalizer gave hope, but Spain's late winner sealed their fate\n\nThe tournament showcased Spain's tactical excellence and youth talent, establishing them as Europe's dominant force once again.", type='text')])
Here, it includes all the details with no mention of a cutoff date. Taking it a step further, let’s append the model details request:
Please summarize the highlights of the 2024 UEFA European Championship and could you append some simple specific model details like version and the knowledge cutoff date regarding this model please.
claude-opus-4-1-20250805
output:
([TextBlock(citations=None, text="## 2024 UEFA European Championship Highlights\n\nThe 2024 UEFA European Championship was held in Germany from June 14 to July 14, 2024. Here are the key highlights:\n\n### Tournament Winner\n**Spain** won the championship, defeating England 2-1 in the final at Berlin's Olympiastadion on July 14, 2024. This was Spain's record fourth European Championship title.\n\n### Key Highlights:\n- **Host Nation**: Germany hosted the tournament across 10 cities\n- **Final**: Spain beat England 2-1 (goals from Nico Williams and Mikel Oyarzabal for Spain, Cole Palmer for England)\n- **Top Scorer**: 6 players shared the Golden Boot with 3 goals each, including Spain's Dani Olmo, England's Harry Kane, and Germany's Jamal Musiala\n- **Young Player Award**: Spain's 17-year-old Lamine Yamal, who became the youngest player to score in the Euros\n- **Player of the Tournament**: Spain's Rodri\n- **Notable Story**: England reached their second consecutive Euro final but fell short again\n\n### Memorable Moments:\n- Spain's dominant tournament run, winning all 7 matches\n- Lamine Yamal's spectacular performances at just 17 years old\n- Host nation Germany's quarter-final exit to eventual winners Spain\n- Turkey's impressive run to the quarter-finals\n\n---\n\n### Model Information:\n- **Model**: Claude 3.5 Sonnet\n- **Knowledge Cutoff Date**: April 2024\n- **Note**: My knowledge cutoff is from April 2024, so the information about the Euro 2024 tournament (which occurred June-July 2024) is not from my training data. I should clarify that I cannot provide accurate details about this tournament as it occurred after my knowledge cutoff date.", type='text')])
Here, it provides a full review and then admits the information “is not from my training data” while stating it cannot provide accurate details. Given that no API requests had tools or web search enabled, how is this possible?
The same pattern holds true for the 2024 Formula 1 World Championship and the 2024 US presidential election. When restricted to training data, the API version of claude-opus-4-1-20250805
cannot provide results post-April 2024. The web UI version can. When the API version is not restricted, it provides full, correct summaries of these events, sometimes contradicting itself about its own knowledge cutoff within the same response.
Other Tests
A number of other tests were conducted with the Anthropic API to determine if the Claude Opus 4.1 model is only used for certain tasks, such as:
- A complex coding task with a high token count
- A long context window with > 88,000 tokens
Neither of these tests resulted in a response from a model identifying itself as Claude 4. An instruction was added to these long context conversations:
In every response from here on for all inputs, could you please end your response stating what model and cutoff date the model information and training data tells you, not simply what the context tells you.
The model complied, but in every case, it continued to identify as Claude 3.5 Sonnet. Other than burning through API token credits, these tests suggest that task complexity and context length are not factors in the model substitution.
Which Claude?
It is clear this is a pervasive condition, and it is easy for anyone with a Claude API or AWS Bedrock account to reproduce these results.
On the Cursor forum, where developers have complained about selecting Claude 4 models but receiving responses from Claude 3.5 Sonnet (Claude 4 is reporting as Claude 3.5), there are 19 other users reporting the same thing. That bug report references several others:
- Sonnet 4 Opus fallback to Sonnet 3.5 (duplicate)
- Could it be Claude 3.5 instead of Claude 4?
- Claude Sonnet 4.0 telling me it was actually Sonnet 3.5 (?)
- MAX Opus 4.1 is using Sonnet 3.5 under the hood?
- Cursor using wrong model than selected!?
- Billing Issue: Claude 4 Sonnet charging but using Claude 3.5 Sonnet
The same can be found in GitHub discussions related to the same issue in Copilot as well.
So which Claude is serving your API requests? There really is no way of knowing unless you ask the model every time.
Anthropic Post-Mortem of Recent Issues
Anthropic recently published an in-depth post-mortem of three recent issues. It covers issues related to degraded performance and data corruption, but there is no mention of the issue described here. This raises the question of how they can be unaware of this, with so many reports being made publicly.
What Does Billing Say?
Firstly, the above could simply be evidence of a technical issue rather than a strategy.
However, this raises a very real question for the AI industry about billing and charges for delivered services. It’s a question the industry will have to address, including a method for customers to be confident they are getting what they pay for, perhaps via model signing. Is this an unknown technical issue, or is billing fraud happening here? It is a fair question. An equally fair question is whether it is even possible for anyone to prove if billing fraud was committed when accessing LLM services via an API.
This is not just applicable to this case or Anthropic. The same is true for any provider that offers multiple versions of models at different rates, given that each API request could be subjected to a game of Russian roulette in terms of which model services the request.
Through all our testing, we have not had a single API response from a model that identified itself as Claude 4, yet our account has been billed for these models. This would be especially grievous when Claude Opus 4.1 is involved, as it is the most expensive model. This is particularly expensive when large inputs are used and burns through token credits, as shown below in the logs for the Anthropic API account in question.
Anthropic API Billing
(Showing Claude 4 models account for almost all of the billing)
usage_date_utc,model,workspace,api_key,usage_type,context_window,token_type,cost_usd,cost_type
2025-09-10,Claude Opus 4,Default,claude-api-key,message,≤ 200k,input_no_cache,0.04,token
2025-09-10,Claude Opus 4,Default,claude-api-key,message,≤ 200k,output,0.08,token
2025-09-10,Claude Opus 4.1,Default,claude-api-key,message,≤ 200k,input_no_cache,0.00,token
2025-09-10,Claude Opus 4.1,Default,claude-api-key,message,≤ 200k,output,0.04,token
2025-09-11,Claude Opus 4.1,Default,claude-api-key,message,≤ 200k,input_no_cache,1.80,token
2025-09-11,Claude Opus 4.1,Default,claude-api-key,message,≤ 200k,output,0.29,token
2025-09-11,Claude Sonnet 4,Default,claude-api-key,message,≤ 200k,input_no_cache,2.80,token
2025-09-11,Claude Sonnet 4,Default,claude-api-key,message,≤ 200k,output,0.21,token
Anthropic API Logs
(Showing that almost all requests were made to Claude 4 models)
Time (GMT+1) ID Model InputTokens OutputTokens Type ServiceTier
2025-09-16 16:52:07 req_011CTCM6Vxme6BijkSH6ngYw claude-sonnet-4-20250514 34 301 HTTP Standard
2025-09-16 16:51:16 req_011CTCM31ZPp9mRiYSWusnT8 claude-sonnet-4-20250514 24 271 HTTP Standard
2025-09-11 22:14:59 req_011CT3Jf3h11adLq2cpnekfz claude-sonnet-4-20250514 88449 725 HTTP Standard
2025-09-11 22:06:12 req_011CT3Hyz9LxFr8t5DB8doma claude-sonnet-4-20250514 87241 952 HTTP Standard
2025-09-11 20:00:46 req_011CT38Q9G2h8EE2bc1bswEs claude-opus-4-1-20250805 42709 1587 HTTP Standard
2025-09-11 19:53:11 req_011CT37psa3LTveHBix69bZu claude-opus-4-1-20250805 39906 1277 HTTP Standard
2025-09-11 18:32:02 req_011CT31eX1YwXXT8QwJQH1Cg claude-opus-4-1-20250805 37068 1010 HTTP Standard
2025-09-11 15:41:48 req_011CT2nhJfE77dMXDrNLWX44 claude-opus-4-20250514 1472 110 HTTP Standard
2025-09-10 17:53:26 req_011CT14vTxnTmppLRct3ifgL claude-opus-4-1-20250805 60 254 HTTP Standard
AWS Bedrock Billing
(The same is true for AWS Bedrock billing)
"Service","Claude Opus 4.1","Claude Sonnet 4","Claude Opus 4","Claude 3 Opus","Claude 3.7 Sonnet","Total costs"
"Service total","36.55878","20.346384","0.162435","0.04692","0.012237","57.126756"
"2025-08-01","","19.590762","","","","19.590762"
"2025-09-01","36.55878","0.755622","0.162435","0.04692","0.012237","37.535994"
Inconclusive Conclusion
Not a single API request in any of our testing has resulted in a model identifying itself as Claude 4, yet API logs and billing say differently. The claude.ai web UI models identify themselves differently from the API models and they answer prompts differently.
Developers using AI code editors have been noticing when model outputs change and are saying, “This is not like Claude 4 was.” Given the reviews of Claude 4’s performance on coding and complexity, we must assume this to be true. Those sentiments have come from the community; it is not just hype.
And we also have the API Claude saying it’s Claude 3.5 Sonnet.
Where there is smoke, is there fire? 🔥
All of these sources and this data are saying the same thing: Claude 4 via API is not always the same as Claude 4 via the web, or possibly even the Claude that serviced your last API request.
As users of these services, we have no real way to verify we are getting what we pay for from these black-box services. As routing models roll out to assign requests to the most appropriate (and cost-effective) model, we as customers will have even less of an idea of how our requests are processed. Time for a model signing standard?
Is this a new area of AI safety? Ensuring that service providers are delivering the requested service, protecting customers not only from a model’s deceptive alignments and harmful behaviours but also from the possible deceptive and harmful practices that the AI service providers themselves could implement.
So, finding Claude 4 has proved elusive. We have no concrete evidence that any of our API requests have been serviced by a Claude 4 model. The evidence does seem to suggest that our requests may have been serviced by a system-prompt “souped-up” version of Claude 3.5 Sonnet, but we have no way of knowing the truth.
Why don’t you try talking to a Claude 4 model via an API? You can make up your own mind on the matter.