Learn how prompt caching works in large language models, why it reduces API costs and latency, and how to design your prompts and system state to take full advantage of it.

This guide breaks down what's actually new for the Anthropic's latest Sonnet-tier model, Claude Sonnet 5.

| Feature | Claude Sonnet 5 |
|---|---|
| Description | Speed and Intelligence |
| API model ID | claude-sonnet-5 |
| Context window | 1M tokens |
| Max output tokens | 128k |
| Latency | Fast |
| Extended thinking | No |
| Adaptive thinking | Yes |
| Knowledge Cutoff | Jan 2026 |
These are the changes that can actually break your existing code if you skip them.
On Claude Sonnet 4.6, a request with no thinking field just ran without thinking. On Claude Sonnet 5, that same request now runs with adaptive thinking automatically.
If you want the old behavior, turn it off explicitly:
thinking = {"type": "disabled"}One thing to watch: max_tokens covers thinking tokens plus your response text combined. If your workload used to run without thinking, revisit your max_tokens value so your response doesn't get cut off.
Setting temperature, top_p, or top_k to anything other than the default now returns a 400 error.
# This will fail on Claude Sonnet 5
response = client.messages.create(
model="claude-sonnet-5",
temperature=0.7,
max_tokens=1000,
messages=[{"role": "user", "content": "Hello"}]
)Fix: remove these parameters entirely, or leave them at default. If you were using temperature to steer tone or creativity, move that guidance into your system prompt instead.
This same restriction already applies to Claude Opus 4.7, so it's not a totally new pattern.
Manually setting a thinking budget no longer works:
# Not supported on Claude Sonnet 5 (returns 400)
thinking = {"type": "enabled", "budget_tokens": 32000}
# Use this instead
thinking = {"type": "adaptive"}This matches Claude Opus 4.8 and Claude Opus 4.7. If your code manually sets budget_tokens, switch to adaptive thinking and use the effort parameter if you need more control over how much the model thinks.
Claude Sonnet 5 uses a new tokenizer. The same text now produces roughly 30% more tokens than it did on Claude Sonnet 4.6.
Nothing about the API shape changes: your requests, responses, and streaming events all work the same way. But anything measured in tokens will look different.
Here's what to double check:
max_tokens limits: a limit that worked fine on Claude Sonnet 4.6 might now truncate your output. Recheck any limit set close to your expected output length.| Claude Sonnet 4.6 | Claude Sonnet 5 | |
|---|---|---|
| Default thinking | Off | Adaptive (on) |
| Manual thinking budget | Deprecated | Removed (400 error) |
| Custom sampling params | Allowed | Returns 400 error |
| Tokenizer | Older | New (~30% more tokens for same text) |
| Priority Tier | Supported | Not supported |
| Assistant message prefilling | Not supported | Not supported |
| Model | Claude Sonnet 5 |
|---|---|
| Pricing | $3 input / M tokens $15 output / M tokens |
| Intro Pricing (through Aug 31, 2026) | $2 input / M tokens $10 output / M tokens |
Claude Sonnet 5 is available through:
Claude Sonnet 5 also supports zero data retention for organizations with a ZDR agreement.
Step one is simple: swap the model ID.
model = "claude-sonnet-4-6" # Before
model = "claude-sonnet-5" # AfterThen work through this checklist:
max_tokens limits that are close to your expected output size.budget_tokens for {"type": "adaptive"}.temperature, top_p, or top_k values from your requests.Everything else, including tool definitions and response formats, stays the same. Assistant message prefilling was already unsupported on Claude Sonnet 4.6, so that's not a new limitation.
1. Do I need to rewrite my whole integration to use Claude Sonnet 5?
No. It's a drop-in replacement. Change the model ID, then handle the three behavior changes above.
2. Why is thinking suddenly on by default?
Claude Sonnet 5 uses adaptive thinking by default instead of running with thinking off, which was the default on Claude Sonnet 4.6. You can disable it if you don't want it.
3. Can I still set a specific thinking budget in tokens?
No. Manual extended thinking is removed. Use adaptive thinking with the effort parameter instead.
4. Why am I getting a 400 error when I set temperature?
Claude Sonnet 5 no longer accepts non-default values for temperature, top_p, or top_k. Remove them from your request.
5. Will my old token counts still be accurate?
No. The new tokenizer produces about 30% more tokens for the same text, so old counts from Claude Sonnet 4.6 are no longer valid.
6. Does the new tokenizer change how I call the API?
No. Requests, responses, and streaming events keep the same structure. Only your token-based measurements are affected.
7. Is Claude Sonnet 5 more expensive than Claude Sonnet 4.6?
Per-token pricing is the same. But since the same text now produces more tokens, your total cost per request can be higher.
8. Does Claude Sonnet 5 support Priority Tier?
No, Priority Tier isn't available for Claude Sonnet 5 at this time.
9. Can I use Claude Sonnet 5 on Amazon Bedrock?
Yes, through Claude in Amazon Bedrock and Claude Platform on AWS. It's not available on the legacy Bedrock InvokeModel or Converse APIs.
10. What's the biggest capability improvement in Claude Sonnet 5?
The largest gains over Claude Sonnet 4.6 are in coding and agentic tasks.
Tags
Learn how prompt caching works in large language models, why it reduces API costs and latency, and how to design your prompts and system state to take full advantage of it.

A clear breakdown of everything new in Claude Opus 4.8, including fast mode, mid-conversation system messages, lower prompt cache minimum, refusal stop details, and behavior improvements.

Learn how to use Claude Code for everyday developer workflows: exploring codebases, debugging, refactoring, running tests, and automating scripts.

Master Claude prompt engineering with best practices for XML tags, structured outputs, adaptive thinking, tool use, and few-shot examples.

Learn how to set up and use Claude Cowork on your desktop. Automate multi-step tasks, schedule routines, manage files, and run code in isolated VMs.
