Gemma 4 on Amazon Bedrock: Getting Started with Google's Open-Weight AI Models

Learn how to use Gemma 4 models (31B, 26B-A4B, and E2B) on Amazon Bedrock. This guide covers setup, code examples, multimodal input, reasoning mode, tool calling, and production best practices.

You want powerful AI models in production, but you also need data privacy, regulatory compliance, and operational control. That usually feels like a trade-off. Use the cutting-edge model, or keep your data safe. Pick one.

Amazon Bedrock removes that trade-off. It hosts Google DeepMind's Gemma 4 family as a fully managed service, running inference entirely on AWS infrastructure. Your prompts and completions are never used to train models, and nothing is shared with third parties.

In this post, you will learn how to pick the right Gemma 4 variant, set up access, and call the API with real code examples covering text, images, streaming, tool calling, and reasoning mode.

What Is Gemma 4?

Gemma 4 is an open-weight model family from Google DeepMind, released under the Apache 2.0 license. It covers dense and mixture-of-experts (MoE) architectures, with every variant supporting reasoning mode, native function calling, and multimodal input (text plus images).

All three variants are available on Amazon Bedrock through the bedrock-mantle endpoint, which uses an OpenAI-compatible interface. This means if you are already using the OpenAI Python SDK, you only need to swap the base URL and model ID.

Gemma 4 Variants at a Glance

	Gemma 4 31B	Gemma 4 26B-A4B	Gemma 4 E2B
Model ID	google.gemma-4-31b	google.gemma-4-26b-a4b	google.gemma-4-e2b
Architecture	Dense	Mixture-of-Experts	Dense (PLE)
Parameters	30.7B total	25.2B total / 3.8B active	5.1B total / 2.3B effective
Context window	256K tokens	256K tokens	128K tokens
Modalities	Text, image	Text, image	Text, image
Reasoning mode	Yes	Yes	Yes
Function calling	Native	Native	Native

Choosing the Right Variant

Your workload	Choose	Why
Reasoning-heavy or coding tasks	Gemma 4 31B	Strongest reasoning and coding, 256K context
Cost-sensitive at high throughput	Gemma 4 26B-A4B	MoE design: 4B-class cost with larger model knowledge
Latency-sensitive or multimodal classification	Gemma 4 E2B	Fastest and cheapest; use `reasoning_effort=high`

All three variants share the same API surface, so you can build once and swap models without changing your code.

Prerequisites and Setup

You need an AWS account with permission to call the bedrock-mantle endpoint. The easiest way is to attach the AmazonBedrockMantleInferenceAccess managed IAM policy to your user or role.

The endpoint URL is:

https://bedrock-mantle.{region}.api.aws/openai/v1

You will also need a short-term Amazon Bedrock API key. These expire automatically after a maximum of 12 hours. In production, store them in AWS Secrets Manager rather than environment variables.

Basic Text Completion

Use the OpenAI Python SDK and point it at the bedrock-mantle endpoint:

python

from openai import OpenAI

client = OpenAI(
    api_key="<your-short-term-bedrock-api-key>",
    base_url="https://bedrock-mantle.us-east-1.api.aws/openai/v1",
)

response = client.chat.completions.create(
    model="google.gemma-4-31b",
    messages=[
        {"role": "user", "content": "Explain the benefits of mixture-of-experts architectures for production inference."}
    ],
    max_tokens=512,
)
print(response.choices[0].message.content)

If you are migrating from another OpenAI-compatible provider, this is likely all you need to change.

Multimodal Input (Images)

All Gemma 4 variants accept image input. The endpoint supports base64-encoded data URLs and Amazon S3 URLs (s3://). Public https:// image URLs are not supported.

Place image content before the text prompt for best results, following Google DeepMind's recommended input ordering.

python

import base64

with open("chart.png", "rb") as image_file:
    image_b64 = base64.b64encode(image_file.read()).decode("utf-8")
data_url = f"data:image/png;base64,{image_b64}"

response = client.chat.completions.create(
    model="google.gemma-4-31b",
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": data_url}},
            {"type": "text", "text": "Describe the trend shown in this chart."}
        ]
    }],
)
print(response.choices[0].message.content)

To use an S3 image instead, replace data_url with "s3://my-bucket/chart.png".

Streaming Responses

For chat or agent interfaces where you want to show tokens as they arrive, set stream=True:

python

stream = client.chat.completions.create(
    model="google.gemma-4-31b",
    messages=[
        {"role": "user", "content": "Write a short poem about distributed systems."}
    ],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)
print()

No additional IAM permissions are needed. Streaming uses the same bedrock-mantle:CreateInference permission as non-streaming calls.

Tool Calling (Function Calling)

Gemma 4 supports native function calling for agentic workflows. The pattern is: define tools, let the model call one, run the function yourself, return the result, get the final answer.

python

import json
from openai import OpenAI

client = OpenAI(
    api_key="<your-short-term-bedrock-api-key>",
    base_url="https://bedrock-mantle.us-east-1.api.aws/openai/v1",
)

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get the current weather for a given location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {"type": "string", "description": "City and country"},
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
            },
            "required": ["location"]
        }
    }
}]

messages = [{"role": "user", "content": "What's the weather like in Seattle?"}]

# Step 1: Send request with tool definitions
response = client.chat.completions.create(
    model="google.gemma-4-31b",
    messages=messages,
    tools=tools,
    tool_choice="auto",
)
assistant_message = response.choices[0].message

# Step 2: Check if the model wants to call a tool
if assistant_message.tool_calls:
    messages.append(assistant_message)

    for tool_call in assistant_message.tool_calls:
        arguments = json.loads(tool_call.function.arguments)
        location = arguments.get("location", "Unknown")
        unit = arguments.get("unit", "fahrenheit")

        # Step 3: Run the actual function
        result = {
            "location": location,
            "temperature": 18 if unit == "celsius" else 64,
            "unit": unit,
            "condition": "Partly cloudy",
        }

        # Step 4: Return the result to the model
        messages.append({
            "role": "tool",
            "tool_call_id": tool_call.id,
            "content": json.dumps(result),
        })

    # Step 5: Get the final answer
    final_response = client.chat.completions.create(
        model="google.gemma-4-31b",
        messages=messages,
        tools=tools,
    )
    print(final_response.choices[0].message.content)
else:
    print(assistant_message.content)

Reasoning Mode

Reasoning mode makes the model show its thinking before giving a final answer. It is useful for complex multi-step problems and is enabled through the Responses API.

python

response = client.responses.create(
    model="google.gemma-4-31b",
    input="If a train leaves at 3pm at 60 km/h and another leaves an hour later at 90 km/h, when does the second catch up?",
    reasoning={"effort": "high"},
)

# Final answer
print(response.output_text)

# The model's thought process
for item in response.output:
    if item.type == "reasoning":
        for block in item.content:
            print(block.text)

The effort value can be low, medium, or high. For Gemma 4 E2B specifically, always use high to prevent reasoning text from leaking into the final answer.

For multi-turn conversations, only pass back final answers in the message history. Do not replay prior reasoning items to the model, as this degrades response quality.

Recommended sampling parameters: Use temperature=1.0 and top_p=0.95 for all variants in both reasoning and non-reasoning modes.

Service Tiers

Amazon Bedrock offers three inference tiers for Gemma 4:

Tier	Best for	Notes
Standard	Everyday tasks: content generation, text analysis	Default. Consistent performance, no commitment.
Priority	Customer-facing, real-time, latency-sensitive workloads	Up to 25% better throughput than Standard. Premium price, no reservation needed.
Flex	Batch jobs, model evals, background summarization	Discounted pricing. Higher latency, especially during peak traffic.

Opt into Priority by setting service_tier="priority" on each API call. You can mix tiers within the same application.

Handling Errors and Scaling in Production

Retry logic: Configure the OpenAI SDK with max_retries to handle transient failures automatically:

python

from openai import OpenAI

client = OpenAI(
    api_key="<your-short-term-bedrock-api-key>",
    base_url="https://bedrock-mantle.us-east-1.api.aws/openai/v1",
    max_retries=6,
)

Error codes to watch:

Error	Meaning	Action
HTTP 429	Token-per-minute quota exceeded	Reduce request rate, retry with exponential backoff, or request a quota increase
HTTP 503	Regional capacity under pressure	Retry with exponential backoff; consider Priority tier for sustained issues

Traffic ramp procedure: Avoid jumping straight to your target request rate. Instead:

Start at target rate. If you see 503 errors, cut rate by 50%.
Keep reducing until requests succeed consistently.
Hold at that rate for 15 minutes.
Increase by 50%, hold for 15 minutes.
Repeat until you reach your target.

Other production tips:

Spread large workloads across multiple minutes rather than sending bursts.
Use feature flags when migrating to a new model version to ramp traffic gradually.
Send async or batch work to the Flex tier for better price-performance.
Distribute across multiple AWS regions for resilience.

Prompt Caching

Implicit prompt caching is automatically enabled for all Gemma 4 models on Amazon Bedrock. When consecutive requests share the same prompt prefix, the model can reuse cached internal state, reducing latency at no extra cost.

To take advantage of this, put static content (system prompts, tool definitions, source documents) at the start of your prompt, and put dynamic content at the end.

Availability and Pricing

Gemma 4 is available in:

US East (N. Virginia)
US East (Ohio)
US West (Oregon)
Europe (Frankfurt)

Pricing is per token and varies by model and service tier. Check the Amazon Bedrock pricing page for current rates.

Q&A

1. Do I need to host the Gemma 4 weights myself on Amazon Bedrock?

No. Amazon Bedrock is fully managed. You do not provision infrastructure, host model weights, or operate inference stacks. You call an API and pay per token.

2. Can I use the standard OpenAI Python SDK to call Gemma 4 on Bedrock?

Yes. You only need to change the base_url to the bedrock-mantle endpoint and update the model parameter to the Gemma 4 model ID. Everything else stays the same.

3. Are my prompts used to train Gemma 4 or any other model?

No. Amazon Bedrock does not use your prompts or completions to train models, and your content is not shared with third parties, including Google DeepMind.

4. What is the difference between the Chat Completions API and the Responses API?

Chat Completions uses a messages list and is best for multi-turn conversations and client-side tool-calling loops. The Responses API uses a single input field and is suited for single-turn generation and accessing the model's reasoning output via output_text.

5. When should I enable reasoning mode?

Enable reasoning mode for complex multi-step tasks like math problems, logic puzzles, and detailed coding tasks. It increases latency and token usage, so avoid it for simple or time-sensitive requests. For the E2B variant, always use reasoning_effort=high.

6. What image formats and sources does Gemma 4 support?

It supports images sent as base64-encoded data URLs (e.g., data:image/png;base64,...) or as Amazon S3 URLs (s3://bucket/file.png). Public https:// URLs from the internet are not supported.

7. What is the Gemma 4 26B-A4B and why is it cost-efficient?

It is a mixture-of-experts model with 25.2B total parameters, but only 3.8B parameters are active per token during inference. This means its compute cost and latency are similar to a 4B dense model, while retaining the knowledge capacity of a much larger one.

8. How do I handle throttling errors in production?

Configure the OpenAI SDK with max_retries=6 to automatically retry with exponential backoff. For sustained 503 errors, reduce your request rate using the ramp procedure described above, or upgrade latency-sensitive traffic to the Priority tier.

9. Does prompt caching cost extra?

No. Implicit prompt caching is automatically enabled at no extra cost. Cache hits reduce latency for requests with matching prompt prefixes, and cached input tokens do not count against your input-token quota.

10. How do I choose between Standard, Priority, and Flex tiers?

Use Standard for everyday tasks with no urgency. Use Priority for customer-facing and real-time workloads where response speed matters and you can absorb the premium price. Use Flex for background jobs, evaluations, and batch processing where higher latency is acceptable in exchange for lower cost.

My SaaS

Acluebox

Build modular and reusable system prompts with my SaaS,

Acluebox

. Also, free prompt template generators there.

References

Introducing Gemma 4 models on Amazon Bedrock

Gemma 4 on Amazon Bedrock: Getting Started with Google's Open-Weight AI Models ​

What Is Gemma 4? ​

Gemma 4 Variants at a Glance ​

Choosing the Right Variant ​

Prerequisites and Setup ​

Basic Text Completion ​

Multimodal Input (Images) ​

Streaming Responses ​

Tool Calling (Function Calling) ​

Reasoning Mode ​

Service Tiers ​

Handling Errors and Scaling in Production ​

Prompt Caching ​

Availability and Pricing ​

Q&A ​

References ​

Gemma 4 on Amazon Bedrock: Getting Started with Google's Open-Weight AI Models

What Is Gemma 4?

Gemma 4 Variants at a Glance

Choosing the Right Variant

Prerequisites and Setup

Basic Text Completion

Multimodal Input (Images)

Streaming Responses

Tool Calling (Function Calling)

Reasoning Mode

Service Tiers

Handling Errors and Scaling in Production

Prompt Caching

Availability and Pricing

Q&A

References