Article Jun 18, 2026 · 9 min read

Two Days, AWS Bedrock, and the Moment Claude Figured It Out Before I Did

A field story about Hermes, reasoning models, and the illusion of 'plug-and-play AI' — for R&D leaders, QA/Test managers, and engineers who were promised 10x, 20x, 50x.

#AWS Bedrock#Claude#AI Agents#Reasoning Models

Why bother with the lab in the first place

Most of my work involves researching agentic AI. Not for LinkedIn posts, but to answer two questions: where do Quality Engineering teams get real leverage, and how do we build these tools into developer workflows?

Hypotheses don’t get tested in slide decks. At Skipper, we spun up an AWS instance, got Bedrock API access, and decided to stand up Hermes as our experiment platform. The admin sized the box to the minimum requirements. We also set up a few Gmail accounts for our AI agents.

The plan looked like a couple of hours of work. In practice, it turned into a two-day marathon. But those two days taught me more than a successful lab run ever would.

Part 1. “Hello world” and the descent into the funnel

The first run followed the standard tutorials:

# SSH in, install, and point Hermes at Bedrock
ssh ubuntu@hermes-lab
curl -fsSL https://get.hermes.sh | sh
hermes setup --provider bedrock --region us-east-1
export AWS_BEARER_TOKEN_BEDROCK="$BEDROCK_TOKEN"
hermes   # ...and start working. In theory.

In practice, that last line is where everything fell apart.

That is where the two-day debugging loop began. The YouTube walkthrough I followed promised it would take a couple of hours.

Here is the loop I fell into:

Rebuild the configs.
Switch to a custom LLM provider.
Verify the Bedrock token via curl (it works).
Run Hermes and get an authentication error.

I did what most engineers do: I opened a clean Claude Code context window and dumped in the configs, documentation, error logs, and package versions. One terminal ran Claude, the other ran the box with Hermes. I copied and pasted between them by hand, since I didn’t give Claude direct terminal access to the lab machine.

I went through four laps of the same cycle:

Formatting and reformatting the configuration file.
Tweaking the API endpoint URL.
Double-checking the documentation.
Reinstalling everything from scratch, feeding the model best-practice articles and success stories.

Every time, Claude confidently told me: “Try again, it should work now.”

And every time, I got the same auth error.

At that point, I had to choose: keep trying the same fixes, or accept that our assumptions were wrong.

Part 2. The turn: Claude stops “fixing” and starts thinking

I pasted the same error one more time. The response I got sobered me up:

Let’s face the reality: your bearer token works with the native Bedrock runtime, but not with Bedrock Mantle (the OpenAI-compatible wrapper).

✓ Works: Native Bedrock API with your bearer token: https://bedrock-runtime.us-east-1.amazonaws.com/model/us.anthropic.claude-sonnet-4-6/converse

✗ Doesn’t work:

Bedrock Mantle (OpenAI-compatible) — rejects your bearer token or model names

Hermes bedrock provider — uses boto3, needs AWS access keys

Best Solution: get AWS Access Keys, or build a small Python proxy that translates OpenAI-format requests into Bedrock Converse calls with the bearer token.

That wasn’t another “try again” suggestion. It was a different class of response.

Claude stopped proposing tactical fixes and looked at the architecture.

That is when I realized what makes reasoning models different. The model understood before I did that the problem was not in the configuration details. It was in our basic assumptions and the connection architecture. You cannot solve a puzzle if you are trying to force two pieces from different sets together.

Part 3. The root cause in plain English

Here is what was actually going on under the hood:

I was using an AWS Bedrock API Key (the ABSK... bearer token AWS released in July 2025). The token worked with direct curl requests to the native Bedrock Converse API.
The Hermes Agent (NousResearch) Bedrock provider (added in v0.11.0) is built on top of the standard AWS SDK (boto3). It expects credentials from the standard AWS credential chain: an IAM role, ~/.aws/credentials, or AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. It uses SigV4 signing, not a bearer header.
I was trying to feed Hermes credentials from an authentication model it did not support. The requests went out unsigned or got signed incorrectly, resulting in a 403 error.
My attempts at workarounds failed for predictable reasons:
- Setting the provider to custom and using the bedrock-runtime URL broke the response format. The Bedrock Converse API does not use the OpenAI schema, so Hermes could not parse the empty replies.
- Using custom with Bedrock Mantle (AWS’s OpenAI-compatible engine) did not work because Mantle is a separate authentication domain. The AWS-issued bearer token is not valid there.
We also ran into a known botocore bug (#3603). If AWS_BEARER_TOKEN_BEDROCK is exported as an empty string, boto3 mistakenly treats it as a valid token and fails signature verification. Simply having that variable in the environment was enough to break the connection.

The root cause: we had the right credentials and the right endpoint, but they belonged to two different authentication models that Hermes does not bridge.

How to resolve this:

Use AWS Access Keys or an IAM role. This is what the Hermes Bedrock provider actually supports.
Use the bearer token outside of Hermes, directly via the native Bedrock Converse API or a client that supports it.
Write a Python proxy to accept OpenAI-style requests from Hermes, translate them, and forward them to the Bedrock Converse API using the bearer token. This works, but it means you are now hosting and maintaining an extra microservice just to run a lab experiment.

In my first draft of this post, I thought updating Hermes would solve it. I updated to v0.14.0, and nothing changed. The Hermes maintainers did not write a bug; they wrote a provider that does not support bearer tokens. The documentation states this, but I did not read it closely enough.

None of this is mentioned in the quick-start YouTube videos or the basic setup guides. I had to piece the architecture together myself—and then revise my assumptions when my first “fix” failed.

Three takeaways for three different audiences

This isn’t really a story about Hermes. Hermes is just the lab specimen. The real lessons apply to three distinct roles.

For engineers: reasoning models change how we debug

Working with older LLMs was mostly about asking for the next step. With reasoning models, it is about asking them to help you rethink your assumptions.

A standard model would have kept suggesting minor configuration tweaks forever. The reasoning model did what I failed to do: it stepped back, looked at the architecture, and pointed out that my token was valid for a completely different product.

Takeaway: Use your AI tools in two modes—as an accelerator for routine tasks, and as a diagnostician when you are stuck in a loop. When an assistant starts giving you the same suggestions for the third time, it is time to switch models or change your prompt to focus on architecture rather than syntax.

If an assistant keeps telling you to “try again” with minor variations, that is your cue to stop. Either it is missing critical context, or you need a model that can reason about the system design.

For QA managers: “plug-and-play” AI is marketing fiction

Vendors love to promise seamless integrations: “just plug in your stack and let the agent work.” In reality, the agent did not figure this out on its own. It required an engineer working alongside the model, and the process took two days instead of two hours.

What this means for engineering leadership:

When evaluating AI tools, look beyond the core features. Calculate the integration cost for your specific environment—including versions, auth methods, network policies, and security constraints. A demo in a clean sandbox tells you very little about how it will run in your VPC.
Ask vendors for clear documentation on their integration dependencies. Knowing how a tool handles authentication under the hood is not pedantry—it is understanding the technical debt you are adopting.
Define who is responsible for verifying the assumptions your AI agents operate on. The agent cannot validate its own context.

Significant productivity gains are real, but they do not come from buying a SaaS license. They happen when a skilled engineer uses a capable tool while understanding its limits.

For R&D leaders: who pays for the efficiency gap

We are all under pressure to deliver massive productivity multipliers. Some of these expectations are realistic; many are not.

This integration struggle shows where the limits lie:

AI speeds up tasks that fit its training data. When a tool combination does not work out of the box, the AI is not going to rewrite the codebase to support your custom setup.
Humans still make (and pay for) the decisions. Deciding to scrap an approach, pivot to a proxy, or abandon a tool are engineering decisions, not prompts.
A vendor promising plug-and-play is essentially selling you a lottery ticket. They hope your infrastructure matches the specific scenarios they tested. When it does not, your team pays the price in engineering hours.

High productivity happens when a competent human stands next to a competent tool and both understand where the boundaries lie.

What I’m taking out of the lab

Reasoning models are a distinct category, not just larger versions of standard LLMs. They require a different approach. However, their answers are still hypotheses that need validation. Claude identified the authentication mismatch, but my assumption that updating Hermes would fix it was wrong—something I should have caught by reading the docs.
Integrating AI into an existing stack is a major task, not a minor configuration step. AWS Bedrock, Hermes, and Bedrock Mantle all handle authentication differently. Connecting them requires traditional software engineering.
Traditional engineering expertise is not going away. It is becoming the point of verification for AI proposals. Without human oversight, we risk creating a new class of bugs justified by “the AI said it would work.”

A final question to consider:

How many hours did your team spend last month repeating tasks because the AI confidently said “try again”?

Be honest. If that number is significant, the issue is not just the tool. It is that we are not giving the models—or ourselves—the space to step back and analyze the architecture. And that is a human problem, not an algorithmic one.

Igor Goldshmidt is a Quality Engineering Advisor and Consultant. He writes about practical AI implementation in testing and developer workflows.

Quality is the way!