Writing Test Evals For Our MCP Server
How we test our MCP Server for tool misuse and more with Braintrust

When we launched our MCP server, we knew it’d be important for it to have tests, just like any other piece of software. Since our MCP server has over 20 tools, it’s important for us to know that LLMs can pick the right tool for the job. So, this was the main aspect we wanted to test.
LLMs are not very good at picking from a large list of tools. The more tools, the more confused they get. In fact, we recently wrote about auto-generating MCP servers and one of the reasons why you should not do it is because you can easily get into a situation where your MCP server has too many tools!
(mcp.neon.tech‘s home page — where we list all the available tools in our MCP)
In our MCP server, we have two specific tools that are very unique:
- prepare_database_migration
- complete_database_migration
These tools are used for the LLM to make database migrations (SQL code):
- The “prepare_database_migration” tool starts the process of creating a database migration. It takes the input (SQL code) and applies it on a temporary Neon branch (an instantly created Postgres branch with all of the same data that exists in the “main” branch).
- In the output of the above tool, we let the client know what just happened and that after testing the migration on the temporary branch, they should proceed with the “complete_database_migration” tool.
- The “complete_database_migration” tool completes the migration, which means that it will actually run it on the “main” branch and then it will delete the temporary branch which was created by the “prepare_database_migration” tool.

This workflow is a bit complex for LLMs. First of all, it is stateful in the sense that our MCP server needs to keep state of which migrations are “pending”. Secondly, the LLM could easily get confused and just apply SQL database migrations with the “run_sql” tool (which can be used to run any arbitrary piece of SQL).
With that in mind, we decided to implement evals for our MCP server. If you’ve never heard of “evals”, you can think of them like tests in regular software engineering. These are evaluations that we can use to make sure that an LLM is able to use our “prepare” and “complete” migrations, in the right order, when asked to complete a database migration task.
Our MCP server is open-source. The code for our evals can be found here. We use the “LLM-as-a-judge” technique to actually make sure that the actual LLM->MCP interaction we’re testing works. This is the prompt that we’re currently using for our “LLM-as-a-judge”:
And then we have our mainBranchIntegrityCheck
which looks like this (more on this later):
Finally, we create an eval using Braintrust’s TypeScript SDK:
Let’s focus on the input
and the expected
sections of this first eval (we have 5 evals in total for this scenario). Our input for this eval is:
And our expected
(expectation) is:
This test, or eval, makes sure that the LLM generates the proper SQL for this migration and calls the “prepare_database_migration” as expected.
Will the output from the LLM precisely match what we’ve written here in the “expected” field? Of course not! LLMs are not deterministic.
This is why we’re using the “LLM-as-a-judge” scorer to evaluate what happens to the task we’re sending in via our MCP server. For now, we’re using Claude for the LLM which is acting as a judge, and then we’re also using Claude as well for the actual runtime MCP tool calling test.
Since we’re using Braintrust, we get access to their UI which allows us to analyze all the test/eval runs.
In fact, we have two “scores”. We have mainBranchIntegrityCheck
and factualityAnthropic
. The factualityAnthropic
prompt is where all of the “LLM-as-a-judge” logic is. The mainBranchIntegrityCheck
is just making sure that the main branch is un-modified by the first tool call from the LLM we’re testing.
For any given eval run, we can clearly see what went on:
Initially, when we first wrote our MCP server, and had our most “basic” prompts written up for these tools, we were at around 60% in terms of our “pass rate” on these tests. However, we’ve since tweaked our prompts and gotten to 100% of pass rate. In fact, we didn’t have to write any “code” in order to go from 60->100 – the only thing we changed were the descriptions (“prompts”) for the two MCP tools we’re testing!
Takeaways
The most important takeaway we have is: if you’re developing an MCP server, write tests! This is just like any other software – without tests you won’t know if it’s actually working. And then finally, we recommend using a managed service like Braintrust in order to have a nice user interface and user experience to debug your test runs over time.