Thoughts on the MCP

Thoughts on MCP

Let’s say you want to integrate an external API into your application. You look up the documentation, understand how the API works, and then write code against it. This happens at design-time — your application is now permanently coupled to that specific API. Obviously!

But LLMs could understand and interact with arbitrary endpoints. This suggests we could shift the burden of understanding an API from the developer to the LLM, or in other words, from design-time to runtime. So why can’t we connect AI applications to arbitrary software while they’re running?

Unfortunately, there is no consensus yet on how applications should collect API metadata, or conversely, how API providers should expose them. Simply speaking, APIs expect the caller to already know how they work. And this is starting to become a limitation.

The missing piece seems to be some kind of open meta-protocol that makes the API metadata and semantics available to the caller at runtime. The Model Context Protocol (MCP) by Anthropic is the most prominent one that heads in that direction. I’ll simplify it drastically to bring the main point across:

Let’s say we’re building an LLM chat application. To make the model more useful, we’d like to give it access to our Slack channels. To do this we’ll use good old function calling: We add Slack API functions to our application and then pass their metadata (along with our chat prompts) to the LLM. The model may (at some point) decide it would be useful to call one of these functions — this happens at runtime.

However, the decision to include the Slack API functions into our application (either by writing them ourselves or importing a library) happens at design-time:

MCP adds plugin capabilities to function calling. It does so using an MCP client, which can connect to any arbitrary MCP server at runtime — similar to how the HTTP client inside your browser can connect to any HTTP server while the browser is running.

The difference is that HTTP servers (usually) transfer back content of a website, while MCP servers transfer back information about their functions and what they mean — so that the client can call them later.

The key insight is: Because this can happen at runtime, the user (NOT the developer) can add arbitrary functionality to the application (while the application is running — hence, runtime). And because this also works remotely, it could finally enable standardized b2ai software!

The general idea of the MCP is similar to an LSP in that messages get exchanged over a stateful, bidirectional connection. This allows for powerful features, but also makes certain things more complicated. For example, the set of actions available to a client might depend on the state the client is in. Similar to how the set of available links and buttons on a website depends on the state the user is in — e.g. you won’t be able to click the “Go to checkout” button if your basket is empty. If you click some link, the response you get from a server ideally includes not only the new content, but the new set of actions as well.

But when an MCP client makes a function call request, the server cannot include the new set of actions in the response — the response must only include the result of the function call. MCP does support changing the set of available functions at runtime, but we have to send a separate notification for that. So we need to keep the bidirectional connection open on the side — and have to handle synchronization.

Now, one can argue that the MCP is designed for more interactive applications (again, like an LSP) and this is true. In practice however, most applications can be written using a simpler request-response approach, which makes the MCP niche (like Websockets vs HTTP) — not a good property to have for an emerging protocol.

The MCP has plenty of other features as well, some of them really interesting. For example sampling, which allows the server to “borrow” intelligence from the client — that way the server can use LLMs, but it doesn’t have to implement LLM APIs. Few clients support this yet (not even Claude Desktop) and neither do servers — the concept is cool, but it seems to be hard to come up with use-cases. Maybe sampling will be a huge hit, I don’t know? Predicting what users want is really hard! But every additional feature comes with structure, which makes the protocol more opinionated and thus, less flexible.

Earlier I simplified when saying “functions”, but the MCP does not actually use that term. Instead, they have tools, prompts and resources. But in essence, all of those are functions on the server, they are just different “flavors”. Each flavor has slightly different use cases and rules, e.g. prompts can only be invoked by the user, not by the LLM, tools on the other hand can’t be invoked by the user. Resources can have callbacks to the client, but tools can’t (except indicating progress). The input and return types are also strictly defined per flavor, e.g. you can’t return a list of images from a tool, only a single image.

All that being said, I have huge incentive to see this succeed as I spent 100+ hours building my own MCP framework (and will continue to do so). I also want to give Anthropic lots of credit — they are doing pioneering work and everything is open-source!

But I’ve still been asking myself — could we come up with a better approach? I propose a web-like alternative in Part 2 of this post.