Part 1 summarized my overall thoughts on the MCP — in this second part I propose an MCP alternative that is closer to the principles of the early web.
A web for AI
While I previously compared the MCP to HTTP, it's actually much more similar to FTP.
You see, in pre-web FTP, you couldn't just do:
ftp server.com/folder/readme.txt
Instead, you had to connect to the server root with:
ftp server.com
which established a bidirectional, stateful channel. Only then could you navigate to the
folder and download the file — if you knew where it was, otherwise you had to request file
metadata from the server using ls
. (The MCP is similar in a sense. There is one
root entry point through which all functionality gets discovered and consumed.)
That's by the way one of the main reasons why HTTP had to be invented in the first place and FTP was disregarded as a transport for HTML — it was too heavy-weight, you couldn't fetch arbitrary files with a simple request/response pattern; you couldn't link from file to file.
When the web enabled linking to a random page, this actually created a new problem. The user arriving at that page might have no idea where they were and where to go next! In contrast, Telnet and FTP sessions were stateful and Gopher's hierarchical menus also provided structural context. Only HTTP allowed you to hop to any URL directly, so a page had to convey its own context for the arriving user.
Since all websites in 1993 were server-rendered, this problem was basically solved with a version of HATEOAS (Hypermedia As The Engine Of Application State).
With HATEOAS, a client needs to have little to no knowledge about how to interact with an application. It's an architectural pattern that (if used correctly) completely decouples client and server — the server can evolve arbitrarily and the client never breaks.
HATEOAS says a server should respond
- with all actions a client can take
- with all data the client needs in order to make those actions
- to a client who knows nothing about either
Here is an example:
<html>
<head>
<title>Account 42</title>
</head>
<body>
<div>Account 42</div>
<div>Balance: <strong>100.00 USD</strong></div>
<div>
<a href="/accounts/42/transfer">Transfer money</a>
<a href="/accounts/42/withdraw">Withdraw money</a>
</div>
</body>
</html>
The first 2 requirements are clearly visible:
- There are 2 links (actions)
- There is the data the client needs in order to invoke the correct action.
But notice the last requirement - it says the client has no prior knowledge of neither data nor actions. If you think about this for a minute, it implies a generic client that can understand anything at runtime. And who is the only such client in the world?
Humans! Only we have the agency to understand arbitrary data and then invoke arbitrary actions at runtime. And because we suck at reading JSON, the server sends us HTML, which the browser can render. That's why in practice, HATEOAS only makes sense with HTML.
Unfortunately, there's been lots of confusion because people started using HATEOAS with JSON. Even the HATEOAS Wikipedia article uses JSON as an example. This loses the core benefits of HATEOAS because either you send JSON directly to humans (which again, we suck at reading, and also - how do we click the links?) or you build a UI layer for humans on top and then make it depend on the JSON from the server - but then you lose the "generic client" requirement, because the direct client of the server is now the UI layer (not the human) and changing actions/data on the server will break it.
There is a great article from 2016 called HATEOAS is for HUMANS. Specifically this section made it click for me:
I like to turn the client-server relationship around, and consider the human users of a software system as providing Agency As A Service (AAAS) for the server. The server software knows all about the data and what actions are available on that data, but has no idea what the heck to do.
Fortunately, these otherwise bumbling humans show up and will poke and prod the server to provide the agency the server so desperately needs. The server, of course, wants to speak with the humans in a language (hypermedia) that the humans find pleasant, or at least tolerable.
And that language is HTML.
So, you can see: a system satisfying HATEOAS is wasted if the hypermedia isn’t being consumed by something with agency. Humans are that thing, and, therefore for HATEOAS to be effective, the hypermedia needs to be humane.
(...)
Once we have strong AI, maybe the situation changes. But that’s what we’ve got today.
And now, 9 years later, we do have strong AI - which means using JSON could finally make sense!
We could build a web of functionality — agents get both the possible actions and all context they need to take the right actions.
A naive first attempt would be:
{
"state": {
"account": {
"account_id": 42,
"balance": 100.0,
"currency": "USD"
}
},
"actions": {
"transfer_money": "/accounts/42/transfer",
"withdraw_money": "/accounts/42/withdraw"
}
}
But links are for humans, LLMs need JSON schemas:
{
"state": {
"account": {
"account_id": 42,
"balance": 100.0,
"currency": "USD"
}
},
"actions": {
"transfer_money": {
"schema": {
"description": "Transfer money between two accounts. Deduct from one and credit the other.",
"type": "object",
"required": ["from_account_id", "to_account_id", "amount", "currency"],
"properties": {
"from_account_id": {
"type": "integer",
"description": "ID of the source account that the money should be transferred from."
},
"to_account_id": {
"type": "integer",
"description": "ID of the target account that the money should be deposited to."
},
"amount": {
"type": "number",
"description": "Amount to transfer"
},
"currency": {
"type": "string",
"description": "ISO currency code, e.g. USD, EUR"
}
}
},
"href": "/accounts/transfer",
"method": "POST"
},
"withdraw_money": {
// ...
}
}
}
The endpoint might be an RPC endpoint or not, either way, LLM applications would take the function call intents (that the LLMs spit out) and make the actual network request.
When invoking the `transfer_money` function, the result would be something like:
{
"state": {
"account": {
"account_id": 42,
"balance": 100.0,
"currency": "USD"
},
"pending_transfer": {
"from_account_id": 42,
"to_account_id": 99,
"amount": 40.0,
"currency": "USD"
}
},
"actions": {
"confirm_transfer": {
"description": "Confirm the pending transfer. This will finalize the transfer of funds.",
"href": "/accounts/42/transfer/confirm",
"method": "POST",
"schema": {
"type": "object",
"properties": {},
"required": []
}
},
"cancel_transfer": {
"description": "Cancel the pending transfer. This discards the transfer request entirely.",
"href": "/accounts/42/transfer/cancel",
"method": "POST",
"schema": {
"type": "object",
"properties": {},
"required": []
}
}
}
}
Finalizing the transfer brings us back to the previous page with the according actions:
{
"state": {
"account": {
"account_id": 42,
"balance": 60.0,
"currency": "USD"
},
"result": "Transfer of 40 USD from account 42 to account 99 successful!"
},
"actions": {
"transfer_money": {
// ...
},
"withdraw_money": {
// ...
}
}
}
Think of it as a blend of MCP and HAL/JSON-LD/Hydra with the early web's simplicity and LLMs as clients.
In a sense, this is going back to the roots, similar to how early server-rendered websites worked — each request rendered a new full page with all necessary data and links. This time, though, we would need to be even more thoughtful to always include all state/context the LLM needs to invoke the right action. In the scenario of human clients this requirement is not as strict, as they can be assumed to have slight out-of-band knowledge.
We could avoid maintaining session data on the server by including the current application state
in each request and response, effectively ping-ponging the state back and forth, making the
server stateless. This approach might get bulky for large states, but caching or referencing
previously sent state could mitigate the overhead. LLM applications could keep deltas of the
state in the conversation history to keep track of what has been achieved so far (e.g. result
in the above example).
Now, it could also make sense for human-facing UIs to rely on this same HATEOAS-style server output, because even if that UI breaks when the server changes, the LLM-based client applications don't.
I'm currently building an MCP client and I plan to experiment with such a web-like structure as well — I'll see how it goes.