Agents are using the wrong door
I don't want to choose between invisible tool calls and a hijacked cursor, we need to do better.
Agents have gotten a lot more useful lately. Ever since the introduction of OpenClaw they’re starting to reach into the apps and content we actually work in. They could do this before, but OpenClaw has shown us why it’s useful. Before, we used to prompt our way to output, and then go do the work ourselves.

Basically, there are 2 ways an agent can do things for us now.
The first way is through an a so-called MCP or Model Context Protocol. It’s a method for an agent to connect to an app, as well as explaining what’s possible. The agent sends tool calls, the app executes them. Fast, clean, precise. What this setup doesn’t do is bring us Humans into the conversation. When the agent does something, we don’t see the work happening in any understandable way. We see a chat line saying a tool was called, or we see nothing at all. The action, and the language it was done in, stay between the agent and the app.
The second way is by operating our device for us. The agent takes over the mouse and keyboard, looks at the screen, clicks the buttons. It works the way we do, so we can follow along. It also demos beautifully. But it’s also slow, and the Aikido folks made a compelling case that you can’t really secure it. Since the agent is operating on our behalf, it has our computer permissions. Put it in a sandbox and it can’t do anything useful; leave it out and pray it doesn’t do anything stupid.
To me, neither of these feels quite right. Computer use does the work for us, in a Human way, but we can’t interact while it happens. MCP does the work for us in a computer way, and we can’t see it happen at all. Both of them replace us. Neither of them work with us.
An agent is an assistant. We should be able to work alongside it, correct it, learn from it.
My colleague Scott and I saw Adobe’s announcement last week and noticed it lined up with something we’ve been designing at work for a few months. The middle-door framing below came out of that conversation.
A middle door
There’s another method of accessing an app’s functionalities that neither approach is using. Every app already has it: the technical service layer between the UI and the internal functions. It’s what the buttons call when we click them. This layer is an integral part of any app, and it already works with permissions and states.
If an agent reaches the app there, a few things fall into place:
It gets access to what the app can do, not just a subset.
Scoping is possible, because apps already know how to scope things for the people using them.
The app can render what the agent is doing, in the same visual language we already speak with it.
Adobe showing the way?
Firefly AI Assistant works like this: you describe what you want, and it orchestrates across Photoshop, Premiere, Lightroom, Illustrator. The assistant asks contextual questions. It surfaces decisions. You can step in at any point to guide or override.
This is where the speed comes from. Real-life work rarely happens inside a single app. We move between them constantly, carrying context in our heads. An agent that can cross those boundaries on our behalf is doing something much closer to the actual shape of our work. Not automating a single step faster, but removing the handoffs between them.
While doing this, the app can show finetune controls and allow the user to see what actions were taken, as well as override different parameters to their liking. The creator stays in the loop because the loop has a shape the creator can read and manipulate.
What’s next?
Don’t get me wrong, a lot of work can be automated, and should be. Data moving between systems, reports generating themselves, inboxes triaging on their own. That work can happen in the background, we don’t need to watch it happen.
But some work is different. The work where we’re thinking, shaping, deciding, making something that didn’t exist before. Things like creative work, strategic work, anything where the point is the judgment we bring to it. That work needs us in the loop. Not as approvers at the end, but as participants the whole way through. And that’s where the middle door matters.
Apps should grow a new front door for agents, speaking to the same service layer, producing the same state changes, rendering into the same UI. Not as a separate product. Not as sidepanel chat. As another way into the same room.
What that looks like is still to be explored. Cross-app work doesn’t have a clean UI paradigm yet. How does an agent show us what it’s doing across three apps at once? How do we reach in and adjust mid-flow without messing things up? I don’t have the answers, yet. We have the existing grammar of each app to borrow from, but the shape of this will come from building things and seeing what holds up.
What we do know is that the answer isn’t more chat! Text can describe the outcome we want, but it can’t be the surface we work on. Working through a chat window scales badly the moment the work gets visual, spatial, or complex, which is most of the work worth doing. Whatever this middle door ends up looking like, it won’t be a conversation. It’ll be an interface.
Update, response from Scott:
Perhaps the only thing I would push on is that I think the middle door helps people map how that changes the paradigm of either the “frontdoor” for use and “backdoor” for data. I’d argue we need a new room devoted to human and agent collaboration and we don’t have well defined canvases or protocols for that but I think the ingredients are there. Right now people are stuck in exactly the trap you articulated - replacement or simulacrum of the current process, but that approach will very quickly prove inferior to human alone and vastly inferior to a designed human + AI collaboration.
