Self-Serve Infrastructure Is Still a Myth

Self-Serve Infrastructure Is Still a Myth

Self-serve infrastructure replaces the ops ticket with IAM confusion, undocumented dependencies, and a Slack message to the one person who knows.

Self-Serve Infrastructure Is Still a Myth

The pitch is compelling. Any developer can spin up a database, provision a staging environment, or deploy a service without raising a ticket or waiting on the operations team. The platform handles it. You just click a button.

In practice, the button leads to a wall.

The wall might be an IAM policy that doesn't cover your team's workload. A Terraform module that works in theory but needs a specific VPC configuration nobody documented. A networking dependency that wasn't mentioned in the runbook. The operational friction didn't disappear. It just moved deeper into the system, where it's harder to see and harder to ask about.

The result is always the same: a Slack message to the one person who actually knows how this works.

The permission maze nobody wants to own

IAM is where self-serve initiatives go to die quietly.

AWS, Azure, and GCP combined expose over 21,000 unique assignable permissions. Nobody understands all of them. Developers certainly don't, and most platform teams don't either. What gets built instead is a patchwork of policies that work for the teams who were present when the platform was set up, and silently fail for everyone else.

According to a survey of developers, 66.7% find IAM configuration overly complex, with 53.3% citing ambiguous permission boundaries as their primary challenge. When the access model is too complex for the people operating it to understand, it signals a system design failure, not a training deficit. "Self-serve" becomes "spin up and hope."

The downstream effects compound. Under time pressure, 60% of developers admit to reducing security measures, taking wider permissions than needed because requesting the correct scoped role takes too long. Service accounts accumulate unchecked privileges. Studies consistently show over 70% of data leaks trace back to excessive access privileges. Self-serve without guardrails doesn't just create friction: it creates security debt that compounds silently until something breaks badly.

The portal that isn't self-service

Many organisations solved the UX problem before they solved the infrastructure problem.

They installed Backstage, or a commercial equivalent, added a clean interface, and announced that self-serve infrastructure was live. The portal looks like a product. The catalogue is populated. The "create service" button is right there.

What the portal doesn't advertise is that clicking that button triggers a Jira ticket to the platform team, or calls a Terraform module that hasn't been updated in eight months, or provisions a resource in the wrong account because the team's workload profile wasn't accounted for when the template was written.

CNCF has documented this pattern directly: building the portal before the platform creates a self-service facade rather than actual self-service. The interface exists. The underlying automation doesn't. The developer clicks confidently, and then waits exactly as long as they would have with a ticket.

Backstage, the most widely adopted open-source IDP, is explicit about this: it is a framework, not a finished product. Implementing it requires dedicated platform engineering effort, ongoing plugin maintenance, and significant customisation. That's a reasonable trade-off for large organisations with dedicated platform teams. It is not self-serve out of the box, and presenting it as such sets expectations the implementation can't meet.

Why tribal knowledge keeps winning

Infrastructure is underdocumented by default. The person who set up the Kubernetes cluster in 2022 has since left. The reasons behind specific network topology decisions live in a Notion page that was last edited eighteen months ago. The "right" Terraform module for a new microservice is whichever one the senior platform engineer points you to on your first week.

76% of organisations report that their software architecture's cognitive burden creates developer stress and reduces productivity. Most of that cognitive burden isn't in writing code. It's in understanding which of the seventeen available options for spinning up a service is the current approved one, what you're allowed to configure yourself, and which change will require a review from someone in the platform channel.

Tribal knowledge fills the documentation gap because documentation is always behind reality. It's a structural problem, not a discipline problem. The organisations that fight it most effectively don't just write better runbooks: they reduce the number of decisions developers need to make by shipping fewer, more opinionated options.

What self-serve actually requires

Real self-serve infrastructure has three properties that most platforms don't deliver together.

Opinionated defaults that work for 80% of cases. A developer should be able to provision a production-ready service without understanding the underlying network topology, IAM model, or storage configuration. The platform makes reasonable choices on their behalf, based on workload type, and documents the defaults clearly. Not a menu of options. A known-good path.

Guardrails that enforce policy without requiring understanding. Permissions should be granted by role, not requested by ticket. When a developer deploys a web service, the platform knows what access that workload needs and provisions it automatically within defined bounds. The developer does not need to understand IAM policy syntax to get a correctly-scoped service account.

Documented escape hatches for the edge cases. The 20% of workloads that don't fit the default path still need a way forward. That way forward should be documented, not tribal. An engineering team that hits the edge of the golden path should find a signpost, not a dead end and a Slack handle.

The CNCF case study from Zepto illustrates what this looks like in practice. Before investing in proper platform infrastructure, their 500+ developers across 20 teams spent days manually onboarding microservices and configuring CI/CD pipelines. After implementing standardised templates and workflows with genuine automation underneath, teams could self-serve without operational dependencies. The portal didn't change. The automation underneath it did.

The baseline question

Before claiming self-serve infrastructure is live, it's worth asking the question from the previous piece in this series: if you made the platform optional tomorrow, would developers still use it?

If the honest answer is no, the self-serve story isn't ready. Developers are using it because they have to, not because it removes friction faster than the alternative. Mandated adoption hides the gap between the promise and the reality.

The goal isn't a self-serve portal. It's infrastructure that a developer can provision in under five minutes without any prior knowledge of how it was built, who owns it, or which Slack channel to ask in when something goes wrong.

Forge's POV: deploy shouldn't require an operator

The same principle that applies to infrastructure applies to deployment. The promise of Forge's developer platform is that shipping a change shouldn't require understanding the hosting layer.

Git push. Branch deploys. Preview environments that work in production conditions. No ticket to request a staging slot. No IAM role to configure before you can see your change live.

That's the same opinionated default model that makes infrastructure genuinely self-serve. The developer makes decisions about their code. The platform makes decisions about everything else, and makes them consistently, within documented bounds.

Self-serve infrastructure is not a portal. It is a set of decisions made in advance, on behalf of developers, so they never have to make them at all.