It all began with the DevOps movement, which mainly aimed to increase delivery speed and efficiency by implementing new ways of managing the Software Development Life Cycle (SDLC) and production operations, while also easing the tension between system admins and developers. This movement helped the industry gain a lot of new practices and change their processes, but also resulted in adding new DevOps departments within organisations, which, paradoxically, meant adding one more layer between infrastructure teams (system admins) and developers.
Although the DevOps movement has not always been implemented according to its projected ideals within the industry, it helped us accept the problems and seek options to address those with new practices and tools. Some organisations tried to address DevOps more closely to the projected ideals: not by creating a new department, but rather by injecting DevOps roles inside the delivery teams. Such an approach resulted in smoother operations, but still did not fully empower development delivery teams.
When it came to the enterprises, security and compliance issues became additional challenges to implementing the ideals. Delivery teams who have the required DevOps technical capabilities as separate roles, or T-Shaped developers, still need to interact with the system administrators (sysadmins) to provision an infrastructure element, gain insights and metrics, or get deeper into the infrastructure to see what is really going on. These requirements are sometimes addressed by providing them to "read-only" users in production environments, or through internal ticketing and issue management systems, but they are still not as efficient as they should be.
In summary, infrastructure managed by sysadmins, and even the principles of Infrastructure as Code (IaC), can still be implemented. However, there may still be gaps, unmet and unaddressed requirements, and custom-built components tailored to specific needs. Friction remains.
Let’s make an analogy: Consider you are the platform developer for an AWS PaaS such as RDS, which is an easy-to-manage relational database service. As a platform developer, you are responsible to set the platform standards and develop the functions defined by the customer requests. You develop and ship those functions.
Customers of the platform are entitled to consume the platform based on the functions that are in place. You provide necessary interfaces to developers (delivery teams) to interact with the platform, such as APIs, CLIs, SDKs. You are responsible for maintaining the production environment of the platform, and customers using the platform are responsible for the layers that should be defined in advance. Let’s refer to the generic IaaS vs PaaS vs SaaS segregation of responsibilities:
In this scenario, we are providing a PaaS to AWS customers and they are responsible for the Data and the Applications running on it (refer to the PaaS column). And it’s obvious that this PaaS (RDS in our scenario) is a product.
Which brings us to the question: why can’t the platforms customly developed for a specific company's needs, also be approached as a product? We see a similar pattern:
Similar to the diagram above, a platform team and a delivery team may define a more granular segregation of responsibilities between them:
Regarding the last point, access for troubleshooting, this has been a grey area between developers and systadmins, where platform engineering practices are not yet implemented. Am I the only one who noticed that some level of Secure Shell (SSH) or other access is given to developers to do troubleshooting? Giving this access may also just be an indicator of poor observability maturity, but solving the troubleshooting requirements by giving system-level access to developers is already breaking the common traditional model, which is one where developers have access to non-production environments and sysadmins have access, and full control, on the production environment.
Did that model work as intended? As I mentioned, now we often see developers also given some degree of access to production environments, usually a system-level access like:
As a summary, the traditional system is already broken:
Do the AWS RDS team give you access to production on the system level? (I am excluding RDS Custom which is another story)
The platform engineering way can be considered as a horizontal line of responsibilities, passing though all Dev-Test-Prod environments both for developers and system admins, which I tried to illustrate in the graph below:
To elaborate more, let’s also add some generic responsibilities for each team, as now they are responsible for all environments (Dev-Test-Prod) within the defined borders. I tried to illustrate this also as following:
In this illustration, both teams have Build, DevOps, monitor and on-call responsibilities. But aren’t those duties overlapping? In practice they may depend on each other, but should not overlap. I tried to define the differences as below:
Delivery Teams |
Platform Team |
|
Build |
Mainly developing and building the applications which may include compiled languages or interpret languages |
Building the platform which have declarative languages and also may have compiled/interpret programming languages, APIs, Dashboard … |
Deploy |
Deploying the application to an environment |
Deploying the platform from Dev to Test or to Production |
Monitor |
Monitoring the Application metrics |
Monitoring the Platform resources metrics |
On-call |
On-call for application incidents which does not depend on the platform |
On-call for platform errors, not for the application incidents |
Now, one more step further to iterate the previous illustration:
The difference: Delivery teams are not ideally expected to access the non-production environments of the Platform. If you access the non-production environments of the AWS RDS team, would that be acceptable?
The Platform Team does not ideally know which application environment is running on their production environment. Are AWS RDS platform engineers aware whether your RDS Database is the production environment or the test enviroment? I don’t think so, and technically they also cannot know this.
I also added “Platform unit testing” which is similar (although technically different) to the application unit testing. “Application unit testing” checks whether the new commit is integrated with the rest of the codebase, similarly “Platform unit testing” also makes a certain level of testing. We can also add additional non-functional CI-level tests such as compliance, security, performance or FinOps which are similar to the non-functional tests being developed for the application codes.
In this latest picture, similar to the AWS RDS teams who provide necessary interfaces (Dashboard, API …) to the AWS Customers (delivery teams), as a platform team, we are also expected to provide deliverables similar to our internal delivery teams, all necessary dashboard/interfaces to consume/monitor/manage the platform.
The rise of the IDPs (Internal Developer Platforms)
Platform teams serving multiple delivery teams have been developing custom dashboards through which developer teams can interact with the platform, like the Dashboard AWS gives to its customers. Most of these custom dashboards are developed internally, but one company, Spotify, decided to open-source their IDP to the public, which is called Backstage. Although there are other alternatives in the market which I am still experimenting with, in this blogpost, I will only refer to Backstage as an illustration later on.
First, let’s consider the needs of the developers. When it comes to an organisation where several delivery teams are working on the platform, the following fundamental capabilities are usually required:
Additionally, from the organisation's perspective, the following requirements need to be considered:
So how can platform engineers address those? The platform team is developing a product for delivery teams, and they can address some of those operations by offering Golden Paths. And to address the frictions between operations and delivery teams (developers), an Internal Developer Platform helps delivery teams be more confident on the platform, benefiting from its self-service functions, by creating the following values across teams.
This list is not limited to the capabilities I have described above, and can be extended with new capabilities which will support the autonomy of delivery teams. As platform is a product, and a product also needs a customer-driven approach, and as developers are the customers: based on their requests, platform teams need to prioritise those requests and add new capabilities by time.
Every product has a functional and nonfunctional backlog. Functional items serve directly to developers, and nonfunctional items (stability, availability, security, scaling, and patching) also need to be executed, even without a developer’s demand.
To make things more tangible, I am switching to the real-world capabilities of one of the most popular IDPs today: Backstage. Here are some of the most interesting Backstage plugins, tried and tested:
Using pgvector/AWS/Anthropic in the background, you provide your developers with an AI assistant for project-related queries. This is open to extension with more data sources to enhance the AI assistant capabilities:
Reference Link: https://roadie.io/backstage/plugins/ai-assistant-rag-ai/
Argo is widely used in the industry, especially with AWS/EKS. This plugin lets developers see the status of the projects, whether they are healthy/synced or not:
Infrastructure-as-code (IaC) is a key platform approach and widely used for projects with AWS because of its close integration. This plugin allows developers to see the status of the changes triggered by Terraform:
The value of having CodeFamily on Backstage is to provide developers with a central dashboard without requiring a login to the AWS Dashboard. By adding a CodeBuild annotation, all CodeBuild projects can be visible on Backstage:
People sometimes refer to Serverless Lambda as lacking observability, but there are now several solutions available. This plugin does not solve end-to-end observability but gives the status of AWS Lambdas and last modification times:
Teams that did not opt for Kubernetes usually prefer using ECS. This plugin provides version, status, and last updated fields to the developers:
Not only for secret management; also, its easy integration with several programming languages makes Vault a preferred tool for managing secrets. By referring to its “trust triangle”, you can integrate it with “AWS Secrets Manager”. The plugin cleverly steers clear of direct editing for compliance rules and instead offers links to the Vault:
In conclusion, developers are now able to see/manage several functions, which will:
The Backstage plugin landscape is growing, and it is exciting to see new plugins being published frequently.
Please let me know which plugin you would like to see on Backstage. You can find me on Platform Engineering Community Slack as Dorian.