Evolution to IDPs and AWS Footprint
Evolution to Internal Developer Platforms (IDPs)
It all began with the DevOps movement, which mainly aimed to increase delivery speed and efficiency by implementing new ways of managing the Software Development Life Cycle (SDLC) and production operations, while also easing the tension between system admins and developers. This movement helped the industry gain a lot of new practices and change their processes, but also resulted in adding new DevOps departments within organisations, which, paradoxically, meant adding one more layer between infrastructure teams (system admins) and developers.
Although the DevOps movement has not always been implemented according to its projected ideals within the industry, it helped us accept the problems and seek options to address those with new practices and tools. Some organisations tried to address DevOps more closely to the projected ideals: not by creating a new department, but rather by injecting DevOps roles inside the delivery teams. Such an approach resulted in smoother operations, but still did not fully empower development delivery teams.
When it came to the enterprises, security and compliance issues became additional challenges to implementing the ideals. Delivery teams who have the required DevOps technical capabilities as separate roles, or T-Shaped developers, still need to interact with the system administrators (sysadmins) to provision an infrastructure element, gain insights and metrics, or get deeper into the infrastructure to see what is really going on. These requirements are sometimes addressed by providing them to "read-only" users in production environments, or through internal ticketing and issue management systems, but they are still not as efficient as they should be.
In summary, infrastructure managed by sysadmins, and even the principles of Infrastructure as Code (IaC), can still be implemented. However, there may still be gaps, unmet and unaddressed requirements, and custom-built components tailored to specific needs. Friction remains.
Transition from the traditional model to the Platform era
Let’s make an analogy: Consider you are the platform developer for an AWS PaaS such as RDS, which is an easy-to-manage relational database service. As a platform developer, you are responsible to set the platform standards and develop the functions defined by the customer requests. You develop and ship those functions.
Customers of the platform are entitled to consume the platform based on the functions that are in place. You provide necessary interfaces to developers (delivery teams) to interact with the platform, such as APIs, CLIs, SDKs. You are responsible for maintaining the production environment of the platform, and customers using the platform are responsible for the layers that should be defined in advance. Let’s refer to the generic IaaS vs PaaS vs SaaS segregation of responsibilities:
In this scenario, we are providing a PaaS to AWS customers and they are responsible for the Data and the Applications running on it (refer to the PaaS column). And it’s obvious that this PaaS (RDS in our scenario) is a product.
Which brings us to the question: why can’t the platforms customly developed for a specific company's needs, also be approached as a product? We see a similar pattern:
- One platform and multiple teams consuming it
- Developed for customer needs (the delivery team) and regulatory needs (company compliance policies)
Similar to the diagram above, a platform team and a delivery team may define a more granular segregation of responsibilities between them:
- Who is responsible for which observability layers?
- What are the golden paths using the platform?
- What type of access is given to delivery teams for troubleshooting?
Regarding the last point, access for troubleshooting, this has been a grey area between developers and systadmins, where platform engineering practices are not yet implemented. Am I the only one who noticed that some level of Secure Shell (SSH) or other access is given to developers to do troubleshooting? Giving this access may also just be an indicator of poor observability maturity, but solving the troubleshooting requirements by giving system-level access to developers is already breaking the common traditional model, which is one where developers have access to non-production environments and sysadmins have access, and full control, on the production environment.
Did that model work as intended? As I mentioned, now we often see developers also given some degree of access to production environments, usually a system-level access like:
As a summary, the traditional system is already broken:
- Layers of responsibilities are overlapping
- Company compliance rules are broken
- Sysadmins no longer have full control on production (developer teams also may make changes on production)
Do the AWS RDS team give you access to production on the system level? (I am excluding RDS Custom which is another story)
The platform engineering way can be considered as a horizontal line of responsibilities, passing though all Dev-Test-Prod environments both for developers and system admins, which I tried to illustrate in the graph below:
To elaborate more, let’s also add some generic responsibilities for each team, as now they are responsible for all environments (Dev-Test-Prod) within the defined borders. I tried to illustrate this also as following:
In this illustration, both teams have Build, DevOps, monitor and on-call responsibilities. But aren’t those duties overlapping? In practice they may depend on each other, but should not overlap. I tried to define the differences as below:
Delivery Teams |
Platform Team |
|
Build |
Mainly developing and building the applications which may include compiled languages or interpret languages |
Building the platform which have declarative languages and also may have compiled/interpret programming languages, APIs, Dashboard … |
Deploy |
Deploying the application to an environment |
Deploying the platform from Dev to Test or to Production |
Monitor |
Monitoring the Application metrics |
Monitoring the Platform resources metrics |
On-call |
On-call for application incidents which does not depend on the platform |
On-call for platform errors, not for the application incidents |
Now, one more step further to iterate the previous illustration:
The difference: Delivery teams are not ideally expected to access the non-production environments of the Platform. If you access the non-production environments of the AWS RDS team, would that be acceptable?
The Platform Team does not ideally know which application environment is running on their production environment. Are AWS RDS platform engineers aware whether your RDS Database is the production environment or the test enviroment? I don’t think so, and technically they also cannot know this.
I also added “Platform unit testing” which is similar (although technically different) to the application unit testing. “Application unit testing” checks whether the new commit is integrated with the rest of the codebase, similarly “Platform unit testing” also makes a certain level of testing. We can also add additional non-functional CI-level tests such as compliance, security, performance or FinOps which are similar to the non-functional tests being developed for the application codes.
In this latest picture, similar to the AWS RDS teams who provide necessary interfaces (Dashboard, API …) to the AWS Customers (delivery teams), as a platform team, we are also expected to provide deliverables similar to our internal delivery teams, all necessary dashboard/interfaces to consume/monitor/manage the platform.
The rise of the IDPs (Internal Developer Platforms)
Platform teams serving multiple delivery teams have been developing custom dashboards through which developer teams can interact with the platform, like the Dashboard AWS gives to its customers. Most of these custom dashboards are developed internally, but one company, Spotify, decided to open-source their IDP to the public, which is called Backstage. Although there are other alternatives in the market which I am still experimenting with, in this blogpost, I will only refer to Backstage as an illustration later on.
First, let’s consider the needs of the developers. When it comes to an organisation where several delivery teams are working on the platform, the following fundamental capabilities are usually required:
- Service catalog
- Environment provisioning: This can be either to create a new environment (test/staging or short-living temporary environment created from a feature branch) or change requests on the current environments with standardized templates across the organisation.
- CI (Continuous Integration) tools and pipeline: Streamline code compilation, testing, and deployment processes
- Deploy and release orchestration: Enable one-click or automated deployments with rollback capabilities, considering the dependency tree between the deployments
- Monitoring and observability
- Managing secrets: Another back-and-forth between developers and infrastructure is defining/checking deploy-time configurations/environment variables
Additionally, from the organisation's perspective, the following requirements need to be considered:
- Security and compliance policies and rules of the organization
- Consistency across delivery pipelines and golden paths
- Cost optimization practices across the organization
- Compliance on the technology stack: tools and technologies to be used
So how can platform engineers address those? The platform team is developing a product for delivery teams, and they can address some of those operations by offering Golden Paths. And to address the frictions between operations and delivery teams (developers), an Internal Developer Platform helps delivery teams be more confident on the platform, benefiting from its self-service functions, by creating the following values across teams.
How IDP effects your efficiency?
Developer Productivity
- Service catalog: Discover the service and technology stack in a particular project. Centralized registry of available services and APIs together with dependencies and interconnections
- Documentation: Documentation of the related project or a platform capability
- Resources: All resources combined in a central location
Observability and Monitoring
- Integrations with APMs: All metrics and traces are combined on the standard APM that platform provides. IDPs may provide the ability to access the APM and basic metrics.
- Incident management: Integration with the alert management system that platform supports
Security and Compliance
- Automated security scanning: Integrate with the SCA/SAST tools being used in development pipelines
- Compliance: Ensure adherence to organizational and industry security standards with golden paths
- Secret management: Give developers only required access to check which secrets are being used and links to secret management system
Developer Experience (DevEx) Capabilities
- Project templates: Creating new projects using the project templates (GitHub repo, application code templates, CI/CD templates, etc.) automatically to increase developer productivity
- AI conversational agent: Providing GenAI conversational chatbot to developers enhanced with RAG to help them onboard to the platform and ongoing support
- Prebuilt developer workstations: Links to provisioned remote development environments with standardised IDEs and plugins, together with configured pre-commit hooks and local security and compliance tools/configurations. (This may not be relevant for all organizations, but we need it for our Enterprise-level projects.)
This list is not limited to the capabilities I have described above, and can be extended with new capabilities which will support the autonomy of delivery teams. As platform is a product, and a product also needs a customer-driven approach, and as developers are the customers: based on their requests, platform teams need to prioritise those requests and add new capabilities by time.
Platform-as-a-product mindset
Every product has a functional and nonfunctional backlog. Functional items serve directly to developers, and nonfunctional items (stability, availability, security, scaling, and patching) also need to be executed, even without a developer’s demand.
Real-World IDP: Backstage
To make things more tangible, I am switching to the real-world capabilities of one of the most popular IDPs today: Backstage. Here are some of the most interesting Backstage plugins, tried and tested:
AI Assistant – RAG AI
Using pgvector/AWS/Anthropic in the background, you provide your developers with an AI assistant for project-related queries. This is open to extension with more data sources to enhance the AI assistant capabilities:
Reference Link: https://roadie.io/backstage/plugins/ai-assistant-rag-ai/
ArgoCD Plugin
Argo is widely used in the industry, especially with AWS/EKS. This plugin lets developers see the status of the projects, whether they are healthy/synced or not:
Terraform Plugin
Infrastructure-as-code (IaC) is a key platform approach and widely used for projects with AWS because of its close integration. This plugin allows developers to see the status of the changes triggered by Terraform:
AWS CodeBuild & CodePipeline
The value of having CodeFamily on Backstage is to provide developers with a central dashboard without requiring a login to the AWS Dashboard. By adding a CodeBuild annotation, all CodeBuild projects can be visible on Backstage:
AWS Lambda Plugin
People sometimes refer to Serverless Lambda as lacking observability, but there are now several solutions available. This plugin does not solve end-to-end observability but gives the status of AWS Lambdas and last modification times:
AWS ECS Plugin
Teams that did not opt for Kubernetes usually prefer using ECS. This plugin provides version, status, and last updated fields to the developers:
Vault Plugin
Not only for secret management; also, its easy integration with several programming languages makes Vault a preferred tool for managing secrets. By referring to its “trust triangle”, you can integrate it with “AWS Secrets Manager”. The plugin cleverly steers clear of direct editing for compliance rules and instead offers links to the Vault:
Final Thoughts
In conclusion, developers are now able to see/manage several functions, which will:
- Optimize/eliminate unnecessary QA or tickets between delivery teams and platform
- Help developers become more confident on the platform
- Onboard developers to new projects more smoothly
The Backstage plugin landscape is growing, and it is exciting to see new plugins being published frequently.
Please let me know which plugin you would like to see on Backstage. You can find me on Platform Engineering Community Slack as Dorian.
Derya (Dorian) Sezen
Derya, a.k.a. Dorian, ex-CTO of an amazon.com subsidiary, is currently working as Cloud and DevOps Consultant at kloia.