Introduction

Miles_Kniep · ‎2023-11-28

Introduction
The Landscape: Why Cloud Native ‘transformation’ is happening, and its many promised benefits.
What is Cloud Native?
Impact on Time-to-Market & Competitive Edge
Impacts to Agility & Efficiency
Benefits to Resiliency by Disaggregation
The New Platform
Time for a Reality Check
Metric #1: Deploy Frequency
Metric #2: Lead Time for Changes
Metric #3: Time to Restore Services
Metric #4: Change Failure Rate
Conclusion

Introduction

“Everyone is going to the public cloud… right?”

“Shouldn’t we be building Cloud Native, instead of doing Lift & Shift? They say it practically guarantees success.”

“I’ve heard containers are the future, but I don’t know why we’d fuss with that complexity. Our needs are simpler than that.”

“Developers don’t want to do Operations anymore, DevOps is dead!”

If any of this sounds even remotely familiar to you personally, your colleagues, or your leadership team – you’re not alone. I spend my days talking with organizations around the globe that are asking themselves these same things. Some adopt Cloud Native ways of working with tremendous success, and still others fail to transform sufficiently, going so far as to disgrace DevOps and Agile practices right alongside Cloud Native technology for the perceived safety of better-known patterns with a slower release cycle to box away the trauma of a project gone sideways. To validate if it’s worth the trouble, we need to deconstruct what ‘Cloud Native’ really is, why it’s important to so many organizations and examine the inherent risks involved with choosing this path - knowing this, we can then look at what you can do to start ensuring success in your organization.

The Landscape: Why Cloud Native ‘transformation’ is happening, and its many promised benefits.

It's a known fact that organizations are in perpetual states of flux, whether due to market competition, changes in staff and technology, or external factors. This is particularly true in the business world where the need to adapt to change is paramount for both the survival of a company and to protect its potential to thrive and outcompete its peers.

The most important type of flux is what many would label as 'transformation' - too commonly used as a fluffy buzzword to inspire outside audiences, transformation is the process of investing resources into fundamentally changing how your business operates to ensure future success. It comes at a cost to maintaining the status quo, but it's a vital activity to maintain competitiveness in the market and to search out new opportunities.

Organizational transformation necessitated adopting Cloud Native technology due to the vital benefits it offers. A Cloud Native architecture enables the organization to leverage scalability, flexibility, and cost-efficiency that simply would not be possible otherwise. This applies both on-premises and in the public clouds, and nets you the potential for faster innovation, improved agility, and overall a better experience for your end customer, whoever they might be.

What is Cloud Native?

Contrary to what you might initially think, Cloud Native does not mean using the public cloud providers. Cloud Native is the practice of building applications and services in a cloud-like way so that they are scalable and resilient – as if it was a service you’d consume from cloud provider. To quote the custodians of this landscape, the Cloud Native Computing Foundation (CNCF):

"Cloud native technologies empower organizations to build and run scalable applications in modern, dynamic environments such as public, private, and hybrid clouds. Containers, service meshes, microservices, immutable infrastructure, and declarative APIs exemplify this approach." - The CNCF Charter

So, can I build a Cloud Native system in a public cloud provider? Yes.

Can I also do the same exact thing in my private data center? Also yes.

The key is to ensure the systems you operate are adaptable, scalable, and resilient. It's equally as important to ensure your teams can operate this landscape effectively.

Impact on Time-to-Market & Competitive Edge

Taking a moment to focus on velocity, going Cloud Native has the potential to accelerate time-to-market for just about anything that is built on it, providing a competitive edge. How you might ask? By enabling rapid development and deployment - where teams might have been constrained to a few major updates per year, they have the potential to push updates every day, if not at least every few weeks.

This benefit is sourced from the change in architecture, where previously there were large monolithic applications providing core functions, these get broken down into smaller components commonly called microservices. Microservices can be deployed and updated more independently, allowing for a vastly accelerated development cycle and enabling the concept of Continuous Delivery. Imagine a world where your most critical business application didn't require a multi-day upgrade plan, but instead could be upgrade piecemeal by each sub-service as and when each team is ready to do it, all while the lights stay on and practically no risk to the business is present. This is what Continuous Delivery provides.

There is of course risk that comes with taking this approach. Developing in a Cloud Native way isn’t just a matter of using newer technologies that automatically give you an edge. It requires reorganizing how your teams work together and deliver both their software and the IT services on which the ‘new stuff’ resides. This risk plays out with outcomes like losing competitiveness due to degraded adaptability, inefficient resource utilization driving up operational costs and impacting vital projects, as well as poor architectural design introducing bigger scalability challenges than the original monolithic approach might have suffered from.

In many cases I’ve encountered teams which are struggling with the organizational delivery more than the technology, mired in rigid release cycles, overly complicated end-user services, and build-and-maintain-your-own solutions that should have been consumed in an as-a-service model to save time and effort or instead provided by a platform team so the development teams can deliver faster.

This failure to transform has led to some organizations going so far as to disgrace DevOps and Agile practices entirely and reverting to the perceived safety of older patterns designed for a slower release cycle, in still other cases it’s the technology stack that is abandoned in favor of more easily managed monoliths that are easily managed by the infrastructure we know.

Impacts to Agility & Efficiency

Agility and operational efficiency in IT and software development are critical for maintaining competitiveness in today's business landscape. The ability to quickly adapt to market changes, deliver value to customers rapidly, and optimize resource utilization is essential. By fostering a culture of innovation, streamlining workflows, and providing superior customer experiences, organizations can stay ahead of the competition, drive growth, and achieve long-term success.

As mentioned above, while the biggest gain to be had is arguably the speed increase provided to your stack and your teams, it can also be one of the things to mire your transformation. I’ve seen plenty of teams tasked with building new, indescribably more complex systems while being bound to the same ‘proven’ ways of working. The two worlds collide, and the outcome can very much turn into a death spiral towards implosion of critical services. No amount of eleventh-hour heroics by your best people can truly course correct the pattern. It is vital to change both the technology, and the organization.

Benefits to Resiliency by Disaggregation

Cloud-native architectures provide better resiliency by disaggregating components, which means breaking down monolithic applications into smaller, independently deployable services. This disaggregation allows for improved fault isolation, as failures in one component do not necessarily impact the entire system. By decoupling services and employing containerization, cloud-native architectures enable fault tolerance and graceful degradation. When one service fails, other services can continue to function, ensuring the overall system remains operational. Additionally, cloud-native architectures support automatic scaling and load balancing, enabling applications to handle varying workloads and traffic spikes efficiently.

The New Platform

Kubernetes has gained popularity as a way of managing the cloud-native application landscape due to several key reasons. Firstly, Kubernetes provides a robust and scalable orchestration platform for containerized applications. It automates deployment, scaling, and management of application containers, simplifying the operational complexities associated with managing distributed systems. Kubernetes offers features like self-healing, automated rollouts, and service discovery, which drastically enhance resiliency and minimize downtime.

Secondly, Kubernetes promotes portability and interoperability. It abstracts away the underlying infrastructure, allowing applications to run consistently across different environments, such as on-premises data centers or multiple cloud providers. This portability enables organizations to avoid vendor lock-in and choose the most suitable infrastructure for their needs. Kubernetes also has a vast and multi-faceted ecosystem that offers a wide range of tools, extensions, and community support, making it an attractive choice for managing cloud-native applications.

In summary, while cloud-native architectures should provide better resiliency by disaggregating components and leveraging containerization - organizations may still be working through how to make it happen, from the technical end however Kubernetes has matured as an effective platform for managing the cloud-native application landscape due to its robust orchestration capabilities, portability, and thriving ecosystem. One more challenge removed from the equation.

Time for a Reality Check

Considering all this, is success truly measurable?

It’s the big question I hear everywhere I go – how can I measure success when adopting a Cloud Native-aligned way of working, considering all the technical benefits and operational risks that come with it? Fear not – all is not lost – you need not cast yourself adrift with no way to orient yourself towards your goal!

Thankfully, there is a long-running research program looking at this exact question whose aim is to distill their findings into a framework leaders and practitioners can leverage. This organization is known as DevOps Research and Assessment (DORA), and the project's fundamental goal is to map out what capabilities are required to drive consistent software delivery and smooth operations. DORA provides two aspects to distilling this down. The first is a set of metrics on which you can observe your organizational progress – we’ll cover that here. The second is a set of capabilities your organization and technology stack can adopt to directly influence those metrics. For now, we’ll focus on the metrics – and in the future we can explore some of the many capabilities NetApp can help with in detail.

Metric #1: Deploy Frequency

Simplistically, Deploy Frequency measures your organization’s number of deployments over time. While you may not think this number may reveal much, it’s important because it reflects the overall ability to deliver changes. A higher deploy frequency is a good indicator of an elevated level of agility and responsiveness – by extension meaning you can gather feedback and iterate rapidly, allowing delivery of real value to customers more quickly. Follow that path for a second, and you’ll see the connection that faster delivery overall means an ability to keep pace with market demand, or maintain your competitive edge, or at the very least drive a cycle of continuous improvement in your own software delivery – more than likely, it’s going to impact all of these.

Metric #2: Lead Time for Changes

Lead Time for Changes is the measurement of how much time it takes from the point some code is changed to the point in time it’s implemented and deployed in production. This means you’re looking at the whole process, from the initial idea to the tangible outcome being live. Measuring this is crucial because it helps identify inefficiencies in the delivery cycle. Maintaining a short lead time ensures adaptability and works synergistically with Deploy Frequency. For example, you might do 30 deploys per month, but were those from code that was committed in the same month, the month prior, or six months ago? Shortening this window helps your organization keep the initiative on their delivery plans.

Metric #3: Time to Restore Services

Time to Restore Services is perhaps one of the most familiar metrics to the IT operations organization – being a measure of the time it takes to recover and fully rectify a service disruption. In many ways this is the most paramount metric to manage because its impacts will cascade to your teams’ ability to deploy more changes and the overall lead time for those changes to go live. This is partly due to the overall loss in velocity for teams dealing with an outage, but it also has an overall psychological effect on teams, causing less frequent changes out of fear of critical issues while generally causing more because more changes are introduced at once in production. Being able ensure your teams can effectively minimize the Time to Restore Services is a Day 0 architectural and organizational priority.

Metric #4: Change Failure Rate

Lastly, measuring the Change Failure Rate helps you understand what percentage of changes result in a failure or require effort to remediate. Measuring this is important because it gives you visibility into the effectiveness of your organization’s overall release process and change management approach. Being able to lower the change failure rate ensures your delivery pipeline keeps a healthy velocity by both mitigating disruptions and potential outages, but also by reducing the toil developer and platform teams spend on managing and remediating failures.

Interested to learn more about measuring these important metrics? Check out the DORA team's Four Keys project.

Conclusion

We’ve covered how to measure success and why these indicators are so important, and measuring is indeed the clichéd first step to identifying if you have a problem. As such, it’s critical no matter where you are in your Cloud Native transformation to prioritize measuring these key indicators to understand your organization’s health. There is critical work that comes alongside this, by building a set of competencies you can focus on maturing to ensure success - or to "Get Better at Getting Better" as the DORA team likes to say.

In future articles we can explore these metrics in more detail and how to influence them, but before we do so it helps to look at what can broadly impact the whole range of measures. One of the most broadly impacting competencies you can build is a robust and accessible Observability practice that supports all your teams.

When Observability is done right, democratized access to health and performance data helps every team operate smoothly and resolve issues before there is an impact. In the next blog, we’ll explore the different ways organizations approach Observability, some of the pitfalls to avoid, and how NetApp can help you deliver on a strategy worthy of your organization.

Cloud Native: Is it worth the hype?

Introduction

The Landscape: Why Cloud Native ‘transformation’ is happening, and its many promised benefits.

What is Cloud Native?

Impact on Time-to-Market & Competitive Edge

Impacts to Agility & Efficiency

Benefits to Resiliency by Disaggregation

The New Platform

Time for a Reality Check

Considering all this, is success truly measurable?

Metric #1: Deploy Frequency

Metric #2: Lead Time for Changes

Metric #3: Time to Restore Services

Metric #4: Change Failure Rate

Conclusion