TNS
VOXPOP
Do You Resent AI?
If you’re a developer, do you resent generative AI’s ability to write code?
Yes, because I spent a lot of time learning how to code.
0%
Yes, because I fear that employers will replace me and/or my peers with it.
0%
Yes, because too much investment is going to AI at the expense of other needs.
0%
No, because it makes too many programming mistakes.
0%
No, because it can’t replace what I do.
0%
No, because it is a tool that will help me be more productive.
0%
No, I am a highly evolved being and resent nothing.
0%
I don’t think much about AI.
0%
DevOps / Observability / Operations

Mitigating Software Outages: Shifting Left Observability

It’s vital to empower developers to take control over their applications across environments. To accomplish this, adopt live debugging observability practices.
May 14th, 2024 11:53am by
Featued image for: Mitigating Software Outages: Shifting Left Observability
Image from vchal on Shutterstock.

The escalating adoption of containerized and cloud native application development has significantly enhanced agility and DevOps practices as well as innovative and scalable offerings. Organizations are able to quickly deploy their software to production and reduce internal software dependencies across services.

This is the positive outcome and organizations are not slowing down; they are looking to enhance such practices through adoption of GenAI and platform engineering practices.

The less positive outcomes of such complex technologies are often the lack of control over core and business-centric services from the moment they are deployed to production and used by customers.

There is an evident disconnection between developers and their live applications in production, due to current processes and tooling such as application performance and monitoring (APM) and other general observability solutions. They are mostly serving Ops personas and are by nature slower to take advantage once an outage occurs. There is an imperative need to empower developers to take control over their applications across environments, and this can only be achieved by shifting observability left toward developers.

Most Common Root Causes for Software Outages

Lightrun data, along with Microsoft‘s recent research, highlight the primary causes of software outages and critical P1-P2 issues encountered by enterprises and modern application development groups. The list below is prioritized from the highest and most common root cause to the lowest:

  • Code bugs
  • Dependency failure
  • Infrastructure issues
  • Deployment error
  • Configuration bug
  • Database/network failure
  • Authorization failure
  • Other

Each of the aforementioned root causes of production incidents demands specific mitigation measures, spanning from software rollback to infrastructure adjustments and configuration fixes. It’s crucial to understand the significant impact these critical incidents have on business operations and brand reputation.

For instance, consider the recent outage experienced by Sainsbury’s, the second largest supermarket chain in the UK. This reported incident not only disrupted customer experiences, but also dealt a severe blow to business revenue. Similarly, a separate instance in a different market segment saw the State Farm insurance website and application crippled for several days, severely hampering its ability to serve customers.

It’s essential to recognize that resolving each of these outages is a time-consuming and costly process, exacerbated by the limitations of current APM tools and the disconnected approach between development and operations teams. According to Microsoft’s research, outages resulting from code bugs incur the longest mean time to resolve (MTTR), time to mitigate (TTM) and time to detect (TTD).

Shifting Left Developer Observability

The fundamental concept involves integrating observability early into the software development life cycle (SDLC), but what does this mean in practical terms?

  • Accessing real-time live data: Existing development methodologies often treat observability as a static and reactive procedure. Ideally, developers should have real-time access to observability data as part of a modern development process. In modern implementations of platform engineering and setting up an internal development platform (IDP), an observability platform will be an integral part of the developers’ tool stack.
  • Contextualized data: While traditional observability tools excel at aggregating and indexing data, they are primarily tailored to the requirements of operations teams. Most observability solutions feature complex dashboards inundated with information spanning the entire system. However, developers typically require data specific to the code they are currently working on. By correlating and narrowing down logs, metrics and traces within the context of developers’ code, essential information can be efficiently surfaced.
  • Developer-centric approach: Similarly, observability tools should align with the natural workflow of developers. Rather than requiring developers to switch to another dashboard, data should be presented within a tab. This tab should be directly integrated into their IDE. While seemingly minor, minimizing context switching is crucial for maintaining productivity and streamlining operations.
  • Dynamic instrumentation: Lastly, developers should possess the ability to dynamically incorporate new observability features in real time without necessitating code modifications or enduring the typical build and deployment cycle. This not only saves time in terms of code deployment but also reduces costs, as developers are no longer compelled to add static instrumentation preemptively.

The advantages stemming from shifting developer observability left can be classified into three main pillars:

1. Reduced MTTR (mean time to resolution):

  • Strengthened revenue protection and risk mitigation
  • Enhanced efficiency in incident management
  • Decreased occurrence of critical P1 incidents

2. Enhanced developer productivity:

  • Substantial decrease in developers’ debugging time

3. Reduction in overall observability costs:

  • Reduced spending on static logs
  • Lower total cost of ownership (TCO) for APM tools

Closing Remarks

The adoption of shifting left observability represents an emerging trend among high-performing software engineering teams. Recognizing the limitations of static and reactive observability, these teams have come to understand the unsustainable nature of simply logging more data and deferring resolution.

By integrating observability earlier into the SDLC, these high-performing teams are producing not only cleaner and more efficient code but also alleviating the operational challenges associated with debugging, incident response and managing an overwhelming volume of observability metrics. Addressing outages like those mentioned above in a timely manner — in minutes and mere hours rather than days — or even preventing such issues in the first place are clear benefits of this practice.

Group Created with Sketch.
TNS owner Insight Partners is an investor in: Lightrun.
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.