No one enjoys being woken up in the middle of the night or having a weekend interrupted because of a major incident disrupting application reliability or performance. When an application is truly down and impacts business operations, few desire the pressure of the war room. Agile developers should focus on their sprint commitments and spend as little time as possible investigating the root causes of major incidents. Yet responding to major incidents, providing support to resolve issues, and participating in root-cause analysis is everyone’s responsibility.

In the best of circumstances, operations teams have monitoring systems that detect, alert, and resolve issues. The reality is that operating environments can have problems outside of everyone’s controls, such as security breaches, major cloud outages, third-party service trouble, or major infrastructure failures that disrupt operations. Even the most robust agile processes, software development lifecycles, or devops best practices can’t assure that applications are risk-free and 100 percent reliable.  

Operations and site reliability engineers can often fix common issues without impacting the development team. Common problems can be cleared up with automation or by maintaining runbooks that prescribe how to address them. But developers are likely needed to help unravel more complex or less frequent mishaps, and there are many ways they can help prevent operational problems from occurring in the first place.

Many organizations today develop software applications as part of customer-facing products, customer experiences to support business services, or workflows to enable employees to fulfill their jobs. When these applications fail or underperform, it can have significant business implications, such as revenue loss, unbudgeted costs, brand reputation impacts, project delays, and poor employee morale.

When applications experience frequent or lengthy outages, poor performance, or unexpected errors, it also reflects poorly on the agile software development teams. IT departments that survey employees and measure customer satisfaction are unlikely to receive high scores if unreliable applications impact people’s work. It’s also harder for IT management to get budget increases, training, added compensation, or other benefits if the organization feels that the software development teams can’t release new capabilities reliably.

Development teams must take proactive steps to prevent problems, provide support during incidents, participate in the analysis of root causes, and prioritize work to address critical defects.

Let’s look at these responsibilities in more detail.

Agile development teams often focus their efforts on developing and releasing new features, enhancing user experiences, and addressing technical debt. Teams instituting devops practices such as CI/CD (continuous integration/continuous delivery) pipelines must also shift-left their testing practices and automate most testing to ensure that new code doesn’t break software builds and that automated tests all pass.

Developers and quality assurance testers should shift-left security and institute coding practices to ensure the reliability of applications. Development teams should also partner with operations teams on infrastructure configuration, automation, and monitoring. Best practices include:

Lastly, it’s critically important to document the application’s architecture and code because it’s highly likely that people who weren’t involved in the application’s development will have responsibilities to support it. Even when code is modular or uses microservices, it’s vital to leave documentation for developers and site reliability engineers to resolve issues and improve applications.

Before incidents happen, software development teams should establish protocols and processes to better support incident response teams and site reliability engineers:

During an incident, software developers should aid in fixing the issue and restoring service in minimal time. Once the developers are called in, the assumption must be that operational engineers have already reviewed and possibly ruled out infrastructure-related concerns, and that site reliability engineers have already explored a list of common problems with the application.

When there is a major incident, incident managers will often set up bridge calls, chat sessions, and physical war rooms to assemble a multidisciplinary team to work through the problem collaboratively. Developers who are called in should know and follow the incident response and communications protocols established for these war rooms.

In the war room, developers should be application experts. After reviewing monitors, log files, and other alerts, they should make recommendations on courses of action. It’s essential to use specific language and separate fact from speculation. Try to avoid the wrong turns and added delays that occur when response teams overly pursue symptoms that turn out to be dead ends.

Developers should participate in this collaboration until the incident manager closes the issue or rules out the need for their participation and excuses them.

Major incidents are closed once the application or service is back to normal operating conditions. At this point, in ITIL (Information Technology Infrastructure Library), they are assigned problems so that teams can identify root causes. The goal is to perform a full diagnostic over all the underlying issues and circumstances. What caused the incident? What factors defined the severity and magnitude of the business impact? What conditions, factoring in the duration and the expense, were required to resolve the issue?

Once the root cause is determined, agile development teams should assign one or more defects that either address root causes, lower risks, or lessen business impacts. Development teams may have different definitions and processes around defects in their agile process and software development lifecycle. What’s most critical is that when known issues repeatedly create problems or cause major business interruptions, that agile development teams and their product owners receive this feedback and prioritize making improvements.

After all, delivering new capabilities through software is only part of a developer’s responsibilities. Ensuring that applications are reliable, secure, perform well, and have positive user experiences is where teams truly deliver on business needs.

This story, "How agile teams can support incident management" was originally published by InfoWorld.

ITNews