Saturday, December 29, 2012

Resilient Systems - Survivability of Software Systems

Resilience as we all know is an ability to withstand through tough times.There is also another term quite interchangeably used, which is Reliability. But Reliability and Resilience are different. Reliability is about a system or a process that has zero tolerance to failure or the one that should not fail. In other words, when we talk about reliable systems, the context is that failure is not expected or rather acceptable. Whereas Resilience is about the ability to recover from failures. What is important to understand about resilience is that failure is expected and is inherent in any systems or processes, which might be triggered due to changes to the platform, environment and data. While Reliability is about the system’s robustness of not failing, Resilience is its ability to sense or detect failures ahead and then prevent it from encountering such events that lead to failure and when it cannot be avoided, allow it to happen and then recover from the failure sooner.

A working definition for resilience (of a system) developed by the Resilient Systems Working Group (RSWG) is as follows:

“Resilience is the capability of a system with specific characteristics before, during and after a disruption to absorb the disruption, recover to an acceptable level of performance, and sustain that level for an acceptable period of time.“ The following words were clarified:
  • The term capability is preferred over capacity since capacity has a specific meaning in the design principles.
  • The term system is limited to human-made systems containing software, hardware, humans, concepts, and processes. Infrastructures are also systems.
  • The term sustain allows determination of long-term performance to be stated.
  • Characteristics can be static features, such as redundancy, or dynamic features, such as corrective action to be specified.
  • Before, during and after – Allows the three phases of disruption to be considered.
    • Before – Allows anticipation and corrective action to be considered
    • During – How the system survives the impact of the disruption
    • After – How the system recovers from the disruption
  • Disruption is the initiating event of a reduction is performance. A disruption may be either a sudden or sustained event. A disruption may either be internal (human or software error) or external (earthquake, tsunami, hurricane, or terrorist attack).

Evan Marcus, and Hal Stern in their book Blueprints for High Availability, define a resilient system as one that can take a hit to a critical component and recover and come back for more in a known, bounded, and generally acceptable period of time.

In general Resilience is a term of concern for Information Security professionals as the final impact of disruption (from which a system needs to recover), could mostly be on Availability which is one of the three tenets of Information Security (CIA - Confidentiality, Integrity and Availability). But there is a lot for System designers and developers, especially those tasked to build mission critical systems to be concerned about Resilience and architect the systems to build in a required level of Resilience characteristics. For a system to be resilient, it should draw necessary support from dependent software and hardware components, systems and the platform. For instance a disruption for a web application can even be from network outage, security attacks at the network layer, which the software has no control over. But it is important to consider software resiliency in relation to the resiliency of the entire system, including the human and operational components.

The PDR (Protect - Detect - React) strategy is no longer as effective as it used to be due to various factors. It is time that predictive analytics and some of the disruptive technologies like big data and machine learning need a consideration in enhancing the system resiliency. Based on the logs of various inter-connected applications or components and other traffic data on the network, intelligence need to be built into the system to a combination of number of possible actions. For instance, if there is a reason to suspect an attacker attempting to gain access to the systems, a possible action could be to operate the system at a reduced access mode, i.e. parts of the systems may be shut down or parts of the networks to which the system is exposed could be blocked, etc.

OWASP’s AppSensor project is worth checking by the architects and developers. The AppSensor project defines a conceptual framework and methodology that offers prescriptive guidance to implement intrusion detection and automated response into an existing application. AppSensor defines over 50 different detection points which can be used to identify a malicious attacker.Appsensor provides guidance in the form of possible responses for each such identified malicious atacer.

The following are some of the factors that need to be appropriately addressed to enhance the resilience of the systems:

Complexity - With systems continuously evolving and many discrete systems and components increasingly integrated into today’s IT eco-system, the complexity is on the rise. This makes the resiliency routines as built in to individual systems needing a constant review and revision.

Cloud - Cloud computing is gaining higher acceptance and as organizations embrace cloud for its IT needs, the location of data, systems and components are widespread across the globe and that brings in challenge for those involved in building resilience.

Agility - To stay on top of the competition, organizations need agility in their business processes, which means rapid changes in the underlying systems and this could be a challenge as this will call for a constant check to ensure that the changes being introduced does not downgrade or compromise the resiliency level of the systems.

While there are techniques and guiding principles which when followed and applied, the resilience of the systems can be greatly improved, such design or implementation comes with a price and that is where the economics of Resiliency needs to be considered. For instance, mission critical software systems like the ones used in medical devices, need to have a high resilience characteristic, but quite many of the business systems can have a higher tolerance level and thereby being less resilient. However, it is good to document the expected resilience level at the initial stage and work on it in the early life cycle of the system development. thinking about resilience later in the life cycle may not be any good as implementation will call for higher investment.


Crosstalk - The journal of Defense Software Engineering Vol 22 No:6

Resilient Systems Working Group

OWASP - AppSensor Project