Sunday, April 10, 2016

Economics of Software Resiliency

Resilience is a design feature that facilitates the software to recover from occurrence of an disruptive event. As it is evident, this is kind of automated recovery from disastrous events after occurrence of such events. Yes, given an option, we would want the software that we build or buy has the resilience within it. Obviously, the resilience comes with a cost and the economies of benefit should be seen before deciding on what level of resilience is required. There is a need to balance the cost and effectiveness of the recovery or resilience capabilities against the events that cause disruption or downtime. These costs may be reduced or rather optimized if the expectation of failure or compromise is lowered through preventative measures, deterrence, or avoidance.

There is a trade-off between protective measures and investments in survivability, i.e., the cost of preventing the event versus recovering from the event. Another key factor that influences this decision is that cost of such event if it occurs. This suggests that a number of combinations need to be evaluated, depending on the resiliency of the primary systems, the criticality of the application, and the options as to backup systems and facilities.

This analysis in a sense will be identical to the risk management process. The following elements form part of this process:


Identify problems


The events that could lead to failure of the software are numerous. Developers know that exception handling is an important best practices one should adhere to while designing and developing a software system. Most modern programming languages provide support for catching and handling of exceptions.  This will at a low level help in identifying the exceptions encountered by a particular application component in the run-time. There may be certain events, which can not be handled from within the component, which require an external component to monitor and handle the same. Leave alone the exception handling ability of the programming language, the architects designing the system shall identify and document such exceptions and accordingly design a solution to get over such exception, so that the system becomes more resilient and reliable. The following would primarily bring out possible problems or exceptions that need to be handled to make the system more resilient:


  • Dependency on Hardware / Software resources - Whenever the designed system need to access a hardware resource, for example a specified folder in the local disk drive, expect a situation of the folder not being there, the application context doesn't have enough permissions to perform its actions, disk space being exhausted, etc. This equally applies to software resources like, an operating system, a third party software component, etc.
  • Dependency on external Devices / Servers / Services / Protocols - Access to external devices like printers, scanners, etc., or other services exposed for use by the application system, like an SMTP service for sending emails, database access, a web service over HTTPS protocol, etc. could also cause problems, like the remote device not being reachable, or a protocol mismatch, request or response data inconsistency, access permissions etc. 
  • Data inconsistency - In complex application systems, certain scenarios could lead to a situation of inconsistent internal data which may lead to the application getting into a dead-lock or never ending loop. Such a situation may have cascading effect as such components will consume considerable system resources quickly and leading to a total system crash. This is a typical situation in web applications as each external request is executed in separate threads and when each such thread get into a 'hung' state, over a period, the request queue will soon surpass the installed capacity. 


Cost of Prevention / recovery


The cost of prevention depends on the available solutions to overcome or handle such exceptions. For instance, if the issue is about the SMTP service being unavailable, then the solution could be to have an alternate redundant, always active SMTP service running out of a totally different network environment, so that the system can switch over to such alternate service if it encounters issues with the primary one. While the cost of implementing the handling of multiple SMTP services and a fail-over algorithm may not be significant, but maintaining redundant SMTP service could have significant cost impact. Thus with respect to each such event that may have an impact on the software resilience, the total cost for a pro-active solution vis-a-vis a reactive solution should be assessed.

Time to Recover & Impact of Event


While the cost of prevention / recovery as assessed above will be an indicator of how expensive the solution is, the Time to Recover and the Impact of such an event happening will indicate the cost of not having the event handled or worked around. Simple issues like a database dead-lock may be reactively handled by the DBAs who will be monitoring for such issues and will act immediately when such an event arise. But issues like, the network link to an external service failing, may mean an extended system unavailability and thus impacting the business. So, it is critical to assess the time to recover and the impact that such an event may have, if not handled instantly.

Depending on the above metric, the software architect may suggest an cost-effective solution to handle each such events. The level of resiliency that is appropriate for an organization depends on how critical the system in question is for the business, and the impact of the lack of resilience for the business. The organization understands that the resiliency has its own cost-benefit. The architects should have this in mind and design solutions to suit the specific organization.

The following are some of the best practices that the architects and the developers should follow while designing and building the software systems:
  • Avoid usage of proprietary protocols and software that makes migration or graceful degradation very difficult.
  • Identify and handle single points of failure. Of course, building redundancy has cost.
  • Loosely couple the service integrations, so that inter-dependence of services is managed appropriately.
  • Identify and overcome weak architecture / designs within the software modules or components.
  • Anticipate failure of every function and design for fall-back-scenarios, graceful degradation when appropriate.
  • Design to protect state in multi‐threaded and distributed execution environments.
  • Expect exceptions and implement safe use of inheritance and polymorphism 
  • Manage and handle the bounds of various software and hardware resources.
  • Manage allocated resources by using it only when needed.
  • Be aware of timeouts of various services and protocols and handle it appropriately