Friday, July 13, 2012

Architecture Review - Scalability

Scalability is an important quality attribute of any system, be it hardware or software. But in most cases, the need for a scalability check or review is felt only when certain signs of scalability problems show up. Typically, the following are such signs that call for a scalability review of an existing application.
  • When changes requested on certain subsystems are turned down by the development team(s) citing that it is a complex subsystem and any change to it might call for huge efforts in terms of regression testing or else, it could lead to a bigger impact on the whole system. This indicates that there are certain components or sub systems which prevent the system from scaling. 
  • Months after production usage, the application performance gradually slows down and there is a tendency to accept the performance slow down or to pump in more hardware to compensate the slowdown. This is again is an important sign that the application is not scaling to take on the ever growing user base and transaction volumes.
There could be more signs that could indicate that there are scalability issues within the application. It is unfortunate that scalability reviews are not done in the initial design phase, so that these post production troubles won’t show up. While reviewing an existing application for potential scalability issues may be easy, the solutions for addressing those may not be really easy. That could be because of the underlying design & architecture of the application and its inter-dependencies with other systems in use. Let us examine certain important aspects to look into to spot potential scalability problems.

Distributed architecture: While distributed design is likely to improve performance, it could lead to scalability issues when one or more components or sub system relies on the local resources. Another reason for this to be reviewed with care is an ill designed system may call for too much of communication across physical and logical boundaries of various subsystems and would rely more on the communication infrastructure.

Component interaction: Examine how the components or subsystems are designed to interact with each other and how closely they are positioned. Too much of component interactions could lead to network congestion and also result in very high latencies which results in performance scalability issues later on when the usage increases. Measure the payload and the latency of such inter component interactions and isolate the components that need redesign. Keeping the data and the behaviour closer will reduce the interactions across boundaries and as a result keeps the latency under check.

Resource contentions: Look for potential limitations on the hardware or software resources used by the application. For instance, if the application produces huge amounts of log data, on the same disk where its transaction data is stored, the write requests may encounter resource contentions. Similarly, how fast the data files grow and how does the disk subsystems support such growth. Possible solutions for such issues are using resource pooling, message queues or such other asynchronous mechanisms.

Remote Communications: It would always be beneficial to limit the remote calls to the minimum or else, too many remote calls may expose the system too much on the reliability and availability of the communication infrastructure. Ensure required validations are performed ahead of the remote calls, so that unnecessary remote calls are avoided. Where possible, the remote calls should be stateless and asynchronous. Synchronous calls may hold up the communication channels and associated resources for longer period which could be the potential cause for performance and scalability issues. Use of message queues may help in decoupling subsystems from being held up for synchronous responses.

Cache Management: While use of Cache can help achieve better performance, it could also prevent the application from scaling in a load balanced environment, unless a distributed caching mechanism is designed and used.

State Management: Look for how the state of the persistent objects is being managed. Stateless objects scale better than the stateful objects. Distributed state management is the solution to address the state management issue in a load balanced environment. Always prefer stateless components or services as this will perform well and at the same time scale well.

Here are some of the best practices that help achieve high scalability
  • Prefer stateless asynchronous communications as this will free up resources considerably and supports load balancing.
  •  Design the application into multiple fault isolated subsystems with ability of being deployed on different hardware environments (or isolated application pools), so that faults in one subsystem does not impact the other subsystems. This partition can be either by service categories or by customer segments.
  •  Use distributed cache solutions, so that the cached data is available on multiple clustered environments.
  •  Use distributed databases with appropriate replication so that loads can be distributed.
  •  Do not depend too much on the specific capabilities of the RDBMS, as this might couple the application tightly to one vendor’s RDBMS. High degree of scalability can be achieved by keeping the business logic outside of the RDBMS.
  •  Spot the potential scalability issues early on by performing design reviews during development and by performing periodic load and performance tests.
  •  Do not ignore the capacity planning activity early during the pre-project phase, as it could significantly impact the application usage in production over a period of time. Also be aware of the data growth rates and have a road map to support the ever increasing data and volume growth.
  • Do not ignore the root cause analysis as many times when developers roll in a fix for a defect, they are not fixing the root cause, which could come back later as a scalability bottleneck.
Also read this MSDN Library article which lists down five key considerations for a scalable design.

Saturday, July 7, 2012

Direct Database Updates – A Cause of Concern

Many organizations still have the practice of directly updating the production databases to fix data integrity issues. This shows that the one or more applications deployed on top of the database are not reliable enough to maintain the database integrity. This is one of the biggest concerns for the information security auditors as this requires certain resources being granted he access privilege to the production databases. This opens up opportunity for internal hackers to indulge into fraudulent activities.

There could be a multitude of reasons which could lead to such a situation, needing frequent database updates. The following are some such reasons that impact the reliability:
  • Incomplete requirements – It may be possible that the business rules and / or validations are not completely gathered and documented. 
  • Design deficiencies – Design deficiencies like inappropriate error handling, managing the concurrency, etc. could also lead to data integrity issues.
  • Shared database across multiple applications – When multiple applications use a shared database, it might possible that some business rules or data validation requirements might be implemented differently or some applications might have technology or design limitations leading to introducing data integrity issues.
  • Creeping code complexity over a period of application maintenance – As the applications move into maintenance cycle, and as newer resources may get on to maintain the application code base, chances are high that due to the growing complexity and lack of complete knowledge, issues might slip through the development and sometimes QA phase as well.
  • Lack of adequate QA / Reviews – Review is a very effective technique to identify potential issues way ahead in the application development life cycle. But, unfortunately, most organizations does not give importance to requirement, design and code reviews or don’t get it done effectively. This review or QA deficiency could impact the reliability. 
Though the software development process has matured enough, organizations tend to compromise in some of the quality attributes which might lead to a situation of the application being not reliable. Thus, it may not be possible to completely eliminate the need for direct database updates. However, a process with adequate checks and controls should be put in place around this activity to ensure that the chances of security breach through this channel are under control. At a minimum, he following checks and controls need to be in place to have the database updates in control.
  • Every request for database update should originate from business function heads and should formally be supported by a service request as logged in to an appropriate tracking system or into a register.
  • Every such request shall be reviewed by the analysts and / or architects to identify whether the data update is necessary and there not another way of fixing this using any of the application features.
  • The review should also suggest two solutions, one being the isolating the specific data table and columns that need to be updated (corrective action) and the other being the possible enhancement to the application(s) to prevent such integrity issue from occurring in the future. The review should also identify the constraints in implementing the data fix, for instance some of the fixes may warrant that they should be executed ahead or after a specific scheduled job or sometimes may need the database to be taken offline before execution.
  • In most cases, these issues would be very hard to investigate, as the occurrence would be rare and upon encountering a unique combination of data / program flow. It would be beneficial if the result of such review flows into the process and necessary checks and controls are put in place to prevent such issues slipping through the review and testing phases of the SDLC.
  • On completion of the review, developers may be engaged to create necessary SQL scripts that are required for such updates.
  • This shall be subject to review by the analysts and / or architects and then subject to testing by the QA team. 
  • Once the review and test results are clear the scripts shall be forwarded to the DBAs who should execute the scripts in production. Ideally such data updates should be performed in batches and the affected tables / objects should be backed up prior to execution, so that the old data can be restored when needed.
  • The DBAs should maintain a record of such execution and the resulting log data and the same shall be subject to periodic audit, so as to ensure that the scripts remain unaltered and that no additional unwanted activities happen along with script execution.
  • None of the resources involved in this process except the DBAs should have access to production database. For the purpose of investigation or troubleshooting certain cases, a clone of the production data may be made available on request and should be taken off when the its intended purpose is complete. It is important to have a practice of masking sensitive data while making such production clones and also should have restricted access over the network.
  • It is important that the responsibilities are divided amongst different groups and the associated employees should have demonstrated high credibility in the past and the accountability should be well established.
  • A periodic end to end audit should be performed, which should track right from the origination of the service request to its execution in the production database and any non-compliance must be seriously dealt with.

More than these checks and controls, the organization should look for declining database update requests over a period of time, which is an indicator of improving system reliablity. Another way to look at the improvement is that the recurring requests of the same nature should vanish after two or three occurrances. The organization's software engineering process also should call far adequate checks and controls which will contribute to improved system reliability.

Sunday, July 1, 2012

Software Architecture Reviews

Review is a powerful technique that contributes to software quality. Various artifacts of the software development lifecycle are subject to review to ensure that any deficiencies could be spotted early on and addressed sooner, before letting it slip through further phases and in turn consuming more efforts than expected down the line. One such important review is the review of the software architecture. If you are asked to review the architecture of a software, it could be due to one of the following reasons.

  1. Possibly, you are a Senior Architect and is expected to complement your fellow Architect by reviewing his work and thereby helping him and in turn the organization to get the best possible software Design. Some or most organizations mandate this need as part of their engineering process. When this review is done effectively, the benefits are huge, as this review occurs early in the development life cycle. 
  2. One or more of the custom built application(s) used in the organization are suspected to have certain serious reliability / performance issue and you are engaged to come up with an analysis and a plan to set it right. If this situation arise, then it is very much evident that the first one did not happen or it wasn’t done well. In some cases, such situation arise when the stake holders knowingly compromise on certain software quality attributes initially and then surprised to see its impact down the line as it hits back. 
  3. You are possibly looking out to license a product and are evaluating its suitability to your organization. In this, case you will probably have a checklist of items created based on the IT policies and framework of your organization and this is highly dependent on the information revealed by the product vendor. 

Though there could be more reasons, the above are some of the primary reasons as to why one would need to perform an architectural review. In spite of as many reviews and testing, issues slips through and challenges the IT architects at some point down the line. Resolution of such issues may call for certain specific reviews and the method and approach would be different based on the type of the problem. For instance, if there be a data breach, a security review of the architecture is what is needed to not only identify the root cause for the current problem, but also to identify potential vulnerabilities and come up with solutions to plug those gaps.

These specific reviews can be typically associated with the broad software quality attributes, which are also termed as non-functional-requirements. The best way to approach these specific reviews is to start with an architectural review. A review checklist would be a good tool to use for the purpose, but the checklist should be exhaustive enough to cover necessary areas, so that the reviewer can get the right and required inputs and would be in a good position to form an opinion about the possible deficiencies and can relate it with the problem being attempted to be resolved.

Keep a watch on my blog for more on specific architecture reviews.