Showing posts with label Scalability. Show all posts
Showing posts with label Scalability. Show all posts

Tuesday, December 23, 2025

Bridging the Gap: Engineering Resilience in Hybrid Environments (DR, Failover, and Chaos)

The "inevitable reality of failure" is the foundational principle of cyber resilience, which shifts the strategic focus from the outdated goal of total prevention (which is impossible) to anticipating, withstanding, recovering from, and adapting to cyber incidents. This approach accepts that complex, interconnected systems will experience failures and breaches, and success is defined by an organization's ability to survive and thrive amidst this uncertainty.

In the past, resilience meant building a fortress around your on-premises data center—redundant power, dual-homed networks, and expensive SAN replication. Today, the fortress walls have been breached by necessity. We live in a hybrid world. Critical workloads remain on-premises due to compliance or latency needs, while others burst into the cloud for scalability and innovation.

This hybrid reality offers immense power and scalability, but it introduces a new dimension of fragility: the "seam" between environments.

How do you ensure uptime when a backhoe or an excavator cuts fiber outside your data center, an AWS region experiences an outage, or, more commonly, the complex networking glue connecting the two suddenly degrades?

Key principles for managing inevitable failure include:
 
  • Anticipate: This involves proactive risk assessments and scenario planning to understand potential threats and vulnerabilities before they materialize.
  • Withstand: The goal is to ensure critical systems continue operating during an attack. This is achieved through resilient architectures, network segmentation, redundancy, and failover mechanisms that limit the damage and preserve essential functions.
  • Recover: This focuses on restoring normal operations quickly and effectively after an incident. Key components include immutable backups, tested recovery plans, and clean restoration environments to minimize downtime and data loss.
  • Adapt: The final, crucial step is to learn from every incident and near-miss. Post-incident analyses (often "blameless" to encourage honest assessment) inform continuous improvements to strategies, tools, and processes, helping the organization evolve faster than the threats it faces.

Resilience in a hybrid environment isn't just about preventing failure; it’s about enduring it. It requires moving beyond hope as a strategy and embracing a tripartite approach: Robust Disaster Recovery (DR), automated Failover, and proactive Chaos Engineering.

1. The Foundation: Disaster Recovery (DR) in a Hybrid World


Disaster Recovery is your insurance policy for catastrophic events. It is the process of regaining access to data and infrastructure after a significant outage—a hurricane hitting your primary data center, a massive ransomware attack, or a prolonged regional cloud failure.

In a hybrid context, DR often involves using the cloud as a cost-effective lifeboat for on-premises infrastructure.

The Metrics That Matter: RTO and RPO


Before choosing a strategy, you must define your business tolerance for loss:
  • Recovery Point Objective (RPO): How much data can you afford to lose? (e.g., "We can lose up to 15 minutes of transactions.")
  • Recovery Time Objective (RTO): How fast must you be back online? (e.g., "We must be operational within 4 hours.")

The lower the RTO/RPO, the higher the cost and complexity.

Hybrid DR Strategies


Hybrid architectures unlock several DR models that were previously unaffordable for many organizations:

A. Backup and Restore (Cold DR):

A Backup and Restore (Cold DR) strategy is a cost-effective, fundamental disaster recovery approach for non-critical systems, involving regular data/config backups stored dormant, then manually restoring everything (data, apps, infra via Infrastructure as Code) to a secondary site after an outage, leading to longer Recovery Time Objectives (RTOs) but lower costs. It protects against major disasters by replicating data to another region, relying on automated backups and Infrastructure as Code (IaC) like CloudFormation for efficient, repeatable recovery.

How it Works:

Backup: Regularly snapshot data (databases, volumes) and configurations (AMIs, application code) to a secure, remote location (e.g., S3 in another AWS Region). 
Infrastructure as Code (IaC): Use tools (CloudFormation, Terraform, AWS CDK) to define your entire infrastructure (servers, networks) in code.
Dormant State: In a disaster, the secondary environment remains unprovisioned or powered down (cold).
Recovery:
    1. Manually trigger IaC scripts to provision the infrastructure in the recovery region.
    2. Restore data from the stored backups onto the newly provisioned resources.
    3. Automate application redeployment if needed.
Best For: Systems where downtime (hours/days) and some data loss are acceptable; compliance needs; protecting against regional outages.


B. Pilot Light:

A Pilot Light Disaster Recovery (DR) strategy involves running a minimal, core version of your infrastructure in a standby cloud region, like a small flame ready to ignite a full fire, keeping essential data replicated (e.g., databases) but leaving compute resources shut down until a disaster strikes, offering a cost-effective balance with faster recovery (minutes) than backup/restore but slower than warm standby, ideal for non-critical systems needing quick, affordable recovery.

How it Works:

Core Infrastructure: Essential services (like databases) are always running and replicating data to a secondary region (e.g., AWS, Azure, GCP).
Minimal Resources: Compute resources (like servers/VMs) are kept in a "stopped" or "unprovisioned" state, saving costs.
Data Replication: Continuous, near real-time data replication ensures minimal data loss (low RPO).
Scale-Up on Demand: During a disaster, automated processes rapidly provision and scale up the idle compute resources (using pre-configured AMIs/images) around the live data, scaling to full production capacity.

Best For: 
Applications where downtime is acceptable for a few minutes to tens of minutes (e.g., 10-30 mins).
Non-mission-critical workloads that still require faster recovery than simple backups.

C. Warm Standby:

A Warm Standby DR strategy uses a scaled-down, but fully functional, replica of your production environment in a separate location (like another cloud region) that's always running and kept updated with live data, allowing for rapid failover with minimal downtime (low RTO/RPO) by quickly scaling resources to full capacity when disaster strikes, balancing cost with fast recovery.

How it Works:
 
Minimal Infrastructure: Key components (databases, app servers) are running but at lower capacity (e.g., fewer or smaller instances) to save costs.
Always On: The standby environment is active, not shut down, with replicated data and configurations.
Quick Scale-Up: In a disaster, automated processes quickly add more instances or resize existing ones to handle full production load.
Ready for Testing: Because it's a functional stack, it's easier to test recovery procedures.

Best For
Business-critical systems needing recovery in minutes.
Environments requiring frequent testing of DR readiness.


D. Active/Active (Multi-Site):

An Active/Active (Multi-Site) DR Strategy runs full production environments in multiple locations (regions) simultaneously, sharing live traffic for maximum availability, near-zero downtime (low RTO/RPO), and performance; it involves real-time data replication and smart routing (like DNS/Route 53) to instantly shift users from a failed site to healthy ones, but comes with the highest cost and complexity, suitable only for critical systems needing continuous operation.

How it Works:
 
Simultaneous Operations: Two or more full-scale, identical environments run in different geographic regions, handling live user requests concurrently.
Data Replication: Data is continuously replicated between sites, often synchronously, ensuring low Recovery Point Objective (RPO) – minimal data loss.
Intelligent Traffic Routing: Services like Amazon Route 53 or AWS Global Accelerator direct users to the nearest or healthiest region, using health checks to detect failures.
Instant Failover: If one region fails, traffic is automatically and immediately redirected to the remaining active regions, leading to near-instant recovery (low Recovery Time Objective - RTO).

Best For
Business-critical applications where any downtime is unacceptable.
Workloads requiring low latency for a global user base.


2. The Immediate Response: Hybrid Failover Mechanisms


While DR handles catastrophes, Failover handles the everyday hiccups. Failover is the (ideally automatic) process of switching to a redundant or standby system upon the failure of the primary system, mostly automatic.

Failover mechanisms in a hybrid environment ensure immediate operational continuity by automatically switching workloads from a failed primary system (on-premises or cloud) to a redundant secondary system with minimal downtime. This requires coordinating recovery across cloud and on-premises platforms.

In a hybrid environment, failover is significantly more complex because it often involves crossing network boundaries and dealing with latency differentials.

Core Concepts of Hybrid Failover


High Availability (HA) vs. Disaster Recovery (DR): HA focuses on minimizing downtime from component failures, often within the same location or region. DR extends this capability to protect against large-scale regional outages by redirecting operations to geographically distant data centers.
Automatic vs. Manual Failover: Automatic failover uses system monitoring (like "heartbeat" signals between servers) to trigger a switch without human intervention, ideal for critical systems where every second of downtime is costly. Manual failover involves an administrator controlling the transition, suitable for complex environments where careful oversight is needed.
Failback: Once the primary system is repaired, failback is the planned process of returning operations to the original infrastructure.

Common Failover Configurations


Hybrid environments typically use a combination of these approaches:

Active-Passive: The primary system actively handles traffic, while the secondary system remains in standby mode, ready to take over. This is cost-effective but may have a brief switchover time.
Active-Active: Both primary and secondary systems run simultaneously and process traffic, often distributing the workload via a load balancer. If one fails, the other picks up the slack immediately, resulting in virtually zero downtime, though at a higher cost.
Multi-Site/Multi-Region: Involves deploying resources across different physical locations or cloud availability zones to protect against localized outages. DNS-based failover is often used here to reroute user traffic to the nearest healthy endpoint.
Cloud-to-Premises/Premises-to-Cloud: A specific hybrid strategy where, for example, a cloud-based Identity Provider (IDP) failing results in an automatic switch to an on-premises Active Directory system

3. The Stress Test: Chaos Engineering


You have designed your DR plan, and you have implemented automated failover. But will they actually work at 3:00 AM on Black Friday?

Chaos engineering is a proactive discipline used to stress-test systems by intentionally introducing controlled failures to identify weaknesses and build resilience. In hybrid environments—which combine on-premises infrastructure with cloud resources—this practice is essential for navigating the added complexity and ensuring continuous reliability across diverse platforms.

It is not about "breaking things randomly"; it is about controlled, hypothesis-driven experiments.

In a hybrid environment, Chaos Engineering is mandatory because the complexity masks hidden dependencies.

The Role of Chaos Engineering in Hybrid Environments


Hybrid environments are inherently complex due to the number of interacting components, network variations, and differing management models. Chaos engineering helps address this by:
 
Uncovering hidden dependencies: Experiments reveal unexpected interconnections and single points of failure (SPOFs) between cloud-based microservices and legacy on-premise systems.
Validating failover mechanisms: It tests whether the system can automatically switch to redundant systems (e.g., a backup database in the cloud if an on-premise one fails) as intended.
Assessing network resilience: Simulating network latency or packet loss between the different environments helps understand how applications handle intermittent connectivity across the hybrid setup.
Improving observability: Running experiments forces teams to implement robust monitoring and alerting, providing a clearer picture of system behavior under stress across the entire hybrid architecture.
Building team confidence and "muscle memory": By conducting planned "Game Days" (disaster drills), engineering teams gain valuable practice in incident response, reducing Mean Time To Recovery (MTTR) during actual outages.

Key Principles and Best Practices


To conduct chaos engineering safely and effectively, especially in complex hybrid scenarios, specific principles should be followed:
 
Define a "Steady State": Before any experiment, establish clear metrics for what "normal" system behavior looks like (e.g., request success rate, latency, error rates).
Formulate a Hypothesis: Predict how the system should react to a specific failure (e.g., "If the on-premise authentication service goes down, the cloud-based application will automatically use the backup in Azure without user impact").
Start Small and Limit the "Blast Radius": Begin experiments in a non-production environment and, when moving to production, start with a minimal scope to control potential damage.
Automate and Monitor Extensively: Use robust observability tools to track metrics in real time during experiments and automate rollbacks if the experiment spirals out of control.
Foster a Learning Culture: Treat failures as learning opportunities rather than reasons for blame to encourage open analysis and continuous improvement.

Common Experiment Types in a Hybrid Context


Experiments can be tailored to the unique vulnerabilities of hybrid setups:

Service termination: Randomly shutting down virtual machines or containers residing on different platforms (on-premise vs. cloud) to test redundancy.
Network chaos: Introducing artificial latency or dropped packets in traffic between the on-premise datacenter and the cloud region.
Resource starvation: Consuming high CPU or memory on a specific host to see how load balancing and failover mechanisms distribute the workload.
Dependency disruption: Blocking access to a core service (like a database or API gateway) housed in one environment from applications running in the other.


Conclusion: Resilience is a continuous Journey


Building resilience in a hybrid environment is not a project you complete once and forget. It is a continuous operational lifecycle.
 
Design with failure in mind (using hybrid DR strategies).
Implement automated recovery (using intelligent failover mechanisms).
Verify your assumptions relentlessly (using Chaos Engineering).

The hybrid cloud offers incredible flexibility, but it demands a higher standard of engineering discipline. By integrating DR, Failover, and Chaos Engineering into your operational culture, you move from fearing the inevitable failure to embracing it as just another Tuesday event.

Monday, November 3, 2025

Securing APIs at Scale: Threats, Testing, and Governance

As organizations embrace microservices, cloud-native architectures, and digital ecosystems, APIs have become the connective tissue of modern business. From mobile apps to microservices architectures, APIs power virtually every digital interaction we have. As API usage explodes, so do the potential attack vectors, making robust security measures not just important, but essential. 

API security must be approached as a fundamental element of the design and development process, rather than an afterthought or add-on. Many organizations fall short in this regard, assuming that security measures can be patched onto an existing system by deploying security devices like Web Application Firewall (WAF) at the perimeter. In reality, secure APIs begin with the first line of code, integrating security controls throughout the design lifecycle. Even minor security gaps can result in significant economic losses, legal repercussions, and long-term brand damage. Designing APIs with inadequate security practices introduces risks that compound over time, often becoming a time bomb for organizations.

Securing APIs at scale requires more than just technical controls; it demands a lifecycle approach that integrates threat awareness, rigorous testing, and robust governance.
 

The Evolving Threat Landscape


APIs are attractive targets for attackers because they expose business logic, data flows, and authentication mechanisms. According to Salt Security, 94% of organizations experienced an API-related security incident in the past year. The threats facing APIs are constantly evolving, becoming more sophisticated and targeted. Here are some of the most prevalent and concerning threats:

  • Broken Authentication & Authorization: This is a perennial favourite for attackers. Weak authentication mechanisms, default credentials, or insufficient authorization checks can lead to unauthorized access, allowing attackers to impersonate users, access sensitive data, or perform actions that they shouldn't. Think of a poorly secured login endpoint that allows brute-forcing, or an API that lets a regular user modify administrative settings.
  • Injection Flaws (SQL, NoSQL, Command Injection): While often associated with web applications, injection vulnerabilities are equally dangerous in APIs. Malicious input, often disguised within legitimate API requests, can trick the backend system into executing unintended commands, revealing sensitive data, or even taking control of the server.
  • Excessive Data Exposure: APIs are designed to provide data, but sometimes they provide too much data. Overly broad API responses might inadvertently expose sensitive information (e.g., user email addresses, internal system details) that isn't necessary for the client's function. Attackers can then leverage this exposed information for further exploitation.
  • Lack of Resource & Rate Limiting: Unrestricted access to API endpoints can lead to various attacks, including denial-of-service (DoS) or brute-force attacks. Without proper rate limiting, an attacker could bombard an API with requests, overwhelming the server or attempting to guess credentials repeatedly.
  • Broken Function Level Authorization: Even if a user is authenticated, they might have access to functions or resources they shouldn't. This often occurs when access control checks are not granular enough, allowing a user with basic permissions to perform actions intended only for administrators.
  • Security Misconfiguration: This is a broad category encompassing many common errors, such as default security settings that are left unchanged, improper CORS policies, verbose error messages that reveal system details, or unpatched vulnerabilities in underlying software components.
  • Mass Assignment: This occurs when an API allows a client to update an object's properties without proper validation, potentially allowing an attacker to modify properties that should only be controlled by the server (e.g., changing a user's role from "standard" to "admin").
  • Denial-of-Service (DoS): A DoS attack on an API aims to make the API unavailable to legitimate users by overwhelming it with requests or exploiting vulnerabilities. This can lead to service disruptions, downtime, and potential reputational damage. This is usually accomplished by the attackers using techniques like, Request Flooding, Resource Exhaustion, Exploiting vulnerabilities.
  • Shadow APIs: These are the APIs that operates within an organization's environment without the knowledge, documentation, or oversight of the IT and security teams. These unmanaged APIs represent a significant security threat because they expand the attack surface and often lack essential security controls, making them an easy entry point for cybercriminals.

Proactive Testing: Building Resilience


Given the complexity and scale of API ecosystems, a proactive and comprehensive testing strategy is crucial. Relying solely on manual testing is no longer sufficient; automation is key. Following are some of the testing techniques:
 
  • Static Application Security Testing (SAST): SAST tools analyze your API's source code, bytecode, or binary code without executing it. They can identify potential vulnerabilities like injection flaws, insecure cryptographic practices, and hardcoded secrets early in the development lifecycle, allowing developers to fix issues before they reach production.
  • Dynamic Application Security Testing (DAST): DAST tools interact with the running API, simulating real-world attacks. They can identify vulnerabilities like broken authentication, injection flaws, and security misconfigurations by sending various requests and analyzing the API's responses. DAST is excellent for finding vulnerabilities that only manifest during runtime.
  • Interactive Application Security Testing (IAST): IAST combines elements of SAST and DAST. It works by instrumenting the running application and monitoring its execution in real-time. This allows IAST to provide highly accurate vulnerability detection, pinpointing the exact line of code where a vulnerability resides and offering context on how it can be exploited.
  • API Penetration Testing: Beyond automated tools, ethical hackers perform manual penetration tests to uncover complex vulnerabilities that automated scanners might miss. These "white hat" hackers simulate real-world attack scenarios, trying to exploit logical flaws, bypass security controls, and gain unauthorized access to the API.
  • Fuzz Testing: This technique involves feeding a large volume of malformed or unexpected data to an API endpoint to stress-test its resilience and uncover vulnerabilities or crashes that might not be apparent with standard inputs.
  • Schema Validation: Enforcing strict schema validation for all API requests and responses helps prevent malformed inputs and ensures data integrity, significantly reducing the risk of injection attacks and other data manipulation exploits.
  • Runtime Protection: This refers to the measures and tools implemented to safeguard APIs while they are actively listening and processing requests and responses in production environment. This form of protection focuses on real-time threat detection and prevention, ensuring that APIs function securely during their operational lifespan. API runtime protection is crucial because it addresses threats that may not be caught during the design or development phases.

Robust Governance: The Foundation of Security


Technical controls are vital, but without a strong governance framework, API security efforts can quickly unravel. Without governance, APIs become a “wild west” of inconsistent standards, duplicated efforts, and accidental exposure. Governance provides the policies, processes, and oversight necessary to maintain a secure API ecosystem at scale. Effective Governance includes:

  • API Security Policy & Standards: Establish clear, comprehensive security policies and coding standards that all API developers must adhere to. This includes guidelines for authentication, authorization, input validation, error handling, logging, and data encryption.
  • Centralized API Gateway: Implement an API Gateway as a single entry point for all API traffic. Gateways can enforce security policies (e.g., authentication, rate limiting, IP whitelisting), perform threat protection, and provide centralized logging and monitoring capabilities.
  • Access Control & Least Privilege: Implement robust Role-Based Access Control (RBAC) to ensure users and applications only have access to the specific API resources and actions they need to perform their functions. Adhere to the principle of least privilege.
  • Regular Security Audits & Reviews: Conduct periodic security audits of your API infrastructure, code, and configurations. Regular reviews help identify deviations from policy, outdated security measures, and new vulnerabilities.
  • Threat Modeling: Before developing new APIs, conduct threat modeling exercises to identify potential threats, vulnerabilities, and attack vectors. This proactive approach helps embed security into the design phase rather than trying to patch it on later.
  • Incident Response Plan: Develop a comprehensive incident response plan specifically for API security incidents. This plan should outline steps for detection, containment, eradication, recovery, and post-incident analysis.
  • Developer Training & Awareness: Educate your development teams on secure coding practices, common API vulnerabilities, and your organization's security policies. Continuous training is essential to keep developers informed about the latest threats and mitigation techniques.
  • Version Control & Deprecation Strategy: Securely manage API versions and have a clear strategy for deprecating older, less secure API versions. Attackers often target older endpoints with known vulnerabilities.
  • Continuous Monitoring & Alerting: Implement robust monitoring solutions to track API traffic, identify unusual patterns, detect potential attacks, and trigger alerts in real-time. This includes monitoring for authentication failures, unusually high request volumes, and suspicious data access patterns.

Conclusion 


Securing APIs at scale is an ongoing journey, not a destination and it is not just a technical challenge—it’s a strategic imperative. It requires a multifaceted approach that combines advanced technical testing with a strong governance framework and a culture of security awareness. By understanding the evolving threat landscape, implementing proactive testing methodologies, and establishing robust governance, organizations can build resilient API ecosystems that empower innovation while protecting sensitive data and critical business functions. The investment in API security today will undoubtedly pay dividends in preventing costly breaches and maintaining trust in an increasingly API-driven world.

Sunday, July 28, 2013

Colocation - Key Considerations for Selection of Service Provider

Increasing number of organizations are shifting towards Colocating the data centers as they see the inherent benefits like efficiency, security and other shared managed services. Colocation is offered as a service based on the standards, policies, procedures, people and the data centre infrastructure. The user experience depends on the quality of each of these components that form part of the service. The type of provider also drives the end-user experience because you get what you pay for. Given the significant benefits of colocation, CIOs cannot just ignore this as an option, but need to look for reliable facilities that upholds the needs of the organization.

The benefits of going for colocation service include effective use of capital and having higher quality facilities through power redundancy, cooling, and scalability/growth. Choosing a colocation provider is a strategic business decision that evolves from thoughtful consideration certain key parameters. More over, the partnership with the colocation service provider need a multi-year commitment from both. Listed below are some of the key considerations that help the CIOs to make a well thought out decision on choosing the right provider. The list below however is in no order of priority.

Customer & Partner Considerations
Many tend to focus on the organization's IT needs and base the decision to colocate and the choice of the service provider on such internal design parameters. However, the CIOs also have to consider the strategies of their customers and partners . For instance, in case of highly interconnected and collaborative partner systems, the DR sites need to have connectivity with the DR sites of the partner systems as well. If the service goes down customers will easily become frustrated with the company, especially if they are in a new market and are not long-time clients.

Power and capacity planning
The power supply is a key component of the shared services that the colocation service provider offers. Needless to say that it is a vital that the data center is supported with a scalable and reliable power supply in order to keep up the infrastructure components. The supplier should focus on the future needs of the customers and perform proactive capacity planning so that they are prepared to support the scalability needs of the organizaiton.

Resiliency
Designing and building a reliable and resilient data center would call for heavy investment and smaller organizations would find it difficult to justify the RoI for such investments. However, colocation service offers this benefit as the provider would however will be sharing the cost with multiple service consumers. However careful evaluation of the design and architecture elements that will contribute towards ensuring reliability and resiliency of the service is essential in making a decision on the choice of service provider. A carefull review of the SLA of any colocation provider is also very important to ensure that the service provider assures commits for the current future investments to support this need.

Security and technical advancement
With colocation, the organization is leaving its servers, equipments and the more precious data assets in the custody of the service provider, which usually is outside the physical security boundary of the organization. Thus, physical security to the premises and the assets hosted there in is an important component of the colocation service and needs a careful consideration and evaluation. A comprehensive security, beyond the guard, ensuring data protection at all times is what is required. Additional factors of security involving evolving technologies like biometric, card readers, keypad or electronic locks as well as surveilance cameras should form part of the security investments.

Network connectivity and proximity
Like reliability and resiliency, network performance is a key customer service issue, not just a technology issue. An ineffective interconnect setup or limited network system can lead to disruptions in the business operations. It would also matter much to ascertain the network latency associated with connectivity with partner systems. In most cases, if there is a need to have heavy amount of data or information exchanged with a particular partner system, it would be beneficial to have the data center closer to that of the partner's so that the latency between such centers could be as much lower, leading to faster and efficient communication. Network performance issues can have a tangible financial impact to the bottom line. It is usual that the colocation providers offer diverse options for internet access and a careful selection of a primary and redundant internet service providers is essential. 

Technology & support
The colocation service provider should be implementing standard facility design elements and integrating proven technologies so that better power and cooling capacities is achieved. The provider should also have the demonstrated expertise in facilitating flexible deployment of higher density solutions in an on-demand model. Despite the colocation arrangement, the IT heads of the organization are still responsible in addressing the business needs and as such a careful evaluation of the standards and practices that the provider follows in keeping up with the technology and trends is very much essential.

While most providers offer on-site 24/7 technical support, the skill sets in the offer could vary. Again the organization may demand a specialized skill, which the service provider might not be able to support. As such, it is important to compare the skills needed vis-a-vis the skills on the offer and make appropriate decisions. Similarly, the maintenance windows of the service provider for certain support services could be different from that the organization would want. Thus this is another component that needs careful consideration.

Flexible commercial terms
While the key benefit of outsourcing is the shift from the capital investment to operational expenditure, CIOs might endup spending way higher unless a carefull evaluation of the commercials attached to each of the specific service or component that forms part of the colocation service. Ideal cloud service provider would let the customer to pick and choose the components and also offer multiple and flexible billing options to choose. A careful choice is required to ensure to optimize the value realization out of the colocation engagement.

Locational Constraints
Despite having on-site technical support, there would be emergencies needing experts to visit the data center to physically visit the location and work on solutions. As such, the geographical location of the data center(s) does matter to the organization and needs careful consideration. Similarly, local issues like frequent political unrests, frequency of natural disasters in the region, proximity to the airport, etc need to be considered. For instance if the data center is located in a flood prone area, there is a high risk of the data center services getting affected.

Regulatory Compliance
Various countries and states have legislations that could have an impact on the data or other assets that is being housed at the colocated data center.  For instance, UK and US have privacy and security legislations that have an impact on the way the information is stored, processed or disseminated. It is also beneficial to have an insider’s view and legal opinions where needed.  A provider located in a flood plain or an area prone to hurricanes creates unnecessary risk for the business.

Auditability
Given that most businesses are technology driven, the regulatory and legal needs warrant that the audit of systems and facilities and thus calling for appropriate arrangement with the provider to support this need. The  provider may be requested to share copy(ies) of a SAS 70, SOC 2 or such other audit report, which represents a third-party validation of internal process and controls.

Costs
Cost is also an equally important aspect to consider. However, many times, it may not make sense in just considering the cost and instead it should be considered in line with the tangible and intangible value that could be realized out of colocation. For instance, better network performance would benefit as there could be faster and increased acceptance of the systems by the customers and partners resulting in expanding the market. Similarly, better security services might protect the organization from potential liabilities that may arise in future on account of security breaches. Thus the lowest cost isn’t necessarily the best for the business. Lack of support, insufficient product lines or unreliable data center facilities can end up being very costly later on. 

Green IT
The power consumption of a data center is way higher than a regular office premise. Colocation by itself would contribute towards optimizing the energy consumption as it would be a consolidated specially built facility shared amongst many. However, it is worth exploring the measures that the provider has in place towards conserving energy. and such other green practices in place.

While the above considerations are generic, certain other considerations might be vital for some organizations depending on their business nature and priorities. This list can be further updated based on the feedback and comments from the readers.

Friday, July 13, 2012

Architecture Review - Scalability

Scalability is an important quality attribute of any system, be it hardware or software. But in most cases, the need for a scalability check or review is felt only when certain signs of scalability problems show up. Typically, the following are such signs that call for a scalability review of an existing application.
  • When changes requested on certain subsystems are turned down by the development team(s) citing that it is a complex subsystem and any change to it might call for huge efforts in terms of regression testing or else, it could lead to a bigger impact on the whole system. This indicates that there are certain components or sub systems which prevent the system from scaling. 
  • Months after production usage, the application performance gradually slows down and there is a tendency to accept the performance slow down or to pump in more hardware to compensate the slowdown. This is again is an important sign that the application is not scaling to take on the ever growing user base and transaction volumes.
There could be more signs that could indicate that there are scalability issues within the application. It is unfortunate that scalability reviews are not done in the initial design phase, so that these post production troubles won’t show up. While reviewing an existing application for potential scalability issues may be easy, the solutions for addressing those may not be really easy. That could be because of the underlying design & architecture of the application and its inter-dependencies with other systems in use. Let us examine certain important aspects to look into to spot potential scalability problems.

Distributed architecture: While distributed design is likely to improve performance, it could lead to scalability issues when one or more components or sub system relies on the local resources. Another reason for this to be reviewed with care is an ill designed system may call for too much of communication across physical and logical boundaries of various subsystems and would rely more on the communication infrastructure.

Component interaction: Examine how the components or subsystems are designed to interact with each other and how closely they are positioned. Too much of component interactions could lead to network congestion and also result in very high latencies which results in performance scalability issues later on when the usage increases. Measure the payload and the latency of such inter component interactions and isolate the components that need redesign. Keeping the data and the behaviour closer will reduce the interactions across boundaries and as a result keeps the latency under check.

Resource contentions: Look for potential limitations on the hardware or software resources used by the application. For instance, if the application produces huge amounts of log data, on the same disk where its transaction data is stored, the write requests may encounter resource contentions. Similarly, how fast the data files grow and how does the disk subsystems support such growth. Possible solutions for such issues are using resource pooling, message queues or such other asynchronous mechanisms.

Remote Communications: It would always be beneficial to limit the remote calls to the minimum or else, too many remote calls may expose the system too much on the reliability and availability of the communication infrastructure. Ensure required validations are performed ahead of the remote calls, so that unnecessary remote calls are avoided. Where possible, the remote calls should be stateless and asynchronous. Synchronous calls may hold up the communication channels and associated resources for longer period which could be the potential cause for performance and scalability issues. Use of message queues may help in decoupling subsystems from being held up for synchronous responses.

Cache Management: While use of Cache can help achieve better performance, it could also prevent the application from scaling in a load balanced environment, unless a distributed caching mechanism is designed and used.

State Management: Look for how the state of the persistent objects is being managed. Stateless objects scale better than the stateful objects. Distributed state management is the solution to address the state management issue in a load balanced environment. Always prefer stateless components or services as this will perform well and at the same time scale well.

Here are some of the best practices that help achieve high scalability
  • Prefer stateless asynchronous communications as this will free up resources considerably and supports load balancing.
  •  Design the application into multiple fault isolated subsystems with ability of being deployed on different hardware environments (or isolated application pools), so that faults in one subsystem does not impact the other subsystems. This partition can be either by service categories or by customer segments.
  •  Use distributed cache solutions, so that the cached data is available on multiple clustered environments.
  •  Use distributed databases with appropriate replication so that loads can be distributed.
  •  Do not depend too much on the specific capabilities of the RDBMS, as this might couple the application tightly to one vendor’s RDBMS. High degree of scalability can be achieved by keeping the business logic outside of the RDBMS.
  •  Spot the potential scalability issues early on by performing design reviews during development and by performing periodic load and performance tests.
  •  Do not ignore the capacity planning activity early during the pre-project phase, as it could significantly impact the application usage in production over a period of time. Also be aware of the data growth rates and have a road map to support the ever increasing data and volume growth.
  • Do not ignore the root cause analysis as many times when developers roll in a fix for a defect, they are not fixing the root cause, which could come back later as a scalability bottleneck.
Update:
Also read this MSDN Library article which lists down five key considerations for a scalable design.

Sunday, July 1, 2012

Software Architecture Reviews


Review is a powerful technique that contributes to software quality. Various artifacts of the software development lifecycle are subject to review to ensure that any deficiencies could be spotted early on and addressed sooner, before letting it slip through further phases and in turn consuming more efforts than expected down the line. One such important review is the review of the software architecture. If you are asked to review the architecture of a software, it could be due to one of the following reasons.

  1. Possibly, you are a Senior Architect and is expected to complement your fellow Architect by reviewing his work and thereby helping him and in turn the organization to get the best possible software Design. Some or most organizations mandate this need as part of their engineering process. When this review is done effectively, the benefits are huge, as this review occurs early in the development life cycle. 
  2. One or more of the custom built application(s) used in the organization are suspected to have certain serious reliability / performance issue and you are engaged to come up with an analysis and a plan to set it right. If this situation arise, then it is very much evident that the first one did not happen or it wasn’t done well. In some cases, such situation arise when the stake holders knowingly compromise on certain software quality attributes initially and then surprised to see its impact down the line as it hits back. 
  3. You are possibly looking out to license a product and are evaluating its suitability to your organization. In this, case you will probably have a checklist of items created based on the IT policies and framework of your organization and this is highly dependent on the information revealed by the product vendor. 

Though there could be more reasons, the above are some of the primary reasons as to why one would need to perform an architectural review. In spite of as many reviews and testing, issues slips through and challenges the IT architects at some point down the line. Resolution of such issues may call for certain specific reviews and the method and approach would be different based on the type of the problem. For instance, if there be a data breach, a security review of the architecture is what is needed to not only identify the root cause for the current problem, but also to identify potential vulnerabilities and come up with solutions to plug those gaps.

These specific reviews can be typically associated with the broad software quality attributes, which are also termed as non-functional-requirements. The best way to approach these specific reviews is to start with an architectural review. A review checklist would be a good tool to use for the purpose, but the checklist should be exhaustive enough to cover necessary areas, so that the reviewer can get the right and required inputs and would be in a good position to form an opinion about the possible deficiencies and can relate it with the problem being attempted to be resolved.

Keep a watch on my blog for more on specific architecture reviews.

Wednesday, April 25, 2012

GRC for IT Architects


Governance Risk and Compliance (GRC) as most of you might know, is more than a catchy acronym used by IT and security professionals and in fact it is an approach or framework that an organization adopts to ensure proper management and control. 

The broader term Governance calls for a better way of managing the business, which includes protection of the assets of the organization (includes information as an asset), sustainability of the organization irrespective of the business or economic climate. Risks are the unforeseen events or forces which could potentially result in severe impact on the overall performance of the organization.  Better Governance cannot be achieved without a good risk management program in place. The risk appetite of an organization should be known to the stakeholders who should manage or control the risks, so that the risk exposure is well within the risk appetite. The term Compliance denotes the organization’s approach to being compliant with various legislative requirements of different countries in which it operates and also to comply with social commitments.

GRC exists at different levels, for instance Governance could exist at the corporate level, project level or at sub organization level. While the goals of the GRC at various levels will be the same, the means or techniques used to achieve it vary. 

As one could observe these three terms have inter-relations amongst each other and it’s for that reason, there is a need to have a 360 degree view of all these three together. GRC aligns various components of the enterprise (processes, employees, systems and partners) to be more efficient and more manageable leading to better business performance.

An organization is primarily comprised of People, Processes and Technology. The technology domain in turn is made up of Data, Applications and Infrastructure. The Corporate GRC goals can be met when these components are aligned to meet the respective goals.

Much of the risks that today’s organization is battling with are around Data and Applications used within and outside the organization. The IT Architects in turn play important role in designing the solutions involving data, applications and the infrastructure. Thus it is important for the IT Architect that the solution design process is aligned to the GRC framework of the organization.

Information Systems Audit and Control Association has recently released COBIT 5, which helps organizations to get more value from both information and technology investments. By approaching Governance, COBIT 5 helps maximize the trust in and value from organization’s information and technology. Let us go over some of the questions the stake holders would raise on the governance and management context of enterprise IT and see how it will be relevant for IT Architects.

How do I get value from use of IT? Are end users satisfied with the quality of IT?

IT investments are about enabling business changes and are expected to bring enormous value to the business. But 2 out of 10 enterprise IT projects are outright failures. Keeping a focus on the value delivery from proposal stage till delivery of the solution is likely to improve the chances of success. The Architects should establish the business value that the solution could bring, so that the stakeholders can make an informed decision whether to go ahead with the investment or not.

The perceived value out of IT investments is also dependent on user satisfaction on the service delivery using the solution. The usability should not be ignored for any reason by the Architects and to achieve this Architect should collaborate with target end users on a continuous basis to solicit and elicit feedback.

How do I manage performance of IT?

As businesses heavily depend on IT, the performance of IT to the satisfaction of business is important. Among various other reasons, poor or sub optimal solution design is a major cause for IT’s non performance. Here again, IT Architects have an opportunity to factor the best design practices and ability to generate appropriate metrics so that each of the IT services can be measured and monitored in terms of its performance.

How do I best exploit new technology for new strategic opportunities?

Information Technology is advancing in a faster pace, and the trends are shifting too frequently. Newer tools and technology frameworks that come into the market make enabling business changes more and more easier. This at the same time calls for the people’s abilities in mastering related skills. The Architects has to do a balancing act in not missing the opportunities that the newer technology and tools have to offer and at the same time should not risk the business by taking on such changes so early when skills to manage it is hard to get. Many a times, exploiting new technology ahead of the completion can spur business growth.

How dependent am I on external providers? How well are IT outsourcing agreements being managed? How do I obtain assurance over external providers?

Organizations are embracing cloud and started looking at SaaS applications as these offer a higher degree of flexibility in terms of investments and in terms of capabilities. This is happening though there exist quite many security and other compliance concerns that the industry is still trying to address. This resulting in more external vendors being engaged, calls for a well drafted SLA, which should be in line with the security and regulatory compliance needs of the organization. A careful evaluation of the product and the vendor is essential as it does not absolve the organization from this compliance needs.

What are the control requirements for information?

Information and data as assets are gaining significance and in the next few years, the ability to control and manage large volumes of data from discrete sources in an efficient and effective manner will be looked forward by almost all organizations. At the same time, data breaches are also on the rise and the information security practice is also drawing considerable attention from the CIOs. It is time that the CIOs or CSOs put in place an Information Governance program, identifying and classifying sensitive data and information and defining the control requirements around the same. This will require the all the applications be designed appropriately to have these control requirements implemented.

Did I address all IT related risk?

Risk is one of the important area to be managed well to minimize uncertainty and the associated impact on the business. Risk Management has to be practiced at every level including IT Architecture. IT Architects start risk management right from proposal stage to delivery and even after that. Lack of Risk Management skill amongst the Architects could itself be a risk.

Am I running an efficient and resilient IT operation?

With high dependence on IT, today’s enterprise needs an efficient, effective, secure and resilient IT infrastructure for its survival and success. This requires the sub systems of IT to be highly performing and at the same time architected in such a way to be flexible enough to accommodate changes to it. The Architects should always be willing to embrace change and make sure that the solutions that they design is receptive such changes.

How do I control the cost of IT? How do I use IT resources in the most effective and efficient manner? What are the most effective and efficient sourcing options?

The Architects who design IT solutions are not usually constrained by a budget, and so why in most cases the solutions designed are not necessarily a cost efficient one. Ideally, the Architecture team should consider better budgeting and estimation techniques and should be able to quantify the capital and operational costs, which allows the stakeholders to take informed decisions.

Do I have enough people for IT? How do I develop and maintain their skills, and how do I manage their performance?

Choosing the right tools and technology should also mean that availability of people in to manage and support it. Architects sometimes get carried away by the features and abilities of such tools and sometimes carried away or influenced by vendors and eventually end up in a situation where incurring huge cost in finding skilled people and retaining them. Architects should seriously consider the talents available in house and the availability of such skills in the market on demand, while making such choices.

How do I get assurance over IT?

Quite often, the IT is pulled in to diagnose the problem of an application coming down crashing. Teams like Developers, Architects, Network engineers, Hardware engineers, etc come together to trouble shoot the problem and come up with a corrective and preventive action. Every such instance throws a new root cause and the teams keep on learning out of such outages. But what the end user community wants is a stable and reliable system, which the business can depend on. While it is hard to rule out outages, there should be processes in place, which helps reduce the down times. The systems should be designed to being able to log information necessary for trouble shooting, raise alerts upon encountering exceptional conditions, factor redundancy in hardware and software components. Periodic audits and reviews should be carried out to ensure that the recovery measures put in place are working.

Is the information I am processing well secured?

With cyber security crimes on the rise, organizations are investing heavily on securing the data and information assets that are stored within and outside the organization. IT Security should be one of the key non-functional requirements that the Architects should consider while designing solutions. The significance of Security needs could vary based on the organization’s nature of business and the information being processed or stored. Many countries have pronounced legislations on security requirements for specific industries and specific class of data, which should be complied without exception. Here again, period audits and reviews would help assure about the IT security level to the stakeholders.

How do I improve business agility through a more flexible IT environment?

Agility is key to quickly turnaround business changes as solutions. Flexible IT enables the organizations to quickly capitalize on the new opportunities, to innovate and to get ahead of the competition. This saves time and increases efficiencies. Some of the key evaluation or design criteria to make this happen are:  shared / outsourced infrastructure, ability to scale up and scale out, reduced complexity, continuous data and application availability, built-in efficiency within every component, etc.

The above is not an exhaustive list to be taken care by the Architects. Most of the above would be addressed if one follows the best design practices considering all of the undocumented abilities (scalability, availability, maintainability, usability, etc.) required out of the solutions and applying the right design patterns.


Friday, December 16, 2011

Debugging a performance problem


As with any typical Application development, performance is mostly conveniently ignored in all the phases of the development life cycle. In spite of it being a key non functional requirement it mostly remains undocumented. It is more so, as the development, test and UAT environments may not really represent the real world production usage of the application as some of the performance problems could not be spotted earlier. Even if the application is put to load test, there are certain in the production environment, like data growth, user load, etc, which may lead to performance degradation over a period of time.

While most performance problems could easily be spotted and resolved, some could be a challenge and may require sleepless nights to resolve. A structured approach may help addressing such issues within reasonably quicker time frame. Here is a step by step approach which should work in most cases.

1.       Understand the production environment

It is important to understand the production environment thoroughly so as to identify the various hardware & networking resources and the middleware components involved in the application delivery. In a typical n-tiered application, it is possible that there could be multiple appliances and servers through which a requested passes through and get processed before responding back to the user with response. Also understand which of these components are capable of collecting logs / metrics or capable of being monitored in real time.

2.       Understand the specific feedback from the end users

Gather details like who noticed the performance degradation, at what time frame, whether it is repeating at pattern or just pulling the system down. Also understand if the entire application is slowing down or some specific application components are not performing. Also try to experience the problem first hand, sitting alongside an end user or if possible use an appropriate user credentials to experience the performance issue. The ‘who’ also matters as in certain circumstances, the application slow down may be for a user associated with some specific role as the amount of data to be processed and transmitted may differ based on the user role.

3.       Review available logs and metrics

Gather available logs and metrics data collected by various hardware and software components and look for information that could be relevant to the specific application, or more specifically the set of requests that could demonstrate the performance issue. As Logging itself could be performance overkill, it would be ideal to switch off the logs or to set it to collect only minimal logs. If that be the case, configure or effect necessary code change to achieve appropriate level of logging and then try to collect the required details by re-deploying the application on to a production equivalent environment.

4.       Isolate the problem area

This step is very important and could be very challenging too. Take the help of developers and performance and load testing tools, to simulate the problem and in the meanwhile monitor for key measurement data as the request and response pass through various hardware and software components.

By analyzing the data gathered from the application end user or out of the first hand experience, and with the available logs and metrics try to isolate the issue to a specific hardware or software component. This is best done by doing the following step by step:

a.       Trace the request from the UI to the final destination, which typically may be the Database.

b.      If the request could reach the final destination, then measure the time taken for the request to cross various physical and logical layers and look for any information that could cause the slow down. If a hardware resource is over utilized, it could so happen that the requests would be queued up or rejected after a time out. Look for such information in the logs.

c.       Then review the response cycle and try to spot the delays in the return path.

d.      Try the elimination technique whereby, the involved component one after the other from the bottom is cleared of performance bottleneck.

Experience and expertise on the application and the infrastructure architecture could come in handy to spot the problem area quickly. It is possible that there could be multiple problems whether contributing to the problem on hand or not. This situation may lead to shift in focus on different areas resulting in longer time to resolve the problem. It is important to always stay in focus and proceeding in the right direction.

5.       Simulate the problem in Test /UAT environment

Make sure that the findings are correct by simulating the problem multiple times. This will reveal much more data and help characterize the problem better.

6.       Perform reviews

If the problem area has already been isolated in any of the steps above, then narrow the scope of the review to the components involved in the isolated problem area. If not, then the scope of review is little wider and look for problem areas in every component involved in the request response cycle. Code reviews to debug performance issues require unique skills. For instance, looping blocks, disk usage, processor intensive operations could be the candidates for a detailed review. Similarly, in case of distributed application, look for too many back and forth calls to different physical tiers could easily contribute to performance problem. Good knowledge on the various third party components and Operating System APIs consumed in the application may sometimes be helpful.

When the problem is isolated to a server and the application components seem to have no issues, then it might be possible that any other services or components running on the server might cause load on the server resources there by impacting the application being reviewed. If the problem is isolated to Database server, then look for dead locks, appropriate indexes etc. Sometimes, lack of archival / data retention policies could result in the database tables growing in a much faster pace leading to performance degradation.

7.       Identify the root cause

By now one should have identified the specific application procedure or function that could be the cause of the problem on hand. Have it validated by doing more simulations and tests in environments equivalent to production.

8.       Come up with solution

It is just not over yet, as root cause identification should be followed by a solution. Sometimes, the solution to the problem may require change in the architecture and might have a larger impact on the entire application. An ideal solution should prevent the problem from recurring and at the same time it should not introduce newer problems and should require minimal efforts. Alternatively if the ideal solution is not a possibility with various constraints, a break-fix solution should be offered so that the business continues and also plan for having the ideal solution implemented in the longer term.

Hope this one is useful read for those of you in production support. Feel free to share your thoughts on this subject in the form of comments.

Monday, October 24, 2011

Solution Architect - Understanding the role


I keep getting questions from some of my friends, as to what the role Solution Architecture is all about and will they be a fit for that role. For the benefit of every one out there, I thought of putting together my thoughts on the role of Solution Architect. Let us first examine what is expected out of this role and then look at the skills needed to be in this role.

As the title indicates very well, the role is expected to bring in solution to varying business problems, most which could be a product or project by itself. But, as you know, it is always a challenge to come up with a best solution due to it being intangible and that there are many quality attributes which are not completely identified and specified. There would be lots of missing links in the areas of business domain, choice of technology, hardware components, business processes, future domain and technology trends, etc  which the Solution Architect should be able to connect and come up with best solution that could last longer, so that the organization reaps the return on investing in the solution.

Some of the key characteristics of a best Software Solution are:

Longivity: While the solution must solve the current business problem, it should also be reliable, usable, secure and also future proof. This means that the solution Architect should consider the industry and technology trends that could have an impact on the problem / solution in the near and longer term.

Trade off: The challenge with the various mostly unspecified quality attributes is that they are interdependent. And meeting one such attributes may most likely mean compromising on another. Obviously, a lot of trade off has to happen between various quality attributes and such trade off should be justifiable in the context of perceived benefits for the organization. For instance, performance may have to be compromised to achieve better security. The tradeoffs have to be carefully made after considering various factors, like the risk appetite of the organization, target users of the solution, the technology platform, current IT investments of the organization, etc.

Implementation view: It is important that the solution should be devised with the intended deployment view into consideration. Without that, it could so happen that the solution as designed and built may call for massive changes to the infrastructure investments, which could be a total surprise for stakeholders. Such surprises emerging towards the close line of the project could increase the cost by manifolds or delay the project further.

The above is not the exhaustive list. There are many other factors that will have to be given due consideration before coming up with the best solution. Above all, the solution architect should be able to see that the solution is successfully implemented and put into use. That means, a lot of work in terms of convincing the stakeholders as to why this solution and not an alternative, hand-holding the design and development team and also to some extent the end users to have it implemented the same way it was intended.

With all the above, let us now try to identify the essential skills of an aspiring Solution Architect:

Domain skills: A thorough understanding of the business domain is required to first understand the problem better and second to know the potential future needs that may emerge along the same lines of the problem space. It is also important that the person has the ability to learn things fast, as in most cases, there won’t be lead time for him to gain appropriate business skills. That means, the Solution Architect should also be a Business Analyst.

Technical skills: A thorough understanding on the technology currently in use in the organization, the technology currently in use in similar industry domain and the emerging future technology trends. This knowledge is essential to ensure that the solution does not become obsolete soon and that the organization is in a position to be ahead of completion in terms of IT enabled capabilities. At the same time, applying a new technology early in its evolution has its own issues and it is always better to wait for the technology to evolve and mature as more and more organizations adopt the same. It is important for a Solution Architect to closely follow technology trends and gather enough knowledge to understand what could be the best fit for solving various business problems on hand. He should have enough understanding of the technology chosen, so that his team (mostly himself) comes up with a prototype to establish that the solution really solves the problem. However, as the solution goes further down the implementation lane, the Solution Architect should be able to demonstrate hands on skills, so that he could command expertise and be the go to person for resolution of issues.

Team skills: Though mostly the Solution Architect will be an individual performer, some organizations may have dedicated teams to assist the Architects. Even in case of Architects acting as individual performers, the solution is implemented by a project team. So, the Solution Architect needs to be a team player and should with his domain and technical expertise, lead the team by example.

Process / Project Management Skills: Needless to say that the Solution Architects have to have the Project Management skills too, as one may have to manage the pre-solution activities as a project. For the purpose, he has to be familiar with the processes as well.
That means, the Solution Architect should be an all rounder with moderate to expert level skills in all the areas.  On top of these skills, one has to understand that solutioning is not just a science, but also an art, which is mastered with years of experience over as many projects and years involving various technologies and domains.

There could be different views on this and comments or opinions are welcome. 

Wednesday, July 21, 2010

Scalability

When we talk about the scalability for a web application, it is quite common for us to think of load balanced web farms. This obviously will require the application session state be managed using one of the OutProc options. Depending the utilization of the session state data and the implementation approach, the state server (be it state service or SQL server) may soon hit the scalability / performance limits and become a bottleneck.

This prompted me to think about possible scalability and availability solutions for the State Server. It is again possible to have a load balanced farm of servers to serve the session state requests. But, this requires the state information is replicated across all such servers, so that the information will always be available in across all the load balanced state servers. Obviously this is additional overhead. There are third party tools(like scalenet) to have the session and cache information replicated across multiple servers.

As I was looking for an alternative solution, just came across this Best Practices article on state management published by Microsoft. This article suggests a simple approach of session partitioning using the PartitionResolver Interface and determining the partitioned server using a key derived from the session id.

If any of you have implemented this or any other approach, do post it here.