Tech Bytes by Kannan Subbiah: availability

Showing posts with label availability. Show all posts

Tuesday, December 23, 2025

Bridging the Gap: Engineering Resilience in Hybrid Environments (DR, Failover, and Chaos)

The "inevitable reality of failure" is the foundational principle of , which shifts the strategic focus from the outdated goal of total prevention (which is impossible) to anticipating, withstanding, recovering from, and adapting to cyber incidents. This approach accepts that complex, interconnected systems will experience failures and breaches, and success is defined by an organization's ability to survive and thrive amidst this uncertainty.

In the past, resilience meant building a fortress around your on-premises data center—redundant power, dual-homed networks, and expensive SAN replication. Today, the fortress walls have been breached by necessity. We live in a . Critical workloads remain on-premises due to compliance or latency needs, while others burst into the cloud for scalability and innovation.

This hybrid reality offers immense power and scalability, but it introduces a new dimension of fragility: the "seam" between environments.

How do you ensure uptime when a backhoe or an excavator cuts fiber outside your data center, an experiences an outage, or, more commonly, the complex networking glue connecting the two suddenly degrades?

Key principles for managing inevitable failure include:

Anticipate: This involves proactive risk assessments and scenario planning to understand potential threats and vulnerabilities before they materialize.
Withstand: The goal is to ensure critical systems continue operating during an attack. This is achieved through resilient architectures, network segmentation, redundancy, and failover mechanisms that limit the damage and preserve essential functions.
Recover: This focuses on restoring normal operations quickly and effectively after an incident. Key components include immutable backups, tested recovery plans, and clean restoration environments to minimize downtime and data loss.
Adapt: The final, crucial step is to learn from every incident and near-miss. Post-incident analyses (often "blameless" to encourage honest assessment) inform continuous improvements to strategies, tools, and processes, helping the organization evolve faster than the threats it faces.

Resilience in a hybrid environment isn't just about preventing failure; it’s about enduring it. It requires moving beyond hope as a strategy and embracing a tripartite approach: Robust (DR), automated , and proactive .

1. The Foundation: Disaster Recovery (DR) in a Hybrid World

Disaster Recovery is your insurance policy for catastrophic events. It is the process of regaining access to data and infrastructure after a significant outage—a hurricane hitting your primary data center, a massive ransomware attack, or a prolonged regional cloud failure.

In a hybrid context, DR often involves using the cloud as a cost-effective lifeboat for on-premises infrastructure.

The Metrics That Matter: RTO and RPO

Before choosing a strategy, you must define your business tolerance for loss:

Recovery Point Objective (RPO): How much data can you afford to lose? (e.g., "We can lose up to 15 minutes of transactions.")
Recovery Time Objective (RTO): How fast must you be back online? (e.g., "We must be operational within 4 hours.")

The lower the RTO/RPO, the higher the cost and complexity.

Hybrid DR Strategies

Hybrid architectures unlock several DR models that were previously unaffordable for many organizations:

A. Backup and Restore (Cold DR):

A Backup and Restore (Cold DR) strategy is a cost-effective, fundamental disaster recovery approach for non-critical systems, involving regular data/config backups stored dormant, then manually restoring everything (data, apps, infra via Infrastructure as Code) to a secondary site after an outage, leading to longer Recovery Time Objectives (RTOs) but lower costs. It protects against major disasters by replicating data to another region, relying on automated backups and Infrastructure as Code (IaC) like for efficient, repeatable recovery.

How it Works:

Backup: Regularly snapshot data (databases, volumes) and configurations (AMIs, application code) to a secure, remote location (e.g., S3 in another AWS Region).

Infrastructure as Code (IaC): Use tools (CloudFormation, Terraform, AWS CDK) to define your entire infrastructure (servers, networks) in code.
Dormant State: In a disaster, the secondary environment remains unprovisioned or powered down (cold).
Recovery:

Manually trigger IaC scripts to provision the infrastructure in the recovery region.
Restore data from the stored backups onto the newly provisioned resources.
Automate application redeployment if needed.

Best For: Systems where downtime (hours/days) and some data loss are acceptable; compliance needs; protecting against regional outages.

B. Pilot Light:

A Pilot Light Disaster Recovery (DR) strategy involves running a minimal, core version of your infrastructure in a standby cloud region, like a small flame ready to ignite a full fire, keeping essential data replicated (e.g., databases) but leaving compute resources shut down until a disaster strikes, offering a cost-effective balance with faster recovery (minutes) than backup/restore but slower than warm standby, ideal for non-critical systems needing quick, affordable recovery.

How it Works:

Core Infrastructure: Essential services (like databases) are always running and replicating data to a secondary region (e.g., AWS, Azure, GCP).
Minimal Resources: Compute resources (like servers/VMs) are kept in a "stopped" or "unprovisioned" state, saving costs.
Data Replication: Continuous, near real-time data replication ensures minimal data loss (low RPO).
Scale-Up on Demand: During a disaster, automated processes rapidly provision and scale up the idle compute resources (using pre-configured AMIs/images) around the live data, scaling to full production capacity.

Best For:

Applications where downtime is acceptable for a few minutes to tens of minutes (e.g., 10-30 mins).
Non-mission-critical workloads that still require faster recovery than simple backups.

C. Warm Standby:

A Warm Standby DR strategy uses a scaled-down, but fully functional, replica of your production environment in a separate location (like another cloud region) that's always running and kept updated with live data, allowing for rapid failover with minimal downtime (low RTO/RPO) by quickly scaling resources to full capacity when disaster strikes, balancing cost with fast recovery.

How it Works:

Minimal Infrastructure: Key components (databases, app servers) are running but at lower capacity (e.g., fewer or smaller instances) to save costs.
Always On: The standby environment is active, not shut down, with replicated data and configurations.
Quick Scale-Up: In a disaster, automated processes quickly add more instances or resize existing ones to handle full production load.
Ready for Testing: Because it's a functional stack, it's easier to test recovery procedures.

Best For

Business-critical systems needing recovery in minutes.
Environments requiring frequent testing of DR readiness.

D. Active/Active (Multi-Site):

An Active/Active (Multi-Site) DR Strategy runs full production environments in multiple locations (regions) simultaneously, sharing live traffic for maximum availability, near-zero downtime (low RTO/RPO), and performance; it involves real-time data replication and smart routing (like DNS/Route 53) to instantly shift users from a failed site to healthy ones, but comes with the highest cost and complexity, suitable only for critical systems needing continuous operation.

How it Works:

Simultaneous Operations: Two or more full-scale, identical environments run in different geographic regions, handling live user requests concurrently.
Data Replication: Data is continuously replicated between sites, often synchronously, ensuring low Recovery Point Objective (RPO) – minimal data loss.
Intelligent Traffic Routing: Services like Amazon Route 53 or AWS Global Accelerator direct users to the nearest or healthiest region, using health checks to detect failures.
Instant Failover: If one region fails, traffic is automatically and immediately redirected to the remaining active regions, leading to near-instant recovery (low Recovery Time Objective - RTO).

Best For

Business-critical applications where any downtime is unacceptable.
Workloads requiring low latency for a global user base.

2. The Immediate Response: Hybrid Failover Mechanisms

While DR handles catastrophes, Failover handles the everyday hiccups. Failover is the (ideally automatic) process of switching to a redundant or standby system upon the failure of the primary system, mostly automatic.

Failover mechanisms in a hybrid environment ensure immediate operational continuity by automatically switching workloads from a failed primary system (on-premises or cloud) to a redundant secondary system with minimal downtime. This requires coordinating recovery across cloud and on-premises platforms.

In a hybrid environment, failover is significantly more complex because it often involves crossing network boundaries and dealing with latency differentials.

Core Concepts of Hybrid Failover

High Availability (HA) vs. Disaster Recovery (DR): HA focuses on minimizing downtime from component failures, often within the same location or region. DR extends this capability to protect against large-scale regional outages by redirecting operations to geographically distant data centers.
Automatic vs. Manual Failover: Automatic failover uses system monitoring (like "heartbeat" signals between servers) to trigger a switch without human intervention, ideal for critical systems where every second of downtime is costly. Manual failover involves an administrator controlling the transition, suitable for complex environments where careful oversight is needed.
Failback: Once the primary system is repaired, failback is the planned process of returning operations to the original infrastructure.

Common Failover Configurations

Hybrid environments typically use a combination of these approaches:

Active-Passive: The primary system actively handles traffic, while the secondary system remains in standby mode, ready to take over. This is cost-effective but may have a brief switchover time.
Active-Active: Both primary and secondary systems run simultaneously and process traffic, often distributing the workload via a load balancer. If one fails, the other picks up the slack immediately, resulting in virtually zero downtime, though at a higher cost.
Multi-Site/Multi-Region: Involves deploying resources across different physical locations or cloud availability zones to protect against localized outages. DNS-based failover is often used here to reroute user traffic to the nearest healthy endpoint.
Cloud-to-Premises/Premises-to-Cloud: A specific hybrid strategy where, for example, a cloud-based Identity Provider (IDP) failing results in an automatic switch to an on-premises Active Directory system

3. The Stress Test: Chaos Engineering

You have designed your DR plan, and you have implemented automated failover. But will they actually work at 3:00 AM on Black Friday?

Chaos engineering is a proactive discipline used to stress-test systems by intentionally introducing controlled failures to identify weaknesses and build resilience. In hybrid environments—which combine on-premises infrastructure with cloud resources—this practice is essential for navigating the added complexity and ensuring continuous reliability across diverse platforms.

It is not about "breaking things randomly"; it is about controlled, hypothesis-driven experiments.

In a hybrid environment, Chaos Engineering is mandatory because the complexity masks hidden dependencies.

The Role of Chaos Engineering in Hybrid Environments

Hybrid environments are inherently complex due to the number of interacting components, network variations, and differing management models. Chaos engineering helps address this by:

Uncovering hidden dependencies: Experiments reveal unexpected interconnections and single points of failure (SPOFs) between cloud-based microservices and legacy on-premise systems.
Validating failover mechanisms: It tests whether the system can automatically switch to redundant systems (e.g., a backup database in the cloud if an on-premise one fails) as intended.
Assessing network resilience: Simulating network latency or packet loss between the different environments helps understand how applications handle intermittent connectivity across the hybrid setup.
Improving observability: Running experiments forces teams to implement robust monitoring and alerting, providing a clearer picture of system behavior under stress across the entire hybrid architecture.
Building team confidence and "muscle memory": By conducting planned "Game Days" (disaster drills), engineering teams gain valuable practice in incident response, reducing Mean Time To Recovery (MTTR) during actual outages.

Key Principles and Best Practices

To conduct chaos engineering safely and effectively, especially in complex hybrid scenarios, specific principles should be followed:

Define a "Steady State": Before any experiment, establish clear metrics for what "normal" system behavior looks like (e.g., request success rate, latency, error rates).
Formulate a Hypothesis: Predict how the system should react to a specific failure (e.g., "If the on-premise authentication service goes down, the cloud-based application will automatically use the backup in Azure without user impact").
Start Small and Limit the "Blast Radius": Begin experiments in a non-production environment and, when moving to production, start with a minimal scope to control potential damage.
Automate and Monitor Extensively: Use robust observability tools to track metrics in real time during experiments and automate rollbacks if the experiment spirals out of control.
Foster a Learning Culture: Treat failures as learning opportunities rather than reasons for blame to encourage open analysis and continuous improvement.

Common Experiment Types in a Hybrid Context

Experiments can be tailored to the unique vulnerabilities of hybrid setups:

Service termination: Randomly shutting down virtual machines or containers residing on different platforms (on-premise vs. cloud) to test redundancy.
Network chaos: Introducing artificial latency or dropped packets in traffic between the on-premise datacenter and the cloud region.
Resource starvation: Consuming high CPU or memory on a specific host to see how load balancing and failover mechanisms distribute the workload.
Dependency disruption: Blocking access to a core service (like a database or API gateway) housed in one environment from applications running in the other.

Conclusion: Resilience is a continuous Journey

Building resilience in a hybrid environment is not a project you complete once and forget. It is a continuous operational lifecycle.

Design with failure in mind (using hybrid DR strategies).
Implement automated recovery (using intelligent failover mechanisms).
Verify your assumptions relentlessly (using Chaos Engineering).

The hybrid cloud offers incredible flexibility, but it demands a higher standard of engineering discipline. By integrating DR, Failover, and Chaos Engineering into your operational culture, you move from fearing the inevitable failure to embracing it as just another Tuesday event.

Sunday, March 25, 2018

Securing the Operational Technology (OT) - The Challenges

OT - Overview

Operational Technology(OT) is generally technology used in the manufacturing or operational floor. The OT has evolved considerably in the recent years from pure mechanical technology to data-driven technologies like Robotic Process Automation (RPA) leveraging IOT, Machine Learning and Artifiial Intelligence. The impetus from the Industrial IOT (IIOT) brings more and more automation capabilities and the connected behavior into the manufacturing floor. Thus the adoption of IT and related technologies in OT is now the common norm and so the need for alignment and convergence with the IT function.

IOT sensors are deployed everywhere, inside a manufacturing floor, or along the gas pipelines, inside a moving automobile, to monitor the stock movements, etc. Though these dispersed IOT devices perform small functions, the data it produces and the decisions taken based on sucgh data are critical and thus it is being realized that the OT could lead to critical security issues, depending on the size, and critical nature of such enterprise.

The adoption of IIoT and related technologies brings many benefits to businesses such as smart machines and real-time intelligence from the factory floor - but it also increases the attack surface and requires continuous connectivity between IT and OT. The differing culture and mindset between the IT and OT functions, combined with few other factors often leads to conflicts.

Hackers and Cybercriminals are now looking at critical infrastructure systems as the targets. Motivations include holding systems hostage for a ransom, stock price manipulation, denial of production operations, etc. For example, the hackers may take control of your car when on a high way and demand a ransom, which could be life threatening. Similarly, Hackers may get hold of the Energy Grid and shut down the power supply for a region or even nation as a whole. The connected nature of these devices and systems involved in the modern day OT poses serious challenges as they get hooked on to the IT owned network infrastructure, wireless access points, and mobile networks.

Securing the OT

The introduction of new technologies to drive improvements such as production and supply chain efficiency and asset management has led to closer and more open integration between IT and shop floor systems. But the increasing connectivity of previously isolated manufacturing systems, together with a reliance on remote supporting services for operational maintenance, has introduced new vulnerabilities for cyber attack. Not only is the number of attacks growing, but so is their sophistication. As OT security becomes a widely discussed topic, the awareness of OT operators is rising, but so is the knowledge and understanding of OT-specific problems and vulnerabilities in the hacker community.

It’s true that the systems and devices involved in OT are often based on the same technologies as that of IT and as such many of the threats they face are exactly the same. However, it is an open secret that OT security is not the same as IT security. While securing OT systems requires an integrated approach similar to IT, its objectives are inverted, with availability being the primary requirement, followed by integrity and confidentiality. There are certain other important differences as well that mean that the OT infrastucture can not be managed as an extension of the IT infrastructure

Here are some of the areas that makes OT different from IT and thus pose a challenge for the IT Security experts:

1. Visibility:

From the perspective of the organizational units responsible for IT Security function, OT has been somewhat off the radar. This is so, because, the IT function is not involved in the evaluation and selection and procurement of the OT systems. More so, as such OT systems come with a dedicated-networked IT system(s), which could mean even isolated data-centers being setup within the manufacturing floor without the knowledge of the IT function. Until recently, or even now in certain cases, the IT systems involved in OT are treated as an integral part of production machinery rather than computerized information systems, so the ultimate responsibility of its operation and maintenance, regardless of the cause of potential failure, was assigned to the OT function and not IT function. In most cases, the OT staff often don’t know what types of IT, or IoT devices or equipment that they have as part of their OT ecosystem.

2. Skill Gap:

One of the biggest challenges facing the industry is deciding who is responsible for OT security - should it be the IT or OT function? Given their background and resources, in many cases IT security teams are being asked to take ownership of coordinating security for OT. However, they typically lack OT specific skills. Defining the security controls / processes for OT systems require indepth knowledge on the OT systems, so that the interests and priorities of the OT function is also taken care of. The cybersecurity industry is projected to reach 1.8 million unfilled roles by 2020. The added complexities of a converged IT/OT security environment could amplify perceived barriers to entry, as organizations struggle to manage the aging workforce of their plant teams with the Millennial generation of new cybersecurity talent.

3. Availability and Safety:

For a Manufacturing company, the production line is very important and its smooth functioning always is very important. Companies lose revenue when their production line is shut down for maintenance, be it planned or unplanned. Nobody wants to disturb OT equipment because any downtime can turn into millions of dollars in lost productivity, highly vocal, disgruntled customers and regulatory fines. Machines must reach a high OEE (overall equipment effectiveness). There is no time to allow IT-style updates and patches that take down equipment.

In many cases, where OT systems are involved in delivering essential services, such as electricity or water, or maintaining safety systems at chemical plants or dams availability is a significant parameter. Even momentary non-availability could lead to catastrophy in certain cases. Enabling high availability of OT systems and maintaining the confidentiality of some sensitive information processes by those systems require additional security controls. Not only are many of these now-connected OT system components are quite vulnerable to compromise, a failure in one of these also has the possibility of causing a catastrophic effect on human life and property.

4. Processes:

Safety and security for employees and customers have always been top priorities for the OT function and the processes and guidelines are usually defined keeping that in mind. IT function doesn’t even factor plant or employee physical safety in, except where physical access systems are under their domain. IT’s top priority is to protect the data. OT’s top priority, however, is to protect the availability and integrity of the process with security (confidentiality) coming last. At the same time, the OT system components designed for direct control, supervisory control or the safe operation of manufacturing processes, could turn out to be a safety hazard, even if any component or subsystem involved compromised. Business systems are also critical but their failure is unlikely to result in the uncontrolled release of hazardous materials or energy.

5. Legacy:

It is not uncommon that the computer and related software systems used as part of the OT are used over a decade without being replaced or made any change. These computers and softwares are designed for certain specific functions of interfacing with the other plants and equipments involved in the manufacturing process. It largely depends on the plant or equipment vendor to come out with software and related IT hardware enhancements, otherwise, such systems may not be compatible with the upgraded IT hardware or the OS. Consequently, such systems would be vulnerable to a wide range of cyber-threats that have already been mitigated on the systems used in IT function. This is more so

6. Disparate Technologies:

Until recently, or even now in most cases, the OT architectures run on a separate and isolated infrastructure and as such they have been traditionally isolated from the Internet. One of the reasons for this is because these systems are often hard wired to work with a plant / equipment and to receive and process signlas received and disseminate instructions back to various components. Some OT systems are already only supporting obsolete, insecure operating systems. OT system vendors also do not feel obliged to increase the security capabilities of their systems. Something as benign as an active system scan can cause these devices to fail, which can have serious if not catastrophic results.

System-dedicated networks, multiple domains and dedicated supporting systems require more resources to achieve a maturity level comparable with IT. It also greatly increases the complexity in monitoring and maintaining security levels. The sophisticated nature of OT infrastructure technologies means that most IT security and threat intelligence solutions don’t have visibility into, let alone the ability defend against attacks on critical infrastructures. This creates a challenge in defining and implementing coherent security policies across production plants

7. IIoT Impact:

The Industry 4.0 revolution is having a great impact on the manufacturing environments. It offers significant opportunities for improving production effectiveness; in particular, based on continual, online information about manufacturing processes and equipment. However, the utilization of new IoT technologies also has an impact on security. It’s not just about networks of course, there are loads of components, including things like sensors and actuators (transducers) and ‘smart things’, fog nodes,(industrial and intelligent) IoT gateways, IoT platforms and so forth. And for IT some of these components are “different” from the cyber security perspective they are used to by the way. New protocols (including wireless) or mesh network architectures increase the number of potential access points to the network and require a different approach to security.

8. Culture:

The IT function responsible for maintaining and securing the Information and related Resources, help ensuring the data Confidentiality, Integrity and Availability aspects and in the process protect corporate information and related assets including networks from cyberattacks. They're less familiar with the OT space, and often display little interest in knowing what their counterparts do to keep it safe and operational. In contrast, OT function monitors and fixes issues in highly complex and sensitive industrial plants with maintaining operational safety, reliability, and continuity as the top priorities. They don't deal or work with IT function, and certainly don't want them to get involved in their operational issues.

Each group is concerned that the other side will wreak havoc in their environment. When there is a need to secure OT against cyberthreats, plant engineers worry that if IT team members get involved, they'll compromise system safety and stability. Unsanctioned changes to these systems might cripple the plant, cause an explosion, or worse. These concerns are justified. After all, when it comes to OT, IT staff members are in uncharted waters. At the same time, the IT function is concerned that vulnerable OT networks will introduce new threats into IT networks, threatening corporate assets, data, and systems.

Conclusion:

As industrial organizations begin to connect their machines to the network, the differences in security requirements for IT versus operational technology (OT) are becoming more important to understand.

There were no good practices and formal regulations for manufacturers on how to provide even minimal security protection on medical devices.

IT and OT teams are discovering the need to work together in order to deploy cybersecurity solutions throughout the enterprise; from headquarters to remote locations, and the factory floor. Hackers are going after intellectual property, financial data and customer information. CIOs report that intellectual property can constitute more than 80% of company value. Now is the time for OT and IT leaders to develop strong partnerships to promote operational efficiency, safety and competitive advantage.

Neither OT team members nor IT team members are experts in defending OT systems against emerging cyberthreats. Because OT networks were previously disconnected from the external world, engineering staff never had to deal with such threats. Meanwhile, IT staff members who deal with cyberthreats on a daily basis don't fully understand how these new threats will affect OT systems. Nevertheless, both sides must cooperate, because neither group can protect industrial systems singlehandedly. Given the divergent cultures, technologies, and objectives of IT and OT, the two groups must overcome a significant divide, including mutual suspicion.

To ensure IT and OT collaboration, business-level oversight and leadership is required. More and more organizations are taking senior, experienced engineers from OT business units, usually from under the COO, and moving them under the CIO hierarchy. This interdisciplinary model combines expertise and roles that straddle and unify both sides of the IT-OT fence. Some organizations have taken this one step further. Instead of aligning IT roles under the CIO, they're creating a new C-level role to facilitate this management strategy.

The higher up the organizational ladder that IT-OT convergence decisions are being made, the better the chances for success in bridging the gap.

Sunday, April 27, 2014

WAF - Typical Detection & Protection Techniques

WAF - Web Application Firewalls is a new breed of information security technology that offers protection to web sites and web applications from malicious attacks. As the name suggests, WAF solution is intended scanning the HTTP and HTTPS traffic alone. The WAF solutions have evolved over the last few years and are capable of preventing attacks that network firewalls and intrusion detection systems can't. The WAF offering typically comes in the form of a packaged appliance, i.e. with a purpose built hardware and a software running on it and is plugged in to the network. Different appliances offer different level of deployment capabilities, like, active / passive modes, support for High Availability,etc.

Different vendors have come up with various techniques to detect and protect web applications of the enterprise and thus the capabilities of the solution differ. However, at a minimum these devices offer the following detection and protection capabilities:

Detection Techniques

Normalization techniques

Web applications of those days were simple and mostly was comprising of the HTML content. Various tools and solutions have emerged to leverage the HTTP protocol for use by various applications to receive and send complex data including encoded binary data of higher volumes and also extend the use of the HTTP methods. Hackers also leverage these techniques to attack a web application. This calls for the WAF device should have the ability to use a technique to transform the input data into a normalized form, so that the same can be inspected for potential malicious content that could be leverage to perform an attack.

Signature Based Detection

This technique involves use of a string or regular expression based match against the incoming traffic for a specific signature and thus detecting a potential attack. For this purpose, the need to maintain a database of such attack signature is essential. Most popular WAF solution vendors maintain their own databases, whereas others subscribe to such databases.These databases need frequent updates to take into account the signatures used in recent attacks elsewhere.

Rule Based Detection

Rule based Detection technique is similar to Signature Based Detection, but it allows use of a more complex logic. For instance, even if a signature match is detected, it can be further subjected to certain other conditions, like if the data is from a trusted source, the traffic may still be allowed to pass through with or without appropriate alerts and triggers for manual inspection. While the WAF solution is shipped with the standard rules, the same would be configurable to meet the security needs of the customer. The standard rules may also be part of the signature / rule database as may be maintained or subscribed to by the vendor

APIs for Extensibility

Despite the standard signature and rule based detection techniques, the actual deployment scenario at the customer site may require customization of the techniques used in detection. WAF solutions vendors usually support this need by offering extensible APIs, plug-ins, or scripting. These extensiblity options if not appropriately secured, can be exploited by hackers too.

Protection Techniques

Brute Force Attacks Mitigation

These attacks use automated scripts that attempt to login to the web application with common user name and passwords. The attacks usually originate from a large number of sources consisting of both legitimate web servers and private home computers. Once a username and password is successfully guessed, the hackers or their scripts / tools use the gained admin credentials for the next stage of attacks. Given that the user name passwords follow stricter rules and thus these attack is most likely to fail in guessing the valid credentials, but these attacks generate unduly high traffic, which will result in resource drain and in turn affect the availability of the web application.

Protection from Cookie Poisoning

Cookie Poisoning attacks involve the modification of the contents of a cookie (personal information stored in a Web user's computer) in order to bypass security mechanisms. Using cookie poisoning attacks, attackers can gain unauthorized information about another user and steal their identity. Cookie poisoning is in fact a Parameter Tampering attack, where the parameters are stored in a cookie. In many cases cookie poisoning is more useful than other Parameter Tampering attacks because programmers store sensitive information in the allegedly invisible cookie. Most WAF solutions offer protection from Cookie poisoning by facilitating the signing and / or encryption of cookies, virtualizing the cookies or a custom protection mechanism as the specific web application may demand.

Session Attacks Mitigation

Session store is an important component of a web application and this store is used to share some of the common parameters pertaining to the user and the specific session across various actions within the application. Thus the session data is a key component that is used to secure the web applications. The hackers on the other hand try various techniques to hijack the session or tamper the session parameters. While tampering the parameter values is similar to Cookie Poisoning, Session Hijacking is stealing the session identifier and simulating requests from different sources with the stolen session identity. WAF solutions provide protection to session hijacking by signing and / or encrypting the session data and also linking the session identifier with the originating client.

Injection Attack Protection

An SQL injection attack is insertion of a SQL query via the input data from the client to the application. A successful SQL injection attack can read sensitive data from the database, modify database data, or shutdown the server. Similarly operating system and platform commands can often be used to give attackers access to data and escalate privileges on back-end servers Remote File Inclusion attacks allow malicious users to run their own PHP code on a vulnerable website to access anything that the PHP program could: databases, password files, etc. Most WAF solutions using the normalization technique and the signature and rule database would be able to deny requests carrying such data, command or instruction that could lead to any of the injection attacks.

DDoS Protection

Distributed Denial of Attack is a common technique used by hackers to impair the availability of a website or application by directing unusually huge traffic against the site or application. This will result in all the computing resources used up and eventually leading to the site not being available at all. The WAF solutions making use of the normalization techniques and the signature and rule databases would be able to block such requests. Some common techniques used by the WAF solutions are to have a check on the content length and by evaluating the number of requests or sessions from the same originating client within a given time period.

Obviously, what is listed above are most common detection and protection techniques that any WAF solution would offer. But vendors are constantly improving these techniques and thus adding more detection and protection features. This has to be a constant endeavor as the hackers on the other hand are also coming up with newer techniques to exploit various vulnerabilities.

Saturday, March 22, 2014

Business Impact Analysis for Effective BCM

A business continuity plan facilitates in improving the availability of organization's critical services. In the process, the BCP plan identifies and mandates such critical processes and also periodically assesses the quantitative and qualitative impact to the organization in the event of any disruption to such services. While Business Continuity Plan is proactive in managing the risk of business disruption, Business Resumption Plan and Disaster Recovery Plan are reactive in restoring the business to its working state as it deals with recovering or resuming the business services and assets following a disruption. BCP planning is a direct input to the business's D/R action plans.

Business Continuity Management and disaster recovery are natural components of Enterprise Risk Management. All the resources and plans that make up a business continuity plan are developed to address business interruption risk in an organization and should be part of a comprehensive mitigation plan for all the enterprise risks. Many organizations are beginning to recognize the opportunity they have from embedding or incorporating BCM into an overall program to identify, evaluate and mitigate risk. By viewing BCM as a risk management function and embedding it into the enterprise level ERM program, which has been aligned with the strategic imperatives of the company, boardroom expectations are met and alignment achieved.

The typical goals of BCM are:

To identify critical business processes and assign criticality. Factors influencing the determination of criticality include inter-dependencies among business processes and the MAD for each unique business process.
To estimate the maximum downtime the bank can tolerate while still maintaining viability. Bank management must determine the longest period of time a business process can be disrupted before recovery becomes impossible or moot.
To evaluate resource requirements such as facilities, personnel, equipment, software, data files, vital records, and vendor and service provider relationships

Business Impact Analysis

The first step in developing a strong, organization-wide business continuity plan is conducting a Business Impact Analysis. The result of BIA is a business impact analysis report, which describes the potential risks specific to the organization. The challenge lies in assessing the financial and other business risks associated with a service disruption. A BIA report quantifies the importance of business components and suggests appropriate plan and fund allocation for measures to protect them.

As with any plan, the Business Continuity Planning should also evolve on a continuous basis, as the business contexts keep changing in line with the growth and changing directions. Business Impact Analysis being an important phase of the BCM life-cycle, the same should be revisited and refreshed in line with the BCM life cycle. As a process, the BIA shall be performed with respect to each critical activity or even resources forming part of the enterprise business processes. Though BIA is applied to critical activities, it is recommended to perform BIA on all activities as it is BIA that establishes the criticality of such activity, process or resource.

Performing BIA

The following are the key steps in performing the Business Impact Analysis:

Preparation and Set-up - It is important to identify the tools or templates required to perform BIA. For instance, a reference table to determine the business impact is essential to provide consistent definitions to different types of impacts and severity levels. If a structured risk assessment has already been carried out, the definitions and severity levels should already have been captured, and should be used for the BIA as well.
Identification - This first step determines the activities to be performed, resources to be used to deliver the goods and services of the business organization. The source for gathering this information could be right from the mission & objectives of the enterprise to the defined business processes. Given that the BIA is performed on the identified activities and resources, this step however can be considered as a pre-requisite for BIA, rather than a step within BIA.
Identify potential disruptions - With respect to each identified activity or resource, identify the possible events or scenarios that could impact its desired outcome and thereby impacting the business process. This activity is usually best done using techniques like brain storming involving the relevant business users. As part of this step the correlation of the severity of the impact with the duration of disruption is also established.
Identify tangible losses - Disruption in certain activities or non availability of certain resources would directly result in monetary losses. If the given activity or resource or it in combination with other resources or activities could potentially cause revenue loss, the same should be identified and established as to the magnitude of such loss as well.
Quantify intangible losses - Certain activities, when disrupted may not directly result into monetary losses, but may result in intangible loss to the organization. For instance, non availability of customer care executives to respond to customer queries, could result in erosion of brand value. Such impacts should be quantified using appropriate techniques so that the same can be considered in determining the priority.
Recovery cost - As part of the impact analysis it would make sense to capture details of time and efforts it takes to resume or recover from the disruption. The magnitude of the recovery cost would also contribute to the determination of the prioritization or ranking.
Identify dependencies - Some times, the potential disruption or its impact depends on certain other activities or resources be it internal or external. This details will be useful in drawing up the business resumption plan and the disaster recovery plan.
Ranking - Once all relevant information has been collected and assembled, rankings for the critical business services or resources can be produced. Ranking is based on the potential loss of revenue, time of recovery and severity of impact a disruption would cause. Minimum service levels and maximum allowable downtime are also established.
Prioritize critical services or products - Once the critical services or products are identified, they must be prioritized based on minimum acceptable delivery levels and the maximum period of time the service can be down before severe damage to the organization results. To determine the ranking of critical services, information is required to determine impact of a disruption to service delivery, loss of revenue, additional expenses and intangible losses.

The quality of the BIA is reflected in the reports that are produced after completing the above mentioned steps. Given that BIA is a critical phase of BCM, it is important that this activity is performed with as much care and attention to the details. Using the right set of tools, techniques, templates and questionaire is recommended for best results.

Pages