Showing posts with label Enterprise Architecture. Show all posts
Showing posts with label Enterprise Architecture. Show all posts

Sunday, January 25, 2026

Stop Choosing Between Speed and Stability: The Art of Architectural Diplomacy

In contemporary business environments, Enterprise Architecture (EA) is frequently misunderstood as a static framework—merely a collection of diagrams stored digitally. In fact, EA functions as an evolving discipline focused on effective conflict management. It serves as the vital link between the immediate demands of the present and the long-term, sustainable objectives of the organization.

To address these challenges, experienced architects employ a dual-framework approach, incorporating both W.A.R. and P.E.A.C.E. methodologies.

At any given moment, an organization is a house divided. On one side, you have the product owners, sales teams, and innovators who are in a state of perpetual W.A.R. (Workarounds, Agility, Reactivity). They are facing the external pressures of a volatile market, where speed is the only currency and being "first" often trumps being "perfect." To them, architecture can feel like a roadblock—a series of bureaucratic "No’s" that stifle the ability to pivot.

On the other side, you have the operations, security, and finance teams who crave P.E.A.C.E. (Principles, Efficiency, Alignment, Consistency, Evolution). They see the long-term devastation caused by unchecked "cowboy coding" and fragmented systems. They know that without a foundation of structural integrity, the enterprise will eventually collapse under the weight of its own complexity, turning a fast-moving startup into a sluggish, expensive legacy giant.

The Enterprise Architect is the high-stakes diplomat standing at the border of these two worlds. You are not there to pick a side; you are there to manage the trade-offs. You must know when to let the "warriors" bypass a standard to capture a market opportunity, and when to exercise your "peace-keeping" authority to prevent a catastrophic failure of the system.

Achieving an effective balance between W.A.R. and P.E.A.C.E. distinguishes technical experts from strategic leaders who enable the organisation to address current challenges while safeguarding its long-term success.

Part 1: Entering the W.A.R. Zone

W.A.R. represents the tactical, often aggressive reality of modern business. It stands for:
 
  • Workarounds: The "quick fixes" needed to bypass legacy hurdles.
  • Agility: The demand for instant pivot-ability and rapid feature delivery.
  • Reactivity: Responding to market shifts, competitor moves, or sudden security threats.

It is the "battlefield" of the enterprise where the primary objective is to gain or defend market share at all costs. In this phase, the Enterprise Architect acts as a combat medic. You aren’t looking for the "perfect" long-term solution; you are looking for the solution that keeps the business alive and moving today.

The Risk: Constant warfare leads to "Spaghetti Architecture." Without a roadmap back to stability, your temporary workarounds become permanent liabilities.

W - Workarounds (Pragmatic Compromise)

In an ideal world, every system would integrate seamlessly via a robust API gateway. In W.A.R., you don't have six months to build that gateway. Workarounds are the "duct tape" of architecture. They involve:


A - Agility (Speed as a Weapon)

Agility in W.A.R. isn't just about Scrum meetings; it’s about architectural pivotability.
 
  • Micro-decisions: Empowering teams to make local decisions without waiting for the central architecture review board.
  • Minimum Viable Architecture (MVA): Designing just enough structure to support the immediate feature set, ensuring that the architecture doesn't become a "prevention" department.

R - Reactivity (The Pulse of the Market)

Reactivity is the ability to respond to external "black swan" events—be it a competitor’s surprise product launch or a sudden shift in global supply chains.
 

Part 2: Seeking P.E.A.C.E.

P.E.A.C.E. represents the strategic, long-term vision that ensures the enterprise remains sustainable. It stands for:

  • Principles: Establishing the "North Star" rules that guide technology choices.
  • Efficiency: Reducing redundancy and optimizing costs across the stack.
  • Alignment: Ensuring IT strategy and Business strategy are speaking the same language.
  • Consistency: Standardizing data, interfaces, and platforms.
  • Evolution: Planning for a future that is 3–5 years out, not 3–5 days out.

If W.A.R. is about surviving the day, P.E.A.C.E. (Principles, Efficiency, Alignment, Consistency, Evolution) is about thriving for a decade. It is the restorative force that prevents the enterprise from collapsing into a pile of unmanageable code.

In this phase, the architect is a city planner. You are building the infrastructure (roads, power grids, zoning laws) that allows the business to grow without collapsing under its own weight.

P - Principles (The North Star)

Principles are the "laws of the land." They provide a decision-making framework so that even in the heat of battle, teams don’t wander too far off-path. Examples include "Cloud-First," "Data as an Asset," or "Buy over Build."

E - Efficiency (The Lean Engine)

A peaceful enterprise is an efficient one. This involves:
 

A - Alignment (The Bridge)

Alignment is the hardest part of P.E.A.C.E. It ensures that the IT roadmap isn't just a "wish list" of cool tech, but a direct reflection of business goals. If the CEO wants to expand to Europe, the Architect ensures the data residency and GDPR P.E.A.C.E. protocols are already in place.

C - Consistency (The Common Language)

Without consistency, an enterprise becomes a Tower of Babel.
 
  • Data Governance: Ensuring "Customer ID" means the same thing in the Sales system as it does in the Billing system.
  • Standardized Stacks: Limiting the number of supported languages and frameworks to ensure developers can move between teams easily.

E - Evolution (The Long Game)

Evolution is about future-proofing. It involves horizon scanning—looking at AI, Quantum Computing, or Edge computing—and building a "composable architecture" that can swap out parts as technology evolves without a total "rip and replace."

Part 3: The Balancing Act

How do you balance these two opposing forces? It’s not about choosing one; it’s about a rhythmic oscillation between them.

Strategies for Equilibrium:

The "Tax" Model: For every "W.A.R." project (tactical/fast), mandate a small contribution toward a "P.E.A.C.E." objective (e.g., "We'll use this non-standard API for now, but the project must fund the documentation of the legacy endpoint it's hitting").

  • Architectural Guardrails: Instead of rigid rules, create "sandboxes." Within the sandbox, teams have total W.A.R. freedom. Outside the sandbox, P.E.A.C.E. protocols are non-negotiable.
  • Iterative Refactoring: Schedule "Peace-time" sprints. Once a major tactical launch is over, dedicate resources specifically to cleaning up the technical debt incurred during the "War."

The Synthesis: When to Fight and When to Build

The art of Enterprise Architecture is knowing which mode to occupy.
 
  • During a Product Launch: You are in W.A.R. mode. You accept the debt. You enable the workarounds. You prioritize the "A" (Agility).
  • During the Post-Launch "Cooldown": You shift to P.E.A.C.E. You refactor those workarounds into the "C" (Consistency). You document the "P" (Principles) that were stretched.
  • The Golden Rule: You cannot have P.E.A.C.E. without the revenue generated by W.A.R., and you cannot survive W.A.R. without the structural integrity provided by P.E.A.C.E.

Comparison Matrix: The EA's Dual Persona

Dimension

W.A.R. Focus

P.E.A.C.E. Focus

Success Metric

Time-to-Market

Total Cost of Ownership (TCO)

Documentation

"Just enough" / Post-facto

Comprehensive / Pre-emptive

Risk Tolerance

High (Accepts instability)

Low (Prioritizes resilience)

Team Vibe

"Move fast and break things"

"Measure twice, cut once"



The Verdict

The most successful Enterprise Architects are those who can sit comfortably in the middle of this chaos. They recognize that a business that is always at W.A.R. will eventually burn out and break, while a business that is always at P.E.A.C.E. will eventually be disrupted and disappear.

Your job is to be the diplomat between the "Now" and the "Next."

Tuesday, December 23, 2025

Bridging the Gap: Engineering Resilience in Hybrid Environments (DR, Failover, and Chaos)

The "inevitable reality of failure" is the foundational principle of cyber resilience, which shifts the strategic focus from the outdated goal of total prevention (which is impossible) to anticipating, withstanding, recovering from, and adapting to cyber incidents. This approach accepts that complex, interconnected systems will experience failures and breaches, and success is defined by an organization's ability to survive and thrive amidst this uncertainty.

In the past, resilience meant building a fortress around your on-premises data center—redundant power, dual-homed networks, and expensive SAN replication. Today, the fortress walls have been breached by necessity. We live in a hybrid world. Critical workloads remain on-premises due to compliance or latency needs, while others burst into the cloud for scalability and innovation.

This hybrid reality offers immense power and scalability, but it introduces a new dimension of fragility: the "seam" between environments.

How do you ensure uptime when a backhoe or an excavator cuts fiber outside your data center, an AWS region experiences an outage, or, more commonly, the complex networking glue connecting the two suddenly degrades?

Key principles for managing inevitable failure include:
 
  • Anticipate: This involves proactive risk assessments and scenario planning to understand potential threats and vulnerabilities before they materialize.
  • Withstand: The goal is to ensure critical systems continue operating during an attack. This is achieved through resilient architectures, network segmentation, redundancy, and failover mechanisms that limit the damage and preserve essential functions.
  • Recover: This focuses on restoring normal operations quickly and effectively after an incident. Key components include immutable backups, tested recovery plans, and clean restoration environments to minimize downtime and data loss.
  • Adapt: The final, crucial step is to learn from every incident and near-miss. Post-incident analyses (often "blameless" to encourage honest assessment) inform continuous improvements to strategies, tools, and processes, helping the organization evolve faster than the threats it faces.

Resilience in a hybrid environment isn't just about preventing failure; it’s about enduring it. It requires moving beyond hope as a strategy and embracing a tripartite approach: Robust Disaster Recovery (DR), automated Failover, and proactive Chaos Engineering.

1. The Foundation: Disaster Recovery (DR) in a Hybrid World


Disaster Recovery is your insurance policy for catastrophic events. It is the process of regaining access to data and infrastructure after a significant outage—a hurricane hitting your primary data center, a massive ransomware attack, or a prolonged regional cloud failure.

In a hybrid context, DR often involves using the cloud as a cost-effective lifeboat for on-premises infrastructure.

The Metrics That Matter: RTO and RPO


Before choosing a strategy, you must define your business tolerance for loss:
  • Recovery Point Objective (RPO): How much data can you afford to lose? (e.g., "We can lose up to 15 minutes of transactions.")
  • Recovery Time Objective (RTO): How fast must you be back online? (e.g., "We must be operational within 4 hours.")

The lower the RTO/RPO, the higher the cost and complexity.

Hybrid DR Strategies


Hybrid architectures unlock several DR models that were previously unaffordable for many organizations:

A. Backup and Restore (Cold DR):

A Backup and Restore (Cold DR) strategy is a cost-effective, fundamental disaster recovery approach for non-critical systems, involving regular data/config backups stored dormant, then manually restoring everything (data, apps, infra via Infrastructure as Code) to a secondary site after an outage, leading to longer Recovery Time Objectives (RTOs) but lower costs. It protects against major disasters by replicating data to another region, relying on automated backups and Infrastructure as Code (IaC) like CloudFormation for efficient, repeatable recovery.

How it Works:

Backup: Regularly snapshot data (databases, volumes) and configurations (AMIs, application code) to a secure, remote location (e.g., S3 in another AWS Region). 
Infrastructure as Code (IaC): Use tools (CloudFormation, Terraform, AWS CDK) to define your entire infrastructure (servers, networks) in code.
Dormant State: In a disaster, the secondary environment remains unprovisioned or powered down (cold).
Recovery:
    1. Manually trigger IaC scripts to provision the infrastructure in the recovery region.
    2. Restore data from the stored backups onto the newly provisioned resources.
    3. Automate application redeployment if needed.
Best For: Systems where downtime (hours/days) and some data loss are acceptable; compliance needs; protecting against regional outages.


B. Pilot Light:

A Pilot Light Disaster Recovery (DR) strategy involves running a minimal, core version of your infrastructure in a standby cloud region, like a small flame ready to ignite a full fire, keeping essential data replicated (e.g., databases) but leaving compute resources shut down until a disaster strikes, offering a cost-effective balance with faster recovery (minutes) than backup/restore but slower than warm standby, ideal for non-critical systems needing quick, affordable recovery.

How it Works:

Core Infrastructure: Essential services (like databases) are always running and replicating data to a secondary region (e.g., AWS, Azure, GCP).
Minimal Resources: Compute resources (like servers/VMs) are kept in a "stopped" or "unprovisioned" state, saving costs.
Data Replication: Continuous, near real-time data replication ensures minimal data loss (low RPO).
Scale-Up on Demand: During a disaster, automated processes rapidly provision and scale up the idle compute resources (using pre-configured AMIs/images) around the live data, scaling to full production capacity.

Best For: 
Applications where downtime is acceptable for a few minutes to tens of minutes (e.g., 10-30 mins).
Non-mission-critical workloads that still require faster recovery than simple backups.

C. Warm Standby:

A Warm Standby DR strategy uses a scaled-down, but fully functional, replica of your production environment in a separate location (like another cloud region) that's always running and kept updated with live data, allowing for rapid failover with minimal downtime (low RTO/RPO) by quickly scaling resources to full capacity when disaster strikes, balancing cost with fast recovery.

How it Works:
 
Minimal Infrastructure: Key components (databases, app servers) are running but at lower capacity (e.g., fewer or smaller instances) to save costs.
Always On: The standby environment is active, not shut down, with replicated data and configurations.
Quick Scale-Up: In a disaster, automated processes quickly add more instances or resize existing ones to handle full production load.
Ready for Testing: Because it's a functional stack, it's easier to test recovery procedures.

Best For
Business-critical systems needing recovery in minutes.
Environments requiring frequent testing of DR readiness.


D. Active/Active (Multi-Site):

An Active/Active (Multi-Site) DR Strategy runs full production environments in multiple locations (regions) simultaneously, sharing live traffic for maximum availability, near-zero downtime (low RTO/RPO), and performance; it involves real-time data replication and smart routing (like DNS/Route 53) to instantly shift users from a failed site to healthy ones, but comes with the highest cost and complexity, suitable only for critical systems needing continuous operation.

How it Works:
 
Simultaneous Operations: Two or more full-scale, identical environments run in different geographic regions, handling live user requests concurrently.
Data Replication: Data is continuously replicated between sites, often synchronously, ensuring low Recovery Point Objective (RPO) – minimal data loss.
Intelligent Traffic Routing: Services like Amazon Route 53 or AWS Global Accelerator direct users to the nearest or healthiest region, using health checks to detect failures.
Instant Failover: If one region fails, traffic is automatically and immediately redirected to the remaining active regions, leading to near-instant recovery (low Recovery Time Objective - RTO).

Best For
Business-critical applications where any downtime is unacceptable.
Workloads requiring low latency for a global user base.


2. The Immediate Response: Hybrid Failover Mechanisms


While DR handles catastrophes, Failover handles the everyday hiccups. Failover is the (ideally automatic) process of switching to a redundant or standby system upon the failure of the primary system, mostly automatic.

Failover mechanisms in a hybrid environment ensure immediate operational continuity by automatically switching workloads from a failed primary system (on-premises or cloud) to a redundant secondary system with minimal downtime. This requires coordinating recovery across cloud and on-premises platforms.

In a hybrid environment, failover is significantly more complex because it often involves crossing network boundaries and dealing with latency differentials.

Core Concepts of Hybrid Failover


High Availability (HA) vs. Disaster Recovery (DR): HA focuses on minimizing downtime from component failures, often within the same location or region. DR extends this capability to protect against large-scale regional outages by redirecting operations to geographically distant data centers.
Automatic vs. Manual Failover: Automatic failover uses system monitoring (like "heartbeat" signals between servers) to trigger a switch without human intervention, ideal for critical systems where every second of downtime is costly. Manual failover involves an administrator controlling the transition, suitable for complex environments where careful oversight is needed.
Failback: Once the primary system is repaired, failback is the planned process of returning operations to the original infrastructure.

Common Failover Configurations


Hybrid environments typically use a combination of these approaches:

Active-Passive: The primary system actively handles traffic, while the secondary system remains in standby mode, ready to take over. This is cost-effective but may have a brief switchover time.
Active-Active: Both primary and secondary systems run simultaneously and process traffic, often distributing the workload via a load balancer. If one fails, the other picks up the slack immediately, resulting in virtually zero downtime, though at a higher cost.
Multi-Site/Multi-Region: Involves deploying resources across different physical locations or cloud availability zones to protect against localized outages. DNS-based failover is often used here to reroute user traffic to the nearest healthy endpoint.
Cloud-to-Premises/Premises-to-Cloud: A specific hybrid strategy where, for example, a cloud-based Identity Provider (IDP) failing results in an automatic switch to an on-premises Active Directory system

3. The Stress Test: Chaos Engineering


You have designed your DR plan, and you have implemented automated failover. But will they actually work at 3:00 AM on Black Friday?

Chaos engineering is a proactive discipline used to stress-test systems by intentionally introducing controlled failures to identify weaknesses and build resilience. In hybrid environments—which combine on-premises infrastructure with cloud resources—this practice is essential for navigating the added complexity and ensuring continuous reliability across diverse platforms.

It is not about "breaking things randomly"; it is about controlled, hypothesis-driven experiments.

In a hybrid environment, Chaos Engineering is mandatory because the complexity masks hidden dependencies.

The Role of Chaos Engineering in Hybrid Environments


Hybrid environments are inherently complex due to the number of interacting components, network variations, and differing management models. Chaos engineering helps address this by:
 
Uncovering hidden dependencies: Experiments reveal unexpected interconnections and single points of failure (SPOFs) between cloud-based microservices and legacy on-premise systems.
Validating failover mechanisms: It tests whether the system can automatically switch to redundant systems (e.g., a backup database in the cloud if an on-premise one fails) as intended.
Assessing network resilience: Simulating network latency or packet loss between the different environments helps understand how applications handle intermittent connectivity across the hybrid setup.
Improving observability: Running experiments forces teams to implement robust monitoring and alerting, providing a clearer picture of system behavior under stress across the entire hybrid architecture.
Building team confidence and "muscle memory": By conducting planned "Game Days" (disaster drills), engineering teams gain valuable practice in incident response, reducing Mean Time To Recovery (MTTR) during actual outages.

Key Principles and Best Practices


To conduct chaos engineering safely and effectively, especially in complex hybrid scenarios, specific principles should be followed:
 
Define a "Steady State": Before any experiment, establish clear metrics for what "normal" system behavior looks like (e.g., request success rate, latency, error rates).
Formulate a Hypothesis: Predict how the system should react to a specific failure (e.g., "If the on-premise authentication service goes down, the cloud-based application will automatically use the backup in Azure without user impact").
Start Small and Limit the "Blast Radius": Begin experiments in a non-production environment and, when moving to production, start with a minimal scope to control potential damage.
Automate and Monitor Extensively: Use robust observability tools to track metrics in real time during experiments and automate rollbacks if the experiment spirals out of control.
Foster a Learning Culture: Treat failures as learning opportunities rather than reasons for blame to encourage open analysis and continuous improvement.

Common Experiment Types in a Hybrid Context


Experiments can be tailored to the unique vulnerabilities of hybrid setups:

Service termination: Randomly shutting down virtual machines or containers residing on different platforms (on-premise vs. cloud) to test redundancy.
Network chaos: Introducing artificial latency or dropped packets in traffic between the on-premise datacenter and the cloud region.
Resource starvation: Consuming high CPU or memory on a specific host to see how load balancing and failover mechanisms distribute the workload.
Dependency disruption: Blocking access to a core service (like a database or API gateway) housed in one environment from applications running in the other.


Conclusion: Resilience is a continuous Journey


Building resilience in a hybrid environment is not a project you complete once and forget. It is a continuous operational lifecycle.
 
Design with failure in mind (using hybrid DR strategies).
Implement automated recovery (using intelligent failover mechanisms).
Verify your assumptions relentlessly (using Chaos Engineering).

The hybrid cloud offers incredible flexibility, but it demands a higher standard of engineering discipline. By integrating DR, Failover, and Chaos Engineering into your operational culture, you move from fearing the inevitable failure to embracing it as just another Tuesday event.

Friday, March 30, 2018

Enterprise Architecture Framework - Non-Functional Attributes

Non-Functional Attributes (NFAs) always exist though their signficance and priority differs when considered with certain other functional or non-functional attribute. It’s particularly important to pay attention and consider them in the inital phase of the EA framework development, as these attributes may have direct or indirect impact on some of the functional attribute of the framework. Considering Non Functional attributes early in the lifecycle is important because NFAs tend to be cross-cutting, and because they tend to drive important aspects of your architecture, they do cause considerable impact on certain important aspects of your test strategy. For example, security requirements will drive the need to support security testing, performance requirements will drive the need for stress and load testing, and so on. These testing needs in turn may drive aspects of your test environments and your testing tool choices.


The Enterprise Architecture team will interact closely with all the other management processes in an organisation, especially the IT management processes. When all these processes work together effectively, an enterprise will be able to successfully manage strategic changes and drive business transformation effectively and efficiently. Often in organisations little thought has been given to the integration of the EA processes to the other management processes. Identifying and considering NFAs early on will certainly of help in proactively address such issues. Having a clear picture of the NFAs help the EAs in taking into account innovative alternatives or trade-off before presenting decision-ready options. 


NFAs play a vital role in defining certain atomic properties of each enterprise architecture framework. The challenge with NFAs is that it is difficult to trace and identify the same. It is also difficult to define metrics to measure its performance. Described below in this blog are the typical NFAs that need to be considered while developing the EA Framework:


  • Adaptability – Be it people, process or technology, Adaptability as an attribute has never been more needed in the enterprise workplace. With the change happening at a faster pace than ever before, Adaptability is becoming a key attribute of every resource, including the human resouces apart from the systems. The resources identified as part of the EA Framework should have the the ability to accept and acquire the changes that is coming along. This way, the longivity of the EA Framework can be furthered with fewer or least changes to the framework itself.
  • Compatibility – EA Framework will have many artificacts which are not only interfaced with the other internal artifacts, but also with the external actors. Making this work seamlessly requires that the interfaces shall be compatible with each other at all times. The EA Framework shall be developed considering this important aspect of compatibility in mind and any incremental changes should not lead to break the compatibility, so that functional performance of the same is not impacted. Considering the compatibility of the artificats in the initial phase of the development of the EA Framework will save considerable efforts than fixing it when a compatibility issue surface later in the lifecycle.
  • Cohesiveness – Cohesion is the uniqueness in purpose of the system elements. A certain amount of formality is essential in providing uniformity and forming a coherent aggregate. This is critical when the components of EA Framework are developed by people both from a centralized EA team and from projects and programs. Obviously, lower level architectures should conform to the upper level architectures and unnecessary duplication should be avoided. Cohesion has to be considered in developing components or models describing a certain target area from different viewpoints. Utilizing a formal EA framework in an appropriate way is critical in achieving uniformity and cohesion in EA products.
  • Conceptuality – The benefit of enterprise architecture (EA) management is directly coupled to the underlying conceptualization of the enterprise. This conceptualization should reflect the goals pursued by the EA management endeavor and focus on the areas of interest of the involved stakeholders. A conceptual model captures the essential concepts that are present or should be present in the specific artifiact or entity and thus makes the understanding or visualization of such entity easier and unambiguous.
  • Coupling – It describes the level of dependencies between modules and components of the system. Loosely coupled systems minimize the assumptions they make about one another while still providing a meaningful interchange. Conversely, Tightly coupled systems have restrictive effect on the variability and evolution of the connected components or systems. The level of coupling that is appropriate for the particular system component shall be ascertained and considered while developing the EA Framework.
  • Diversity – Diversity is the difference between the systems or components of the EA Framework in terms of technology, methodology, principles, process, environment, etc. Diversity shall be at the manageable level, so as to minimise the cost of maintaining expertise in and connectivity between multiple processing environments. The advantages of minimum diversity include: standard packaging of components; predictable implementation impact; predictable valuations and returns; redefined testing; and increased flexibility to accommodate future changes. 
  • Dependability – As system operations become more pervasive, the enterprise become more dependent on them. Dependable systems are characterized by a number of attributes including: reliability, availability, safety and security. For some attributes, there exist probability-based theoretic foundations, enabling the application of dependability analysis techniques.  To ensure that all stakeholders at different level get the same understanding, considering the level of dependability expected out of the systems and components becomes critical. This will also ensure that the systems and components are developed and implemented as expected. 
  • Extensibility – One of the capabilities of the enterprise architecture is to allow for various artifacts of prebuilt integrations to be extended without or with least efforts. Extensibility also ensures that such system or component extensions are protected during implementing changes or revisions later on.  It is essential to evaluate and consider the appropriate level of extensibility of each system or component that is part of the EA Framework in the initial phase. 
  • Flexibility – It is a quality attribute of business information systems that contributes to the prevention of aging. It may also be considered as the capability of the enterprise to connect people, process and information in way that allows enterprise to become more flexible and responsive to the dynamics of its ever changing environment, stakeholders and competitors. This requires simplification of underlying technology and related infrastructure and creation of a consolidated view of and access to, all available resources in the enterprise.  
  • Interoperability – It is the ability of systems (including organizations) to exchange and use exchanged information without knowledge of the characteristics or inner workings of the collaborating systems (or organizations). Clearly, making systems interoperable can mean many things. The strongest drive for interoperability is technical interoperability—the technical problem of sharing information that already exists in different systems from different times and places by enabling sharing, or at least providing connected technical services. Therefore, it is imperative to develop the big picture of what data the enterprise needs to share, to receive as incoming data and to send to other systems. Both end points may reside within the enterprise, or some may reside in external enterprises..
  • Maintainability – Maintainability is defined as the ease with which a system or component can be modified to correct faults, improve performance or other attributes, or adapt to a changed environment. A fast and continuously changing business environment demands flexible systems easy to modify and maintain. Maintainability is said to be affected by; the maturity of the human resources involved, the maturity of the process governing change management, the quality of the systems' supporting documentation, the systems' architectural quality and the quality of the enterprise ecosystem on which the system executes. Thus, identifying and appropriately documenting the expectations around this attribute will certainly help implementing a better EA Framework.
  • Portability – It is the ability of the system to run under a different environment without any disruptions. Portability depends on the symmetry of conformance of both applications and the platform to the architected API. That is, the platform must support the API as specified, and the application must use no more than the specified API. Documenting the level of portability expected early on would contribute considerably in designing and developing the systems in line with the target platforms or ecosystems.
  • Robustness – It is the ability of a system to recover elegantly after failure or restart. Clearly, robust and easily modifiable automation is fundamental to achieving an enterprise’s vision for the future. However, such benefits don’t come without their price. Hard work and management commitment, both from IT and from the highest levels of the business are needed to build the kind of integrated IT architecture plans that will make the difference between success and failure in today’s highly competitive business climate.  
  • Scalability – It is the capability of a system, network, or process to handle a growing amount of work, or its potential to be enlarged to accommodate that growth. Scalability, as a property of systems, is generally difficult to define and in any particular case it is necessary to define the specific requirements for scalability on those dimensions that are deemed important. The concept of scalability is desirable in technology as well as business settings.
  • Security – With the ever evolving cyber threats both on the IT and as well as OT, security has become a very important NFA to be considered in the development of EA Framework. Considering its significance, the Security requirements ideally should be intertwined with EA Framework. Security must be designed into data elements from the beginning; it cannot be added later. Systems, data, and technologies must be protected from unauthorized access and manipulation. Headquarters information must be safeguarded against inadvertent or unauthorized alteration, sabotage, disaster, or disclosure.
Most of the attributes mentioned above are easily reckoned as Non Functional Requirements with respect to a Software System. Though Enterprise Architecture by itself may not be 'software system', it is a 'System' which depicts the blueprint of the enterprise's overall business activities with answers to the basic questions like What, Who, When, Where and How. Enterprise Architecture has multiple layers and implementation of software and IT systems is one such layer. To ensure that the stakeholders involved in different layers get the accurate view of the principles, strategies and guidlines, it is important to identify, analyze and consider these NFAs early on in the EA Framework development lifecycle.

Sunday, March 25, 2018

Securing the Operational Technology (OT) - The Challenges

OT - Overview

Operational Technology(OT) is generally technology used in the manufacturing or operational floor. The OT has evolved considerably in the recent years from pure mechanical technology to data-driven technologies like Robotic Process Automation (RPA) leveraging IOT, Machine Learning and Artifiial Intelligence. The impetus from the Industrial IOT (IIOT) brings more and more automation capabilities and the connected behavior into the manufacturing floor. Thus the adoption of IT and related technologies in OT is now the common norm and so the need for alignment and convergence with the IT function. 
IOT sensors are deployed everywhere, inside a manufacturing floor, or along the gas pipelines, inside a moving automobile, to monitor the stock movements, etc. Though these dispersed IOT devices perform small functions, the data it produces and the decisions taken based on sucgh data are critical and thus it is being realized that the OT could lead to critical security issues, depending on the size, and critical nature of such enterprise.  

The adoption of IIoT and related technologies brings many benefits to businesses such as smart machines and real-time intelligence from the factory floor - but it also increases the attack surface and requires continuous connectivity between IT and OT. The differing culture and mindset between the IT and OT functions, combined with few other factors often leads to conflicts. 

Hackers and Cybercriminals are now looking at critical infrastructure systems as the targets.  Motivations include holding systems hostage for a ransom, stock price manipulation, denial of production operations, etc. For example, the hackers may take control of your car when on a high way and demand a ransom, which could be life threatening. Similarly, Hackers may get hold of the Energy Grid and shut down the power supply for a region or even nation as a whole. The connected nature of these devices and systems involved in the modern day OT poses serious challenges as they get hooked on to the IT owned network infrastructure, wireless access points, and mobile networks.

Securing the OT

The introduction of new technologies to drive improvements such as production and supply chain efficiency and asset management has led to closer and more open integration between IT and shop floor systems. But the increasing connectivity of previously isolated manufacturing systems, together with a reliance on remote supporting services for operational maintenance, has introduced new vulnerabilities for cyber attack. Not only is the number of attacks growing, but so is their sophistication. As OT security becomes a widely discussed topic, the awareness of OT operators is rising, but so is the knowledge and understanding of OT-specific problems and vulnerabilities in the hacker community.

It’s true that the systems and devices involved in OT are often based on the same technologies as that of IT and as such many of the threats they face are exactly the same. However, it is an open secret that OT security is not the same as IT security. While securing OT systems requires an integrated approach similar to IT, its objectives are inverted, with availability being the primary requirement, followed by integrity and confidentiality. There are certain other important differences as well that mean that the OT infrastucture can not be managed as an extension of the IT infrastructure

Here are some of the areas that makes OT different from IT and thus pose a challenge for the IT Security experts:

1. Visibility:

From the perspective of the organizational units responsible for IT Security function, OT has been somewhat off the radar. This is so, because, the IT function is not involved in the evaluation and selection and procurement of the OT systems. More so, as such OT systems come with a dedicated-networked IT system(s), which could mean even isolated data-centers being setup within the manufacturing floor without the knowledge of the IT function.  Until recently, or even now in certain cases, the IT systems involved in OT are treated as an integral part of production machinery rather than computerized information systems, so the ultimate responsibility of its operation and maintenance, regardless of the cause of potential failure, was assigned to the OT function and not IT function. In most cases, the OT staff often don’t know what types of IT, or IoT devices or equipment that they have as part of their OT ecosystem. 


2. Skill Gap:

One of the biggest challenges facing the industry is deciding who is responsible for OT security - should it be the IT or OT function? Given their background and resources, in many cases IT security teams are being asked to take ownership of coordinating security for OT. However, they typically lack OT specific skills. Defining the security controls / processes for OT systems require indepth knowledge on the OT systems, so that the interests and priorities of the OT function is also taken care of. The cybersecurity industry is projected to reach 1.8 million unfilled roles by 2020. The added complexities of a converged IT/OT security environment could amplify perceived barriers to entry, as organizations struggle to manage the aging workforce of their plant teams with the Millennial generation of new cybersecurity talent.


3. Availability and Safety:

For a Manufacturing company, the production line is very important and its smooth functioning always is very important. Companies lose revenue when their production line is shut down for maintenance, be it planned or unplanned. Nobody wants to disturb OT equipment because any downtime can turn into millions of dollars in lost productivity, highly vocal, disgruntled customers and regulatory fines. Machines must reach a high OEE (overall equipment effectiveness). There is no time to allow IT-style updates and patches that take down equipment.

In many cases, where OT systems are involved in delivering essential services, such as electricity or water, or maintaining safety systems at chemical plants or dams availability is a significant parameter. Even momentary non-availability could lead to catastrophy in certain cases. Enabling high availability of OT systems and maintaining the confidentiality of some sensitive information processes by those systems require additional security controls. Not only are many of these now-connected OT system components are quite vulnerable to compromise, a failure in one of these also has the possibility of causing a catastrophic effect on human life and property. 


4. Processes:

Safety and security for employees and customers have always been top priorities for the OT function and the processes and guidelines are usually defined keeping that in mind. IT function doesn’t even factor plant or employee physical safety in, except where physical access systems are under their domain. IT’s top priority is to protect the data. OT’s top priority, however, is to protect the availability and integrity of the process with security (confidentiality) coming last. At the same time, the OT system components designed for direct control, supervisory control or the safe operation of manufacturing processes,  could turn out to be a safety hazard, even if any component or subsystem  involved compromised. Business systems are also critical but their failure is unlikely to result in the uncontrolled release of hazardous materials or energy. 


5. Legacy: 

It is not uncommon that the computer and related software systems used as part of the OT are used over a decade without being replaced or made any change. These computers and softwares are designed for certain specific functions of interfacing with the other plants and equipments involved in the manufacturing process. It largely depends on the plant or equipment vendor to come out with software and related IT hardware enhancements, otherwise, such systems may not be compatible with the upgraded IT hardware or the OS. Consequently, such systems would be vulnerable to a wide range of cyber-threats that have already been mitigated on the systems used in IT function. This is more so


6. Disparate Technologies:

Until recently, or even now in most cases, the OT architectures run on a separate and isolated infrastructure and as such they have been traditionally isolated from the Internet. One of the reasons for this is because these systems are often hard wired to work with a plant / equipment and to receive and process signlas received and disseminate instructions back to various components. Some OT systems are already only supporting obsolete, insecure operating systems. OT system vendors also do not feel obliged to increase the security capabilities of their systems. Something as benign as an active system scan can cause these devices to fail, which can have serious if not catastrophic results.

System-dedicated networks, multiple domains and dedicated supporting systems require more resources to achieve a maturity level comparable with IT. It also greatly increases the complexity in monitoring and maintaining security levels. The sophisticated nature of OT infrastructure technologies means that most IT security and threat intelligence solutions don’t have visibility into, let alone the ability defend against attacks on critical infrastructures. This creates a challenge in defining and implementing coherent security policies across production plants


7. IIoT Impact: 

The Industry 4.0 revolution is having a great impact on the manufacturing environments. It offers significant opportunities for improving production effectiveness; in particular, based on continual, online information about manufacturing processes and equipment. However, the utilization of new IoT technologies also has an impact on security. It’s not just about networks of course, there are loads of components, including things like sensors and actuators (transducers) and ‘smart things’, fog nodes,(industrial and intelligent) IoT gateways, IoT platforms and so forth. And for IT some of these components are “different” from the cyber security perspective they are used to by the way. New protocols (including wireless) or mesh network architectures increase the number of potential access points to the network and require a different approach to security.

8. Culture:

The IT function responsible for maintaining and securing the Information and related Resources, help ensuring the data Confidentiality, Integrity and Availability aspects and in the process protect corporate information and related assets including networks from cyberattacks. They're less familiar with the OT space, and often display little interest in knowing what their counterparts do to keep it safe and operational. In contrast, OT function monitors and fixes issues in highly complex and sensitive industrial plants with maintaining operational safety, reliability, and continuity as the top priorities. They don't deal or work with IT function, and certainly don't want them to get involved in their operational issues.

Each group is concerned that the other side will wreak havoc in their environment. When there is a need to secure OT against cyberthreats, plant engineers worry that if IT team members get involved, they'll compromise system safety and stability. Unsanctioned changes to these systems might cripple the plant, cause an explosion, or worse. These concerns are justified. After all, when it comes to OT, IT staff members are in uncharted waters. At the same time, the IT function is concerned that vulnerable OT networks will introduce new threats into IT networks, threatening corporate assets, data, and systems.

Conclusion:

As industrial organizations begin to connect their machines to the network, the differences in security requirements for IT versus operational technology (OT) are becoming more important to understand.
There were no good practices and formal regulations for manufacturers on how to provide even minimal security protection on medical devices. 

IT and OT teams are discovering the need to work together in order to deploy cybersecurity solutions throughout the enterprise; from headquarters to remote locations, and the factory floor. Hackers are going after intellectual property, financial data and customer information. CIOs report that intellectual property can constitute more than 80% of company value. Now is the time for OT and IT leaders to develop strong partnerships to promote operational efficiency, safety and competitive advantage.

Neither OT team members nor IT team members are experts in defending OT systems against emerging cyberthreats. Because OT networks were previously disconnected from the external world, engineering staff never had to deal with such threats. Meanwhile, IT staff members who deal with cyberthreats on a daily basis don't fully understand how these new threats will affect OT systems.  Nevertheless, both sides must cooperate, because neither group can protect industrial systems singlehandedly. Given the divergent cultures, technologies, and objectives of IT and OT, the two groups must overcome a significant divide, including mutual suspicion.

To ensure IT and OT collaboration, business-level oversight and leadership is required. More and more organizations are taking senior, experienced engineers from OT business units, usually from under the COO, and moving them under the CIO hierarchy. This interdisciplinary model combines expertise and roles that straddle and unify both sides of the IT-OT fence. Some organizations have taken this one step further. Instead of aligning IT roles under the CIO, they're creating a new C-level role to facilitate this management strategy. 

The higher up the organizational ladder that IT-OT convergence decisions are being made, the better the chances for success in bridging the gap.