Review Article

Chaos Engineering 2.0: A Review of AI-Driven, Policy-Guided Resilience for Multi-Cloud Systems

Authors

Abstract

Multi-cloud has become the default posture; 89 % of large enterprises now run workloads across two or more providers, yet most failure-testing playbooks were written for a single-vendor world. Chaos Engineering 2.0 extends the classical “break-things-on-purpose” paradigm by pairing AI-guided experiment orchestration, service-mesh–native fault injection, and chaos-as-code, which is safeguarded by policy-as-code, so teams can probe complex, cross-cloud failure domains without jeopardizing customer trust. Building on the original Netflix Chaos Monkey ethos and the four “steady-state-first” principles, this review synthesizes the resilience patterns that have surfaced over a decade of practice, circuit breakers, bulkheads, adaptive retries, and progressive delivery, and maps them to the modern toolchain. Open-source projects like LitmusChaos and Chaos Mesh have limited production use, commercial platforms offer rapid onboarding, and new chaos services are now embedded in AWS and Azure. Two illustrative case studies, an e-commerce cache stampede revealed by latency chaos and a fintech blue/green rollback validated under a simulated inter-cloud partition, demonstrate tangible ROI. Finally, ethical guardrails, cost-risk trade-offs, and forward directions such as autonomous chaos agents and security chaos engineering are discussed. The goal is pragmatic: equip practitioners with a concise, pattern-driven playbook for hardening real-world multi-cloud systems before the next outage strikes.

Keywords:

Fault Injection Lineage Driven Microservice Testing

Article information

Journal

Journal of Computer, Software, and Program

Volume (Issue)

2(2), (2025)

Pages

10-24

Published

05-09-2025

How to Cite

Opara, L. C., Akatakpo, O. N., Ironuru, I. C., Anyaene, K., & Enobakhare, B. O. (2025). Chaos Engineering 2.0: A Review of AI-Driven, Policy-Guided Resilience for Multi-Cloud Systems. Journal of Computer, Software, and Program, 2(2), 10-24. https://doi.org/10.69739/jcsp.v2i2.846

References

Alvaro, P., Rosen, J., & Hellerstein, J. M. (2015). Lineage-driven Fault Injection. Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (pp. 331–346). https://doi.org/10.1145/2723372.2723711

Amazon Web Services. (2021a). Announcing General Availability of AWS Fault Injection Simulator, a fully managed service to run controlled experiments. Amazon Web Services, Inc. https://aws.amazon.com/about-aws/whats-new/2021/03/aws-announces-service-aws-fault-injection-simulator/

Amazon Web Services. (2023b). REL01-BP06 Ensure that a sufficient gap exists between the current quotas and the maximum usage to accommodate failover—Reliability Pillar. https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/rel_manage_service_limits_suff_buffer_limits.html

Azure status history. (n.d.). Microsoft Azure. Retrieved April 24, 2025, from https://azure.status.microsoft/status/history/?utm_source=chatgpt.com

Beatteay, S. (2021, August 23). How A Cache Stampede Caused One Of Facebook’s Biggest Outages. Better Programming. https://medium.com/better-programming/how-a-cache-stampede-caused-one-of-facebooks-biggest-outages-dbb964ffc8ed

Bennett, J. (2025, April 14). Chaos Engineering in Regulated Industries: Building Resilience Within Constraints. Medium. https://jbenx.medium.com/chaos-engineering-in-regulated-industries-building-resilience-within-constraints-7ffbe8feb6e5

Beswick, J. (2024, March 22). Automating chaos experiments with AWS Fault Injection Service and AWS Lambda. AWS Compute Blog. https://aws.amazon.com/blogs/compute/automating-chaos-experiments-with-aws-fault-injection-service-and-aws-lambda/

Blog, N. T. (2018, September 20). The Netflix Simian Army. Medium. https://netflixtechblog.com/the-netflix-simian-army-16e57fbab116

Bloomberg Businessweek. (2024, November 21). What American Airlines Learned From the CrowdStrike Outage. Bloomberg.Com. https://www.bloomberg.com/news/articles/2024-11-21/what-american-airlines-learned-from-the-crowdstrike-outage

Butow, T. (2018, October 22). Chaos Engineering Monitoring & Metrics Guide. https://www.gremlin.com/community/tutorials/chaos-engineering-monitoring-metrics-guide

Chaos Engineering & Autonomous Optimization combined to maximize resilience to failure. (n.d.). Retrieved April 24, 2025, from https://www.gremlin.com/blog/chaos-engineering-autonomous-optimization-combined-to-maximize-resilience-to-failure

Chaos engineering with LitmusChaos: September 2022 update. (n.d.). Retrieved April 24, 2025, from https://www.cncf.io/blog/2022/10/14/chaos-engineering-with-litmuschaos-september-2022-update/?utm_source=chatgpt.com

ChaosIQ. (n.d.). Reliability Workflow—Welcome to your Reliability Toolkit. Retrieved May 3, 2025, from https://docs.chaosiq.io/reliability-workflow/?utm_source=chatgpt.com

Chaos Mesh. (2025b). Simulate Network Faults. Chaos Mesh. https://chaos-mesh.org/docs/next/simulate-network-chaos-in-physical-nodes/

Chaos Mesh. (n.d.-a). Create Chaos Mesh Workflow. Chaos Mesh. Retrieved May 2, 2025, from https://chaos-mesh.org/docs/create-chaos-mesh-workflow/

CNCF. (2024, April 9). CNCF Annual Survey 2023. CNCF. https://www.cncf.io/reports/cncf-annual-survey-2023/

CNCF. (2025). Cloud Native Computing Foundation. CNCF. https://www.cncf.io/

Coredge. (2024, February 8). Seamless Multi-Cloud Observability: The Power of Analytics and Tracing for Effective Orchestration. Medium. https://medium.com/%40Coredge_79865/seamless-multi-cloud-observability-the-power-of-analytics-and-tracing-for-effective-orchestration-152294749ecb

Davis, T. (2025, April 30). Harnessing Chaos Safely: An Introduction to ChaosGuard. Harness.Io. https://www.harness.io/blog/harnessing-chaos-safely-an-introduction-to-chaosguard

Doddala, H. (2025, April 30). Introducing Harness AI - AI Development Assistant for AI Infused Software Delivery. Harness.Io. https://www.harness.io/blog/introducing-harness-ai-devops-agent-for-ai-infused-software-delivery

European Union Agency for Cybersecurity. (2024). 2024 report on the state of cybersecurity in the Union. Publications Office. https://data.europa.eu/doi/10.2824/0401593

Fastly. (2021, June 8). Summary of June 8 outage. Fastly. https://www.fastly.com/blog/summary-of-june-8-outage

FCA. (2024, May 28). Operational resilience: Insights and observations for firms. FCA. https://www.fca.org.uk/firms/operational-resilience/insights-observations?

Financial Conduct Authority (FCA). (2024, February 29). Wholesale Data Market Study Responses to Terms of Reference. https://www.fca.org.uk/publication/market-studies/ms23-1-5-tor.pdf?utm_source=chatgpt.com

Flexera. (2024, March 28). Cloud computing trends: Flexera 2024 State of the Cloud Report. https://www.flexera.com/blog/finops/cloud-computing-trends-flexera-2024-state-of-the-cloud-report

Flexera Blog. (2024, March 28). Cloud computing trends: Flexera 2024 State of the Cloud Report. Flexera Blog. https://www.flexera.com/blog/finops/cloud-computing-trends-flexera-2024-state-of-the-cloud-report/

Gartner Peer Community. (2023). Chaos Engineering Adoption. Gartner Peer Community. https://www.gartner.com/peer-community/oneminuteinsights/omi-chaos-engineering-adoption-dop

Gogineni, A. (2025). Chaos Engineering in the Cloud-Native Era: Evaluating Distributed AI Model Resilience on Kubernetes. Journal of Artificial Intelligence, Machine Learning and Data Science, 3(1), 2182–2187. https://doi.org/10.51219/JAIMLD/anila-gogineni/477

Gremlin. (2022d). Measuring the benefits of Chaos Engineering. Gremlin. https://www.gremlin.com/chaos-engineering-measuring-benefits

Gremlin. (2023, December). Release Roundup Dec 2023: Driving reliability standards. https://www.gremlin.com/blog/release-roundup-dec-2023-driving-reliability-standards-and-much-more

Gremlin. (2024b). Five Hidden Barriers to Chaos Engineering Success. https://www.gremlin.com/webinars/five-hidden-barriers-to-ce-success

Gremlin. (2025a). Chaos Engineering. https://www.gremlin.com/chaos-engineering

Gremlin. (2025c). Gremlin—Reliability Scoring. https://www.gremlin.com/technologies/reliability-scoring

Gupta, R. (2023, November). Simplifying Policy Creation and Management with Harness AIDATM. Harness.Io. https://www.harness.io/blog/simplifying-policy-creation-and-management-with-harness-ai

Harness.io. (2025a). OPA Policy for Pipeline Execution. Harness Developer Hub. https://developer.harness.io/docs/chaos-engineering/security/security-templates/opa/

Harness.io. (n.d.-b). The Chaos Engineering Maturity Model. Harness.Io. Retrieved May 3, 2025, from https://www.harness.io/resources/the-chaos-engineering-maturity-model

Harness Developer Hub. (2025). Governance in Execution. https://developer.harness.io/docs/chaos-engineering/use-harness-ce/governance/governance-in-execution

Hirevire. (2024, July 1). Prescreening Questions to Ask Chaos Engineering Ethics Officer. Hirevire - Pre-Screening Video Interviewing Software with AI Transcripts. https://hirevire.com/pre-screening-interview-questions/chaos-engineering-ethics-officer

Hui, M., Wang, L., Li, H., Yang, R., Song, Y., Zhuang, H., Cui, D., & Li, Q. (2025). Unveiling the microservices testing methods, challenges, solutions, and solutions gaps: A systematic mapping study. Journal of Systems and Software, 220, 112232. https://doi.org/10.1016/j.jss.2024.112232

IBM. (2023, August 3). What is Chaos Engineering? IBM. https://www.ibm.com/think/topics/chaos-engineering

IBM. (2024, February). Enhancing observability with chaos engineering: Steadybit integration with Instana. IBM. https://www.ibm.com/products/tutorials/enhancing-observability-with-chaos-engineering-steadybit-integration-with-instana

Istio, 5 Minute Read Page. (n.d.). Fault Injection. Istio. Retrieved May 2, 2025, from https://istio.io/latest/docs/tasks/traffic-management/fault-injection

Kamran, A. (2024, September 6). Autonomous Agent Swarms in Chaos Engineering: Revolutionizing Resilience Testing. Medium. https://medium.com/@armankamran/autonomous-agent-swarms-in-chaos-engineering-revolutionizing-resilience-testing-42be9c915bcc

Kyle, M. (2022, April 14). Chaos Engineering & Autonomous Optimization combined to maximize resilience to failure. https://www.gremlin.com/blog/chaos-engineering-autonomous-optimization-combined-to-maximize-resilience-to-failure

Lawler, R. (2024, August 1). Delta CEO blames Microsoft and CrowdStrike for a $500 million outage. The Verge. https://www.theverge.com/2024/8/1/24210680/crowdstrike-microsoft-outage-delta-lawsuit-class-action-damages?utm_source=chatgpt.com

Leach, B. (2017, February 22). Designing robust and predictable APIs with idempotency. https://stripe.com/blog/idempotency

Li, H. M. (2024, August 20). How to Set Up Chaos Engineering in your Continuous Delivery pipeline with Gremlin and Jenkins. https://www.gremlin.com/community/tutorials/how-to-set-up-chaos-engineering-in-your-continuous-delivery-pipeline-with-gremlin-and-jenkins?

Long, J. (2021, July). A Bootiful Podcast: Benjamin Wilms, founder of the Chaos Monkey for Spring Boot and Steadybit, a company to help you build more robust software. A Bootiful Podcast: Benjamin Wilms, Founder of the Chaos Monkey for Spring Boot and Steadybit, a Company to Help You Build More Robust Software. https://spring.io/blog/2021/07/01/a-bootiful-podcast-benjamin-wilms-founder-of-the-chaos-monkey-for-spring-boot-and-steadybit-a-company-to-help-you-build-more-robust-software

Low, K. (2023, November 3). How to use chaos engineering in incident response. Amazon Web Services. https://aws.amazon.com/blogs/security/how-to-use-chaos-engineering-in-incident-response

Lu, A. (2024, November 21). 2DC Support with Cross-Cluster Replication. https://www.cockroachlabs.com/blog/2dc-support-cross-cluster-replication

Lunney, J., & Lueder, S. (2017). Blameless Postmortem for System Resilience. Google SRE .https://sre.google/sre-book/postmortem-culture

Mace, J., Oertel, J., Thorne, S., & Chakrabarti, A. (n.d.). Root Cause Analysis for Probing Incident. Google SRE. Retrieved April 24, 2025, from https://sre.google/workbook/incident-response/?utm_source=chatgpt.com

Matthew Helmke. (2020, June 18). Chaos Engineering and Windows: Mitigating common Windows failure scenarios. Gremlin. https://www.gremlin.com/blog/chaos-engineering-and-windows

Meiklejohn, C. S., Estrada, A., Song, Y., Miller, H., & Padhye, R. (2021). Service-Level Fault Injection Testing. Proceedings of the ACM Symposium on Cloud Computing (pp. 388–402). https://doi.org/10.1145/3472883.3487005

Michalowski, M. (2024, January 16). Navigating the Multi-Cloud Ecosystem. DevOps.Com. https://devops.com/navigating-the-multi-cloud-ecosystem/

Microsoft Learn. (2025, June 7). Azure Chaos Studio fault and action library—Azure Chaos Studio. Azure. https://learn.microsoft.com/en-us/azure/chaos-studio/chaos-studio-fault-library

Mistry, D. (2025, April 20). AI Meets Chaos Engineering: Designing Self-Healing Systems using Reinforcement Learning. Medium. https://medium.com/@dhruvmistry_/ai-meets-chaos-engineering-designing-self-healing-systems-using-reinforcement-learning-88b7d9940801

Mondal, S. (2021, July 27). How the Resilience Score Algorithm works in Litmus! LitmusChaos. https://litmuschaos.io/blog/how-the-resilience-score-algorithm-works-in-litmus-1d22

Mooney, M. (2023, October 10). Security-focused chaos engineering experiments for the cloud. Datadog. https://www.datadoghq.com/blog/chaos-engineering-for-security/

Moreschini, S., Pour, S., Lanese, I., Balouek, D., Bogner, J., Li, X., Pecorelli, F., Soldani, J., Truyen, E., & Taibi, D. (2025). AI Techniques in the Microservices Life-Cycle: A Systematic Mapping Study. Computing, 107(4), 100. https://doi.org/10.1007/s00607-025-01432-z

Nedosekin, V., Kumar, S., & Stoll, A. (2024, November 5). Introducing AWS Fault Injection Service Actions to Inject Chaos in Lambda functions. AWS Cloud Operations Blog. https://aws.amazon.com/blogs/mt/introducing-aws-fault-injection-service-actions-to-inject-chaos-in-lambda-functions

Netflix Technology Blog. (2020, November 2). Keeping Netflix Reliable Using Prioritized Load Shedding. Medium. https://netflixtechblog.com/keeping-netflix-reliable-using-prioritized-load-shedding-6cc827b02f94

Newman, A. (2020, December 15). How to train your engineers in Chaos Engineering. Gremlin. https://www.gremlin.com/community/tutorials/how-to-train-your-engineers-in-chaos-engineering

Newman, A. (2023, October 30). How Gremlin’s reliability score works. https://www.gremlin.com/blog/how-gremlins-reliability-score-works

Observability in the realm of Chaos Engineering. (n.d.). National Australia Bank. Medium. Retrieved April 24, 2025, from https://medium.com/%40nabtechblog/observability-in-the-realm-of-chaos-engineering-99089226ca51

Palacios Chavarro, S., Nespoli, P., Díaz-López, D., & Niño Roa, Y. (2023). On the Way to Automatic Exploitation of Vulnerabilities and Validation of Systems Security through Security Chaos Engineering. Big Data and Cognitive Computing, 7(1), 1. https://doi.org/10.3390/bdcc7010001

Palumbo, F., Aceto, G., Botta, A., Ciuonzo, D., Persico, V., & Pescapé, A. (2021). Characterization and analysis of cloud-to-user latency: The case of Azure and AWS. Computer Networks, 184, 107693. https://doi.org/10.1016/j.comnet.2020.107693

Payment Card Industry. (2022, April). Self-Assessment Questionnaire A and Attestation of Compliance. https://listings.pcisecuritystandards.org/documents/PCI-DSS-v4-0-SAQ-A.pdf?utm_source=chatgpt.com

PCI Security Standards Council. (2021, October). PCI SSC Global Community Forum 2021. PCI SSC Global Community Forum. https://events.pcisecuritystandards.org/global2021/agenda/

Principles of chaos engineering - Principles of chaos engineering. (n.d.). Retrieved April 24, 2025, from https://principlesofchaos.org/?utm_source=chatgpt.com

Reuters. (2024, October 31). UK finance firms told to beef up buffers against CrowdStrike-like events. Reuters. https://www.reuters.com/technology/cybersecurity/uk-finance-firms-told-beef-up-buffers-against-crowdstrike-like-events-2024-10-31

Sachto, A., & Walcer, A. (n.d.). Anatomy of an Incident.

Satyanarayana, S., & Black, R. (2025, April 30). Harness Guardrails and Resilience. Harness.Io. https://www.harness.io/blog/harness-guardrails-and-resilience

Satyanarayana, S.. (2025, January 9). Integrating Chaos Engineering with AI/ML: Proactive Failure Prediction. Harness.Io. https://www.harness.io/blog/integrating-chaos-engineering-with-ai-ml-proactive-failure-prediction

Schulte, D. (2021, December). Is Chaos Engineering Worth It? A Cost-Benefit Analysis. https://steadybit.com/blog/if-you-are-not-doing-chaos-engineering

Service meshes are on the rise – but greater understanding and experience are required. (2022, May 17). CNCF. https://www.cncf.io/blog/2022/05/17/service-meshes-are-on-the-rise-but-greater-understanding-and-experience-are-required/

Silverthorne, V. (2025, March). Cloud Native Computing Foundation, & Stephen Hendrick, The Linux Foundation. Cloud Native 2024.

Sonar, V. (2024, September 6). How to Integrate Chaos Engineering Into CI/CD. Aviator. https://www.aviator.co/blog/how-to-integrate-chaos-engineering-into-your-ci-cd-pipeline

State of Chaos Engineering 2021. (n.d.). Retrieved May 3, 2025, from https://www.gremlin.com/state-of-chaos-engineering/2021

Stripe. (2025). Errors | Stripe API Reference. https://docs.stripe.com/api/errors?

Summary of June 8 outage. (2021, June 8). Fastly. https://www.fastly.com/blog/summary-of-june-8-outage?utm_source=chatgpt.com

Summary of the AWS Service Event in the Northern Virginia (US-EAST-1) Region. (2021, December 10). Amazon Web Services, Inc. https://aws.amazon.com/message/12721/

Tavargere, Z. (2025, January 10). Cache Stampede: A Problem The Industry Fights Every Day. https://newsletter.adaptiveengineer.com/p/cache-stampede-a-problem-the-industry

Taylor, H. (2024, July 24). Microsoft to take hit as Fortune 500 suffers $5.4B in CrowdStrike losses: Study. New York Post. https://nypost.com/2024/07/24/business/microsoft-to-take-hit-as-fortune-500-suffers-5-4b-in-crowdstrike-losses-study/

Terraform Registry. (2025a). Resource: Aws_fis_experiment_template. HashiCorp. https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/fis_experiment_template?

Terraform Registry. (n.d.-b). Young-ook/eks/aws | chaos-mesh Submodule. Terraform Registry. Retrieved May 3, 2025, from https://registry.terraform.io/modules/Young-ook/eks/aws/1.7.8/submodules/chaos-mesh?utm_source=chatgpt.com

Torkura, K. (n.d.). Security Chaos Engineering 101: Fundamentals. Mitigant. Retrieved May 3, 2025, from https://www.mitigant.io/en/blog/security-chaos-engineering-101-fundamentals?utm_source=chatgpt.com

Touzi, J. (2020, August 7). Using AWS Global Accelerator to achieve blue/green deployments. Networking & Content Delivery. https://aws.amazon.com/blogs/networking-and-content-delivery/using-aws-global-accelerator-to-achieve-blue-green-deployments/

Treat, T. (2020, July 6). Guidelines for Chaos Engineering, Part 1. Medium. https://blog.realkinetic.com/guidelines-for-chaos-engineering-part-1-e5528a8a219

Vizard, M. (2025, January 29). Harness Applies AI to Chaos Engineering Testing. DevOps.Com. https://devops.com/harness-applies-ai-to-chaos-engineering-testing

Warren, T. (2024, July 19). Major Windows BSOD issue hits banks, airlines, and TV broadcasters. The Verge. https://www.theverge.com/2024/7/19/24201717/windows-bsod-crowdstrike-outage-issue?utm_source=chatgpt.com

Weiss, D. (2024, September 3). Video Spotlight: “Chaos Testing – Behind CockroachDB’s Resilience.” Cockroach Labs. https://www.cockroachlabs.com/blog/video-chaos-testing

Yu, G., Tan, G., Huang, H., Zhang, Z., Chen, P., Natella, R., & Zheng, Z. (2024). A Survey on Failure Analysis and Fault Injection in AI Systems (No. arXiv:2407.00125). arXiv. https://doi.org/10.48550/arXiv.2407.00125

Downloads

Views

43

Downloads

46