Article section
Chaos Engineering 2.0: A Review of AI-Driven, Policy-Guided Resilience for Multi-Cloud Systems
Abstract
Multi-cloud has become the default posture; 89 % of large enterprises now run workloads across two or more providers, yet most failure-testing playbooks were written for a single-vendor world. Chaos Engineering 2.0 extends the classical “break-things-on-purpose” paradigm by pairing AI-guided experiment orchestration, service-mesh–native fault injection, and chaos-as-code, which is safeguarded by policy-as-code, so teams can probe complex, cross-cloud failure domains without jeopardizing customer trust. Building on the original Netflix Chaos Monkey ethos and the four “steady-state-first” principles, this review synthesizes the resilience patterns that have surfaced over a decade of practice, circuit breakers, bulkheads, adaptive retries, and progressive delivery, and maps them to the modern toolchain. Open-source projects like LitmusChaos and Chaos Mesh have limited production use, commercial platforms offer rapid onboarding, and new chaos services are now embedded in AWS and Azure. Two illustrative case studies, an e-commerce cache stampede revealed by latency chaos and a fintech blue/green rollback validated under a simulated inter-cloud partition, demonstrate tangible ROI. Finally, ethical guardrails, cost-risk trade-offs, and forward directions such as autonomous chaos agents and security chaos engineering are discussed. The goal is pragmatic: equip practitioners with a concise, pattern-driven playbook for hardening real-world multi-cloud systems before the next outage strikes.
Keywords:
Article information
Journal
Journal of Computer, Software, and Program
Volume (Issue)
2(2), (2025)
Pages
10-24
Published
Copyright
Copyright (c) 2025 Lasbrey Chibuzo Opara, Ogheneruemu Nathaniel Akatakpo, Ifeanyi Charles Ironuru, Kingsley Anyaene, Benjamin Osaze Enobakhare (Author)
Open access

This work is licensed under a Creative Commons Attribution 4.0 International License.
How to Cite
References
Alvaro, P., Rosen, J., & Hellerstein, J. M. (2015). Lineage-driven Fault Injection. Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (pp. 331–346). https://doi.org/10.1145/2723372.2723711
Amazon Web Services. (2021a). Announcing General Availability of AWS Fault Injection Simulator, a fully managed service to run controlled experiments. Amazon Web Services, Inc. https://aws.amazon.com/about-aws/whats-new/2021/03/aws-announces-service-aws-fault-injection-simulator/
Amazon Web Services. (2023b). REL01-BP06 Ensure that a sufficient gap exists between the current quotas and the maximum usage to accommodate failover—Reliability Pillar. https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/rel_manage_service_limits_suff_buffer_limits.html
Azure status history. (n.d.). Microsoft Azure. Retrieved April 24, 2025, from https://azure.status.microsoft/status/history/?utm_source=chatgpt.com
Beatteay, S. (2021, August 23). How A Cache Stampede Caused One Of Facebook’s Biggest Outages. Better Programming. https://medium.com/better-programming/how-a-cache-stampede-caused-one-of-facebooks-biggest-outages-dbb964ffc8ed
Bennett, J. (2025, April 14). Chaos Engineering in Regulated Industries: Building Resilience Within Constraints. Medium. https://jbenx.medium.com/chaos-engineering-in-regulated-industries-building-resilience-within-constraints-7ffbe8feb6e5
Beswick, J. (2024, March 22). Automating chaos experiments with AWS Fault Injection Service and AWS Lambda. AWS Compute Blog. https://aws.amazon.com/blogs/compute/automating-chaos-experiments-with-aws-fault-injection-service-and-aws-lambda/
Blog, N. T. (2018, September 20). The Netflix Simian Army. Medium. https://netflixtechblog.com/the-netflix-simian-army-16e57fbab116
Bloomberg Businessweek. (2024, November 21). What American Airlines Learned From the CrowdStrike Outage. Bloomberg.Com. https://www.bloomberg.com/news/articles/2024-11-21/what-american-airlines-learned-from-the-crowdstrike-outage
Butow, T. (2018, October 22). Chaos Engineering Monitoring & Metrics Guide. https://www.gremlin.com/community/tutorials/chaos-engineering-monitoring-metrics-guide
Chaos Engineering & Autonomous Optimization combined to maximize resilience to failure. (n.d.). Retrieved April 24, 2025, from https://www.gremlin.com/blog/chaos-engineering-autonomous-optimization-combined-to-maximize-resilience-to-failure
Chaos engineering with LitmusChaos: September 2022 update. (n.d.). Retrieved April 24, 2025, from https://www.cncf.io/blog/2022/10/14/chaos-engineering-with-litmuschaos-september-2022-update/?utm_source=chatgpt.com
ChaosIQ. (n.d.). Reliability Workflow—Welcome to your Reliability Toolkit. Retrieved May 3, 2025, from https://docs.chaosiq.io/reliability-workflow/?utm_source=chatgpt.com
Chaos Mesh. (2025b). Simulate Network Faults. Chaos Mesh. https://chaos-mesh.org/docs/next/simulate-network-chaos-in-physical-nodes/
Chaos Mesh. (n.d.-a). Create Chaos Mesh Workflow. Chaos Mesh. Retrieved May 2, 2025, from https://chaos-mesh.org/docs/create-chaos-mesh-workflow/
CNCF. (2024, April 9). CNCF Annual Survey 2023. CNCF. https://www.cncf.io/reports/cncf-annual-survey-2023/
CNCF. (2025). Cloud Native Computing Foundation. CNCF. https://www.cncf.io/
Coredge. (2024, February 8). Seamless Multi-Cloud Observability: The Power of Analytics and Tracing for Effective Orchestration. Medium. https://medium.com/%40Coredge_79865/seamless-multi-cloud-observability-the-power-of-analytics-and-tracing-for-effective-orchestration-152294749ecb
Davis, T. (2025, April 30). Harnessing Chaos Safely: An Introduction to ChaosGuard. Harness.Io. https://www.harness.io/blog/harnessing-chaos-safely-an-introduction-to-chaosguard
Doddala, H. (2025, April 30). Introducing Harness AI - AI Development Assistant for AI Infused Software Delivery. Harness.Io. https://www.harness.io/blog/introducing-harness-ai-devops-agent-for-ai-infused-software-delivery
European Union Agency for Cybersecurity. (2024). 2024 report on the state of cybersecurity in the Union. Publications Office. https://data.europa.eu/doi/10.2824/0401593
Fastly. (2021, June 8). Summary of June 8 outage. Fastly. https://www.fastly.com/blog/summary-of-june-8-outage
FCA. (2024, May 28). Operational resilience: Insights and observations for firms. FCA. https://www.fca.org.uk/firms/operational-resilience/insights-observations?
Financial Conduct Authority (FCA). (2024, February 29). Wholesale Data Market Study Responses to Terms of Reference. https://www.fca.org.uk/publication/market-studies/ms23-1-5-tor.pdf?utm_source=chatgpt.com
Flexera. (2024, March 28). Cloud computing trends: Flexera 2024 State of the Cloud Report. https://www.flexera.com/blog/finops/cloud-computing-trends-flexera-2024-state-of-the-cloud-report
Flexera Blog. (2024, March 28). Cloud computing trends: Flexera 2024 State of the Cloud Report. Flexera Blog. https://www.flexera.com/blog/finops/cloud-computing-trends-flexera-2024-state-of-the-cloud-report/
Gartner Peer Community. (2023). Chaos Engineering Adoption. Gartner Peer Community. https://www.gartner.com/peer-community/oneminuteinsights/omi-chaos-engineering-adoption-dop
Gogineni, A. (2025). Chaos Engineering in the Cloud-Native Era: Evaluating Distributed AI Model Resilience on Kubernetes. Journal of Artificial Intelligence, Machine Learning and Data Science, 3(1), 2182–2187. https://doi.org/10.51219/JAIMLD/anila-gogineni/477
Gremlin. (2022d). Measuring the benefits of Chaos Engineering. Gremlin. https://www.gremlin.com/chaos-engineering-measuring-benefits
Gremlin. (2023, December). Release Roundup Dec 2023: Driving reliability standards. https://www.gremlin.com/blog/release-roundup-dec-2023-driving-reliability-standards-and-much-more
Gremlin. (2024b). Five Hidden Barriers to Chaos Engineering Success. https://www.gremlin.com/webinars/five-hidden-barriers-to-ce-success
Gremlin. (2025a). Chaos Engineering. https://www.gremlin.com/chaos-engineering
Gremlin. (2025c). Gremlin—Reliability Scoring. https://www.gremlin.com/technologies/reliability-scoring
Gupta, R. (2023, November). Simplifying Policy Creation and Management with Harness AIDATM. Harness.Io. https://www.harness.io/blog/simplifying-policy-creation-and-management-with-harness-ai
Harness.io. (2025a). OPA Policy for Pipeline Execution. Harness Developer Hub. https://developer.harness.io/docs/chaos-engineering/security/security-templates/opa/
Harness.io. (n.d.-b). The Chaos Engineering Maturity Model. Harness.Io. Retrieved May 3, 2025, from https://www.harness.io/resources/the-chaos-engineering-maturity-model
Harness Developer Hub. (2025). Governance in Execution. https://developer.harness.io/docs/chaos-engineering/use-harness-ce/governance/governance-in-execution
Hirevire. (2024, July 1). Prescreening Questions to Ask Chaos Engineering Ethics Officer. Hirevire - Pre-Screening Video Interviewing Software with AI Transcripts. https://hirevire.com/pre-screening-interview-questions/chaos-engineering-ethics-officer
Hui, M., Wang, L., Li, H., Yang, R., Song, Y., Zhuang, H., Cui, D., & Li, Q. (2025). Unveiling the microservices testing methods, challenges, solutions, and solutions gaps: A systematic mapping study. Journal of Systems and Software, 220, 112232. https://doi.org/10.1016/j.jss.2024.112232
IBM. (2023, August 3). What is Chaos Engineering? IBM. https://www.ibm.com/think/topics/chaos-engineering
IBM. (2024, February). Enhancing observability with chaos engineering: Steadybit integration with Instana. IBM. https://www.ibm.com/products/tutorials/enhancing-observability-with-chaos-engineering-steadybit-integration-with-instana
Istio, 5 Minute Read Page. (n.d.). Fault Injection. Istio. Retrieved May 2, 2025, from https://istio.io/latest/docs/tasks/traffic-management/fault-injection
Kamran, A. (2024, September 6). Autonomous Agent Swarms in Chaos Engineering: Revolutionizing Resilience Testing. Medium. https://medium.com/@armankamran/autonomous-agent-swarms-in-chaos-engineering-revolutionizing-resilience-testing-42be9c915bcc
Kyle, M. (2022, April 14). Chaos Engineering & Autonomous Optimization combined to maximize resilience to failure. https://www.gremlin.com/blog/chaos-engineering-autonomous-optimization-combined-to-maximize-resilience-to-failure
Lawler, R. (2024, August 1). Delta CEO blames Microsoft and CrowdStrike for a $500 million outage. The Verge. https://www.theverge.com/2024/8/1/24210680/crowdstrike-microsoft-outage-delta-lawsuit-class-action-damages?utm_source=chatgpt.com
Leach, B. (2017, February 22). Designing robust and predictable APIs with idempotency. https://stripe.com/blog/idempotency
Li, H. M. (2024, August 20). How to Set Up Chaos Engineering in your Continuous Delivery pipeline with Gremlin and Jenkins. https://www.gremlin.com/community/tutorials/how-to-set-up-chaos-engineering-in-your-continuous-delivery-pipeline-with-gremlin-and-jenkins?
Long, J. (2021, July). A Bootiful Podcast: Benjamin Wilms, founder of the Chaos Monkey for Spring Boot and Steadybit, a company to help you build more robust software. A Bootiful Podcast: Benjamin Wilms, Founder of the Chaos Monkey for Spring Boot and Steadybit, a Company to Help You Build More Robust Software. https://spring.io/blog/2021/07/01/a-bootiful-podcast-benjamin-wilms-founder-of-the-chaos-monkey-for-spring-boot-and-steadybit-a-company-to-help-you-build-more-robust-software
Low, K. (2023, November 3). How to use chaos engineering in incident response. Amazon Web Services. https://aws.amazon.com/blogs/security/how-to-use-chaos-engineering-in-incident-response
Lu, A. (2024, November 21). 2DC Support with Cross-Cluster Replication. https://www.cockroachlabs.com/blog/2dc-support-cross-cluster-replication
Lunney, J., & Lueder, S. (2017). Blameless Postmortem for System Resilience. Google SRE .https://sre.google/sre-book/postmortem-culture
Mace, J., Oertel, J., Thorne, S., & Chakrabarti, A. (n.d.). Root Cause Analysis for Probing Incident. Google SRE. Retrieved April 24, 2025, from https://sre.google/workbook/incident-response/?utm_source=chatgpt.com
Matthew Helmke. (2020, June 18). Chaos Engineering and Windows: Mitigating common Windows failure scenarios. Gremlin. https://www.gremlin.com/blog/chaos-engineering-and-windows
Meiklejohn, C. S., Estrada, A., Song, Y., Miller, H., & Padhye, R. (2021). Service-Level Fault Injection Testing. Proceedings of the ACM Symposium on Cloud Computing (pp. 388–402). https://doi.org/10.1145/3472883.3487005
Michalowski, M. (2024, January 16). Navigating the Multi-Cloud Ecosystem. DevOps.Com. https://devops.com/navigating-the-multi-cloud-ecosystem/
Microsoft Learn. (2025, June 7). Azure Chaos Studio fault and action library—Azure Chaos Studio. Azure. https://learn.microsoft.com/en-us/azure/chaos-studio/chaos-studio-fault-library
Mistry, D. (2025, April 20). AI Meets Chaos Engineering: Designing Self-Healing Systems using Reinforcement Learning. Medium. https://medium.com/@dhruvmistry_/ai-meets-chaos-engineering-designing-self-healing-systems-using-reinforcement-learning-88b7d9940801
Mondal, S. (2021, July 27). How the Resilience Score Algorithm works in Litmus! LitmusChaos. https://litmuschaos.io/blog/how-the-resilience-score-algorithm-works-in-litmus-1d22
Mooney, M. (2023, October 10). Security-focused chaos engineering experiments for the cloud. Datadog. https://www.datadoghq.com/blog/chaos-engineering-for-security/
Moreschini, S., Pour, S., Lanese, I., Balouek, D., Bogner, J., Li, X., Pecorelli, F., Soldani, J., Truyen, E., & Taibi, D. (2025). AI Techniques in the Microservices Life-Cycle: A Systematic Mapping Study. Computing, 107(4), 100. https://doi.org/10.1007/s00607-025-01432-z
Nedosekin, V., Kumar, S., & Stoll, A. (2024, November 5). Introducing AWS Fault Injection Service Actions to Inject Chaos in Lambda functions. AWS Cloud Operations Blog. https://aws.amazon.com/blogs/mt/introducing-aws-fault-injection-service-actions-to-inject-chaos-in-lambda-functions
Netflix Technology Blog. (2020, November 2). Keeping Netflix Reliable Using Prioritized Load Shedding. Medium. https://netflixtechblog.com/keeping-netflix-reliable-using-prioritized-load-shedding-6cc827b02f94
Newman, A. (2020, December 15). How to train your engineers in Chaos Engineering. Gremlin. https://www.gremlin.com/community/tutorials/how-to-train-your-engineers-in-chaos-engineering
Newman, A. (2023, October 30). How Gremlin’s reliability score works. https://www.gremlin.com/blog/how-gremlins-reliability-score-works
Observability in the realm of Chaos Engineering. (n.d.). National Australia Bank. Medium. Retrieved April 24, 2025, from https://medium.com/%40nabtechblog/observability-in-the-realm-of-chaos-engineering-99089226ca51
Palacios Chavarro, S., Nespoli, P., Díaz-López, D., & Niño Roa, Y. (2023). On the Way to Automatic Exploitation of Vulnerabilities and Validation of Systems Security through Security Chaos Engineering. Big Data and Cognitive Computing, 7(1), 1. https://doi.org/10.3390/bdcc7010001
Palumbo, F., Aceto, G., Botta, A., Ciuonzo, D., Persico, V., & Pescapé, A. (2021). Characterization and analysis of cloud-to-user latency: The case of Azure and AWS. Computer Networks, 184, 107693. https://doi.org/10.1016/j.comnet.2020.107693
Payment Card Industry. (2022, April). Self-Assessment Questionnaire A and Attestation of Compliance. https://listings.pcisecuritystandards.org/documents/PCI-DSS-v4-0-SAQ-A.pdf?utm_source=chatgpt.com
PCI Security Standards Council. (2021, October). PCI SSC Global Community Forum 2021. PCI SSC Global Community Forum. https://events.pcisecuritystandards.org/global2021/agenda/
Principles of chaos engineering - Principles of chaos engineering. (n.d.). Retrieved April 24, 2025, from https://principlesofchaos.org/?utm_source=chatgpt.com
Reuters. (2024, October 31). UK finance firms told to beef up buffers against CrowdStrike-like events. Reuters. https://www.reuters.com/technology/cybersecurity/uk-finance-firms-told-beef-up-buffers-against-crowdstrike-like-events-2024-10-31
Sachto, A., & Walcer, A. (n.d.). Anatomy of an Incident.
Satyanarayana, S., & Black, R. (2025, April 30). Harness Guardrails and Resilience. Harness.Io. https://www.harness.io/blog/harness-guardrails-and-resilience
Satyanarayana, S.. (2025, January 9). Integrating Chaos Engineering with AI/ML: Proactive Failure Prediction. Harness.Io. https://www.harness.io/blog/integrating-chaos-engineering-with-ai-ml-proactive-failure-prediction
Schulte, D. (2021, December). Is Chaos Engineering Worth It? A Cost-Benefit Analysis. https://steadybit.com/blog/if-you-are-not-doing-chaos-engineering
Service meshes are on the rise – but greater understanding and experience are required. (2022, May 17). CNCF. https://www.cncf.io/blog/2022/05/17/service-meshes-are-on-the-rise-but-greater-understanding-and-experience-are-required/
Silverthorne, V. (2025, March). Cloud Native Computing Foundation, & Stephen Hendrick, The Linux Foundation. Cloud Native 2024.
Sonar, V. (2024, September 6). How to Integrate Chaos Engineering Into CI/CD. Aviator. https://www.aviator.co/blog/how-to-integrate-chaos-engineering-into-your-ci-cd-pipeline
State of Chaos Engineering 2021. (n.d.). Retrieved May 3, 2025, from https://www.gremlin.com/state-of-chaos-engineering/2021
Stripe. (2025). Errors | Stripe API Reference. https://docs.stripe.com/api/errors?
Summary of June 8 outage. (2021, June 8). Fastly. https://www.fastly.com/blog/summary-of-june-8-outage?utm_source=chatgpt.com
Summary of the AWS Service Event in the Northern Virginia (US-EAST-1) Region. (2021, December 10). Amazon Web Services, Inc. https://aws.amazon.com/message/12721/
Tavargere, Z. (2025, January 10). Cache Stampede: A Problem The Industry Fights Every Day. https://newsletter.adaptiveengineer.com/p/cache-stampede-a-problem-the-industry
Taylor, H. (2024, July 24). Microsoft to take hit as Fortune 500 suffers $5.4B in CrowdStrike losses: Study. New York Post. https://nypost.com/2024/07/24/business/microsoft-to-take-hit-as-fortune-500-suffers-5-4b-in-crowdstrike-losses-study/
Terraform Registry. (2025a). Resource: Aws_fis_experiment_template. HashiCorp. https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/fis_experiment_template?
Terraform Registry. (n.d.-b). Young-ook/eks/aws | chaos-mesh Submodule. Terraform Registry. Retrieved May 3, 2025, from https://registry.terraform.io/modules/Young-ook/eks/aws/1.7.8/submodules/chaos-mesh?utm_source=chatgpt.com
Torkura, K. (n.d.). Security Chaos Engineering 101: Fundamentals. Mitigant. Retrieved May 3, 2025, from https://www.mitigant.io/en/blog/security-chaos-engineering-101-fundamentals?utm_source=chatgpt.com
Touzi, J. (2020, August 7). Using AWS Global Accelerator to achieve blue/green deployments. Networking & Content Delivery. https://aws.amazon.com/blogs/networking-and-content-delivery/using-aws-global-accelerator-to-achieve-blue-green-deployments/
Treat, T. (2020, July 6). Guidelines for Chaos Engineering, Part 1. Medium. https://blog.realkinetic.com/guidelines-for-chaos-engineering-part-1-e5528a8a219
Vizard, M. (2025, January 29). Harness Applies AI to Chaos Engineering Testing. DevOps.Com. https://devops.com/harness-applies-ai-to-chaos-engineering-testing
Warren, T. (2024, July 19). Major Windows BSOD issue hits banks, airlines, and TV broadcasters. The Verge. https://www.theverge.com/2024/7/19/24201717/windows-bsod-crowdstrike-outage-issue?utm_source=chatgpt.com
Weiss, D. (2024, September 3). Video Spotlight: “Chaos Testing – Behind CockroachDB’s Resilience.” Cockroach Labs. https://www.cockroachlabs.com/blog/video-chaos-testing
Yu, G., Tan, G., Huang, H., Zhang, Z., Chen, P., Natella, R., & Zheng, Z. (2024). A Survey on Failure Analysis and Fault Injection in AI Systems (No. arXiv:2407.00125). arXiv. https://doi.org/10.48550/arXiv.2407.00125