Introduction
AWS Fault Injection Service (FIS) is a fully managed service designed to facilitate fault injection experiments within AWS environments. By introducing controlled failures, FIS helps users assess their applications’ performance, observability, and resilience against real-world conditions. FIS operates according to the principles of chaos engineering, allowing teams to identify vulnerabilities and optimize their applications proactively rather than reactively.
A stress test based on the CPU utilization of an EC2 Instance. If the CPU utilization goes above 80 Percent, then an AWS cloud watch alarm will be triggered and as a result, the EC2 instance will be stopped. In addition to that, an SNS notification will be delivered to the user to notify them about this server failure.
We can perform this task by using AWS Fault Injection Simulation (FIS).
Key aspects of AWS FIS include:
Preconfigured actions: FIS offers a library of preconfigured actions tailored to various AWS services, such as Amazon CloudWatch, Amazon EBS, Amazon EC2, Amazon ECS, Amazon ElastiCache, Amazon RDS, AWS Systems Manager, and Amazon VPC.
Experiment templates: Users create experiment templates containing actions, targets, alarms, and stop conditions.
Stop conditions: FIS incorporates safeguards to prevent experiments from causing irreparable damage; stop conditions can trigger upon reaching defined thresholds.
Integration with delivery pipelines: FIS can be integrated into continuous delivery processes to ensure regular testing of fault tolerance throughout development cycles.
Monitoring and logging: Experiments are monitored through the FIS console, AWS CLI, and CloudWatch, with detailed logs provided through AWS CloudTrail.
Support for multiple AWS regions: FIS is available globally, supporting AWS GovCloud (US) regions alongside others.
Benefits of FIS
AWS Fault Injection Service (FIS) offers several benefits related to improving application performance, observability, and resiliency:
- Identify hidden vulnerabilities: FIS exposes latent weaknesses in applications and systems that might go undetected during normal operations.
- Proactive testing: Instead of waiting for incidents to happen, FIS allows teams to anticipate problems and address them before they become major issues.
- Reduced mean time to repair (MTTR): By identifying and fixing issues earlier, organizations can reduce the amount of time spent on repairs, leading to improved efficiency.
- Cost savings: Up to 90% reduction in MTTR translates into significant cost reductions.
- Standardized chaos engineering: FIS promotes consistency in chaos engineering methodologies across teams, reducing complexity and increasing effectiveness.
- Integrated with delivery pipelines: Teams can repeatedly test the impact of fault actions as part of their software delivery process.
- Superior insights: Generating realistic failure conditions reveals previously unknown weaknesses
- Easy setup and management: FIS requires no special agents and is compatible with popular AWS services, making it convenient to adopt
- Monitoring and logging capabilities: FIS provides detailed logs and integrates with AWS CloudTrail for compliance purposes
- Multi-account support: FIS can be used to run experiments across multiple AWS accounts, enabling larger-scale testing
These advantages demonstrate that AWS FIS is a powerful tool for ensuring the stability and dependability of applications deployed on AWS. Its ability to simulate real-world conditions and provide valuable insights into application behavior makes it an essential component of any organization’s strategy for building highly resilient systems.
Supported AWS services
AWS FIS provides preconfigured actions for specific types of targets across AWS services. AWS FIS supports actions for target resources for the following AWS services:
- Amazon CloudWatch
- Amazon EBS
- Amazon EC2
- Amazon ECS
- Amazon EKS
- Amazon RDS
- AWS Systems Manager
The target resources must be in the same AWS account as the experiment. AWS FIS does not support cross-account experiments.
DEMO:
Prerequisites
● IAM role for AWS FIS Experiments
● IAM role for Amazon EC2 Role for SSM
● EC2 Instance
● SNS Topic
● Cloud Watch Alarm
Step1:
IAM role for AWS FIS experiments :
Create an IAM role and attach a policy that grants permission for fault injection actions, systems manager actions and Amazon EC2 actions. Create an IAM Policy. You can name it as PolicyForFISexperiments with below code
{
“Version”: “2012-10-17”,
“Statement”: [
{
“Sid”: “AllowFISExperimentRoleFISActions”,
“Effect”: “Allow”,
“Action”: [
“fis:InjectApiInternalError”,
“fis:InjectApiThrottleError”,
“fis:InjectApiUnavailableError”
],
“Resource”: “arn:*:fis:*:*:experiment/*”
},
{
“Sid”: “AllowFISExperimentRoleEC2Actions”,
“Effect”: “Allow”,
“Action”: [
“ec2:RebootInstances”,
“ec2:StopInstances”,
“ec2:StartInstances”,
“ec2:TerminateInstances”
],
“Resource”: “arn:aws:ec2:*:*:instance/*”
},
{
“Sid”: “AllowFISExperimentRoleSSMSendCommand”,
“Effect”: “Allow”,
“Action”: [
“ssm:SendCommand”
],
“Resource”: [
“arn:aws:ec2:*:*:instance/*”,
“arn:aws:ssm:*:*:document/*”
]
},
{
“Sid”: “AllowFISExperimentRoleSSMCancelCommand”,
“Effect”: “Allow”,
“Action”: [
“ssm:CancelCommand”
],
“Resource”: “*”
},
{
“Sid”: “AllowFISExperimentRoleReadOnly”,
“Effect”: “Allow”,
“Action”: [
“ec2:DescribeInstances”,
“iam:ListRoles”,
“ssm:ListCommands”
],
“Resource”: “*”
}
]
}
Step2:
Create an IAM role (FIS-Experiments-Role) with a Custom trust policy. Use the below code for Trust Relationships and attach the PolicyForFISexperiments policy with this IAM role-
{
“Version”: “2012-10-17”,
“Statement”: [
{
“Effect”: “Allow”,
“Principal”: {
“Service”: “fis.amazonaws.com”
},
“Action”: “sts:AssumeRole”,
“Condition”: {}
}
]
}
After Step1 and Step2 :
Step3:
IAM role for “Amazon EC2 Role for SSM” :
Create an IAM role attaching the trust policy with EC2 and the Permission Policy will be AWS managed permission policy –AmazonEC2roleForSSM.
Step4:
EC2 Instance :
Launch an EC2 instance. The instance must be managed by SSM. To make the EC2 instance SSM managed, attach an IAM Instance Profile while launching the instance. Attach the IAM role we created in Step3 (IAM role for Amazon EC2 Role for SSM).
SSM agent must be installed on EC2 Instance. By default, the SSM agent is installed on an EC2 instance. To verify that the instance is managed by SSM, open the Fleet Manager console from AWS Systems Manager (SSM).
Step5:
SNS Topic and Cloud Watch Alarm :
Create a Cloud watch Alarm associated with an SNS topic. So that, a notification will be received if the alarm triggers.
Create an SNS topic “EC2CpuUse”
Followed by, Creating a Cloud Watch Alarm based on the EC2 instance CPU utilization metric.
If the Cpu usage is above 80%, then the EC2 instance will be stopped.
And Finally, Check whether the alarm state is “OK” or not. If the alarm is in another state, the test will not be performed.
AWS FIS — Create an experiment template
Description, name and permission :
Description –StressTestForCPUUtilization
IAM rolec-Attach “FIS-Experiments-Role”
Action:
Action Type –aws:ssm:send-command/AWSFIS-Run-CPU-Stress
document parameters–{“DurationSeconds”:”120″}
duration –5 minutes
Target:
Edit the default target option. default name can also be changed.
Resource Type -aws:ec2:instance
Target Method -Resource IDs and Attach the Target EC2 Instance.
Selection Mode-ALL
Stop Conditions:
Attach the Cloud Watch Alarm created on Step6. If the alarm is triggered, the experiment will be Complete.
And Finally, Create the Create experiment template.
Now, Start the Experiments from the Experiment template.
Observe the experiment Sates-
An experiment can be in one of the following states:
•pending— The experiment is pending.
•initiating— The experiment is preparing to start.
•running— The experiment is running.
•completed— All actions in the experiment were completed successfully.
•stopping— The stop condition was triggered or the experiment was stopped manually.
•stopped— All running or pending actions in the experiment are stopped.
•Failed— The experiment failed due to an error, such as insufficient permissions or incorrect syntax.
Now, check the Target-Instance and cloud watch alarm state. If the cloud watch alarm is triggered and the EC2 instance is stopped, an SNS notification will be delivered.
If all the steps go well, the Stress experiment on the EC2 instance will be successful.
But if the experiment fails, then check the Troubleshooting section below-
Troubleshooting
During the AWS FIS experiment, the experiment can be failed. There are various reasons for failing the experiments, some of them are –
“Not authorized to perform the required action”
Solution: Check the IAM Role and Policy configured in step1 and step2.
“One or more of the targeted instances are not managed by SSM, or are not in a running state”
Solution: The target instance is not managed by SSM. For making target-instance SSM managed, check Step3 and Step4.
Conclusion
AWS Fault Injection Service (FIS) is a managed service that enables users to conduct fault injection experiments based on the principles of chaos engineering. By creating disruptive events, FIS allows users to observe how their applications respond, leading to improved performance and resiliency. Some key points about AWS FIS include:
Real-world conditions: FIS helps create real-world conditions to uncover application issues that may not be apparent under normal circumstances1.
Experiment templates: Users can create experiment templates containing actions, targets, alarms, and stop conditions to simulate various scenarios1.
Stop conditions: FIS provides controls and guardrails to stop experiments if predefined thresholds are reached, ensuring safety during testing1.
Integration with AWS services: FIS supports various AWS services like Amazon CloudWatch, Amazon EC2, Amazon ECS, Amazon RDS, and more, allowing for comprehensive testing14.
Cost-effective: Users are charged per minute for actions run during experiments, offering cost-effective testing1.
Pre-production testing: It is recommended to plan and run experiments in a pre-production environment before conducting them in a production environment to avoid unintended consequences1.
In conclusion, AWS FIS simplifies fault injection experiments by providing the necessary controls and guardrails for safe testing in production environments. By leveraging FIS, organizations can proactively identify and address vulnerabilities in their applications, leading to improved reliability and performance.
You can work with AWS FIS in any of the following ways:
- AWS Management Console— Provides a web interface that you can use to access AWS FIS.
- AWS Command Line Interface (AWS CLI)— Provides commands for a broad set of AWS services, including AWS FIS, and is supported on Windows, macOS, and Linux.
- AWS CloudFormation— Create templates that describe your AWS resources. You use the templates to provision and manage these resources as a single unit.
- AWS SDKs— Provides language-specific APIs and takes care of many of the connection details, such as calculating signatures, handling request retries, and handling errors.
- HTTPS API— Provides low-level API actions that you can call using HTTPS requests. For more information, see the AWS Fault Injection Simulator API Reference.
Add comment