Service outage due to AWS node failure
General Information
Incident Detector's Information
Name: Allan Chu
Date and Time Detected: 2025/05/12 9:03 AM MDT
Title: DevOps Engineer
Additional Information
Incident Summary
Type of Incident Detected: Service outage caused by cloud outage
Incident Location: AWS Cloud Services
How was the Incident Detected
Cloud service alarms alerted DevOps engineers to possible issues that needed investigation and remediation.
Additional Information
Location(s) of affected systems: Cluster node deployed in the US-WEST-2 region
Date and time incident handlers arrived at site: 2025/05/12 9:05 AM MDT
Describe affected information system(s)
An AWS node deployed and part of our main V3 cluster became unresponsive and became unreachable
Isolate affected systems
Approval to removal from network? Yes
If YES, Name of Approver: Allan Chu
Date and Time Removed: 2025/05/12 9:10 AM MDT
Backup of Affected System(s)
Last System backup successful? Yes
Name of persons who did backup: Allan Chu
Date and time last backups started: 2025-05-12
Date and time last backups completed: 2025-05-12
Backup Storage Location: AWS S3 Glacier Storage
Incident Eradication
Name of persons performing forensics: Allan Chu
Was the vulnerability (root cause) identified: Yes
Describe
A node in the V3 Illumass cluster became unreachable and needed to be replaced
How was eradication validated
Launched a new node into the V3 Illumass cluster, shifted all affected systems onto the new node and removed the old node