Skip to content

Service outage due to AWS node failure

General Information

Incident Detector's Information

Name: Allan Chu

Date and Time Detected: 2025/05/12 9:03 AM MDT

Title: DevOps Engineer

Additional Information

Incident Summary

Type of Incident Detected: Service outage caused by cloud outage

Incident Location: AWS Cloud Services

How was the Incident Detected

Cloud service alarms alerted DevOps engineers to possible issues that needed investigation and remediation.

Additional Information

Location(s) of affected systems: Cluster node deployed in the US-WEST-2 region

Date and time incident handlers arrived at site: 2025/05/12 9:05 AM MDT

Describe affected information system(s)

An AWS node deployed and part of our main V3 cluster became unresponsive and became unreachable

Isolate affected systems

Approval to removal from network? Yes

If YES, Name of Approver: Allan Chu

Date and Time Removed: 2025/05/12 9:10 AM MDT

Backup of Affected System(s)

Last System backup successful? Yes

Name of persons who did backup: Allan Chu

Date and time last backups started: 2025-05-12

Date and time last backups completed: 2025-05-12

Backup Storage Location: AWS S3 Glacier Storage

Incident Eradication

Name of persons performing forensics: Allan Chu

Was the vulnerability (root cause) identified: Yes

Describe

A node in the V3 Illumass cluster became unreachable and needed to be replaced

How was eradication validated

Launched a new node into the V3 Illumass cluster, shifted all affected systems onto the new node and removed the old node