Service outage due to Elasticsearch node failure
General Information
Incident Detector's Information
Name: Allan Chu
Date and Time Detected: 2024/10/03 8:28 PM MDT
Title: DevOps Engineer
Additional Information
Incident Summary
Type of Incident Detected: Service outage caused by software malfunction
Incident Location: AWS Cloud Services
How was the Incident Detected
Cloud service alarms alerted DevOps engineers to possible issues that needed investigation and remediation.
Additional Information
Location(s) of affected systems: Deployed software solutions in the US-WEST-2 region
Date and time incident handlers arrived at site: 2024/10/03 8:30 PM MDT
Describe affected information system(s)
The Elasticsearch service cluster suffered a crash and was unrecoverable and needed to be replaced immediately.
Isolate affected systems
Approval to removal from network? Yes
If YES, Name of Approver: Allan Chu
Date and Time Removed: 2024/10/03 8:48 PM MDT
Backup of Affected System(s)
Last System backup successful? Yes
Name of persons who did backup: Allan Chu
Date and time last backups started: 2024-09-30
Date and time last backups completed: 2024-09-30
Backup Storage Location: AWS S3 Glacier Storage
Incident Eradication
Name of persons performing forensics: Allan Chu
Was the vulnerability (root cause) identified: Yes
Describe
All nodes in the Elasticsearch cluster were unable to restart themselves after suffering a crash and was stuck in a continual crashing loop.
How was eradication validated
Launched and restored a backup of Elasticsearch, and migrated all other microservices to refer to the new cluster instead.