Service outage due to Elasticsearch node failure

General Information

Name: Allan Chu

Date and Time Detected: 2024/10/03 8:28 PM MDT

Title: DevOps Engineer

Type of Incident Detected: Service outage caused by software malfunction

Incident Location: AWS Cloud Services

Cloud service alarms alerted DevOps engineers to possible issues that needed investigation and remediation.

Location(s) of affected systems: Deployed software solutions in the US-WEST-2 region

Date and time incident handlers arrived at site: 2024/10/03 8:30 PM MDT

The Elasticsearch service cluster suffered a crash and was unrecoverable and needed to be replaced immediately.

Approval to removal from network? Yes

If YES, Name of Approver: Allan Chu

Date and Time Removed: 2024/10/03 8:48 PM MDT

Last System backup successful? Yes

Name of persons who did backup: Allan Chu

Date and time last backups started: 2024-09-30

Date and time last backups completed: 2024-09-30

Backup Storage Location: AWS S3 Glacier Storage

Name of persons performing forensics: Allan Chu

Was the vulnerability (root cause) identified: Yes

All nodes in the Elasticsearch cluster were unable to restart themselves after suffering a crash and was stuck in a continual crashing loop.

Launched and restored a backup of Elasticsearch, and migrated all other microservices to refer to the new cluster instead.