Skip to content

Service outage due to Elasticsearch node failure

General Information

Incident Detector's Information

Name: Allan Chu

Date and Time Detected: 2024/10/03 8:28 PM MDT

Title: DevOps Engineer

Additional Information

Incident Summary

Type of Incident Detected: Service outage caused by software malfunction

Incident Location: AWS Cloud Services

How was the Incident Detected

Cloud service alarms alerted DevOps engineers to possible issues that needed investigation and remediation.

Additional Information

Location(s) of affected systems: Deployed software solutions in the US-WEST-2 region

Date and time incident handlers arrived at site: 2024/10/03 8:30 PM MDT

Describe affected information system(s)

The Elasticsearch service cluster suffered a crash and was unrecoverable and needed to be replaced immediately.

Isolate affected systems

Approval to removal from network? Yes

If YES, Name of Approver: Allan Chu

Date and Time Removed: 2024/10/03 8:48 PM MDT

Backup of Affected System(s)

Last System backup successful? Yes

Name of persons who did backup: Allan Chu

Date and time last backups started: 2024-09-30

Date and time last backups completed: 2024-09-30

Backup Storage Location: AWS S3 Glacier Storage

Incident Eradication

Name of persons performing forensics: Allan Chu

Was the vulnerability (root cause) identified: Yes

Describe

All nodes in the Elasticsearch cluster were unable to restart themselves after suffering a crash and was stuck in a continual crashing loop.

How was eradication validated

Launched and restored a backup of Elasticsearch, and migrated all other microservices to refer to the new cluster instead.