Service degradation due to Cassandra cluster node failure

General Information

Incident Detector's Information

Name: Allan Chu

Date and Time Detected: 2024/10/02 12:20 AM MDT

Title: DevOps Engineer

Additional Information

Incident Summary

Type of Incident Detected: Service degradation caused by hardware malfunction

Incident Location: AWS Cloud Services

How was the Incident Detected

Cloud service alarms alerted DevOps engineers to possible issues that needed investigation and remediation

Additional Information

Location(s) of affected systems: Deployed AWS EC2 instances in US-WEST-2 region.

Date and time incident handlers arrived at site: 2024/10/02 12:22 AM MDT

Describe affected information system(s)

Illumass Cassandra cluster was affected. Affected node was replaced and a backup loaded to resume full service.No actual impact on system usability as enough read/write replicas were still available in the cluster.

Isolate affected systems

Approval to removal from network? Yes

If YES, Name of Approver: Allan Chu

Date and Time Removed: 2024/10/02 12:35 AM MDT

Backup of Affected System(s)

Last System backup successful? Yes

Name of persons who did backup: Allan Chu

Date and time last backups started: 2024-09-30

Date and time last backups completed: 2024-09-30

Backup Storage Location: AWS S3 Glacier Storage

Incident Eradication

Name of persons performing forensics: Allan Chu

Was the vulnerability (root cause) identified: Yes

Describe

AWS EC2 Instance reported unable to connect.

How was eradication validated

Using AWS EC2 controls, the virtual machine was stopped and started to migrate it to a new instance. After migration the machine was brought back into the clusterand allowed to reconnect.