Service degradation due to Cassandra cluster node failure
General Information
Incident Detector's Information
Name: Allan Chu
Date and Time Detected: 2024/10/02 12:20 AM MDT
Title: DevOps Engineer
Additional Information
Incident Summary
Type of Incident Detected: Service degradation caused by hardware malfunction
Incident Location: AWS Cloud Services
How was the Incident Detected
Cloud service alarms alerted DevOps engineers to possible issues that needed investigation and remediation
Additional Information
Location(s) of affected systems: Deployed AWS EC2 instances in US-WEST-2 region.
Date and time incident handlers arrived at site: 2024/10/02 12:22 AM MDT
Describe affected information system(s)
Illumass Cassandra cluster was affected. Affected node was replaced and a backup loaded to resume full service.No actual impact on system usability as enough read/write replicas were still available in the cluster.
Isolate affected systems
Approval to removal from network? Yes
If YES, Name of Approver: Allan Chu
Date and Time Removed: 2024/10/02 12:35 AM MDT
Backup of Affected System(s)
Last System backup successful? Yes
Name of persons who did backup: Allan Chu
Date and time last backups started: 2024-09-30
Date and time last backups completed: 2024-09-30
Backup Storage Location: AWS S3 Glacier Storage
Incident Eradication
Name of persons performing forensics: Allan Chu
Was the vulnerability (root cause) identified: Yes
Describe
AWS EC2 Instance reported unable to connect.
How was eradication validated
Using AWS EC2 controls, the virtual machine was stopped and started to migrate it to a new instance. After migration the machine was brought back into the clusterand allowed to reconnect.