Incident Description
On Wednesday, May 7, 2025, a number of UK customers were unable to access any of their Researcher suite applications (ex. GEM, Ethics Monitor). This caused an application(s) outage situation with critical impact to the affected institutions and this incident was escalated as an Urgent priority/Severity 1 issue.
Time Frame of Impact
Reported: Wednesday, May 7, 2025 @ 8:46 am BST (initial incident report)
Resolved: Each of the impacted customers/applications were restored at varying times throughout the day on Friday, May 9 2025, due to differences in system complexities. Restorations were completed on a rolling basis that day, and any specifics around individual restoration times can be discussed with each affected customer via their Customer Success Manager.
Cause
On May 7, 2025, an internal application server (“app7”) that hosts applications for a subset of Researcher product customers in the UK, crashed due to a server boot issue. The server Amazon Machine Image (AMI) used a legacy operating system (OS), which is not a supported operating system by Amazon Web Services (AWS) hypervisors (a hypervisor both creates and manages virtual machines, and allocates both CPU and memory). Support for this legacy OS ceased on July 25, 2024, and there was an existing re-architecture project on the roadmap to address this already, when the unexpected incident occurred. The Infrastructure team was unable to stand up any new application servers from the original OS AMI to resolve the issue.
Resolution
In order to resolve the issue, the Infrastructure team had to architect and subsequently implement the migration of the affected customers to DuploCloud EKS (which is a more modern platform that simplifies cloud infrastructure) with Amazon Linux. This migration containerizes the applications which enables easier provisioning, managing and scaling overall. DuploCloud also provides more stability and reliability from a performance perspective. Additional technical improvements include database migration to RDS, single non-shared JVM instances, IaC with Terraform (Infrastructure As Code) and application configuration management with DynamoDB. Email functionality has also been migrated to Amazon SES.
Preventative Steps/Follow-Up Tasks
To reduce this issue from occurring in the future, we are taking the following steps:
- The Infrastructure team will be collaborating with the Engineering team to continue the migration of any remaining legacy application servers to the DuploCloud environment to mitigate any future performance concerns.
- There were unfortunately a number of residual issues that arose post-migration of the affected customers to DuploCloud, and that are still impacting various customers.
- Each of these issues have been organized and tracked and are continuing to be addressed by the Infrastructure and Engineering teams as quickly as possible.