Service Disruption - ID3 Global Degraded Platform Issue

Incident Report for GBG

Postmortem

Incidents’ Details

‌Incident 1 - 27/09/2022 10:03 - 11:58 - P1

Incident 2 - 28/09/2022 08:22 - 10:02 - P1

Incident 3 - 29/09/2022 08:43 - 09:00 - P1

Incident 4 - 01/10/2022 11:03 - 13:13 - P2

Incident 5 - 01/10/2022 16:13 - 16:32 - P2

Incident 6 - 02/10/2022 15:25 - 16:11 - P2

Incident 7 - 03/10/2022 15:23 - 16:33 - P2

Incident Description

This incident was initially assigned a P1 status due to the recent migration of ID3global from on-prem to cloud. GBG understood the importance, severity and impact of this incident and therefore urgent support was required. The actual incident is classified according to all GBG Contract as a P2 and will be recorded as such.

In summary, customers processing transactions for UK item checks where the profile contained LOCAL DATA were encountering latency. And in some customer cases, Fatal Errors were being returned. This was due to instability with the Address Component.

Several IIS settings that were in place in the on-premise platform were not in place in Azure. The team were informed that ADT is unstable, and these IIS settings are required to allow it to deal with failures safely and permit replaced incidents to be instantiated by IIS where problems occur.

Impact

During the several incidents, for all customers using item checks that utilise local data encountered transaction latency and fatal errors. The item checks in the local data are as below:

Root Cause

When the ID3global platform in Azure was created and the Application Servers in the old platform were migrated, certain IIS settings related to the ADT service (Address Component) were not brought through consistently and some other service parameters, such as memory allocation were changed.

Over the course of several months of testing and subsequently two weeks of serving live traffic on the Azure platform no degradation of the ADT component was observed until on 27/09/2022 the first incident occurred. As the incident recurred over the next several days, the team was not able to find the trigger for the service failure yet, but did pinpoint remediation steps in IIS settings and additional memory allocation for the ADT service.

The Azure platform has been built with redundancy so failures to a single service do not affect the overall platform. However, in this case it was found that undocumented behaviour in healthchecks for Azure App Service for Containers meant traffic kept being routed to unhealthy service instance, increasing impact to service.

Remediation Steps

The GBG Support Teams have identified the area of the system where the root cause is triggering and quick remediation steps in the event it re-occurs, minimising service degradation duration. Preventative action has been to align IIS settings with the on-premise platform and increase memory allocation for the ADT service. Since these steps have been carried out, service degradation has not been observed. The parameters that trigger the service degradation have yet to be identified.

The team is engaging with Microsoft to ensure the service resiliency works as expected, routing traffic only to healthy service instances. All of these changes have been applied to both the Production and Disaster Recovery sites which live customers use. The change has also been applied to non-production sites for testing purposes.

Future Steps

The monitoring of these changes post remediation has proven that incident has reached a resolved state. GBG are continuing to improve the design of this service. Continued effort is underway to improve our testing framework to reproduce the incident triggers and remediate underlying causes that might exist within the ADT service.

Posted Oct 14, 2022 - 17:09 BST

Resolved

We're pleased to inform you that the service disruption has now been resolved and users should be able to access the portal and complete transactions as normal.

The issue was confirmed as intermittent connectivity issues preventing access to the portal and/or slowness/timeouts when processing transactions. The support teams have restarted unhealthy instances which resolved the issue. The Support Teams are working to identify the Root Cause to prevent further occurrence.

To the customers that have been affected by this issue, we would again like to apologise.

Posted Sep 27, 2022 - 12:36 BST

Update

Unfortunately, the service issue is ongoing. We continue to treat this matter as our top priority and all technical teams are working to determine the cause and resolution. We apologise that this issue is taking longer to resolve than we would like.

Investigations so far point to have determined that a degraded performance has been identified on the ADT service in ID3global.

Unfortunately we are unable to provide a fix ETA at the moment.

We will update you at 13:15 BST 27/09/2022 or as soon as the issue is resolved. Thank you for your continued patience.

Posted Sep 27, 2022 - 12:18 BST

Identified

Initial investigations have determined that a degraded performance has been identified on the ADT service in ID3global.

Users may be experiencing intermittent connectivity issues preventing access to the portal and/or slowness/timeouts when processing transactions.

We’re working to restore full service.

Unfortunately we are unable to provide a fix ETA at the moment.

We will update you at 12:15 BST 27/09/2022 or as soon as the issue is resolved. Thank you for your patience whilst we restore our service back to our usual level.

Posted Sep 27, 2022 - 11:23 BST

Investigating

We are investigating an issue affecting our service. We apologise for the impact this may be having on your business. Our Incident Team is working to identify the root cause and implement a solution as a priority. We hope to advise you within a few minutes that the issue has been resolved, should this not be the case we will provide a further update and estimated resolution time at 11:15 BST 27/09/2022.

Posted Sep 27, 2022 - 10:45 BST

This incident affected: GBG ID3global (ID3global Portal, ID3global Service).