Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 25 Next »

The IBM Cloud Delivery Services team has monitoring in place for all sites and infrastructure under our control.  These are designed to allow the CDS team to pro-actively respond to service impacting or service threatening events or conditions.  When a site is unavailable, or there are infrastructure issues leading to monitor alerts, an ‘Incident Record’ is automatically generated within our Incident Management System.

At the same time, for production environments, a CDS Incident Response Team (IRT) provides 24/7 critical outage support. The goal of IRT is to ensure our customer's applications are running when they should, and to provide effective and timely customer communication during availability incidents or Severity 1 cases during off hours. IRT is sometimes referred to as the "on call" team.

How is the IBM CDS Incident Response Team (IRT) organized?

The IRT is organized into a 2-person rotating schedule on a 12-hour cycle over 7 days. This means that there are two IRT members for each 12/7 period: a Client Communicator and a First Responder.
Support for the other 12 hours of the 24-hour cycle is provided by team members in our China Delivery Team (CDL). The IRT schedule is maintained and updated by CDS on a regular basis.

Client Communicator
The Client Communicator (CC) is responsible for ensuring that any customer affected by a Severity 1 incident or alert is receiving prompt and frequent communication regarding the status of their incident. This resource does not necessarily have technical skills or access to investigate / act upon systems that are failing. The CC may also be expected to triage requests that do not fall within the definition of Severity 1 and communicate with the customer regarding these issues.

First Responder
The First Responder (FR) is a technical role that requires access to systems that may be in a failed or failing state, as well as the skills required to understand what can be done to recover affected environment(s). It may not be possible for the FR to correct all problems and he/she should be equipped to escalate issues to specific individuals for resolution if necessary. The FR remains focused on incident resolution at all times and is not expected to communicate directly with customers; they remain in regular contact with the Client Communicator on duty. It is important to note that the First Responder is precisely that, the first responder - he/she is not solely responsible for solving every incident.

The first responder will respond to alerts and off hours Severity 1 cases to:

  1. Determine the impact of the alert or case

  2. Determine the cause of the alert or case

  3. Initiate corrective action if appropriate

  4. Alert the Client Coordinator if escalation is determined necessary.

The IBM first responder’s priority will be to restore service.  The IBM client coordinator is notified if there are any challenges to restoring service.  The IBM client coordinator will lead the recovery activities and escalate to any personnel required to resolve the issue, while also ensuring that continuous communication is maintained with the customer throughout the length of the incident.

Escalation Manager / Discipline Team Members
Additional support for IRT members is provided by an Escalation Manager as well as dedicated Database and Network discipline team members. These specific CDS individuals are assigned to the IRT schedule to also provide coverage.

  • No labels