[Voice] SIPTLS connections broken
Incident Report for Sound of Data
Postmortem

On Tuesday august 15th we experienced a full outage on our voice system for SIPTLS enabled accounts. This postmortem is intended to explain what went wrong, what we did to remediate the problem and what we are doing to prevent future occurrence.

All times are in CEST / Amsterdam timezone.

Timeline of actions

At 9am we received a call from a customer with the complaint about their sip trunk not working. The engineer who got the call immediately started validating their complaint and found that the sip trunk was not registered. Having knowledge of the customers configuration the engineer checked if he was able to reproduce the registration failure. After confirming it was indeed an issue on our side we raised a p1 ticket with our vendor.

This is standard practice as we work closely together with our platform vendor on a daily basis. We request a priority 1 chat channel with the support engineers.

At 9.30 we’re in the channel with the appropriate engineers from our team and our vendor team. Explaining the situation, the checks already performed and devising a plan of action.

Since we could see the outside certificate to be okay (sip.sod.cloud) and valid we had to investigate the communication between the different back-end nodes. The communication between the front-end nodes and the backend nodes is handled by different certificates with different validity chains.

At 10.30 we finished the investigative work trying to understand the different components involved and tried reloading a new configuration into the edge proxy nodes. We generated new certificates on the edges as a possible solution to the problem.

At 10.45 we found the reload of the different certificates did not solve the issue but gave us a clue to solving the problem.

At 11.30 we regenerated the back-end proxy certificates. Generating these is a tricky process since other modules in the back-end also rely on these certificates. We had to make sure we did not break other functionality of the system.

At 12.30 we activated the newly generated certificates on the back-end and front-end services and service was restored immediately.

Root Cause Analysis (RCA)

The root cause of the problems experienced is due to a bug in certificate renewal script which runs intermittently. The script is responsible for creating valid certificates and loading them on the various nodes. Services like Grafana, Elasticsearch, Radius and inter-node communication for SIP rely on these certificates to provide an encrypted environment. The script runs on a daily basis but does not replace certificates as long as the existing certificates are still valid for a long time. One of the certificates had a validity of less than 90 days which is a trigger to start the renewal scenario.

The renewal trigger however did not work correctly due to a bug introduced in the renewal script which created a certificate with an invalid root CA. This bug was recently discovered by the vendor and a patch is currently under QA, but not officially released yet.

With the invalid certificate loaded, TLS communication breaks down and new sessions cannot be created.
Additionally, the monitoring processes in place (Basic Functionality Test (BFT), network & system monitoring) do little in terms of specific TLS checks which is why the Self test monitoring did not report any issues related to SIPTLS.

Improvement plan

  • We’re doing a complete overhaul of the monitoring systems to include both SIP and SIPTLS related tests.
  • We’re adding additional continuous checks related to SIPTLS and inter-node communication.
  • The script for renewal has been disabled for now until the official path has been released and we’ve tested it on our staging environment.
  • We’ve working on an specific plan of action with our vendor engineering team to bring the time to recovery down in case issues are detected with regards to certificates. A plan of action is already underway and we expect to implement this over the next 2 weeks.
  • We’re planning an upgrade of the system to a newer version which included additional improvements related to the BFT and monitoring. Upgrading is always a big effort, but we had already planned to this within the next 12 weeks. This upgrade will be moved forward to sometime in the next 8 weeks. Please make sure you are signed up for our status page notifications as we will announce any major upgrades and maintenance through status page.

Final remarks

This incident has been a frustrating experience for our customers and a humbling moment for all Sound of Data teams. We are keenly aware of our responsibility as your partner. You trust us and our platform to be your voice gateway and we are sorry for not living up to it on this day. We are committed to learning from this experience, and to delivering meaningful improvements to our service and our communication.

If you have any additional questions related to this outage, please reach out to us and we’re happy to answer any questions you may have.

Thomas Hazelaar
CTO
Sound of Data

Posted Aug 16, 2023 - 09:55 CEST

Resolved
SIPTLS based connections and registrations failed due to a faulty regenerated root certificate on cluster nodes. New connections could not establish a valid TLS connection causing calls to fail.
Posted Aug 15, 2023 - 09:00 CEST