During a new deployment of our voice routing API a configuration error was introduced which caused all of the instances responsible for hosting the voice routing API to refresh at the same time. This resulted in the voice routing API to become unavailable while new instances where still in the provisioning phase and older instanced had already been decommissioned.
Corrective measures:
When our alarms indicated that the routing API became unstable we immediately started our incident response process with senior devops and engineering teams. The response is setup in three phases, identify, correct and monitor.
Since we could identify the problem quickly, we intervened in the auto scaling process and stopped any decommissioning of services. We then corrected the deployment and redeployed the software. this corrected the problem and the new API became available again.
After that we did 2 more deployment runs to make sure there that the configuration was indeed correct for any future upgrades.
Future preventative measures:
We’ve adjusted our auto deployment software to check for certain conditions which could trigger this behavior and will block any roll out of new versions.
If you have any additional questions, please contact our customerservice team @ customerservice@soundofdata.nl
Thomas Hazelaar
CTO
Sound of Data