Services Elevated Response Times

Incident Report for Planning Center

Postmortem

On Sunday morning, July 14th 2019, a significant percentage of Planning Center Services and Music Stand customers experienced slow performance or errors between 5:45am and 7:09am PST.

We understand even a few minutes of slowness with our applications is frustrating and we are truly sorry for the disruption this situation may have caused you and your church.

We know you rely on Planning Center being available and fast on Sunday mornings, which is why we take situations when we fail so seriously. In the wake of last weekend our team has taken the time to investigate what happened and to come up with long-term solutions and we want to be transparent with you about what we learned.

What Happened

At 5:45am on Sunday morning our operations team started receiving automated notifications alerting us that both Services and Music Stand were experiencing a large spike in errors and sluggish performance.

We started investigating immediately and found the problem soon after: our internal systems were exceeding the number of connections allowed to the Services’ primary database server, which in turn impacted customer experience.

We tied this issue to a configuration change we made in Services earlier in the week. Most of our other applications have had the same configuration in place for years and experienced no connection challenges. However, because Services operates at a significantly higher scale on Sunday mornings than other Planning Center applications, it hit a connection ceiling we did not anticipate.

To solve the problem, our team worked to raise the connection limit on the existing database server while simultaneously bringing a larger database server with a higher connection limit online. By successfully moving over to this larger system we reduced the error rate and brought the application up to speed for our customers.

This was our short term fix; ultimately, we had to completely rollback the configuration that caused the issue in the first place. We did this by manually editing the existing build to restore the configuration file to the previous week’s settings. After verifying the rollback was successful, we terminated the offending configuration and service was fully restored.

Lessons Learned

As a result of this weekend, we are improving our deployment tools to ensure we can always revert a bad configuration and are adding explicit monitoring and alerting for this scenario. These measures will both reduce the chance of this specific issue from occurring again and help us be better prepared if something similar does arise.

Thank you for your patience and graciousness with our team this Sunday as we worked to find a solution. We have learned from this experience and will continue to work hard to ensure you can always rely on Planning Center.

Posted Jul 16, 2019 - 11:50 PDT

Resolved

This incident has been resolved.

Posted Jul 14, 2019 - 07:35 PDT

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Jul 14, 2019 - 07:13 PDT

Investigating

Planning Center Services is loading slowly for many customers. We are investigating.

Posted Jul 14, 2019 - 06:03 PDT

This incident affected: Services.