The AWS infrastructure broke due to an event as predictable as the start of the year. Users returning to work after Christmas is not a phenomenon unique to Slack.ġ. Nothing in this report suggests that Slack has unique usage patterns. AWS spotted the problem and fixed it entirely on their side. Nothing in the postmortem implies AWS was misconfigured. You’ll find detailed analysis, as well as a checklist of best practices to prevent, prepare for, and respond to an outage.ĭownload “ 2021 Internet Outages: A compendium of the year’s mischiefs and miseries – with a dose of actionable insights.Slack knew how to set up their infrastructure. Watch Our DNS How-To video series to find out how to verify DNS server mapping, and other DNS-related tips!įor further information on major incidents in 2021, please check out our new report. The fix is easy, but only if you know what needs to be fixed! Observing the DNS of all your essential SaaS services from the cloud, backbone and last mile is essential to understanding the true performance of DNS. If the process of DNS resolution fails, users will experience outages such as this one. Understand How to Resolve DNS Issues More QuicklyĭNS is at the core of the Internet. Read more about how TTL can impact DNS responses. The lesson here? DNS might be a small service in the delivery chain, but minor mistakes in configuration can take hours to recover if you have large TTL for your records. Slack has now confirmed the outage was “caused by our own change and not related to any third-party DNS software and services.” This was related to Slack’s TTL allowing for caching of responses for up to two days. Those who are aware of the issue and its root cause can mitigate it by overriding their default DNS resolver with a public DNS resolver such as 8.8.8.8 or 1.1.1.1. Scatterplot data from Catchpoint showing intermittent failures for Slack DNS testsĬatchpoint records showing server failure while resolving domainĮven after 15 hours of the outage, some users still cannot access Slack. Catchpoint’s Last Mile Tests Detected DNS Issues As The Root CauseĬatchpoint’s last mile tests detected Slack DNS issues, allowing the platform to proactively notify respective teams. At the same time, the last mile network represents availability and performance for real end users who are trying to access digital services on their home/office networks. This allows you to monitor, measure and benchmark application performance without any network fluctuations. The backbone network has predefined bandwidth and consistent network connectivity. Monitoring applications from cloud instances leaves dangerous blind spots and does not accurately represent end user experience.Ī good monitoring and observability strategy must include a combination of observation across backbone and last mile networks. Most monitoring solutions are hosted on cloud instances. IT teams in some organizations might already monitor their SaaS applications, but it is not surprising if none of them had triggered any alarms for Slack. This can lead to outages that directly impact customers. Why Monitoring From The Cloud Isn’t Enoughĭuring such incidents where Operation teams are not able to collaborate efficiently with each other, things can easily get out of control. Things got more difficult as Slack’s status page was down due to the same issue. Users were struggling to understand if they were not able to access Slack due to their device, wireless network or ISP connectivity. At the time of publication, the issue is still ongoing for some users and at 06:57 Slack UTC announced it may take up to 24 hours to completely resolve this issue for all users. The outage was related to a DNS failure, which was later acknowledged by Slack. Users were not able to access desktop, mobile, and web applications of Slack from 15:30 AM UTC onwards. Let’s start by breaking down the issue that’s happening currently. However, you can take action to avoid business impact. If the process of DNS resolution fails, users experience outages like this. Unfortunately, things got more challenging this week as one of the world’s largest collaboration and messaging applications, Slack, was not accessible for various users worldwide during the same time period.ĭNS misconfiguration is at the core of this issue. Many were forced to urgently rotate SSL certificates after one of Lets Encrypt’s root certificates expired.Ĭollaboration plays a critical role during such situations where members in a team or multiple teams must communicate and work with each other to rapidly and efficiently complete a collective task. It has been a busy week for Ops teams across the globe. Looking for proof? It’s happening right now. DNS observability is an essential part of any Ops team’s strategy.
0 Comments
Leave a Reply. |