0371-63319761

谷歌云因“服務(wù)器配置錯(cuò)誤”導(dǎo)致全球性癱瘓

時(shí)間：2019-06-10

不正確的服務(wù)器配置更改抑制了多個(gè)地區(qū)的網(wǎng)絡(luò)容量。

谷歌透露了一些細(xì)節(jié)，表明了導(dǎo)致周日大規(guī)模故障的根本原因。這次故障不光影響了谷歌自己的服務(wù)，包括YouTube、Gmail、谷歌搜索、G Suite、Google Drive和Google Docs，還影響了使用谷歌云的各大科技品牌。

谷歌的工程副總裁Benjamin Treynor Sloss在一篇博文中解釋道，上周日（北京時(shí)間周一）故障的根本原因是針對一個(gè)地區(qū)的一小批服務(wù)器的配置更改錯(cuò)誤地實(shí)施于好幾個(gè)相鄰地區(qū)的大量服務(wù)器。

隨后，該錯(cuò)誤導(dǎo)致那些地區(qū)無法使用一半以上的可用網(wǎng)絡(luò)容量。

對于像YouTube這種高帶寬平臺造成的影響很嚴(yán)重，但對于谷歌搜索這樣的低帶寬服務(wù)來說不那么嚴(yán)重，谷歌搜索只出現(xiàn)了短暫的延遲增加。

Sloss說：“整體而言，YouTube在故障期間全球?yàn)g覽量下降了10%，而谷歌云存儲的訪問量下降了30%。”

“大約1%的活躍Gmail用戶出現(xiàn)了帳戶問題；雖然這只占用戶的一小部分，但仍相當(dāng)于數(shù)百萬用戶無法接收或發(fā)送電子郵件。”

谷歌云狀態(tài)儀表板顯示，谷歌云網(wǎng)絡(luò)在美國東部地區(qū)遇到了網(wǎng)絡(luò)擁塞，影響了谷歌云、G Suite和YouTube。這次中斷持續(xù)了4個(gè)小時(shí)，問題在太平洋時(shí)間下午4點(diǎn)得到了解決。

Sloss解釋道，由于試圖將入站流量和出站流量塞入到剩余容量，因此容量受限的地區(qū)變得堵塞不堪。

他特別指出：“網(wǎng)絡(luò)變得擁塞，我們的網(wǎng)絡(luò)系統(tǒng)正確地排查了流量過載，丟棄了更龐大且對延遲不太敏感的流量，以保留比較小且對延遲敏感的流量，就像遇到最嚴(yán)重的交通堵塞時(shí)通過單車運(yùn)送緊急包裹。”

雖然谷歌的工程師“在短短幾秒內(nèi)”發(fā)現(xiàn)了這個(gè)問題，但解決問題所花的時(shí)間“遠(yuǎn)超過”原本預(yù)定的幾分鐘，這一方面是由于網(wǎng)絡(luò)擁塞妨礙了工程師恢復(fù)正確配置的能力。

此外，正如一名谷歌員工在HackerNews的帖子中解釋的那樣，這次故障導(dǎo)致谷歌工程師們一直用來彼此溝通、告知故障最新情況的內(nèi)部工具無法使用。

Sloss的帖子不是該公司承諾提供給客戶的完整的事后分析報(bào)告，因?yàn)檎{(diào)查仍在進(jìn)行中，旨在發(fā)現(xiàn)導(dǎo)致網(wǎng)絡(luò)容量驟減、恢復(fù)過程緩慢的所有影響因素。

An update on Sunday’s service disruption

Yesterday, a disruption in Google’s network in parts of the United States caused slow performance and elevated error rates on several Google services, including Google Cloud Platform, YouTube, Gmail, Google Drive and others. Because the disruption reduced regional network capacity, the worldwide user impact varied widely. For most Google users there was little or no visible change to their services—search queries might have been a fraction of a second slower than usual for a few minutes but soon returned to normal, their Gmail continued to operate without a hiccup, and so on. However, for users who rely on services homed in the affected regions, the impact was substantial, particularly for services like YouTube or Google Cloud Storage which use large amounts of network bandwidth to operate.

For everyone who was affected by yesterday’s incident, I apologize. It’s our mission to make Google’s services available to everyone around the world, and when we fall short of that goal—as we did yesterday—we take it very seriously. The rest of this document explains briefly what happened, and what we’re going to do about it.

Incident, Detection and Response

In essence, the root cause of Sunday’s disruption was a configuration change that was intended for a small number of servers in a single region. The configuration was incorrectly applied to a larger number of servers across several neighboring regions, and it caused those regions to stop using more than half of their available network capacity. The network traffic to/from those regions then tried to fit into the remaining network capacity, but it did not. The network became congested, and our networking systems correctly triaged the traffic overload and dropped larger, less latency-sensitive traffic in order to preserve smaller latency-sensitive traffic flows, much as urgent packages may be couriered by bicycle through even the worst traffic jam.

Google’s engineering teams detected the issue within seconds, but diagnosis and correction took far longer than our target of a few minutes. Once alerted, engineering teams quickly identified the cause of the network congestion, but the same network congestion which was creating service degradation also slowed the engineering teams’ ability to restore the correct configurations, prolonging the outage. The Google teams were keenly aware that every minute which passed represented another minute of user impact, and brought on additional help to parallelize restoration efforts.

Impact

Overall, YouTube measured a 10% drop in global views during the incident, while Google Cloud Storage measured a 30% reduction in traffic. Approximately 1% of active Gmail users had problems with their account; while that is a small fraction of users, it still represents millions of users who couldn’t receive or send email. As Gmail users ourselves, we know how disruptive losing an essential tool can be! Finally, low-bandwidth services like Google Search recorded only a short-lived increase in latency as they switched to serving from unaffected regions, then returned to normal.

Next Steps

With all services restored to normal operation, Google’s engineering teams are now conducting a thorough post-mortem to ensure we understand all the contributing factors to both the network capacity loss and the slow restoration. We will then have a focused engineering sprint to ensure we have not only fixed the direct cause of the problem, but also guarded against the entire class of issues illustrated by this event.

Final Thoughts

We know that people around the world rely on Google’s services, and over the years have come to expect Google to always work. We take that expectation very seriously—it is our mission, and our inspiration. When we fall short, as happened Sunday, it motivates us to learn as much as we can, and to make Google’s services even better, even faster, and even more reliable.

安全中瀚

安全服務(wù)

安全方案

安全研究

聯(lián)系我們

客服熱線

0371-63319761

高清无码男男同同性,久久久久日韩AV无码一区,自拍 另类 综合 欧美小说,尤物网址在线观看

谷歌云因“服務(wù)器配置錯(cuò)誤”導(dǎo)致全球性癱瘓

客服熱線

高清无码男男同同性,久久久久日韩AV无码一区,自拍另类综合欧美小说,尤物网址在线观看