refirio.org

Memo

メモ > サーバ > 各論: トラブル対応例 > サーバが重い・サーバに繋がらない 19
■サーバが重い・サーバに繋がらない 19

突然EC2に一切繋がらなくなった
AWSコンソールから確認しても「確認できない」となっている

コンソールのアラート（上部メニューのベルマーク）から確認すると、以下の内容が表示されていた
最初に確認したときは「09:18 PM PDT」の内容のみだった
Network Connectivity

[09:18 PM PDT] We are investigating connectivity issues affecting some instances in a single Availability Zone in the AP-NORTHEAST-1 Region.

[09:47 PM PDT] We can confirm that some instances are impaired and some EBS volumes are experiencing degraded performance within a single Availability Zone in the AP-NORTHEAST-1 Region. Some EC2 APIs are also experiencing increased error rates and latencies. We are working to resolve the issue.

[10:27 PM PDT] We have identified the root cause and are working toward recovery for the instance impairments and degraded EBS volume performance within a single Availability Zone in the AP-NORTHEAST-1 Region.

[11:40 PM PDT] We are starting to see recovery for instance impairments and degraded EBS volume performance within a single Availability Zone in the AP-NORTHEAST-1 Region. We continue to work towards recovery for all affected instances and EBS volumes.


以下は和訳したもの
ネットワーク接続

[太平洋夏時間9:18 PM]AP-NORTHEAST-1領域の単一のアベイラビリティゾーン内の一部のインスタンスに影響を与える接続の問題を調査しています。

[太平洋夏時間9:47 PM]AP-NORTHEAST-1領域の単一のアベイラビリティゾーン内で、一部のインスタンスに障害が発生し、一部のEBSボリュームのパフォーマンスが低下していることを確認できます。EC2APIの中には、エラー率と遅延が増加するものもあります。問題の解決に努めています。

[太平洋夏時間10:27 PM]根本原因を特定し、AP-NORTHEAST-1領域の単一のアベイラビリティゾーン内でのインスタンス障害とEBSボリュームパフォーマンスの低下の回復に向けて作業しています。

[太平洋夏時間11:40 PM]AP-NORTHEAST-1領域の単一のアベイラビリティゾーン内での障害やEBSボリュームパフォーマンスの低下などの回復が見られ始めています。引き続き、影響を受けるすべてのインスタンスとEBSボリュームのリカバリに取り組んでいきます。


東京リージョンの Availability Zone 自体の障害らしく、AWSが調査中とのこと
非公式だが、ニュースも出ていた

日本時間2019年8月23日(金) 13:18:32 EC2 (東京) お知らせ ネットワーク接続性の問題
https://twitter.com/awsstatusjp/status/1164754914695770113?s=20

AWSで障害、スマホ決済「PayPay」などに影響 - Engadget 日本版
https://japanese.engadget.com/2019/08/23/aws/

AWSで大規模障害が発生中　『アズレン』で通信障害を報告　『アナデン』『ダンメモ』『シノアリス』『ガルパ』などにも影響【追記】 | Social Game Info
https://gamebiz.jp/?p=246869■顛末

最終的に、コンソールのアラートには以下が表示されていた
インスタンスの接続性について | Instance Availability

[09:18 PM PDT] We are investigating connectivity issues affecting some instances in a single Availability Zone in the AP-NORTHEAST-1 Region.

[09:47 PM PDT] We can confirm that some instances are impaired and some EBS volumes are experiencing degraded performance within a single Availability Zone in the AP-NORTHEAST-1 Region. Some EC2 APIs are also experiencing increased error rates and latencies. We are working to resolve the issue.

[10:27 PM PDT] We have identified the root cause and are working toward recovery for the instance impairments and degraded EBS volume performance within a single Availability Zone in the AP-NORTHEAST-1 Region.

[11:40 PM PDT] We are starting to see recovery for instance impairments and degraded EBS volume performance within a single Availability Zone in the AP-NORTHEAST-1 Region. We continue to work towards recovery for all affected instances and EBS volumes.

[01:54 AM PDT] Recovery is in progress for instance impairments and degraded EBS volume performance within a single Availability Zone in the AP-NORTHEAST-1 Region. We continue to work towards recovery for all affected instances and EBS volumes.

[02:39 AM PDT] The majority of impaired EC2 instances and EBS volumes experiencing degraded performance have now recovered. We continue to work on recovery for the remaining EC2 instances and EBS volumes that are affected by this issue. This issue affects EC2 instances and EBS volumes in a single Availability Zone in the AP-NORTHEAST-1 region.

[04:18 AM PDT] 日本時間 2019年8月23日 12:36 より、AP-NORTHEAST-1 の単一のアベイラビリティゾーンで、一定の割合の EC2 サーバのオーバーヒートが発生しました。この結果、当該アベイラビリティゾーンの EC2 インスタンス及び EBS ボリュームのパフォーマンスの劣化が発生しました。
このオーバーヒートは、影響を受けたアベイラビリティゾーン中の一部の冗長化された空調設備の管理システム障害が原因です。
日本時間 15:21 に冷却装置は復旧し、室温が通常状態に戻り始めました。温度が通常状態に戻ったことで、影響を受けたインスタンスの電源が回復しました。
日本時間 18:30 より大部分の EC2 インスタンスと EBS ボリュームは回復しました。
我々は残りの EC2 インスタンスと EBS ボリュームの回復に取り組んでいます。
少数の EC2 インスタンスと EBS ボリュームが電源が落ちたハードウェア ホスト上に残されています。
我々は影響をうけた全ての EC2 インスタンスと EBS ボリュームの回復のための作業を継続しています。
早期回復の為、可能な場合残された影響を受けている EC2 インスタンスと EBS ボリュームのリプレースを推奨します。
いくつかの影響をうけた EC2 インスタンスはお客様側での作業が必要になる可能性がある為、後ほどお客様個別にお知らせすることを予定しています。

詳細は <a href="https://aws.amazon.com/message/56489/">こちら</a> をご参照ください。追加のご質問がある場合は、<a href="https://aws.amazon.com/support">AWS サポート</a>までご連絡ください。


以下のサイトなどでも、原因について触れられている
空調設備の異常が原因で、マルチAZでの冗長構成でも防げなかったとのこと

AWS障害、大部分の復旧完了　原因は「サーバの過熱」 - ITmedia NEWS
https://www.itmedia.co.jp/news/articles/1908/23/news117.html

AWS、東京リージョン23日午後の大規模障害について詳細を報告。冷却システムにバグ、フェイルセーフに失敗、手動操作に切り替えるも反応せず − Publickey
https://www.publickey1.jp/blog/19/aws23.html

AWS大障害、冗長構成でも障害あったと公式に認める | 日経 xTECH（クロステック）
https://tech.nikkeibp.co.jp/atcl/nxt/news/18/05816/

AWS障害、“マルチAZ”なら大丈夫だったのか？　インフラエンジニアたちはどう捉えたか、生の声で分かった「実情」 (1/3) - ITmedia NEWS
https://www.itmedia.co.jp/news/articles/1908/28/news127.html
Memo

Advertisement