[Remote] Detection Is Better Than Cure: A Cloud Incidents Perspective
Cloud providers use automated watchdogs or monitors to continuously observe service availability and to proactively report incidents when system performance degrades. Improper monitoring can lead to delays in the detection and mitigation of production incidents, which can be extremely expensive in terms of customer impacts and manual toil from engineering resources. Therefore, a systematic understanding of the pitfalls in current monitoring practices and how they can lead to production incidents is crucial for ensuring {continuous} reliability of cloud services.
In this work, we carefully study the production incidents from the past year at Microsoft to understand the monitoring gaps in a hyperscale cloud platform. We conduct an extensive empirical study to answer: (1) What are the major causes of failures in early detection of production incidents and what are the steps taken for mitigation, (2) What is the impact of failures in early detection, (3) How do we recommend best monitoring practices for different services, and (4) How can we leverage the insights from this study to enhance the reliability of the cloud services.
This study provides a deeper understanding of existing monitoring gaps in cloud platforms, uncover interesting insights and provide guidance for best monitoring practices for ensuring continuous reliability.
Wed 6 DecDisplayed time zone: Pacific Time (US & Canada) change
16:00 - 18:00 | Fault Diagnosis and Root Cause Analysis IIIndustry Papers / Research Papers at Golden Gate A Chair(s): Yun Lin Shanghai Jiao Tong University | ||
16:00 15mTalk | DeepDebugger: An Interactive Time-Travelling Debugging Approach for Deep Classifiers Research Papers Xianglin Yang Shanghai Jiao Tong University; National University of Singapore, Yun Lin Shanghai Jiao Tong University, Yifan Zhang National University of Singapore, Linpeng Huang Shanghai Jiao Tong University, Jin Song Dong National University of Singapore, Hong Mei Peking University Media Attached | ||
16:15 15mTalk | AG3: Automated Game GUI Text Glitch Detection Based on Computer Vision Industry Papers Xiaoyun Liang ByteDance, Jiayi Qi ByteDance, Yongqiang Gao ByteDance, Chao Peng ByteDance, China, Ping Yang Bytedance Network Technology DOI Media Attached | ||
16:30 15mTalk | TransMap: Pinpointing Mistakes in Neural Code Translation Research Papers Bo Wang National University of Singapore, Ruishi Li National University of Singapore, Mingkai Li National University of Singapore, Prateek Saxena National University of Singapore Media Attached | ||
16:45 15mTalk | Dynamic Prediction of Delays in Software Projects Using Delay Patterns and Bayesian Modeling Research Papers Elvan Kula Delft University of Technology, Eric Greuter ING, Arie van Deursen Delft University of Technology, Georgios Gousios Endor Labs & Delft University of Technology Pre-print Media Attached | ||
17:00 15mTalk | Commit-level, Neural Vulnerability Detection and Assessment Research Papers Yi Li New Jersey Institute of Technology, Aashish Yadavally The University of Texas at Dallas, Jiaxing Zhang New Jersey Institute of Technology, Shaohua Wang Central University of Finance and Economics , Tien N. Nguyen University of Texas at Dallas Media Attached | ||
17:15 15mTalk | [Remote] Mining Resource-Operation Knowledge to Support Resource Leak Detection Research Papers Chong Wang Nanyang Technological University, Yiling Lou Fudan University, Xin Peng Fudan University, Jianan Liu Fudan University, Baihan Zou Fudan University Media Attached | ||
17:30 15mTalk | [Remote] Detection Is Better Than Cure: A Cloud Incidents Perspective Industry Papers Vaibhav Ganatra Microsoft, Anjaly Parayil Microsoft, Supriyo Ghosh Microsoft, Yu Kang Microsoft Research, Minghua Ma Microsoft Research, Chetan Bansal Microsoft Research, Suman Nath Microsoft Research, Jonathan Mace Microsoft DOI Media Attached | ||
17:45 7mTalk | [Remote] Diffusion-Based Time Series Data Imputation for Cloud Failure Prediction at Microsoft 365 Industry Papers Fangkai Yang Microsoft Research, Wenjie Yin KTH Royal Institute of Technology, Lu Wang Microsoft Research, Tianci Li Microsoft, Pu Zhao Microsoft Research, Bo Liu Beijing Institute of Technology, Paul Wang Microsoft 365, Bo Qiao Microsoft Research, Yudong Liu Microsoft Research, Mårten Björkman KTH Royal Institute of Technology, Saravan Rajmohan Microsoft 365, Qingwei Lin Microsoft, Dongmei Zhang Microsoft Research DOI Media Attached |