Wed 6 Dec 2023 17:30 - 17:45 at Golden Gate A - Fault Diagnosis and Root Cause Analysis II Chair(s): Yun Lin

Cloud providers use automated watchdogs or monitors to continuously observe service availability and to proactively report incidents when system performance degrades. Improper monitoring can lead to delays in the detection and mitigation of production incidents, which can be extremely expensive in terms of customer impacts and manual toil from engineering resources. Therefore, a systematic understanding of the pitfalls in current monitoring practices and how they can lead to production incidents is crucial for ensuring {continuous} reliability of cloud services.

In this work, we carefully study the production incidents from the past year at Microsoft to understand the monitoring gaps in a hyperscale cloud platform. We conduct an extensive empirical study to answer: (1) What are the major causes of failures in early detection of production incidents and what are the steps taken for mitigation, (2) What is the impact of failures in early detection, (3) How do we recommend best monitoring practices for different services, and (4) How can we leverage the insights from this study to enhance the reliability of the cloud services.

This study provides a deeper understanding of existing monitoring gaps in cloud platforms, uncover interesting insights and provide guidance for best monitoring practices for ensuring continuous reliability.

Wed 6 Dec

Displayed time zone: Pacific Time (US & Canada) change

16:00 - 18:00
Fault Diagnosis and Root Cause Analysis IIIndustry Papers / Research Papers at Golden Gate A
Chair(s): Yun Lin Shanghai Jiao Tong University
16:00
15m
Talk
DeepDebugger: An Interactive Time-Travelling Debugging Approach for Deep Classifiers
Research Papers
Xianglin Yang Shanghai Jiao Tong University; National University of Singapore, Yun Lin Shanghai Jiao Tong University, Yifan Zhang National University of Singapore, Linpeng Huang Shanghai Jiao Tong University, Jin Song Dong National University of Singapore, Hong Mei Peking University
Media Attached
16:15
15m
Talk
AG3: Automated Game GUI Text Glitch Detection Based on Computer Vision
Industry Papers
Xiaoyun Liang ByteDance, Jiayi Qi ByteDance, Yongqiang Gao ByteDance, Chao Peng ByteDance, China, Ping Yang Bytedance Network Technology
DOI Media Attached
16:30
15m
Talk
TransMap: Pinpointing Mistakes in Neural Code Translation
Research Papers
Bo Wang National University of Singapore, Ruishi Li National University of Singapore, Mingkai Li National University of Singapore, Prateek Saxena National University of Singapore
Media Attached
16:45
15m
Talk
Dynamic Prediction of Delays in Software Projects Using Delay Patterns and Bayesian Modeling
Research Papers
Elvan Kula Delft University of Technology, Eric Greuter ING, Arie van Deursen Delft University of Technology, Georgios Gousios Endor Labs & Delft University of Technology
Pre-print Media Attached
17:00
15m
Talk
Commit-level, Neural Vulnerability Detection and Assessment
Research Papers
Yi Li New Jersey Institute of Technology, Aashish Yadavally The University of Texas at Dallas, Jiaxing Zhang New Jersey Institute of Technology, Shaohua Wang Central University of Finance and Economics , Tien N. Nguyen University of Texas at Dallas
Media Attached
17:15
15m
Talk
[Remote] Mining Resource-Operation Knowledge to Support Resource Leak Detection
Research Papers
Chong Wang Nanyang Technological University, Yiling Lou Fudan University, Xin Peng Fudan University, Jianan Liu Fudan University, Baihan Zou Fudan University
Media Attached
17:30
15m
Talk
[Remote] Detection Is Better Than Cure: A Cloud Incidents Perspective
Industry Papers
Vaibhav Ganatra Microsoft, Anjaly Parayil Microsoft, Supriyo Ghosh Microsoft, Yu Kang Microsoft Research, Minghua Ma Microsoft Research, Chetan Bansal Microsoft Research, Suman Nath Microsoft Research, Jonathan Mace Microsoft
DOI Media Attached
17:45
7m
Talk
[Remote] Diffusion-Based Time Series Data Imputation for Cloud Failure Prediction at Microsoft 365
Industry Papers
Fangkai Yang Microsoft Research, Wenjie Yin KTH Royal Institute of Technology, Lu Wang Microsoft Research, Tianci Li Microsoft, Pu Zhao Microsoft Research, Bo Liu Beijing Institute of Technology, Paul Wang Microsoft 365, Bo Qiao Microsoft Research, Yudong Liu Microsoft Research, Mårten Björkman KTH Royal Institute of Technology, Saravan Rajmohan Microsoft 365, Qingwei Lin Microsoft, Dongmei Zhang Microsoft Research
DOI Media Attached