TraceDiag: Adaptive, Interpretable, and Efficient Root Cause Analysis on Large-Scale Microservice Systems
Root Cause Analysis (RCA) is becoming increasingly crucial for ensuring the reliability of microservice systems. However, performing RCA on modern microservice systems can be challenging due to their large scale, as they usually comprise hundreds of components, leading significant human effort. This paper proposes TraceDiag, an end-to-end RCA framework that addresses the challenges for large-scale microservice systems. It leverages reinforcement learning to learn a pruning policy for the service dependency graph to automatically eliminates redundant components, thereby significantly improving the RCA efficiency. The learned pruning policy is interpretable and fully adaptive to new RCA instances. With the pruned graph, a causal-based method can be executed with high accuracy and efficiency. The proposed TraceDiag framework is evaluated on real data traces collected from the Microsoft Exchange system, and demonstrates superior performance compared to state-of-the-art RCA approaches. Notably, TraceDiag has been integrated as a critical component in the Microsoft M365 Exchange, resulting in a significant improvement in the system's reliability and a considerable reduction in the human effort required for RCA.
Tue 5 DecDisplayed time zone: Pacific Time (US & Canada) change
16:00 - 18:00 | Fault Diagnosis and Root Cause Analysis IResearch Papers / Journal First / Industry Papers at Golden Gate C3 Chair(s): Akond Rahman Auburn University | ||
16:00 15mTalk | [Remote] Nezha: Interpretable Fine-Grained Root Causes Analysis for Microservices on Multi-Modal Observability Data Research Papers Guangba Yu Sun Yat-Sen University, Pengfei Chen Sun Yat-Sen University, Yufeng Li Sun Yat-sen University, Hongyang Chen School of Computer Science and Engineering, Sun Yat-sen University, Xiaoyun Li Sun Yat-sen University, Zibin Zheng Sun Yat-sen University Pre-print | ||
16:15 15mFull-paper | [Remote] DiagConfig: Configuration Diagnosis of Performance Violations in Configurable Software Systems Research Papers Zhiming Chen Sun Yat-sen University, Pengfei Chen Sun Yat-Sen University, Guangba Yu Sun Yat-Sen University, Zilong He Sun Yat-Sen University, Genting Mai Sun Yat-sen University, Peipei Wang ByteDance Infrastructure System Lab Pre-print Media Attached | ||
16:30 15mTalk | [Remote] Pre-training Code Representation with Semantic Flow Graph for Effective Bug Localization Research Papers Media Attached | ||
16:45 15mTalk | [Remote] A Practical Human Labeling Method for Online Just-in-Time Software Defect Prediction Research Papers Liyan Song Southern University of Science and Technology, China, Leandro Minku University of Birmingham, Cong Teng Southern University of Science and Technology, Xin Yao Southern University of Science and Technology Pre-print Media Attached | ||
17:00 15mTalk | Trace Diagnostics for Signal-Based Temporal Properties Journal First Chaima Boufaied University of Ottawa, Claudio Menghi University of Bergamo; McMaster University, Domenico Bianculli University of Luxembourg, Lionel Briand University of Ottawa, Canada / University of Luxembourg, Luxembourg Media Attached | ||
17:15 15mTalk | TraceDiag: Adaptive, Interpretable, and Efficient Root Cause Analysis on Large-Scale Microservice Systems Industry Papers Ruomeng Ding Microsoft, Chaoyun Zhang Microsoft, Lu Wang Microsoft Research, Yong Xu Microsoft Research, Minghua Ma Microsoft Research, Xiaomin Wu Microsoft, Meng Zhang , Qingjun Chen Microsoft 365, Xin Gao Microsoft 365, Xuedong Gao Microsoft 365, Hao Fan , Saravan Rajmohan Microsoft 365, Qingwei Lin Microsoft, Dongmei Zhang Microsoft Research DOI Media Attached | ||
17:30 15mTalk | Triggering Modes in Spectrum-Based Multi-location Fault Localization Industry Papers DOI Media Attached | ||
17:45 15mTalk | Automata-based Trace Analysis for Aiding Diagnosing GUI Testing Tools for Android Research Papers Enze Ma East China Normal University, Shan Huang East China Normal University, weigang he East China Normal University, Ting Su East China Normal University, Jue Wang Nanjing University, Huiyu Liu East China Normal University, Geguang Pu East China Normal University, Zhendong Su ETH Zurich Media Attached |