Tue 5 Dec 2023 16:00 - 16:15 at Golden Gate C3 - Fault Diagnosis and Root Cause Analysis I Chair(s): Akond Rahman

Root cause analysis (RCA) in large-scale microservice systems is a critical and challenging task. To understand and localize root causes of unexpected faults, modern observability tools collect and preserve multi-modal observability data, including metrics, traces, and logs. Since system faults may manifest as anomalies in different data sources, existing RCA approaches that leverage only the single-modal data are limited in the granularity and interpretability of root causes. In this study, we present Nezha, an interpretable and fine-grained RCA approach that pinpoints root causes at the code region and resource type level by incorporative analysis of multi-modal data. Nezha transforms heterogeneous multi-modal data into a homogeneous event representation and extracts event patterns by constructing and mining event graphs. The core idea of Nezha is to compare event patterns in the fault-free phase with those in the fault-suffering phase to localize root causes in an interpretable way. Practical implementation and experimental evaluations on two microservice benchmarks show that Nezha achieves a high top1 accuracy (89.77%) on average at the code region and resource type level and outperforms state-of-the-art approaches by a large margin. Two ablation studies further confirm the contributions of incorporating multi-modal data.

Tue 5 Dec

Displayed time zone: Pacific Time (US & Canada) change

16:00 - 18:00
Fault Diagnosis and Root Cause Analysis IResearch Papers / Journal First / Industry Papers at Golden Gate C3
Chair(s): Akond Rahman Auburn University
16:00
15m
Talk
[Remote] Nezha: Interpretable Fine-Grained Root Causes Analysis for Microservices on Multi-Modal Observability Data
Research Papers
Guangba  Yu Sun Yat-Sen University, Pengfei Chen Sun Yat-Sen University, Yufeng Li Sun Yat-sen University, Hongyang Chen School of Computer Science and Engineering, Sun Yat-sen University, Xiaoyun Li Sun Yat-sen University, Zibin Zheng Sun Yat-sen University
Pre-print
16:15
15m
Full-paper
[Remote] DiagConfig: Configuration Diagnosis of Performance Violations in Configurable Software Systems
Research Papers
Zhiming Chen Sun Yat-sen University, Pengfei Chen Sun Yat-Sen University, Guangba  Yu Sun Yat-Sen University, Zilong He Sun Yat-Sen University, Genting Mai Sun Yat-sen University, Peipei Wang ByteDance Infrastructure System Lab
Pre-print Media Attached
16:30
15m
Talk
[Remote] Pre-training Code Representation with Semantic Flow Graph for Effective Bug Localization
Research Papers
Yali Du Shandong University, Zhongxing Yu Shandong University
Media Attached
16:45
15m
Talk
[Remote] A Practical Human Labeling Method for Online Just-in-Time Software Defect Prediction
Research Papers
Liyan Song Southern University of Science and Technology, China, Leandro Minku University of Birmingham, Cong Teng Southern University of Science and Technology, Xin Yao Southern University of Science and Technology
Pre-print Media Attached
17:00
15m
Talk
Trace Diagnostics for Signal-Based Temporal Properties
Journal First
Chaima Boufaied University of Ottawa, Claudio Menghi University of Bergamo; McMaster University, Domenico Bianculli University of Luxembourg, Lionel Briand University of Ottawa, Canada / University of Luxembourg, Luxembourg
Media Attached
17:15
15m
Talk
TraceDiag: Adaptive, Interpretable, and Efficient Root Cause Analysis on Large-Scale Microservice Systems
Industry Papers
Ruomeng Ding Microsoft, Chaoyun Zhang Microsoft, Lu Wang Microsoft Research, Yong Xu Microsoft Research, Minghua Ma Microsoft Research, Xiaomin Wu Microsoft, Meng Zhang , Qingjun Chen Microsoft 365, Xin Gao Microsoft 365, Xuedong Gao Microsoft 365, Hao Fan , Saravan Rajmohan Microsoft 365, Qingwei Lin Microsoft, Dongmei Zhang Microsoft Research
DOI Media Attached
17:30
15m
Talk
Triggering Modes in Spectrum-Based Multi-location Fault Localization
Industry Papers
Tung Dao Cvent, Na Meng Virginia Tech, ThanhVu Nguyen George Mason University
DOI Media Attached
17:45
15m
Talk
Automata-based Trace Analysis for Aiding Diagnosing GUI Testing Tools for Android
Research Papers
Enze Ma East China Normal University, Shan Huang East China Normal University, weigang he East China Normal University, Ting Su East China Normal University, Jue Wang Nanjing University, Huiyu Liu East China Normal University, Geguang Pu East China Normal University, Zhendong Su ETH Zurich
Media Attached