[Remote] Assess and Summarize: Improve Outage Understanding with Large Language Models (ESEC/FSE 2023 - Industry Papers)

Who

Pengxiang Jin, Shenglin Zhang, Minghua Ma, Haozhe Li, Yu Kang, Liqun Li, Yudong Liu, Bo Qiao, Chaoyun Zhang, Pu Zhao, Shilin He, Federica Sarro, Yingnong Dang, Saravan Rajmohan, Qingwei Lin, Dongmei Zhang

Track

ESEC/FSE 2023 Industry Papers

Time Zone

The program is currently displayed in (GMT-08:00) Pacific Time (US & Canada).

Use conference time zone: (GMT-08:00) Pacific Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 5 Dec 2023 14:00 - 14:15 at Golden Gate A - Empirical Studies I Chair(s): Cristian Cadar

Abstract

Cloud systems have become increasingly popular in recent years due to their flexibility and scalability. Each time cloud computing applications and services hosted on the cloud are affected by a cloud outage, users can experience slow response times, connection issues or total service disruption, resulting in a significant negative business impact. Outages are usually comprised of several concurring events/source causes, and therefore understanding the context of outages is a very challenging yet crucial first step toward mitigating and resolving outages. In current practice, on-call engineers with in-depth domain knowledge, have to manually assess and summarize outages when they happen, which is time-consuming and labor-intensive. In this paper, we first present a large-scale empirical study investigating the way on-call engineers currently deal with cloud outages at Microsoft, and then present and empirically validate a novel approach (dubbed Oasis) to help the engineers in this task. Oasis is able to automatically assess the impact scope of outages as well as to produce human-readable summarization. Specifically, Oasis first assesses the impact scope of an outage by aggregating relevant incidents via multiple techniques. Then, it generates a human-readable summary by leveraging fine-tuned large language models like GPT-3.x. The impact assessment component of Oasis was introduced in Microsoft over three years ago, and it is now widely adopted, while the outage summarization component has been recently introduced, and in this article we present the results of an empirical evaluation we carried out on 18 real-world cloud systems as well as a human-based evaluation with outage owners. The results obtained show that Oasis can effectively and efficiently summarize outages, and lead Microsoft to deploy its first prototype which is currently under experimental adoption by some of the incident teams.

DOI

https://doi.org/10.1145/3611643.3613891

Pengxiang Jin

Nankai University

China

Shenglin Zhang

Nankai University

China

Minghua Ma

Microsoft Research

China

Haozhe Li

Peking University

China

Yu Kang

Microsoft Research

China

Liqun Li

Microsoft Research

China

Yudong Liu

Microsoft Research

China

Bo Qiao

Microsoft Research

China

Chaoyun Zhang

Microsoft

China

Pu Zhao

Microsoft Research

China

Shilin He

Microsoft Research

n.n.

Federica Sarro

University College London

United Kingdom

Yingnong Dang

Microsoft Azure

United States

Saravan Rajmohan

Microsoft 365

United States

Qingwei Lin

Microsoft

China

Dongmei Zhang

Microsoft Research

China

Media

Time Zone

The program is currently displayed in (GMT-08:00) Pacific Time (US & Canada).

Use conference time zone: (GMT-08:00) Pacific Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Tue 5 Dec
Displayed time zone: Pacific Time (US & Canada) change

14:00 - 15:30	Empirical Studies IIdeas, Visions and Reflections / Research Papers / Industry Papers / Journal First at Golden Gate A Chair(s): Cristian Cadar Imperial College London

14:00 15m Talk		[Remote] Assess and Summarize: Improve Outage Understanding with Large Language Models Industry Papers Pengxiang Jin Nankai University, Shenglin Zhang Nankai University, Minghua Ma Microsoft Research, Haozhe Li Peking University, Yu Kang Microsoft Research, Liqun Li Microsoft Research, Yudong Liu Microsoft Research, Bo Qiao Microsoft Research, Chaoyun Zhang Microsoft, Pu Zhao Microsoft Research, Shilin He Microsoft Research, Federica Sarro University College London, Yingnong Dang Microsoft Azure, Saravan Rajmohan Microsoft 365, Qingwei Lin Microsoft, Dongmei Zhang Microsoft Research DOI Media Attached
14:15 15m Talk		Open Source License Inconsistencies on GitHub Journal First Thomas Wolter Friedrich-Alexander University Erlangen-Nuernberg, Ann Barcomb Department of Electrical and Software Engineering, Schulich School of Engineering, University of Calgary, Dirk Riehle U of Erlangen, Nikolay Harutyunyan Friedrich-Alexander University Erlangen-Nuremberg, Germany Media Attached
14:30 15m Talk		On the Relationship Between Code Verifiability and Understandability Research Papers Kobi Feldman College of William & Mary, Martin Kellogg New Jersey Institute of Technology, Oscar Chaparro William & Mary Media Attached
14:45 15m Talk		Lessons from the Long Tail: Analysing Unsafe Dependency Updates across Software Ecosystems Ideas, Visions and Reflections Supatsara Wattanakriengkrai Nara Institute of Science and Technology, Raula Gaikovina Kula Nara Institute of Science and Technology, Christoph Treude University of Melbourne, Kenichi Matsumoto Nara Institute of Science and Technology Media Attached
15:00 15m Talk		Towards Greener Yet Powerful Code Generation via Quantization: An Empirical Study Research Papers Xiaokai Wei AWS AI Labs, Sujan Kumar Gonugondla AWS AI Labs, Shiqi Wang AWS AI Labs, Wasi Ahmad AWS AI Labs, Baishakhi Ray Columbia University, Haifeng Qian AWS AI Labs, Xiaopeng LI AWS AI Labs, Varun Kumar AWS AI Labs, Zijian Wang AWS AI Labs, Yuchen Tian AWS, Qing Sun AWS AI Labs, Ben Athiwaratkun AWS AI Labs, Mingyue Shang AWS AI Labs, Murali Krishna Ramanathan AWS AI Labs, Parminder Bhatia AWS AI Labs, Bing Xiang AWS AI Labs Media Attached
15:15 15m Talk		Understanding Hackers’ Work: An Empirical Study of Offensive Security Practitioners Industry Papers Andreas Happe TU Wien, Jürgen Cito TU Wien DOI Media Attached