DistXplore: Distribution-guided Testing for Evaluating and Enhancing Deep Learning Systems
Deep learning (DL) models are trained on sampled data, where the distribution of training data differs from that of real-world data (\emph{i.e.}, the distribution shift), which reduces the model robustness. Various testing techniques have been proposed, including distribution-unaware and distribution-aware methods. However, distribution-unaware testing lacks effectiveness by not explicitly considering the distribution of test cases and may generate redundant errors (within the same distribution). Distribution-aware testing techniques primarily focus on generating test cases that follow the training distribution, missing out-of-distribution data that may also be valid and should be considered in the testing process.
In this paper, we propose a novel distribution-guided approach for generating \textit{valid} test cases with \textit{diverse} distributions, which can better evaluate the model robustness (\emph{i.e.}, generating hard-to-detect errors) and enhance the model robustness (\emph{i.e.}, enriching training data). Unlike existing testing techniques that optimize individual test cases, \textit{DistXplore} optimizes test suites that represent specific distributions. To evaluate and enhance the model robustness, we design two metrics: \textit{distribution difference}, which maximizes the similarity in distribution between two different classes of data to generate hard-to-detect errors, and \textit{distribution diversity}, which generates test cases with diverse distributions to enhance the model robustness by enriching the training data. To evaluate the effectiveness of \textit{DistXplore} in model evaluation and model enhancement, we compare \textit{DistXplore} with 9 state-of-the-art baselines on 8 models across 4 datasets. The evaluation results show that \textit{DistXplore} not only detects a larger number of errors (\emph{e.g.}, 2X+ on average), but also identifies more hard-to-detect errors (\emph{e.g.}, 12.1%+ on average); Furthermore, \textit{DistXplore} achieves a higher improvement in empirical robustness (\emph{e.g.}, 5.3% more accuracy improvement than the baselines on average).
Tue 5 DecDisplayed time zone: Pacific Time (US & Canada) change
11:00 - 12:30 | Testing IIdeas, Visions and Reflections / Research Papers / Journal First / Industry Papers at Golden Gate C1 Chair(s): Marcelo d'Amorim North Carolina State University | ||
11:00 15mTalk | [Remote] CAmpactor: A Novel and Effective Local Search Algorithm for Optimizing Pairwise Covering Arrays Research Papers Qiyuan Zhao Beihang University, Chuan Luo Beihang University, Shaowei Cai Institute of Software, Chinese Academy of Sciences, Wei Wu L3S Research Center, Leibniz University Hannover, Germany, Jinkun Lin Seed Math Technology Limited, Hongyu Zhang Chongqing University, Chunming Hu Beihang University DOI Pre-print Media Attached | ||
11:15 15mTalk | Accelerating Continuous Integration with Parallel Batch Testing Research Papers Emad Fallahzadeh Concordia University, Amir Hossein Bavand Concordia University, Peter Rigby Concordia University; Meta Pre-print Media Attached | ||
11:30 15mTalk | Keeping Mutation Test Suites Consistent and Relevant with Long-Standing Mutants Ideas, Visions and Reflections Milos Ojdanic University of Luxembourg, Mike Papadakis University of Luxembourg, Luxembourg, Mark Harman Meta Platforms Inc. and UCL Media Attached | ||
11:45 15mTalk | DistXplore: Distribution-guided Testing for Evaluating and Enhancing Deep Learning Systems Research Papers Longtian Wang Xi'an Jiaotong University, Xiaofei Xie Singapore Management University, Xiaoning Du Monash University, Australia, Meng Tian Singapore Management University, Qing Guo IHPC and CFAR at A*STAR, Singapore, Yang Zheng TTE Lab, Huawei, Chao Shen Xi’an Jiaotong University Media Attached | ||
12:00 15mTalk | Input Distribution Coverage: Measuring Feature Interaction Adequacy in Neural Network Testing Journal First Swaroopa Dola University of Virginia, Matthew B Dwyer University of Virginia, Mary Lou Soffa University of Virginia Media Attached | ||
12:15 15mTalk | A Unified Framework for Mini-game Testing: Experience on WeChat Industry Papers Chaozheng Wang The Chinese University of Hong Kong, Haochuan Lu Tencent, Cuiyun Gao The Chinese University of Hong Kong, Li Zongjie Hong Kong University of Science and Technology, Ting Xiong Tencent Inc., Yuetang Deng Tencent Inc. DOI Media Attached |