npm-follower: A Complete Dataset Tracking the NPM Ecosystem (ESEC/FSE 2023 - Demonstrations)

Sun 3 - Sat 9 December 2023 San Francisco, California, United States

Who

Donald Pinckney, Federico Cassano, Arjun Guha, Jonathan Bell

Track

ESEC/FSE 2023 Demonstrations

Time Zone

The program is currently displayed in (GMT-08:00) Pacific Time (US & Canada).

Use conference time zone: (GMT-08:00) Pacific Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 5 Dec 2023 15:00 - 15:07 at Golden Gate C2 - Software Evolution I Chair(s): Rangeet Pan

Abstract

Software developers typically rely upon a large network of dependencies to build their applications. For instance, the NPM package repository contains over three million packages, over 33 million versions of packages, and serves tens of billions of downloads weekly. Understanding the structure and nature of packages, dependencies, and published OSS code requires datasets that provide researchers with easy access to metadata (dependencies, repository links, etc.) and code of packages. However, prior work on NPM dataset construction typically has two limitations: 1) only metadata is scraped, and 2) packages or versions that are deleted from NPM are not able to be scraped. 335,325 versions of packages have been deleted from NPM just since we started scraping in July 2022, and this data is slipping away from researchers. Moreover, this data is critical for researchers as it often pertains to important questions of security and malware. We present npm-follower, a dataset and crawling architecture for the NPM package repository, which continually scrapes and indexes metadata and code of all packages and versions in near real-time, and is thus able to retain data which is later deleted. Since July 2022 we have archived the metadata and code of 281,858 of those deleted versions. The dataset is designed to be easily used by researchers answering questions involving either metadata or program analysis. Both the code and dataset are available at https://dependencies.science.

Donald Pinckney

Northeastern University

United States

Federico Cassano

Northeastern University

Arjun Guha

Northeastern University and Roblox

United States

Jonathan Bell

Northeastern University

United States

Media

Time Zone

The program is currently displayed in (GMT-08:00) Pacific Time (US & Canada).

Use conference time zone: (GMT-08:00) Pacific Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Tue 5 Dec
Displayed time zone: Pacific Time (US & Canada) change

14:00 - 15:30	Software Evolution IIndustry Papers / Research Papers / Demonstrations at Golden Gate C2 Chair(s): Rangeet Pan IBM Research

14:00 15m Talk		Understanding Solidity Event Logging Practices in the Wild Research Papers Lantian Li Shandong University, Yejian Liang Shandong University, Zhihao Liu Shandong University, Zhongxing Yu Shandong University Media Attached
14:15 15m Talk		Last Diff Analyzer: Multi-language Automated Approver for Behavior-Preserving Code Revisions Industry Papers Yuxin Wang Uber Technologies, Adam Welc Mysten Labs, Lazaro Clapp Uber Technologies Inc, Lingchao Chen Uber Technologies DOI Media Attached
14:30 15m Talk		EvaCRC: Evaluating Code Review Comments Research Papers Lanxin Yang Nanjing University, Jinwei Xu Nanjing University, YiFan Zhang Nanjing University, He Zhang Nanjing University, Alberto Bacchelli University of Zurich Media Attached
14:45 15m Talk		HyperDiff: Computing Source Code Diffs at Scale Research Papers Quentin Le-dilavrec Univ. Rennes, IRISA, INRIA, Djamel Eddine Khelladi CNRS, IRISA, University of Rennes, Arnaud Blouin Univ Rennes, INSA Rennes, Inria, CNRS, IRISA, Jean-Marc Jézéquel Univ Rennes - IRISA Media Attached
15:00 7m Talk		npm-follower: A Complete Dataset Tracking the NPM Ecosystem Demonstrations Donald Pinckney Northeastern University, Federico Cassano Northeastern University, Arjun Guha Northeastern University and Roblox, Jonathan Bell Northeastern University Media Attached
15:08 7m Talk		Issue Report Validation in an Industrial Context Industry Papers Ethem Utku Aktas Softtech Inc., Ebru Cakmak Microsoft EMEA, Mete Cihad Inan Softtech Research and Development, Cemal Yilmaz Sabancı University DOI Media Attached
15:15 15m Talk		Dead Code Removal at Meta: Automatically Deleting Millions of Lines of Code and Petabytes of Deprecated Data Industry Papers Will Shackleton Meta, Katriel Cohn-Gordon Meta, Peter C Rigby Meta; Concordia University, Rui Abreu Meta, James Gill Meta, Nachiappan Nagappan Meta, Karim Nakad Meta, Ioannis Papagiannis Meta, Luke Petre Meta, Giorgi Megreli Meta, Patrick Riggs Meta, James Saindon Meta DOI