npm-follower: A Complete Dataset Tracking the NPM Ecosystem
Software developers typically rely upon a large network of dependencies to build their applications. For instance, the NPM package repository contains over three million packages, over 33 million versions of packages, and serves tens of billions of downloads weekly. Understanding the structure and nature of packages, dependencies, and published OSS code requires datasets that provide researchers with easy access to metadata (dependencies, repository links, etc.) and code of packages. However, prior work on NPM dataset construction typically has two limitations: 1) only metadata is scraped, and 2) packages or versions that are deleted from NPM are not able to be scraped. 335,325 versions of packages have been deleted from NPM just since we started scraping in July 2022, and this data is slipping away from researchers. Moreover, this data is critical for researchers as it often pertains to important questions of security and malware. We present npm-follower
, a dataset and crawling architecture for the NPM package repository, which continually scrapes and indexes metadata and code of all packages and versions in near real-time, and is thus able to retain data which is later deleted. Since July 2022 we have archived the metadata and code of 281,858 of those deleted versions. The dataset is designed to be easily used by researchers answering questions involving either metadata or program analysis. Both the code and dataset are available at https://dependencies.science.