A Dataset for GitHub Repository Deduplication: Extended Description

doi:10.5281/zenodo.3740595

Published April 21, 2020 | Version v1

Technical note Open

A Dataset for GitHub Repository Deduplication: Extended Description

1. Athens University of Economics and Business
2. University of Tennessee

GitHub projects can be easily replicated through the site's fork
process or through a Git clone-push sequence. This is a problem for
empirical software engineering, because it can lead to skewed results
or mistrained machine learning models. We provide a dataset of 10.6
million GitHub projects that are copies of others, and link each record
with the project's ultimate parent. The ultimate parents were derived
from a ranking along six metrics. The related projects were calculated
as the connected components of an 18.2 million node and 12 million
edge denoised graph created by directing edges to ultimate parents.
The graph was created by filtering out more than 30 hand-picked and 2.3
million pattern-matched clumping projects. Projects that introduced
unwanted clumping were identified by repeatedly visualizing shortest path
distances between unrelated important projects. Our dataset identified
30 thousand duplicate projects in an existing popular reference dataset
of 1.8 million projects. An evaluation of our dataset against another
created independently with different methods found a significant overlap,
but also differences attributed to the operational definition of what
projects are considered as related.

Files

forkproj-extended.pdf

Files (1.8 MB)

Name	Size	Download all
forkproj-extended.pdf md5:c3791f1707a97d5e3a25fe8bd662c165	1.8 MB	Preview Download

Additional details

Documents: Software: 10.5281/zenodo.3653924 (DOI); Dataset: 10.5281/zenodo.3653920 (DOI)
Is cited by: Conference paper: 10.1145/3379597.3387496 (DOI)
Is supplement to: Conference paper: 10.1145/3379597.3387496 (DOI)

FASTEN – Fine-Grained Analysis of Software Ecosystems as Networks 825328: European Commission

	All versions	This version
Views	1,037	1,027
Downloads	320	320
Data volume	639.1 MB	639.1 MB

A Dataset for GitHub Repository Deduplication: Extended Description

Files

forkproj-extended.pdf

Files (1.8 MB)

Additional details

Related works

Funding

A Dataset for GitHub Repository Deduplication: Extended Description

Creators

Description

Files

forkproj-extended.pdf

Files (1.8 MB)

Additional details

Related works

Funding