A study of the re-use of code from over twenty thousand Java projects on GitHub discovered that almost 30% of them might be involved in potential code borrowing and almost 10% of them could potentially violate original licenses.
Code borrowing, the use of code cloned from existing projects is part of the open source philosophy. If someone else has already written the code you want then there’s no need to reinvent it. On the other hand it is important to re-use open source code in compliance with the terms and condition of the licence of the original code.
The researchers are affiliated to Russian universities and to JetBrains and explain in their arxiv paper:
We chose Java as a language because of its popularity in industry, where the plagiarism problem is especially relevant because of possible legal action.
The researchers used the Public Git Archive (PGA), a large dataset that was composed in the early 2018. It consists of all GitHub projects with 50 or more stars which can be filtered by language. They extract all projects with at least one line written in Java which resulted in 24,810 projects overall and a final dataset of 23,378 Java repositories.
An improved version of SourcererCC was used for clone detection. SourcererCC is tokenbased, defining a token as a programming language keyword, a literal, or an identifier. The tool parses the files and tokenizes the data, and then uses this tokenization to compare pieces of code to detect possible clones. The researchers re-wrote it in Python 3 and modified it to open each file with UTF-8 encoding, to solve the problem of files with non-ASCII characters in them being omitted. Clone detection was performed in several parallel instances with various parameters for Similarity Threshold and Lower Token Length Threshold. The main search was performed with 75% Similarity Threshold and Lower Token Length Threshold of 19 tokens to filter out trivial pieces of code. Once the results of clone detection were merged together, the output data consisted of a list of all clone pairs plus information about every block, including the project it came from, the file address in the project, and lines in the file.
To studying license violations the researchers needed to collect two types of data for these fragments: licenses and the time of the last modification, which allows them to presume what code could have been copied from where. To determine the licences they used Ninka, a tool written in Perl that takes a sentence-based approach to parse the top part of each file and match it to known licenses. For the time of modification they used the git blame command, the output of which includes each line of the file, as well as its last time of modification (date and time) and the author of this modification commenting:
This system is well-suited for our task of suggesting possible violations on the block level, since the information is not generalized for the entire file.
An impressive amount of effort went into this study:
The gathering of the dataset took about a day, and the clone detection took a little over two months of continuous calculations. In total, the dataset was tokenized into 38,617,427 unique blocks of code. 11,762,703 blocks of code passed the threshold of 19 tokens, of which 7,601,738 engaged in the cloning process (64.6%). In total, 1,163,989,420 clone pairs were detected, which came from 20,824 different projects, meaning that 2,554 projects did not have any clones larger than the threshold at all. Out of these pairs, 560,656,419 were inter-project (48.2%).
With regard to licences, 94 different ones were encountered and this chart shows the distribution of the twelve most popular.
Apache 2.0 was by far the most common and was applied to over half the files. The researchers comment that this finding is consistent with other recent research and state:
The license is very popular because of how permissive and detailed it is.
As the paper points out the second most prevalent, GitHub, is actually not a license at all, but the absence of any licence explaining:
When a developer uploads code to GitHub and does not provide any license with it, then all rights are reserved and the borrowing of code requires explicit permission from the author. Using GitHub as a platform implies agreeing to its Terms of Services that allows free viewing and forking of the code, however, free copying is not allowed.
During the block-level analysis, licence pairs were labeled as either permitted or prohibited, further subdivided into and strong and weak violations. This analysis excluded forks of the same project and code borrowing between projects with the same contributors or belonging to the same individual or organization.
As revealed in the chart above, 35.4% of blocks had No clones and a further 35% were Unique, meaning that they had clones within their project, or clones between forks, or clones within the same author or organization, but no clones that could constitute a code borrowing or license violation. This left 29.6% of blocks which crop up in pairs from unrelated projects one way or another. Origin blocks, which account for 8% of blocks, have clones in other projects, all of which were modified more recently, meaning that this piece of code can only act as an origin point of a possible borrowing. Another 10.4% of blocks constitute Legal borrowings, meaning that they have older clones, but all of their licenses allow this transition. The remaining 9.4% of blocks constitute possible license violations – 5.4% are categorized as weak violation, meaning that only some of their older clones prohibit the possible borrowing, while 4% constitute strong violation, meaning that they have older clones and all of them come from a restricting license.
An interesting finding of the study is apparent from the heat map that shows that code borrowing exhibits:
“relative symmetry: for every pair of licenses, the number of possible borrowings from A to B and possible borrowings from B to A is at least similar”.
The researchers comment:
That might indicate that the amount of possible borrowings between licenses is generally dependant only on the popularity of this license and that if the code is being copied between projects, developers do not pay much attention to the licensing.
Code borrowing from Apache 2.0 to Apache 2.0 is by far the most prevalent – and as it is a permissive licence these can be considered “legal” borrowings. Problems arise when there is no licence – i.e. a GitHub licence – or copying is of code with a restrictive licence such as the GNU Public Licence (GPL). However, the finding that only 4% of code borrowings constitute strong violations, while not grounds for complacency, does indicate that licence violations are relatively rare.
or email your comment to: email@example.com