Did you know that a project is not truly open source unless it has a license? Through our work on the Open Source Contributor Index (OSCI), our team used GH Archive and the GitHub API to analyze data about the open source community and how open source licenses are used. As many of you know, choosing a license is a very important decision for every open source project. When digging into the usage of open source licenses in more detail, we observed many interesting trends across GitHub, as well as among the top contributing companies in the OSCI. To conduct this analysis, we examined the license choice of new public repositories created on GitHub from the beginning of 2018 through to mid-2020. We also studied a year of data on GitLab to compare the patterns on this popular open source hosting platform.
Before delving into the details of license usage, it’s worth noting that the number of repositories created on GitHub has significantly grown over the last two-and-a-half years, as demonstrated in the chart below.
Number of New Public Repositories Created on GitHub
Trend in License Usage Since 2018
Looking at the repositories created since the beginning of 2018, a few trends stand out. First, 34% of repositories do not contain a license file, making their status as open source questionable. Second, 21% of repositories are not recognized by GitHub as a standard license type. This is typically because the license file contains a custom license text; although frequently they contain only a minor edit of a standard license text. Finally, and most importantly, Apache 2.0 and MIT are the two most popular license types, together totaling over 35% of all repositories.
When we exclude repositories that do not have a license file, we find that
- Over half of the repositories use either the Apache 2.0 or MIT license,
- One third of repositories use some form of custom license text, and
- The remaining 13% contain a large variety of licenses of which the BSD and variations of the Gnu Public Licenses are the most common.
Breaking down these totals across the past two-and-a-half years, the trends shown in the image below suggest a small decline in the use of Apache 2.0 and a small growth in the popularity of the MIT license. We also note that there is perhaps a decline in the use of custom license texts (although with just six months of data from 2020, it may be premature to make this conclusion), and most interestingly of all, a steady growth in the number of repositories created without a license file, despite GitHub's advice to the contrary. The data suggests that many individual contributors do not understand the importance of including a license file in the projects they open source.
Next, we compared the total popularity of Copyleft licenses, like the set from GPL, versus permissive licenses, such as Apache, MIT and BSD. The graph below shows that Copyleft licenses are used in less than 15% of all projects, which is fairly consistent over the last two-and-a-half years. As we have seen already in earlier graphs, the popular Copyleft licenses are GPL 3.0 (and GPL 2.0), whereas Apache 2.0 and MIT dominate in the permissive license types.
An interesting finding from our analysis is that a sizeable number of repositories still use the older versions of the BSD and GPL licenses. In summary across 2018 and 2019:
- 352 repositories use BSD 2 and 1,562 use the newer BSD 3
- 841 repositories use GPL 2.0 and 2,114 use GPL 3.0
License Usage in the Top 5 Companies Ranked in the OSCI
In the next step of our analysis, we focused on the license popularity among the top five companies ranked in the OSCI. We examined the repositories for the most popular GitHub organizations established by these companies. The focus on commercial organizations paints a different picture compared to all of the repositories we analyzed on GitHub. Apache 2.0 is the most popular license by far, followed by custom license texts. The MIT license is the only other standard license with any significant popularity. Copyleft licenses are rarely used. Finally, there is a non-trivial number of repositories with no license file; although after manually studying a sample, they are largely non-code repositories (examples or documentation).
When examining each of the top five companies individually, the results are interesting and suggest different corporate preferences.
In the image above, Apache is the most preferred license at Google, IBM and Red Hat by far. At Microsoft, the majority of licenses are custom text, with MIT as the favored standard license type. After manually examining some of the custom license texts, we found that these, in fact, are often MIT (for code repositories) and CreativeCommons (for documentation). By contrast, Intel appears to use a much greater variety of license types, with Apache being the most preferred, followed by custom license texts and 3-Clause BSD. The manual study of the custom license texts in repositories for Intel show a mix of texts based on Apache 2.0, 3-Clause BSD and other standard license types.
GitLab Analysis
Finally, we looked beyond GitHub and performed a similar analysis of recent open source license popularity on the GitLab hosting platform. Across a 12-month period from Q2 2019 to end of Q1 2020, a very different pattern emerged compared with our GitHub findings. Notably, 77.7% of public repositories created in this period have no license file. This once again suggests developers do not appreciate the necessity and value of choosing an open source license. It may also reflect some difference in the users who create open source projects on GitLab versus GitHub, with more individual usage compared to corporate usage.
Excluding those repositories with no license file, the image below shows MIT as the most popular at 37%, followed by custom license text at 21%, GPL 3.0 at 17%, and Apache 2.0 at 10%. In summary on GitLab, permissive license types are again the most popular type, but MIT is the leader and Apache 2.0 usage is much less than on GitHub. Copyleft licenses have a similarly small share on both GitLab and GitHub.
Conclusion
Our study has produced many interesting findings:
-
We see an increasing trend towards permissive license types, with Apache 2.0 and MIT as the clear leaders.
-
There is a small use of Copyleft license types.
-
A growing number of repositories have been created without a license, suggesting that individual developers, in particular, may not understand the legal aspects of open source.
-
Custom license types are popular, especially among commercial organizations; although in many cases, these appear to be based on standard license types.
Through our OSCI data analysis, we will continue to keep our finger on the pulse of the open source community and share our trends and analysis with you!
Find out more...
Visit the OSCI Ranking page and see our latest rankings.