This year at our annual Software Engineering Conference (SEC), EPAM launched something completely new to the world – the Open Source Contributor Index (OSCI) – to help track which commercial organizations contribute the most to open source. To get the full scoop on this exciting project, we interviewed Patrick Stephens, Delivery Director, EPAM, who was heavily involved in launching OSCI. Here’s what Patrick had to say about OSCI, its origins, its methodology and how to get involved.
What is OSCI?
Without going into too much detail, OSCI is a ranking of contributions to GitHub by commercial organizations. It is an open source project that measures and tracks open source activity on GitHub.
What Was the Inspiration for OSCI?
A few important things happened leading up to OSCI that inspired us to take on the project for ourselves. First, in 2016, Github published its own analysis of organizations with the most open source contributors, then went on to publish similar studies in 2017 and 2018. Next, on the freeCodeCamp platform, Felipe Hoffa published a detailed analysis of 2017 data with logic that identified organizations by email domain, counted commits only to projects with more than 20 stars during the period, and counted users with more than three pushes during the period. A few months later, Fil Maj’s analysis of 2017 data was published to InfoWorld. He ranked organizations by the number of users with 10 or more commits, who were identified by the company field in their profile. EPAM was ranked among the top 20, which was a pretty amazing achievement for a services company.
Following these events and seeing EPAM ranked so highly made us curious to know our own position – and others – two years later. While OSCI wasn’t a completely original idea, we were the first (that we know of) to create a formal, regularly updated version of the rankings.
What Kind of Ranking Methodology Does EPAM Use for OSCI?
The truth is, as evidenced by previous attempts, there’s no single right way to do it. After much analysis, we concluded it was best to use the email address of the commit author to identify the organization to which they belong. From there, OSCI tracks two measures. The first is the number of people who authored 10 or more commits in the current year. We call these the Active Contributors. The second is the number of people who made at least one commit in the year. We call this the Total Community.
How Does EPAM Ensure the Accuracy of OSCI Rankings?
We analyze all the push event data from GH Archive, an open project which records the public GitHub timeline, archives it and makes it easily accessible for further analysis. The OSCI rankings are then based on the number of people making commits, rather than the number of commits itself. This is because the GitHub push event data includes large numbers of pushes made by automated processes using an email domain from an organization, so counting the number of commits is unreliable as a measure of user activity.
When it comes to accurately identifying commit authors from commercial organizations, it’s tough since many people use their own private email addresses. Still, our analysis showed that using the email domain is still the most reliable unique identifier for OSCI.
Finally, in the spirit of complete transparency, OSCI’s algorithm is published as an open source project on GitHub. As such, it is open to anyone and everyone online who would like to contribute and help make OSCI the most accurate measure available in the market today.
What Do You Say to People Who Question the Methodology of the OSCI Rankings?
This is our first version of OSCI, and we are aware that there are many directions we could take to extend and improve it. Internally at EPAM, we built the first version of OSCI on the Azure cloud using MS SQL, however the public version does not require Azure, and our backlog contains tasks to also support an open source database. We are very open to suggestions to improve and welcome anyone who wants to get involved and help us make it better. This is also another reason why we’ve made OSCI an open project on GitHub, so that anyone can contribute via a crowdsourced initiative to make it better.
What Plans Do You and Your Team Have for the Future of OSCI?
First and foremost, we’ll be updating the OSCI rankings monthly on dedicated website.
Beyond that, the timing for future improvements isn’t really set in stone given that OSCI just launched, but there are a few things I can mention that we’re excited about. Our three main goals are as follows:
-
Extend the rankings beyond the current top 30 to at least the top 50 and then perhaps 100.
-
Generate the OSCI rankings back to 2016 so that we can publish the trends over a four-year period
-
Support an open source database as well as MS SQL
We also have a large backlog of other tasks and suggestions from the great feedback we have received so far.
How Can People Get Involved in OSCI?
OSCI is a new project, and we would be delighted if anyone would like to collaborate with us on GitHub. We also welcome feedback which will help us extend and improve OSCI. Contact us at [email protected].
Patrick Stephens joined EPAM Systems in 2001. In his many years at EPAM, Patrick has worked from various locations, including the UK, US, Spain, Malaysia and Ireland, to deliver projects for some of our biggest clients. His interest in free and open source software (FOSS) and the development of OSCI grew out of seeing how much open source activity is done at EPAM and recognizing the potential to support it.