The Power of Big Query with Public Data

TLDR - If you are looking for an easy way to query a public dataset, you should definitely check out Big Query’s publicly available datasets, and chances are you will find the dataset you are looking for.

Background

Late one Sunday afternoon, I found myself browsing around a few high profile JVM projects on GitHub and noticed that they all used different build tools. The build tools were spread across Gradle, Maven, and SBT. After seeing this, started wondering what the actual distribution was for the different build tools on GitHub. Clearly, won’t be able to click my way through enough repos to get a clear representation, so I had to figure out a way to find some hard numbers to answer my question.

First Attempt

The first place where I tried to answer this simple question was to use the Github search box. After trying different combinations and even within the advanced search, found it really difficult to get the results because of too much noise. The main problem was that when searching for pom.xml the place it showed up most often was in .gitignore which was not really relevant to my question. This itself seemed a bit strange to me, as why would projects not want to version control their build artifacts? Is the pom.xml being generated as part of the build itself? Will leave these interesting questions for another afternoon.

The Solution

Soon after this, I remembered that Big Query has public datasets available for all to use (for free), and decided to take a quick peek to see if the GitHub dataset was there, and sure enough, it was. Great. After some exploring around the github_repos dataset, was able to find a table called files which looked very promising because it should have one row per file in GitHub right? This should allow me to write a query that looked for the specific build files located within a project and then do a simple count. After a quick browse around the tables, was able to find the files table and it had the following shape:

repo_name	STRING	NULLABLE
ref	STRING	NULLABLE
path	STRING	NULLABLE
mode	INTEGER	NULLABLE
id	STRING	NULLABLE
symlink_target	STRING	NULLABLE

After some simple data exploration was able to get a good idea of the meaning of the data with the following query:

SELECT * FROM [bigquery-public-data:github_repos.files] where path = 'build.sbt' LIMIT 2

yielding:

Row	repo_name	ref	path	mode	id	symlink_target
1	Knewton/kafka-manager	refs/heads/master	build.sbt	33188	934503…	null
2	Radsaggi/zipkin	refs/heads/master	build.sbt	33188	634bab3…	null

Clearly this looks like the right table to use to answer my question about counting the different build tool usages.

The next step was to come up with a query that can count the instances of each build artifact, meaning, count the number of build.gradle, build.sbt, and pom.xml across the Github dataset. The results should give a good indication of how many projects use each build tool.

Here is that query:

SELECT sbt.cnt, gradle.cnt, maven.cnt
FROM
  (SELECT count(*) cnt FROM [bigquery-public-data:github_repos.files] where path = 'build.sbt') sbt
CROSS JOIN
  (SELECT count(*) cnt FROM [bigquery-public-data:github_repos.files] where path = 'pom.xml') maven
CROSS JOIN
  (SELECT count(*) cnt FROM [bigquery-public-data:github_repos.files] where path = 'build.gradle') gradle

which yielded the following results:

Row	sbt_cnt	gradle_cnt	maven_cnt
1	15,628	91,528	116,535

There you have it, based on the data in this dataset, Maven out numbers SBT and Gradle. ¹ If I was going to take this experiment to the next level, I would take this data and join it with the repository’s most prevalent language, meaning, Java, Scala, etc.. to be able to say which build tool is most often used for which language.

The great thing about using Big Query with public datasets is the fact that there is no ETL required.

If you are looking for an easy way to query a public dataset, you should definitely check out Big Query’s publicly available datasets, and chances are you will find the dataset you are looking for. The great thing about using Big Query with public datasets is the fact that there is no ETL required. The only² thing required to run a query against a public dataset is to head to the Big Query Console, pick a dataset, type in the query, and within seconds you will have results to review.

What are you waiting for?

Notes

1: According to the meta data on the files table, it hasn’t been touched since Nov 2, 2017 so this data is a bit old. Also keep in mind, that was is a very un-scientific way of figuring out build tool usage, thus take these numbers for what they are, a fun weekend experiment.

2: Signing up for a Free GCP account is required to gain access to the Big Query Console, although a Credit Card is not required to run queries. However, there is a relatively low quota in place if you do not provide a credit card.

Tags

Background

First Attempt

The Solution

Notes