Working at Mode, I get to talk with a lot of analysts and data scientists.
These talks have led to a ton of insightful conversations about how we all like to work. The usual suspects come up: process, technologies, best practices, etc. Of all the go-to topics, one seems to crop up more often than the rest: language.
This conversation is usually dominated by the big three: SQL, Python, and R. When I ask other analysts about the extent to which they use SQL and Python a few distinct clusters appear: analysts who mainly use SQL, analysts who mainly use Python, and a smaller group who uses both.
It's time to bridge the gap and make the small group of folks who switch back and forth between SQL and Python bigger. In this new series, aptly named Bridge the Gap, we're going to explore this in-between world in hopes of helping people expand their knowledge of both languages and get their work done faster, using the best tools for the job.
As a means of introduction, let's take a look at recent trends in the data science and analytics landscape, which can help show why it’s becoming increasingly advantageous to have a deep understanding of both SQL and Python. Then throughout this series, we’ll focus on explaining how a hybrid model of analytics can achieve a more harmonious relationship between the two languages (and we'll touch on R later in the series). Specifically, we're going to be focusing on their similarities, rather than their differences.
SQL Ages Like a Fine Wine
Even after 40+ years (!) of existence, SQL remains one of the most ubiquitous programming languages. StackOverflow’s 2017 Developer Survey shows that 51.2% percent of all respondents use SQL, making it the second most-utilized programming language in existence, right behind Java. Looking strictly at data scientist and data engineer respondents, this number increases to 58%.
It cannot be understated how impressive this is for a 40+ year-old programming language. Not only does SQL continue to stand the test of time, but it becomes even more pervasive within the analytics and data science community as it ages. Take a look at any number of data analyst and data scientist job postings, and you’ll notice a trend: deep analytical SQL expertise is no longer a luxury, but a necessity.
Two important characteristics account for SQL's staying power. First, its declarative nature (i.e. programming in terms of what you want done, rather than how to do it) makes it easy to learn, write, and comprehend. The logistics of control flow are abstracted away in favor of simplicity and readability. This accessibility has been vital in fostering SQL’s ubiquity, particularly within the analytics and data science community.
Second, and perhaps more important, the evolution of data infrastructure (i.e. performant data warehouses such as BigQuery and Redshift) has transfigured SQL into a language suddenly capable of complex data transformations. This evolution has significantly broadened the scope of what is possible using SQL.
Ad-hoc analytical SQL queries that previously may have taken days to complete now take minutes, or even seconds. As a result, analysts, data scientists, and data engineers are able to iterate significantly faster and go significantly further using SQL alone than they ever have before.
Python Grows Faster Than Ever
In September, StackOverflow revealed evidence to suggest that Python is the fastest growing major programming language. Naturally, this left many of us asking the next logical question. Why? Python has long attracted a dedicated community that adores it for its expressiveness and versatility. But it began to seem implausible that this alone could be the driving force of Python’s accelerating growth.
Shortly after it’s publication, StackOverflow put a magnifying glass to this accelerating growth rate as an attempt to reveal the driving factors. Their analysis affirms what many of us surmised. The fastest-growing use case of Python is for data science, machine learning, and academic research. Particularly, the pandas library is the fastest-growing Python-related tag on StackOverflow.
Pandas is a python data analysis library focused on making analysis of structured or tabular data fast, easy, and expressive. Since pandas is designed to work with structured data, it shares many similarities to other structured data manipulation languages, such as SQL. Pandas provides all of the data manipulation capabilities typically provided in relational databases, within the high-performance, array-based computation environment of Python. Its incredible growth in popularity has been accompanied by similar growth in other popular and related Python data science libraries, such as NumPy, matplotlib, Tensorflow, and Keras.
Pandas has been critical in transforming Python into more of an “end-to-end” data science language. When used in tandem with other popular Python data science libraries like those mentioned above, Python becomes a powerful language suitable for the wrangling, preparation, cleaning, exploration, analysis, modeling, and visualization of data. This paradigm shift has changed the way we think about Python as it relates to other data manipulation languages. The future of Python for data science looks brighter still, with a passionate and ever-growing open source community driving innovation at unprecedented speed.
State of the UNION
But what about the folks who already use a hybrid model of SQL and Python for their analytics work? When asked about the specifics of how they apply this hybrid approach, the answer typically falls along the same lines; Python starts where SQL ends. Any analytics work that can reasonably be performed using SQL, will be. Python then picks up where SQL leaves off.
There isn’t anything inherently wrong with this model. People should feel empowered to use whatever languages they’d like to get the job done. But this model perpetuates the idea that these languages are complementary, yet functionally discrete when it comes to data analysis. It fails to acknowledge a sizable overlap of shared functionality. Exploration of this middle ground can improve our understanding of the capabilities of both languages.
Our goal for this series is to shift the mindset away from thinking about these two languages in terms of how they are different, and more in terms of how they are alike. When we understand how the languages overlap, we can make smarter decisions about which to use and when.
We’ll do this by relating common analytical operations in the two languages, starting with window functions, unions and joins. Additionally, these overlap areas should provide good entry points for folks to learn one language using their existing knowledge of the other.
The first function we'll examine in this series is one that many us use on a daily basis, Window Functions.