Ordinary least squares (OLS) regression is one of the most powerful tools in a data scientist’s pocket. Personally, as someone who specializes in causal inference, I have built most of my career on OLS and closely related analyses—from t-tests up through structural time series models. But we’ve all had to work with simpler tools than we might like from time to time, and I have gotten a lot of mileage out of doing regression analyses directly in SQL.
I’m not the first person to implement regression using SQL. In fact, Mode’s own SQL tutorial includes an example of running an A/B test analysis in SQL, implemented using Student’s t-test—which is just a special case of a simple linear regression.
Often, that’s all you need, but sometimes it’s not.
Sometimes you need to remove seasonality from a time series before you can pipe it into a data visualization but going round-trip through Python just isn’t in the cards. Sometimes you want to run a t-test while controlling for a continuous covariate. Sometimes you want to run a regression on ranks. And sometimes, you want to do any one of the other thousand problems that nobody has bothered to post a SQL recipe for.
My suggestion? Old introductory statistics textbooks.
It turns out that a lot of those old formulas for running statistical analyses by hand with paper and pencil (and maybe, if you’re lucky, a slide rule), are very easy to adapt to SQL code. Or, if not easy, at least possible.
It turns out that a lot of those old formulas for running statistical analyses...are very easy to adapt to SQL code.
Statistical formulas like linear regression are often explained in these older texts by using a table of numbers beginning with X (the predictor) and Y (the outcome), and then by adding more columns off to the right with derived quantities finally summing those columns at the bottom of the page. It ends up looking almost exactly like SQL.
We’re publishing a new whitepaper on how to do linear regression in SQL. Yes, I used an outdated statistics textbook when writing it.
You can download the whitepaper here, with all the gory details. I think it’s an extraordinarily useful little trick to have in your back pocket.
But an even more useful trick is to draw on the resources of the last hundred years of statistical research, including the basics.
Especially the basics.
If your measurements are clean and meaningful, if your A/B tests are well designed and powerful, you might rarely need anything more than the “basics.” Very often, good statistics are boring statistics and exciting business insights, not the other way around.