If you were like me, you spent the week of November 3rd constantly refreshing the live feed from FiveThirtyEight in one browser window and The New York Times in the other. This was a tight and stressful presidential election. But if you looked at election polls from news outlets over the previous two months, and took their results at face value, you would be surprised at how close the election turned out to be.
So what gives? Isn’t data science supposed to be a panacea for all our problems? Didn’t we learn our lessons from polling in 2016? These are all valid questions to be asking. And at the end of the day, I think the answer to these questions can be distilled into the age-old data science adage: garbage in, garbage out.
New York Times Journalist Nate Cohn, clearly resonates with the garbage in, garbage out phenomenon in his thoughtful post-mortem on the 2020 election polling. In his post, he compares the polling in 2020 to that in 2016:
This is a deeper kind of error than ones from 2016. It suggests a fundamental mismeasurement of the attitudes of a large demographic group, not just an underestimate of its share of the electorate. Put differently, the underlying raw survey data got worse over the last four years, canceling out the changes that pollsters made to address what went wrong in 2016.
So the big question myself and other pundits are asking is: Should we even have these polls?
I’m not going to pretend to be a political polling expert here, despite sharing a name with prominent voices in the space, Nate and Nate. However, I do think the question of whether we should produce political polls is akin to asking whether we should use be using imperfect data to inform business decisions, which is something we think a lot about here at Mode.
In order to answer this question of whether we should have political polls, let’s imagine a world without them. Where would we get information about the attitudes and views of our compatriots? We would certainly still gather this data for ourselves, but instead of using polling that aims to find representative samples, we would have to rely on the trends that we observe on social media, and the opinions of our friends and family.
I think the same goes for using data to make smarter business decisions. We often don’t have the ideal dataset to determine the next feature we should build, or how to price our products, so what do we use in the absence of this data? The loudest voice in the room?
When there is a decision to make, what are we to do? We know our gut feelings aren’t always reliable, so should we just lean on mirage of epistemic certainty that data provides?
I think the answer is not so black-and-white. There is value in the model we all have—one that no data scientist can outbuild today—the human mind. But there is also value in the objectivity of data, which can help us overcome our human biases.
To find our answer, we must ask one more question: What role should data play in making this decision?
In an interview with the New Yorker, Nate Cohn provides an example of the weighty influence polling data can have on far-reaching policy decisions:
I think that Barack Obama and establishment Republicans like Marco Rubio and Jeb Bush and so on supported immigration reform after the 2012 election because they thought Hispanic voters decided it in Barack Obama’s favor based on exit polls. And so hitting those demographic trends right and telling the stories of these elections accurately has a huge effect on the course of politics in our country.
However, one has to question the degree to which we allow polling to inform political agendas, which can affect tens of millions of people. I have to wonder how much the validity of the polling data was interrogated, and to what degree other strategies were used to inform a focus on immigration reform.
In the final days of the 2020 election, polls showed Ohio in-play for President-Elect Biden, and he used valuable time to make a campaign stop in the state two days before the election. And as it turns out, Trump won comfortably.
Surely polling data in Ohio that showed a neck-and-neck race played a part, possibly too big of a part. Nate Silver echoes this sentiment about the unrealistically high expectations we have for the certainty of polling data in his analysis of 2020 poll performance:
So if you’re coming to the polls for strong probabilistic hints of what is going to happen, they can provide those — and the hints will usually lead you in roughly the right direction, as they did this year. But if you’re looking for certainty, you’ll have to look elsewhere.
So to the business leaders and data teams out there—are you relying on shaky data for important decisions? Have you considered what other means you have to make this decision? Don’t have data tunnel vision. Ask yourself, what role should data play in this decision?