If you work in analytics, you’ve likely encountered data in JSON format more than a few times, and with good reason. JSON is often used to transmit data over networks, and many web services and APIs now use JSON to serve data to consumers.
JSON is a relatively new concept compared to relational databases. Because of this, many SQL books and tutorials don’t cover the handy native JSON functions many traditional relational databases (like Postgres and Mysql) and warehouses (like Snowflake and BigQuery) have to offer. To make this a little less abstract, I’ve included a detailed, practical example below on how to restructure JSON into a tabular format. (All queries below use Postgres syntax but can fairly easily be translated to other dialects.)
How to restructure data in a nested JSON column
Suppose you have a table titled survey_submissions with the following (messy) data.
The question_repsonses field is JSON, where keys correspond to question ids, and values correspond to further nested JSON containing the question title and response.
user ID | survey_id | question_responses |
---|---|---|
1 | 1 | {"Q1": {"title": "How are you?", "response": "bored"}, "Q2": {"title": "What is 2 + 2?", "response": "6"}} |
2 | 1 | {"Q1": {"title": "How are you?", "response": "great!"}, "Q4": {"title": "What is 1 + 1?", "response": "2"}} |
1 | 3 | {"Q1": {"title": "Do you like taking surveys?", "response": "apparently"}} |
3 | 3 |
Answering questions like “How often is Q2 in survey 1 answered incorrectly?” or “How many responses are at least 5 characters long?” is difficult with the data in this format. Restructuring the data in a tabular format (with columns for question_id, title, and response) would make this much easier. Fortunately, you can do this with a fairly simple SQL query (so you don’t have to worry about loading it into Python/R and iterating over it). I’ll explain how to do this with Postgres below.
If we had one record per tuple (user_id, survey_id, quesiton_id) , answering the questions above would be a lot easier. How can we do this? Postgres function jsonb_each(), which “expands the outermost JSON object into a set of key/value pairs,” is a great starting point. It’s worth pointing out that most commonly used SQL functions return an atomic/scalar value, but jsonb_each() returns a set of key/value pairs! This behavior might be a bit different than what you’re used to when writing SQL, so let’s see how it works with a quick example.
Slightly modifying the example from the Postgres documentation, we see that select key, value from jsonb_each('{"a":"foo", "b":"bar"}')
returns the following set of rows:
key | value |
---|---|
a | "foo" |
b | "bar" |
Note: jsonb_each() maps the atomic string '{"a":"foo", "b":"bar"}'
to a set of rows.
Now that we know how to unnest the data we’re interested in, we need to join it back to the other columns in the table. One simple way to do this is by using a LATERAL subquery-but what does a LATERAL subquery do? Referring back to the Postgres documentation, a LATERAL subquery can “refer to any items that are on the left-hand side of a JOIN that it is on the right-hand side of.”
In our case, a LATERAL subquery allows us to join the set of keys and values returned by jsonb_each() to the row those keys and values belong to. Let’s put this all together in a query:
WITH survey_submissions AS (
SELECT
*
FROM
(
VALUES
(1, 1, '{"Q1": {"title": "How are you?", "response": "bored"}, "Q4": {"title": "What is 2 + 2?", "response": "6"}}'::jsonb),
(2, 1, '{"Q1": {"title": "How are you?", "response": "great!"}, "Q2": {"title": "What is 1 + 1?", "response": "2"}}'::jsonb),
(1, 3, '{"Q3": {"title": "Do you like taking surveys?", "response": "apparently"}}'::jsonb),
(3, 3, NULL)
) AS v(user_id, survey_id, question_responses)
)
-- lines above just create the dataset, meat of the code is below
SELECT
user_id,
survey_id,
question.key AS question_id,
question.value->>'title' AS question_title,
question.value->>'response' AS question_repsonse
-- the `->>` operator extracts scalar from json and formats as string
FROM
survey_submissions
CROSS JOIN LATERAL jsonb_each(question_responses) AS question
The data returned by this query structures the data in a more familiar tabular format, which is much easier to run analytical queries against.
user_id | survey_id | question_id | question_title | question_response |
---|---|---|---|---|
1 | 1 | 1 | Q1 | How are you? |
1 | 1 | 1 | Q4 | What is 2 + 2? |
2 | 1 | 1 | Q1 | How are you? |
2 | 1 | 1 | Q2 | What is 1 + 1? |
1 | 3 | 3 | Q3 | Do you like taking surveys? |
Note that there are no records for user_id 3 in the result above—this is because jsonb_each(NULL) returns no rows and the query uses a CROSS JOIN. If you don’t want empty responses filtered out, you can change the last line of the query to LEFT JOIN LATERAL jsonb_each(question_responses) AS question ON TRUE
. Unsurprisingly, all fields derived from question_responses (question_id, question_title, question_response) will be null for that record.
Simplify analyses using SQL-native JSON functions
JSON is used just about everywhere nowadays, including databases and warehouses. Using SQL to manipulate and restructure this data instead of a scripting language like python streamlines your analysis workflow and makes it much easier to repeat your analysis if you or someone else revisits it.