Dealing with Nested Data in Pandas

Categories: coding

We’ll walk through how to deal with nested data using Pandas (for example - a JSON string column), transforming that data into a tabular format that’s easier to deal with and analyze.


Import Modules

import pandas as pd
import json

Create a Test Dataset

age=[21,22,23]
nested_data=[
    '''{"occupation":"nurse", "interests":"movies"}''',
    '''{"occupation":"teacher", "favorite_food":"spaghetti"}''', 
    '''{"occupation":"doctor"}'''
]

data = pd.DataFrame(dict(age=age, nested_data=nested_data))
data
age nested_data
0 21 {"occupation":"nurse", "interests":"movies"}
1 22 {"occupation":"teacher", "favorite_food":"spag...
2 23 {"occupation":"doctor"}

Load the JSON string into a dictionary and then convert it into a Series object. This flattens out the dictionary into a table-like format. Notice how this creates a column per key, and that NaNs are intelligently filled in via Pandas.

exploded = data.nested_data.apply(json.loads).apply(pd.Series)
exploded
occupation interests favorite_food
0 nurse movies NaN
1 teacher NaN spaghetti
2 doctor NaN NaN

Last - we’ll drop the orignial nested column and concatenate the exploded version to create our final dataset.

pd.concat([data.drop(columns='nested_data'), exploded], axis=1)
age occupation interests favorite_food
0 21 nurse movies NaN
1 22 teacher NaN spaghetti
2 23 doctor NaN NaN
comments powered by Disqus