We’ll walk through how to deal with nested data using Pandas (for example - a JSON string column), transforming that data into a tabular format that’s easier to deal with and analyze.
Import Modules
import pandas as pd
import json
Create a Test Dataset
age=[21,22,23]
nested_data=[
'''{"occupation":"nurse", "interests":"movies"}''',
'''{"occupation":"teacher", "favorite_food":"spaghetti"}''',
'''{"occupation":"doctor"}'''
]
data = pd.DataFrame(dict(age=age, nested_data=nested_data))
data
age | nested_data | |
---|---|---|
0 | 21 | {"occupation":"nurse", "interests":"movies"} |
1 | 22 | {"occupation":"teacher", "favorite_food":"spag... |
2 | 23 | {"occupation":"doctor"} |
Load the JSON string into a dictionary and then convert it into a Series object. This flattens out the dictionary into a table-like format. Notice how this creates a column per key, and that NaNs are intelligently filled in via Pandas.
exploded = data.nested_data.apply(json.loads).apply(pd.Series)
exploded
occupation | interests | favorite_food | |
---|---|---|---|
0 | nurse | movies | NaN |
1 | teacher | NaN | spaghetti |
2 | doctor | NaN | NaN |
Last - we’ll drop the orignial nested column and concatenate the exploded version to create our final dataset.
pd.concat([data.drop(columns='nested_data'), exploded], axis=1)
age | occupation | interests | favorite_food | |
---|---|---|---|---|
0 | 21 | nurse | movies | NaN |
1 | 22 | teacher | NaN | spaghetti |
2 | 23 | doctor | NaN | NaN |