保存到DataFrame的每种产品/嵌套词典的组合

Question

保存到DataFrame的每种产品/嵌套词典的组合

我是Python的新手，正在为X数量的模型运行初始化参数。我需要从N个字典创建每个可能的组合，每个字典都具有嵌套数据。

我知道我需要以某种方式使用itertools.product，但是我在如何浏览字典上受阻。也许我什至不应该使用字典，而是json之类的东西。我也知道这将创建很多参数/运行。

编辑：添加了注释的说明。我想创建一个需要n个字典的函数-例如。def func（dict *）----作为输入，并创建所有字典中所有单个键/值对的所有可能组合，并返回一个包含所有组合的大DF。

我的数据如下所示：

词典1

{
    "chisel": [
        {"type": "chisel"},
        {"depth": [152, 178, 203]},
        {"residue incorporation": [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]},
        {"timing": ["10-nov", "10-apr"]},
    ],
    "disc": [
        {"type": "disc"},
        {"depth": [127, 152, 178, 203]},
        {"residue incorporation": [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]},
        {"timing": ["10-nov", "10-apr"]},
    ],
    "no_till": [
        {"type": "user_defined"},
        {"depth": [0]},
        {"residue incorporation": [0.0]},
        {"timing": ["10-apr"]},
    ],
}

Run Code Online (Sandbox Code Playgroud) 词典2

{
    "nh4_n":
        {
            "kg/ha":[110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 205, 210, 215, 220, 225],
            "fertilize_on":"10-apr"
        },
    "urea_n":
        {
            "kg/ha":[110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 205, 210, 215, 220, 225],
            "fertilize_on":"10-apr"
        }
}

Run Code Online (Sandbox Code Playgroud) 词典3

{
    "maize": {
        "sow_crop": 'maize',
        "cultivar": ['B_105', 'B_110'],
        "planting_dates": [
            '20-apr', '27-apr', '4-may', '11-may', '18-may', '25-may', '1-jun', '8-jun', '15-jun'],
        "sowing_density": [8],
        "sowing_depth": [51],
        "harvest": ['maize'],
    }
}

Run Code Online (Sandbox Code Playgroud)

例如，使用上面的三个词典，我将字典“ chisel”和itertools.product以某种方式与字典2中的每个嵌套字典（例如，“ nh4_n”）和字典3中的每个嵌套字典（在这种情况下，只有一个，因此每个品种，种植日期等）。我还想将每个键值对中的键用作DF列标题。

在此处输入图片说明

Answer 1

Tre*_*ney 3

问题：

主要问题是数据字典格式不一致：
1. 与 dict 1 和 3 不同，dict 2 的顶键不是子键的值
2. 与字典 2 和 3 不同，字典 2 和 3 将字典作为顶级键的值，字典 1 具有顶级值的字典列表。
3. 一些二级值是字符串，一些是列表

第 1 步：修复数据：

功能：

`fix_list_dicts`:

def fix_list_dicts(data: dict) -> dict:
    """
    Given a dict where the values are a list of dicts:
    (1) convert the value to a dict of dicts
    (2) if any second level value is a str, convert it to a list
    """
    data_new = dict()
    for k, v in data.items():
        v_new = dict()
        for x in v:
            for k1, v1 in x.items():
                if type(v1) != list:
                    x[k1] = [v1]
            v_new.update(x)
        data_new[k] = v_new
    return data_new

Run Code Online (Sandbox Code Playgroud)

`add_top_key_as_value`:

def add_top_key_as_value(data: dict, new_key: str) -> dict:
    """
    Given a dict of dicts, where top key is not a 2nd level value:
    (1) add new key: value pair to second level
    """
    for k, v in data.items():
        v.update({new_key: k})
        data[k] = v
    return data

Run Code Online (Sandbox Code Playgroud)

`str_value_to_list`:

def str_value_to_list(data: dict) -> dict:
    """
    Given a dict of dicts:
    (1) Convert any second level value from str to list
    """    
    for k, v in data.items():
        for k2, v2 in v.items():
            if type(v2) != list:
                data[k][k2] = [v2]
    return data

Run Code Online (Sandbox Code Playgroud)

执行：

from pprint import pprint as pp

Run Code Online (Sandbox Code Playgroud)

字典1：

d1 = fix_list_dicts(d1)
pp(d1)

{'chisel': {'depth': [152, 178, 203],
            'residue incorporation': [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
            'timing': ['10-nov', '10-apr'],
            'type': ['chisel']},
 'disc': {'depth': [127, 152, 178, 203],
          'residue incorporation': [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
          'timing': ['10-nov', '10-apr'],
          'type': ['disc']},
 'no_till': {'depth': [0],
             'residue incorporation': [0.0],
             'timing': ['10-apr'],
             'type': ['user_defined']}}

Run Code Online (Sandbox Code Playgroud)

字典2：

d2 = add_top_key_as_value(d2, 'fertilizer')
d2 = str_value_to_list(d2)

{'nh4_n': {'fertilize_on': ['10-apr'],
           'fertilizer': ['nh4_n'],
           'kg/ha': [110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 205, 210, 215, 220, 225]},
 'urea_n': {'fertilize_on': ['10-apr'],
            'fertilizer': ['urea_n'],
            'kg/ha': [110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 205, 210, 215, 220, 225]}}

Run Code Online (Sandbox Code Playgroud)

字典3：

d3 = str_value_to_list(d3)

{'maize': {'cultivar': ['B_105', 'B_110'],
           'harvest': ['maize'],
           'planting_dates': ['20-apr', '27-apr', '4-may', '11-may', '18-may', '25-may', '1-jun', '8-jun', '15-jun'],
           'sow_crop': ['maize'],
           'sowing_density': [8],
           'sowing_depth': [51]}}

Run Code Online (Sandbox Code Playgroud)

第 2 步：将数据合并到 DataFrame 中：

功能：

import pandas as pd

Run Code Online (Sandbox Code Playgroud)

`combine_the_data`:

def combine_the_data(data: list) -> dict:
    """
    Given a list of dicts:
    (1) convert each dict into DataFrame
    (2) set the indices to 0
    (3) add each DataFrame to df_dict
    """
    df_dict = dict()
    for i, d in enumerate(data):
        df = pd.DataFrame.from_dict(d, orient='index')
        df.index = [0 for _ in range(len(df))]
        df_dict[f'd_{i}'] = df

    return df_dict

Run Code Online (Sandbox Code Playgroud)

`merge_df_dict`:

def merge_df_dict(data: dict) -> pd.DataFrame:
    """
    Given a dict of DataFrames
    (1) merge them on the index
    """
    df = pd.DataFrame()
    for _, v in data.items():
        df = df.merge(v, how='outer', left_index=True, right_index=True)
    return df

Run Code Online (Sandbox Code Playgroud)

执行：

data = [d1, d2, d3]
df_dict = combine_the_data(data)

df_dict['d_0']

Run Code Online (Sandbox Code Playgroud)

df_dict['d_1']

Run Code Online (Sandbox Code Playgroud)

df_dict['d_2']

Run Code Online (Sandbox Code Playgroud)

df = merge_df_dict(df_dict)

Run Code Online (Sandbox Code Playgroud)

步骤 3：使用EXPLODE`pd.DataFrame.explode`所有列表：

我不知道 v0.25 还包含哪些其他新功能pandas，但爆炸是其中最好的。

没有pandasv0.25 吗？然后得到它！

df.reset_index(drop=True, inplace=True) # the DataFrame must have a unique 0...x index for col in df.columns: df = df.explode(col).reset_index(drop=True)
Run Code Online (Sandbox Code Playgroud)
所有组合的最终输出：

价值计数和期望：

鉴于：

len(kg/ha) = 24

len(cultivar) = 2

len(plantint_dates) = 9

行数user_defined= 2

总组合user_defined= 864

我没有手动计算其他两个types，但由于user_defined具有正确的组合数量，我希望其他人也能这样做。

df.type.value_counts() disc 48384 chisel 36288 user_defined 864 Name: type, dtype: int64
Run Code Online (Sandbox Code Playgroud)

这是一个需要解决的有趣问题。像大多数数据科学问题一样，我归根结底是将数据重塑为可用的格式。我们不应该发感谢信，但是谢谢。 (3认同)

乐意效劳。两个重要的部分是将索引设置为 0 以进行合并，然后在爆炸之前重置索引。 (2认同)

归档时间：	6 年，6 月前
查看次数：	82 次
最近记录：	6 年，6 月前

保存到DataFrame的每种产品/嵌套词典的组合

问题：

第 1 步：修复数据：

功能：

fix_list_dicts:

add_top_key_as_value:

str_value_to_list:

执行：

字典1：

字典2：

字典3：

第 2 步：将数据合并到 DataFrame 中：

功能：

combine_the_data:

merge_df_dict:

执行：

步骤 3：使用EXPLODEpd.DataFrame.explode所有列表：

所有组合的最终输出：

价值计数和期望：

`fix_list_dicts`:

`add_top_key_as_value`:

`str_value_to_list`:

`combine_the_data`:

`merge_df_dict`:

步骤 3：使用EXPLODE`pd.DataFrame.explode`所有列表：