pandas：将字符串列拆分为多列并动态命名列

Question

pandas：将字符串列拆分为多列并动态命名列

mee*_*ram 2 python split dataframe pandas

我的问题与这个和这个类似，但我无法让他们的解决方案解决我的问题。

我有一个如下所示的数据框：

    study_id    fuzzy_market
0   study1  [Age: 18-67], [Country of Birth: Austria, Germany], [Country: Austria, Germany], [Language: German]
1   study2  [Country: Germany], [Management experience: Yes]
2   study3  [Country: United Kingdom], [Language: English]
3   study4  [Age: 18-67], [Country of Birth: Austria, Germany], [Country: Austria, Germany], [Language: German]
4   study5  [Age: 48-99]

Run Code Online (Sandbox Code Playgroud)

我希望它看起来像这样：

研究编号	年龄	出生国家	国家	语言	管理经验
研究1	18-67	奥地利、德国	奥地利、德国	德语	没有任何
研究2	没有任何	没有任何	德国	没有任何	是的
研究3	没有任何	没有任何	英国	英语	没有任何
研究4	18-67	奥地利、德国	奥地利、德国	德语	没有任何
研究5	48-99	没有任何	没有任何	没有任何	没有任何

因此，每行一行study_id，列中每个冒号之前的文本fuzzy_market作为列标题，每个冒号之后的文本作为单元格中的数据。如果某列没有相关数据，我想用来填充None。所有列都可以是字符串。我不知道会有多少列，所以我需要它是动态的。

这是设置和数据：

import pandas as pd
import numpy as np
import re

np.random.seed(12345)

df = pd.DataFrame.from_dict({'study_id': {0: 'study1',
  1: 'study2',
  2: 'study3',
  3: 'study4',
  4: 'study5'},
 'fuzzy_market': {0: '[Age: 18-67], [Country of Birth: Austria, Germany], [Country: Austria, Germany], [Language: German]',
  1: '[Country: Germany], [Management experience: Yes]',
  2: '[Country: United Kingdom], [Language: English]',
  3: '[Age: 18-67], [Country of Birth: Austria, Germany], [Country: Austria, Germany], [Language: German]',
  4: '[Age: 48-99]'}})

Run Code Online (Sandbox Code Playgroud)

到目前为止，我已经尝试过操作列中的字符串fuzzy_markets，但我认为这种方法不正确。

# a function to strip the square brackets, as I'm not sure this is really a list in here
def remove_square_brackets(x):
    return re.sub(r"[\[\]]", "", x)

# make a new dataframe where there are new columns for data after every comma
df2 = df.join(df['fuzzy_market'].apply(remove_square_brackets).str.split(',', expand=True))

# rename the columns arbitrarily - these will need to be the question titles eventually e.g. Age rather than A, Country of Birth rather than B etc.
df2.columns = ('study_id', 'fuzzy_market', 'A', 'B', 'C', 'D', 'E', 'F')

# try and split again
df3 = df2[['study_id','A', 'B']].join(df2['A'].str.split(":", expand=True).rename(columns={0:'A1', 1:'A2'})).join(df2['B'].str.split(":", expand=True).rename(columns={0:'B1', 1:'B2'}))

# this isn't quite there yet
df3

    study_id    A   B   A1  A2  B1  B2
0   study1  Age: 18-67  Country of Birth: Austria   Age 18-67   Country of Birth    Austria
1   study2  Country: Germany    Management experience: Yes  Country Germany Management experience   Yes
2   study3  Country: United Kingdom Language: English   Country United Kingdom  Language    English
3   study4  Age: 18-67  Country of Birth: Austria   Age 18-67   Country of Birth    Austria
4   study5  Age: 48-99  None    Age 48-99   None    None

Run Code Online (Sandbox Code Playgroud)

感谢您的任何帮助或提示！

Answer 1

Shu*_*rma 5

我们可以使用findall从每一行中提取所有匹配的键值对，然后map将这些对dict创建一个数据框

p = df['fuzzy_market'].str.findall(r'([^:\[]+): ([^\]]+)')
df[['study_id']].join(pd.DataFrame(map(dict, p)))

Run Code Online (Sandbox Code Playgroud)

  study_id    Age  Country of Birth           Country Language Management experience
0   study1  18-67  Austria, Germany  Austria, Germany   German                   NaN
1   study2    NaN               NaN           Germany      NaN                   Yes
2   study3    NaN               NaN    United Kingdom  English                   NaN
3   study4  18-67  Austria, Germany  Austria, Germany   German                   NaN
4   study5  48-99               NaN               NaN      NaN                   NaN

Run Code Online (Sandbox Code Playgroud)

归档时间：	4 年前
查看次数：	1215 次
最近记录：	4 年前