a k*_*a k 168 python dataframe pandas
我有一个包含一列的数据框,我想把它分成两列,一列标题为' fips'和另一列'row'
我的数据框df看起来像这样:
row
0 00000 UNITED STATES
1 01000 ALABAMA
2 01001 Autauga County, AL
3 01003 Baldwin County, AL
4 01005 Barbour County, AL
Run Code Online (Sandbox Code Playgroud)
我不知道如何使用df.row.str[:]来实现分割行单元格的目标.我可以df['fips'] = hello用来添加一个新列并填充它hello.有任何想法吗?
fips row
0 00000 UNITED STATES
1 01000 ALABAMA
2 01001 Autauga County, AL
3 01003 Baldwin County, AL
4 01005 Barbour County, AL
Run Code Online (Sandbox Code Playgroud)
Leo*_*ael 319
对于简单的情况:
最简单的解决方案是:
df['A'], df['B'] = df['AB'].str.split(' ', 1).str
Run Code Online (Sandbox Code Playgroud)
或者,您可以使用以下内容为每个分割条目创建一个包含一列的DataFrame:
df['AB'].str.split(' ', 1, expand=True)
Run Code Online (Sandbox Code Playgroud)
请注意,在任何一种情况下,该expand=True方法都不是必需的.也不是None.
Andy Hayden的解决方案在展示该.tolist()方法的强大功能方面非常出色.
但是对于已知分隔符的简单拆分(例如,通过破折号拆分或通过空格分割),该zip()方法就足够了1.它在字符串的列(系列)上运行,并返回列的列(系列):
>>> import pandas as pd
>>> df = pd.DataFrame({'AB': ['A1-B1', 'A2-B2']})
>>> df
AB
0 A1-B1
1 A2-B2
>>> df['AB_split'] = df['AB'].str.split('-')
>>> df
AB AB_split
0 A1-B1 [A1, B1]
1 A2-B2 [A2, B2]
Run Code Online (Sandbox Code Playgroud)
1:如果您不确定前两个参数是str.extract()做什么的,我推荐该方法的普通Python版本的文档.
但是你怎么做的:
至:
好吧,我们需要仔细研究一下列的.str.split()属性.
它是一个神奇的对象,用于收集将列中的每个元素视为字符串的方法,然后尽可能高效地在每个元素中应用相应的方法:
>>> upper_lower_df = pd.DataFrame({"U": ["A", "B", "C"]})
>>> upper_lower_df
U
0 A
1 B
2 C
>>> upper_lower_df["L"] = upper_lower_df["U"].str.lower()
>>> upper_lower_df
U L
0 A a
1 B b
2 C c
Run Code Online (Sandbox Code Playgroud)
但它还有一个"索引"接口,用于通过索引获取字符串的每个元素:
>>> df['AB'].str[0]
0 A
1 A
Name: AB, dtype: object
>>> df['AB'].str[1]
0 1
1 2
Name: AB, dtype: object
Run Code Online (Sandbox Code Playgroud)
当然,这个索引接口.str.split()并不关心它所索引的每个元素实际上是一个字符串,只要它可以被索引,所以:
>>> df['AB'].str.split('-', 1).str[0]
0 A1
1 A2
Name: AB, dtype: object
>>> df['AB'].str.split('-', 1).str[1]
0 B1
1 B2
Name: AB, dtype: object
Run Code Online (Sandbox Code Playgroud)
然后,利用迭代的Python元组解包来做一个简单的事情
>>> df['A'], df['B'] = df['AB'].str.split('-', 1).str
>>> df
AB AB_split A B
0 A1-B1 [A1, B1] A1 B1
1 A2-B2 [A2, B2] A2 B2
Run Code Online (Sandbox Code Playgroud)
当然,从分割一列字符串中获取DataFrame是非常有用的,该.str方法可以使用.str参数为您完成:
>>> df['AB'].str.split('-', 1, expand=True)
0 1
0 A1 B1
1 A2 B2
Run Code Online (Sandbox Code Playgroud)
所以,另一种完成我们想要的方法是:
>>> df = df[['AB']]
>>> df
AB
0 A1-B1
1 A2-B2
>>> df.join(df['AB'].str.split('-', 1, expand=True).rename(columns={0:'A', 1:'B'}))
AB A B
0 A1-B1 A1 B1
1 A2-B2 A2 B2
Run Code Online (Sandbox Code Playgroud)
roo*_*oot 118
可能有更好的方法,但这是一种方法:
In [34]: import pandas as pd
In [35]: df
Out[35]:
row
0 00000 UNITED STATES
1 01000 ALABAMA
2 01001 Autauga County, AL
3 01003 Baldwin County, AL
4 01005 Barbour County, AL
In [36]: df = pd.DataFrame(df.row.str.split(' ',1).tolist(),
columns = ['flips','row'])
In [37]: df
Out[37]:
flips row
0 00000 UNITED STATES
1 01000 ALABAMA
2 01001 Autauga County, AL
3 01003 Baldwin County, AL
4 01005 Barbour County, AL
Run Code Online (Sandbox Code Playgroud)
And*_*den 53
您可以使用正则表达式模式非常巧妙地提取不同的部分:
In [11]: df.row.str.extract('(?P<fips>\d{5})((?P<state>[A-Z ]*$)|(?P<county>.*?), (?P<state_code>[A-Z]{2}$))')
Out[11]:
fips 1 state county state_code
0 00000 UNITED STATES UNITED STATES NaN NaN
1 01000 ALABAMA ALABAMA NaN NaN
2 01001 Autauga County, AL NaN Autauga County AL
3 01003 Baldwin County, AL NaN Baldwin County AL
4 01005 Barbour County, AL NaN Barbour County AL
[5 rows x 5 columns]
Run Code Online (Sandbox Code Playgroud)
解释有点长的正则表达式:
(?P<fips>\d{5})
Run Code Online (Sandbox Code Playgroud)
\d)并命名它们"fips".下一部分:
((?P<state>[A-Z ]*$)|(?P<county>.*?), (?P<state_code>[A-Z]{2}$))
Run Code Online (Sandbox Code Playgroud)
是(|)中的两个之一:
(?P<state>[A-Z ]*$)
Run Code Online (Sandbox Code Playgroud)
*)的大写字母或空格([A-Z ])并"state"在字符串($)结尾之前命名,要么
(?P<county>.*?), (?P<state_code>[A-Z]{2}$))
Run Code Online (Sandbox Code Playgroud)
.*)然后state_code字符串($)结尾前的两位数字.在示例中:
请注意前两行命中"state"(在县和state_code列中留下NaN),而最后三行命中县,state_code(在州列中留下NaN).
Bha*_*era 34
df[['fips', 'row']] = df['row'].str.split(' ', n=1, expand=True)
Run Code Online (Sandbox Code Playgroud)
keb*_*ein 21
如果您不想创建新的数据框,或者您的数据框中的列数多于您要拆分的列数,则可以:
df["flips"], df["row_name"] = zip(*df["row"].str.split().tolist())
del df["row"]
Run Code Online (Sandbox Code Playgroud)
jez*_*ael 20
您可以使用str.split空格(默认分隔符)和参数expand=True来DataFrame分配新列:
df = pd.DataFrame({'row': ['00000 UNITED STATES', '01000 ALABAMA',
'01001 Autauga County, AL', '01003 Baldwin County, AL',
'01005 Barbour County, AL']})
print (df)
row
0 00000 UNITED STATES
1 01000 ALABAMA
2 01001 Autauga County, AL
3 01003 Baldwin County, AL
4 01005 Barbour County, AL
df[['a','b']] = df['row'].str.split(n=1, expand=True)
print (df)
row a b
0 00000 UNITED STATES 00000 UNITED STATES
1 01000 ALABAMA 01000 ALABAMA
2 01001 Autauga County, AL 01001 Autauga County, AL
3 01003 Baldwin County, AL 01003 Baldwin County, AL
4 01005 Barbour County, AL 01005 Barbour County, AL
Run Code Online (Sandbox Code Playgroud)
修改如果需要删除原始列 DataFrame.pop
df[['a','b']] = df.pop('row').str.split(n=1, expand=True)
print (df)
a b
0 00000 UNITED STATES
1 01000 ALABAMA
2 01001 Autauga County, AL
3 01003 Baldwin County, AL
4 01005 Barbour County, AL
Run Code Online (Sandbox Code Playgroud)
有什么相同的:
df[['a','b']] = df['row'].str.split(n=1, expand=True)
df = df.drop('row', axis=1)
print (df)
a b
0 00000 UNITED STATES
1 01000 ALABAMA
2 01001 Autauga County, AL
3 01003 Baldwin County, AL
4 01005 Barbour County, AL
Run Code Online (Sandbox Code Playgroud)
如果得到错误:
#remove n=1 for split by all whitespaces
df[['a','b']] = df['row'].str.split(expand=True)
Run Code Online (Sandbox Code Playgroud)
ValueError:列的长度必须与key相同
您可以检查并返回4列DataFrame,而不仅仅是2 列:
print (df['row'].str.split(expand=True))
0 1 2 3
0 00000 UNITED STATES None
1 01000 ALABAMA None None
2 01001 Autauga County, AL
3 01003 Baldwin County, AL
4 01005 Barbour County, AL
Run Code Online (Sandbox Code Playgroud)
然后DataFrame通过join以下方式添加解决方案:
df = pd.DataFrame({'row': ['00000 UNITED STATES', '01000 ALABAMA',
'01001 Autauga County, AL', '01003 Baldwin County, AL',
'01005 Barbour County, AL'],
'a':range(5)})
print (df)
a row
0 0 00000 UNITED STATES
1 1 01000 ALABAMA
2 2 01001 Autauga County, AL
3 3 01003 Baldwin County, AL
4 4 01005 Barbour County, AL
df = df.join(df['row'].str.split(expand=True))
print (df)
a row 0 1 2 3
0 0 00000 UNITED STATES 00000 UNITED STATES None
1 1 01000 ALABAMA 01000 ALABAMA None None
2 2 01001 Autauga County, AL 01001 Autauga County, AL
3 3 01003 Baldwin County, AL 01003 Baldwin County, AL
4 4 01005 Barbour County, AL 01005 Barbour County, AL
Run Code Online (Sandbox Code Playgroud)
删除原始列(如果还有其他列):
df = df.join(df.pop('row').str.split(expand=True))
print (df)
a 0 1 2 3
0 0 00000 UNITED STATES None
1 1 01000 ALABAMA None None
2 2 01001 Autauga County, AL
3 3 01003 Baldwin County, AL
4 4 01005 Barbour County, AL
Run Code Online (Sandbox Code Playgroud)
wea*_*ing 14
用于df.assign创建新的 df。请参阅https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.assign.html
split = df_selected['name'].str.split(',', 1, expand=True)
df_split = df_selected.assign(first_name=split[0], last_name=split[1])
df_split.drop('name', 1, inplace=True)
Run Code Online (Sandbox Code Playgroud)
或者以方法链的形式:
df_split = (df_selected
.assign(list_col=lambda df: df['name'].str.split(',', 1, expand=False),
first_name=lambda df: df.list_col.str[0],
last_name=lambda df: df.list_col.str[1])
.drop(columns=['list_col']))
Run Code Online (Sandbox Code Playgroud)
如果要根据分隔符将字符串拆分为两列以上,则可以省略"maximum splits"参数.
您可以使用:
df['column_name'].str.split('/', expand=True)
Run Code Online (Sandbox Code Playgroud)
这将自动创建与任何初始字符串中包含的最大字段数一样多的列.
感到惊讶的是我还没有看到这个。如果您只需要两个分割,我强烈建议您。。。
Series.str.partitionpartition 在分隔符上执行一次拆分,通常表现出色。
df['row'].str.partition(' ')[[0, 2]]
0 2
0 00000 UNITED STATES
1 01000 ALABAMA
2 01001 Autauga County, AL
3 01003 Baldwin County, AL
4 01005 Barbour County, AL
Run Code Online (Sandbox Code Playgroud)
如果您需要重命名行,
df['row'].str.partition(' ')[[0, 2]].rename({0: 'fips', 2: 'row'}, axis=1)
fips row
0 00000 UNITED STATES
1 01000 ALABAMA
2 01001 Autauga County, AL
3 01003 Baldwin County, AL
4 01005 Barbour County, AL
Run Code Online (Sandbox Code Playgroud)
如果您需要将其恢复为原始版本,请使用join或concat:
df.join(df['row'].str.partition(' ')[[0, 2]])
Run Code Online (Sandbox Code Playgroud)
pd.concat([df, df['row'].str.partition(' ')[[0, 2]]], axis=1)
row 0 2
0 00000 UNITED STATES 00000 UNITED STATES
1 01000 ALABAMA 01000 ALABAMA
2 01001 Autauga County, AL 01001 Autauga County, AL
3 01003 Baldwin County, AL 01003 Baldwin County, AL
4 01005 Barbour County, AL 01005 Barbour County, AL
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
267627 次 |
| 最近记录: |