比方说,我有以下DataFrame与原始输入数据,并希望使用一系列pandas函数(" 管道 ")来处理它.特别是,我想重命名和删除列,并根据另一列添加其他列.
Gene stable ID Gene name Gene type miRBase accession miRBase ID
0 ENSG00000274494 MIR6832 miRNA MI0022677 hsa-mir-6832
1 ENSG00000283386 MIR4659B miRNA MI0017291 hsa-mir-4659b
2 ENSG00000221456 MIR1202 miRNA MI0006334 hsa-mir-1202
3 ENSG00000199102 MIR302C miRNA MI0000773 hsa-mir-302c
Run Code Online (Sandbox Code Playgroud)
目前我做了以下(有效):
tmp_df = df.\
drop("Gene type", axis=1).\
rename(columns = {
"Gene stable ID": "ENSG",
"Gene name": "gene_name",
"miRBase accession": "MI",
"miRBase ID": "mirna_name"
})
result = tmp_df.assign(species = tmp_df.mirna_name.str[:3])
Run Code Online (Sandbox Code Playgroud)
结果:
ENSG gene_name MI mirna_name species
0 ENSG00000274494 MIR6832 MI0022677 hsa-mir-6832 hsa
1 ENSG00000283386 MIR4659B MI0017291 hsa-mir-4659b hsa
2 ENSG00000221456 MIR1202 MI0006334 hsa-mir-1202 hsa
3 ENSG00000199102 MIR302C MI0000773 hsa-mir-302c hsa
Run Code Online (Sandbox Code Playgroud)
是否可以将assign命令直接放入"管道"?分配一个额外的临时变量感觉很麻烦.在这种情况下,我不知道如何引用相应的重命名列('mirna_name').
您可以使用管道:
tmp_df = df.\
drop("Gene type", axis=1).\
rename(columns = {
"Gene stable ID": "ENSG",
"Gene name": "gene_name",
"miRBase accession": "MI",
"miRBase ID": "mirna_name"
}).\
pipe(lambda x: x.assign(species = x.mirna_name.str[:3]))
tmp_df
Out[365]:
ENSG gene_name MI mirna_name species
0 ENSG00000274494 MIR6832 MI0022677 hsa-mir-6832 hsa
1 ENSG00000283386 MIR4659B MI0017291 hsa-mir-4659b hsa
2 ENSG00000221456 MIR1202 MI0006334 hsa-mir-1202 hsa
3 ENSG00000199102 MIR302C MI0000773 hsa-mir-302c hsa
Run Code Online (Sandbox Code Playgroud)
正如@Tom指出的,在这种情况下,也可以不使用管道来完成此操作:
df.\
drop("Gene type", axis=1).\
rename(columns = {
"Gene stable ID": "ENSG",
"Gene name": "gene_name",
"miRBase accession": "MI",
"miRBase ID": "mirna_name"
}).\
assign(species = lambda x: x.mirna_name.str[:3])
Run Code Online (Sandbox Code Playgroud)