我正在尝试比较两个 pandas 数据帧,但出现错误,因为“DataFrame”对象没有属性“withColumn”。可能是什么问题?
import pandas as pd
import pyspark.sql.functions as F
pd_df=pd.DataFrame(df.dtypes,columns=['column','data_type'])
pd_df1=pd.DataFrame(df1.dtypes,columns=['column','data_type'])
pd.merge(pd_df,pd_df1, on='column', how='outer'
).withColumn(
"result",
F.when(F.col("data_type_x") == 'NaN','new attribute'.otherwise('old attribute')))
.select(
"column",
"data_type_x",
"data_type_y",
"result"
)
Run Code Online (Sandbox Code Playgroud)
df 和 df1 是一些数据框
是)我有的:
我想做的事:
我已经有以下代码,它可以正常工作.但是,分析表明此代码是我的代码中的重要瓶颈之一,所以我想尽可能优化它,我也有理由相信应该是可能的:
df["NewColumn1"] = df.apply(lambda row: compute_new_column1_value(row), axis=1)
df["NewColumn2"] = df.apply(lambda row: compute_new_column2_value(row), axis=1)
# a few more lines of code like the above
Run Code Online (Sandbox Code Playgroud)
我基于这个答案解决这样的问题这一个(这是与我相似,但具体如何添加一个新列的问题,而我的问题是关于添加了许多新的列).我想这些df.apply()调用中的每一个都是通过所有行的循环在内部实现的,我怀疑应该可以使用只循环所有循环一次的解决方案来优化它(而不是每列需要添加一次) ).
在其他答案中,我看到了对assign()函数的引用,它确实支持一次添加多个列.我尝试以下列方式使用它:
# WARNING: this does NOT work
df = df.assign(
NewColumn1=lambda row: compute_new_column1_value(row),
NewColumn2=lambda row: compute_new_column2_value(row),
# more lines like the two above
)
Run Code Online (Sandbox Code Playgroud)
这不起作用的原因是因为lambda实际上根本没有接收到数据帧的行,它们似乎只是立刻得到整个数据帧.然后期望每个lambda一次返回完整的列/ Series /数组值.所以,我的问题是,我必须最终在这些lambda中通过所有循环实现手动循环,这显然会对性能更糟.
我可以从概念上考虑两种解决方案,但到目前为止还无法找到如何实际实现它们:
类似的东西df.assign()(支持一次添加多个列),但能够将行传递到lambda而不是完整的数据帧
一种向我的compute_new_columnX_value()函数进行向量化的方法,以便它们可以像df.assign()预期的那样用作lambda .
到目前为止我的第二个解决方案的问题是基于行的版本我的一些函数看起来如下,我很难找到如何正确地向量化它们:
def compute_new_column1_value(row):
if row["SomeExistingColumn"] …Run Code Online (Sandbox Code Playgroud) I have a big pandas Dataframe with fictional persondata. The below is a small example - each person is defined by a number.
import pandas as pd
import numpy as np
df = pd.DataFrame({ 'Number':["5569", "3385", "9832", "6457", "5346", "5462", "9873", "2366"] , 'Gender': ['Male', 'Male', 'Female', 'Male', 'Female', 'Female', 'Male', 'Female'], 'Children': [np.nan, "5569 6457", "5569", np.nan, "6457", "2366", "2366", np.nan]})
df
Number Gender Children
0 5569 Male NaN
1 3385 Male 5569 6457
2 9832 Female 5569
3 …Run Code Online (Sandbox Code Playgroud) I have a data frame and big function like below and i wanted to apply norm_group function to data frame columns but its taking too much time with apply command. is there any way to reduce the time for this code? currently it's taking 24.4s for each loop.
import pandas as pd
import numpy as np
np.random.seed(1234)
n = 1500000
df = pd.DataFrame()
df['group'] = np.random.randint(1700, size=n)
df['ID'] = np.random.randint(5, size=n)
df['s_count'] = np.random.randint(5, size=n)
df['p_count'] = np.random.randint(5, size=n)
df['d_count'] …Run Code Online (Sandbox Code Playgroud) 我正在使用Pandas并尝试使用Python if-else语句(也称为三元条件运算符)创建一个新列,以避免被零除.
例如下面,我想通过划分A/B来创建一个新的列C. 我想使用if-else语句来避免除以0.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0, 5, size=(100, 2)), columns=list('AB'))
df.head()
# A B
# 0 1 3
# 1 1 2
# 2 0 0
# 3 2 1
# 4 4 2
df['C'] = (df.A / df.B) if df.B > 0.0 else 0.0
Run Code Online (Sandbox Code Playgroud)
但是,我从最后一行收到错误:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Run Code Online (Sandbox Code Playgroud)
我在StackOverflow上搜索并发现了有关此错误的其他帖子,但它们都没有涉及这种类型的if-else语句.一些帖子包括:
系列的真值是模棱两可的.使用a.empty,a.bool(),a.item(),a.any()或a.all()
任何帮助,将不胜感激.
我有一个数据框说df。df有专栏'Ages'
>>> df['Age']
我想将这个年龄段分组并创建一个像这样的新列
If age >= 0 & age < 2 then AgeGroup = Infant
If age >= 2 & age < 4 then AgeGroup = Toddler
If age >= 4 & age < 13 then AgeGroup = Kid
If age >= 13 & age < 20 then AgeGroup = Teen
and so on .....
Run Code Online (Sandbox Code Playgroud)
如何使用Pandas库实现此目的。
我试图做这样的事情
X_train_data['AgeGroup'][ X_train_data.Age < 13 ] = 'Kid'
X_train_data['AgeGroup'][ X_train_data.Age < 3 ] = 'Toddler'
X_train_data['AgeGroup'][ X_train_data.Age < …Run Code Online (Sandbox Code Playgroud) 我试图将数据帧中的一长串 RGB 值转换为十六进制,以允许进行一些图表构建,我已经设法找到正确的代码来进行转换,只是应用它而已,这让我很烦恼。
df = pd.DataFrame({'R':[152,186,86], 'G':[112,191,121], 'B':[85,222,180] })
def rgb_to_hex(red, green, blue):
"""Return color as #rrggbb for the given color values."""
return '#%02x%02x%02x' % (red, green, blue)
Run Code Online (Sandbox Code Playgroud)
这段代码是最出问题的:
df['hex'] = rgb_to_hex(df['R'],df['G'],df['B'])
Run Code Online (Sandbox Code Playgroud)
出现以下错误:
类型错误:%x 格式:需要整数,而不是系列
有什么想法吗?
我有一个大约 160,000 行的 pandas 数据框 (df2)。我正在尝试更改列(url)中的一些值。
此列中的字符串长度介于 108 到 150 个字符之间。如果字符串不是 108 个字符,我想用相同的字符串替换它,并剪掉最后 10 个字符。IF 字符串有 108 个字符。我想别管它。请注意,我并不是想让每个字符串都包含 108 个字符,我只是想截掉任何不是 108 个字符的字符串的最后 10 个字符。
示例:len(s) = 114,替换为 s[:-10]
我构建了一个可以执行此操作的函数,但它非常慢,可能是因为它在每个循环中重建数据帧。
for i in df2.url:
if len(i) != 108:
new_i = i[:-10]
df2 = df2.replace(i,new_i)
Run Code Online (Sandbox Code Playgroud)
必须有一种更快的方法来做到这一点,但我一直不知道如何做。我希望有更精通熊猫的人提供专业知识。
下面是我尝试更改的 200 行列的示例:
['https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1301108?gameHash=bde58669fc59c853&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1291187?gameHash=f7fcd2d6ca775fb5&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1291192?gameHash=005335984c8f8a3a&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1301128?gameHash=fcbd2630c0faec49&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1301159?gameHash=9a7726176fdabfde&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1301169?gameHash=5d816e6d30d2b659&tab=overview',
'https://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1301183?gameHash=396641afdcdd99d9&tab=overview',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT02/1271494?gameHash=bd51798e1358c47f',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130153?gameHash=00a7861ac0a23aef',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT02/1271495?gameHash=0d828bbc9aa9996c',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT02/1271497?gameHash=bd4810bb801abf24',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130166?gameHash=1cff679b64acb047',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130177?gameHash=1f92cbefd9a965e0',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT02/1271500?gameHash=abbdae6c3e7b4006',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT02/1271505?gameHash=7c970a84e132a578',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130182?gameHash=ccb50f6e86e4c3df',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130193?gameHash=0995997660a65721',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1301262?gameHash=c594a9a52f46cc50',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130196?gameHash=31553f5bb6ba4420',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1301270?gameHash=5b3babb5d392d78d',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130201?gameHash=3d2aa031c17d90ae',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1301290?gameHash=31ce80069fdbc873',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130210?gameHash=91c7b22cded939ff',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1301305?gameHash=3f8d664b3b988446',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130221?gameHash=a8580ee66ffbb525',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1291406?gameHash=5220923eb35c42c6',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1291426?gameHash=83c7c51530ea074e',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1291442?gameHash=28f7b485f710168f',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1291458?gameHash=49cc14d02ccd0674',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT01/1291470?gameHash=f087c853097c2dd9',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT02/1261474?gameHash=e6c01a288de5dc41',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT04/1130229?gameHash=1489421028163983',
'http://matchhistory.na.leagueoflegends.com/en/#match-details/ESPORTSTMNT02/1261475?gameHash=c984e795d6406cd5', …Run Code Online (Sandbox Code Playgroud) 仅供参考,性能/速度对于这个问题并不重要。
我有一个名为cost_table... 的现有熊猫数据框。
+----------+---------+------+-------------------------+-----------------+
| material | percent | qty | price_control_indicator | acct_assign_cat |
+----------+---------+------+-------------------------+-----------------+
| abc111 | 1.00 | 50 | v | # |
| abc222 | 0.25 | 2000 | s | # |
| xyz789 | 0.45 | 0 | v | m |
| def456 | 0.9 | 0 | v | # |
| 123xyz | 0.2 | 0 | v | m |
| lmo888 | 0.6 | …Run Code Online (Sandbox Code Playgroud)