这应该是非常容易的,但我不能让它工作.
我想在两个值上过滤我的数据集.
#this works, when I filter for one value
df.loc[df['channel'] == 'sale']
#if I have to filter, two separate columns, I can do this
df.loc[(df['channel'] == 'sale')&(df['type']=='A')]
#but what if I want to filter one column by more than one value?
df.loc[df['channel'] == ('sale','fullprice')]
Run Code Online (Sandbox Code Playgroud)
这必须是一个OR声明吗?我可以在SQL中使用吗?
我有两个数据帧,它们都有一个Order ID
和一个date
.
我想在第一个数据帧中添加一个标志df1
:如果一个记录具有相同的order id
并且date
在dataframe中df2
,那么添加一个Y
:
[ df1['R'] = np.where(orders['key'].isin(df2['key']), 'Y', 0)]
Run Code Online (Sandbox Code Playgroud)
为了实现这个目标,我要创建一个密钥,这将是的串联order_id
和date
,但是当我尝试下面的代码:
df1['key']=df1['Order_ID']+'_'+df1['Date']
Run Code Online (Sandbox Code Playgroud)
我收到这个错误
ufunc 'add' did not contain a loop with signature matching types dtype('S21') dtype('S21') dtype('S21')
Run Code Online (Sandbox Code Playgroud)
df1看起来像这样:
Date | Order_ID | other data points ...
201751 4395674 ...
201762 3487535 ...
Run Code Online (Sandbox Code Playgroud)
这些是数据类型:
df1.info()
RangeIndex: 157443 entries, 0 to 157442
Data columns (total 6 columns):
Order_ID 157429 non-null object
Date 157443 non-null …
Run Code Online (Sandbox Code Playgroud) 我试图通过pandas python dataframe对组进行线性回归:
这是数据帧df:
group date value
A 01-02-2016 16
A 01-03-2016 15
A 01-04-2016 14
A 01-05-2016 17
A 01-06-2016 19
A 01-07-2016 20
B 01-02-2016 16
B 01-03-2016 13
B 01-04-2016 13
C 01-02-2016 16
C 01-03-2016 16
#import standard packages
import pandas as pd
import numpy as np
#import ML packages
from sklearn.linear_model import LinearRegression
#First, let's group the data by group
df_group = df.groupby('group')
#Then, we need to change the date to integer
df['date'] = pd.to_datetime(df['date']) …
Run Code Online (Sandbox Code Playgroud) 我计算了我的多元线性回归方程,我希望看到调整后的R平方.我知道得分函数可以让我看到r平方,但它没有调整.
import pandas as pd #import the pandas module
import numpy as np
df = pd.read_csv ('/Users/jeangelj/Documents/training/linexdata.csv', sep=',')
df
AverageNumberofTickets NumberofEmployees ValueofContract Industry
0 1 51 25750 Retail
1 9 68 25000 Services
2 20 67 40000 Services
3 1 124 35000 Retail
4 8 124 25000 Manufacturing
5 30 134 50000 Services
6 20 157 48000 Retail
7 8 190 32000 Retail
8 20 205 70000 Retail
9 50 230 75000 Manufacturing
10 35 265 50000 Manufacturing
11 65 …
Run Code Online (Sandbox Code Playgroud) 我有以下数据帧df:
Customer_ID | 2015 | 2016 |2017 | Year_joined_mailing
ABC 5 6 10 2015
BCD 6 7 3 2016
DEF 10 4 5 2017
GHI 8 7 10 2016
Run Code Online (Sandbox Code Playgroud)
我想在他们加入邮件列表的那一年查找客户的价值并将其保存在新列中.
输出将是:
Customer_ID | 2015 | 2016 |2017 | Year_joined_mailing | Purchases_1st_year
ABC 5 6 10 2015 5
BCD 6 7 3 2016 7
DEF 10 4 5 2017 5
GHI 8 9 10 2016 9
Run Code Online (Sandbox Code Playgroud)
我在python中找到了一些匹配vlookup的解决方案,但没有一个会使用其他列的头文件.
我有一个python pandas数据框,有几列,一列有0
值.我想0
用这个列的median
或替换值mean
.
data
是我的数据框
artist_hotness
是列
mean_artist_hotness = data['artist_hotness'].dropna().mean()
if len(data.artist_hotness[ data.artist_hotness.isnull() ]) > 0:
data.artist_hotness.loc[ (data.artist_hotness.isnull()), 'artist_hotness'] = mean_artist_hotness
Run Code Online (Sandbox Code Playgroud)
我尝试过这个,但它没有用.
由于我正在创建一个数据框,我不明白为什么会出现数组错误。
M2 = df.groupby(['song_id', 'user_id']).rating.mean().unstack()
M2 = np.maximum(-1, (M - 3).fillna(0) / 2.) # scale to -1..+1 (treat "0" scores as "1" scores)
M2.head(2)
AttributeError: 'numpy.ndarray' object has no attribute 'fillna'
Run Code Online (Sandbox Code Playgroud) 我想用简单的线性回归预测未来某个日期的值,但我不能因为日期格式.
这是我的数据框:
data_df =
date value
2016-01-15 1555
2016-01-16 1678
2016-01-17 1789
...
y = np.asarray(data_df['value'])
X = data_df[['date']]
X_train, X_test, y_train, y_test = train_test_split
(X,y,train_size=.7,random_state=42)
model = LinearRegression() #create linear regression object
model.fit(X_train, y_train) #train model on train data
model.score(X_train, y_train) #check score
print (‘Coefficient: \n’, model.coef_)
print (‘Intercept: \n’, model.intercept_)
coefs = zip(model.coef_, X.columns)
model.__dict__
print "sl = %.1f + " % model.intercept_ + \
" + ".join("%.1f %s" % coef for coef in coefs) #linear model
Run Code Online (Sandbox Code Playgroud)
我试图将日期转换为失败 …
我正在尝试合并列df1, df2
上的两个数据框Customer_ID
。两者似乎都Customer_ID
具有相同的数据类型(object
)。
df1:
Customer_ID | Flag
12345 A
Run Code Online (Sandbox Code Playgroud)
df2:
Customer_ID | Transaction_Value
12345 258478
Run Code Online (Sandbox Code Playgroud)
当我合并两个表时:
new_df = df2.merge(df1, on='Customer_ID', how='left')
Run Code Online (Sandbox Code Playgroud)
对于某些Customer_ID,它起作用,而对于另一些,则无效。对于此示例,我将得到以下结果:
Customer_ID | Transaction_Value | Flag
12345 258478 NaN
Run Code Online (Sandbox Code Playgroud)
我检查了数据类型,它们是相同的:
df1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 873353 entries, 0 to 873352
Data columns (total 2 columns):
Customer_ID 873353 non-null object
Flag 873353 non-null object
dtypes: object(2)
memory usage: 20.0+ MB
df2.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 873353 entries, 0 to 873352
Data columns (total 2 columns): …
Run Code Online (Sandbox Code Playgroud) 我想向我的 python pandas dataframe df 添加一个标志,如果列中的条目Title
包含单词test
(大写或小写或全部大写),我想添加T
一个新列test
。
这给了我一个错误,并且没有考虑所有大写场景:
df['Test_Flag'] = np.where(df[df['Title'].str.contains("test|Test")==True], 'T', '')
ValueError: Length of values does not match length of index
Run Code Online (Sandbox Code Playgroud)