小编Mar*_*cel的帖子

重命名数据框中列中的字符变量 - R

我有一个数据框,其中包含一个名为 ProjectSubject 的列。数据框大约有 1,000,000 行长。

在 ProjectSubject 列中,我有很多不同的字符串。下面是一个例子:

>unique(unlist(projectdf$ProjectSubject))

[1] "Applied Learning"                           "Applied Learning, Literacy 
& Language"     
[3] "Literacy & Language"                        "Special Needs"                             
[5] "Literacy & Language, History & Civics"      "Math & Science"                            
[7] "History & Civics, Math & Science"           "Literacy & Language, 
Special Needs"        
[9] "Applied Learning, Special Needs"            "Health & Sports, Special 
Needs"            
[11] "Math & Science, Literacy & Language"        "Literacy & Language, Math 
& Science"       
[13] "Literacy & Language, Music & The Arts"      "Math & Science, Special …
Run Code Online (Sandbox Code Playgroud)

r rename character dataframe

5
推荐指数
1
解决办法
1万
查看次数

Sklearn 的 SimpleImputer 在管道中不起作用?

我有一个 Pandas 数据框,它在特定列中有一些 NaN 值:

1291   NaN
1841   NaN
2049   NaN
Name: some column, dtype: float64
Run Code Online (Sandbox Code Playgroud)

我已经制作了以下管道来处理它:

from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

scaler = StandardScaler(with_mean = True)
imputer = SimpleImputer(strategy = 'median')
logistic = LogisticRegression()

pipe = Pipeline([('imputer', imputer),
                 ('scaler', scaler), 
                 ('logistic', logistic)])
Run Code Online (Sandbox Code Playgroud)

现在,当我将此管道传递给 a 时RandomizedSearchCV,出现以下错误:

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

它实际上比那要长得多 - 如果需要,我可以在编辑中发布整个错误。无论如何,我很确定此列是唯一包含 NaN 的列。此外,如果我从管道中切换SimpleImputer到(现已弃用)Imputer,管道在我的RandomizedSearchCV. 我检查了文档,但似乎它的SimpleImputer …

pipeline scikit-learn sklearn-pandas

5
推荐指数
1
解决办法
3850
查看次数