我有一个数据框,其中包含一个名为 ProjectSubject 的列。数据框大约有 1,000,000 行长。
在 ProjectSubject 列中,我有很多不同的字符串。下面是一个例子:
>unique(unlist(projectdf$ProjectSubject))
[1] "Applied Learning" "Applied Learning, Literacy
& Language"
[3] "Literacy & Language" "Special Needs"
[5] "Literacy & Language, History & Civics" "Math & Science"
[7] "History & Civics, Math & Science" "Literacy & Language,
Special Needs"
[9] "Applied Learning, Special Needs" "Health & Sports, Special
Needs"
[11] "Math & Science, Literacy & Language" "Literacy & Language, Math
& Science"
[13] "Literacy & Language, Music & The Arts" "Math & Science, Special …Run Code Online (Sandbox Code Playgroud) 我有一个 Pandas 数据框,它在特定列中有一些 NaN 值:
1291 NaN
1841 NaN
2049 NaN
Name: some column, dtype: float64
Run Code Online (Sandbox Code Playgroud)
我已经制作了以下管道来处理它:
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
scaler = StandardScaler(with_mean = True)
imputer = SimpleImputer(strategy = 'median')
logistic = LogisticRegression()
pipe = Pipeline([('imputer', imputer),
('scaler', scaler),
('logistic', logistic)])
Run Code Online (Sandbox Code Playgroud)
现在,当我将此管道传递给 a 时RandomizedSearchCV,出现以下错误:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
它实际上比那要长得多 - 如果需要,我可以在编辑中发布整个错误。无论如何,我很确定此列是唯一包含 NaN 的列。此外,如果我从管道中切换SimpleImputer到(现已弃用)Imputer,管道在我的RandomizedSearchCV. 我检查了文档,但似乎它的SimpleImputer …