标签: data-science

ID    x1   x2   x3    x4    x5    x6    x7   x8   x9   x10
1   -0.18   5 -0.40 -0.26  0.53 -0.66  0.10   2 -0.20    1
2   -0.58   5 -0.52 -1.66  0.65 -0.15  0.08   3  3.03   -2
3   -0.62   5 -0.09 -0.38  0.65  0.22  0.44   4  1.49    1
4   -0.22  -3  1.64 -1.38  0.08  0.42  1.24   5 -0.34    0
5    0.00   5  1.76 -1.16  0.78  0.46  0.32   5 -0.51   -2

Run Code Online (Sandbox Code Playgroud)

什么是可视化这些数据的最佳方法,我使用matplotlib对其进行可视化,并使用pandas从csv中读取它

谢谢

python numpy matplotlib data-science

kin*_*o-d

2016 10-29

6
推荐指数

1
解决办法

8625
查看次数

如何解决"IndexError:数组索引太多"

我下面的代码给出了以下错误"IndexError:数组索引太多".我对机器学习很新,所以我对如何解决这个问题一无所知.任何形式的帮助将不胜感激.

train = pandas.read_csv("D:/...input/train.csv")


xTrain = train.iloc[:,0:54]
yTrain = train.iloc[:,54:]


from sklearn.cross_validation import cross_val_score
clf = LogisticRegression(multi_class='multinomial')
scores = cross_val_score(clf, xTrain, yTrain, cv=10, scoring='accuracy')
print('****Results****')
print(scores.mean())

Run Code Online (Sandbox Code Playgroud)

python arrays machine-learning indices data-science

Suj*_* De

lucky-day

6
推荐指数

2
解决办法

2万
查看次数

使用 StratifiedKFold 创建训练/测试/验证分割

我正在尝试StratifiedKFold创建训练/测试/验证分割，以便在非 sklearn 机器学习工作流程中使用。因此，DataFrame 需要被拆分并保持这种状态。

我尝试按照以下方式执行此操作，.values因为我正在传递 pandas DataFrames：

skf = StratifiedKFold(n_splits=3, shuffle=False)
skf.get_n_splits(X, y)

for train_index, test_index, valid_index in skf.split(X.values, y.values):
    print("TRAIN:", train_index, "TEST:", test_index,  "VALID:", valid_index)
    X_train, X_test, X_valid = X.values[train_index], X.values[test_index], X.values[valid_index]
    y_train, y_test, y_valid = y.values[train_index], y.values[test_index], y.values[valid_index]

Run Code Online (Sandbox Code Playgroud)

这失败了：

skf = StratifiedKFold(n_splits=3, shuffle=False)
skf.get_n_splits(X, y)

for train_index, test_index, valid_index in skf.split(X.values, y.values):
    print("TRAIN:", train_index, "TEST:", test_index,  "VALID:", valid_index)
    X_train, X_test, X_valid = X.values[train_index], X.values[test_index], X.values[valid_index]
    y_train, y_test, y_valid = y.values[train_index], y.values[test_index], y.values[valid_index]

Run Code Online (Sandbox Code Playgroud)

我通读了所有sklearn …

python pandas scikit-learn cross-validation data-science

tw0*_*000

2017 07-21

6
推荐指数

1
解决办法

6614
查看次数

使用基于训练数据集的模型预测测试数据？

我是数据科学和分析的新手。在 Kaggle 上研究了很多内核之后，我制作了一个预测房产价格的模型。我已经使用我的训练数据测试了这个模型，但现在我想在我的测试数据上运行它。我有一个 test.csv 文件，我想使用它。我怎么做？我之前对训练数据集做了什么：

#loading my train dataset into python
train = pd.read_csv('/Users/sohaib/Downloads/test.csv')

#factors that will predict the price
train_pr = ['OverallQual','GrLivArea','GarageCars','TotalBsmtSF','FullBath','YearBuilt']

#set my model to DecisionTree
model = DecisionTreeRegressor()

#set prediction data to factors that will predict, and set target to SalePrice
prdata = train[train_pr]
target = train.SalePrice

#fitting model with prediction data and telling it my target
model.fit(prdata, target)

model.predict(prdata.head())

Run Code Online (Sandbox Code Playgroud)

现在我尝试做的是，复制整个代码，并将“train”更改为“test”，将“predate”更改为“testprdata”，我认为它会起作用，但遗憾的是没有。我知道我做错了什么，我不知道那是什么。

python data-analysis scikit-learn data-science

Soh*_*yed

2017 08-15

6
推荐指数

1
解决办法

3万
查看次数

预测时如何处理测试数据中的 onehotencoding 后的类别不匹配？

如果问题的标题不是很清楚，我很抱歉，我无法用一句话概括问题。

以下是用于解释的简化数据集。基本上，训练集中的类别数量远大于测试集中的类别数量，因此 OneHotEncoding 后测试和训练集中的列数存在差异。我该如何处理这个问题？

训练集

+-------+----------+
| Value | Category |
+-------+----------+
| 100   | SE1      |
+-------+----------+
| 200   | SE2      |
+-------+----------+
| 300   | SE3      |
+-------+----------+

Run Code Online (Sandbox Code Playgroud)

OneHotEncoding后的训练集

+-------+-----------+-----------+-----------+
| Value | DummyCat1 | DummyCat2 | DummyCat3 |
+-------+-----------+-----------+-----------+
| 100   | 1         | 0         | 0         |
+-------+-----------+-----------+-----------+
| 200   | 0         | 1         | 0         |
+-------+-----------+-----------+-----------+
| 300   | 0         | 0         | 1         |
+-------+-----------+-----------+-----------+

Run Code Online (Sandbox Code Playgroud)

测试装置

+-------+----------+
| Value | Category |
+-------+----------+ …

Run Code Online (Sandbox Code Playgroud)

python machine-learning scikit-learn data-science

Par*_*eog

lucky-day

6
推荐指数

1
解决办法

3705
查看次数

线性回归与随机森林性能准确性

如果数据集包含一些特征，其中一些是分类变量，另一些是连续变量，则决策树比线性回归更好，因为树可以根据分类变量准确地划分数据。线性回归是否有优于随机森林的情况？

python data-science

Sou*_*aha

lucky-day

6
推荐指数

1
解决办法

1万
查看次数

在 Seaborn 中为 python 创建箱线图 FacetGrid

我正在尝试在seaborn中为4个箱线图创建一个4x4 FacetGrid，每个箱线图根据虹膜数据集中的虹膜种类分为3个箱线图。目前，我的代码如下所示：

sns.set(style="whitegrid")
iris_vis = sns.load_dataset("iris")

fig, axes = plt.subplots(2, 2)

ax = sns.boxplot(x="Species", y="SepalLengthCm", data=iris, orient='v', 
    ax=axes[0])
ax = sns.boxplot(x="Species", y="SepalWidthCm", data=iris, orient='v', 
    ax=axes[1])
ax = sns.boxplot(x="Species", y="PetalLengthCm", data=iris, orient='v', 
    ax=axes[2])
ax = sns.boxplot(x="Species", y="PetalWidthCm", data=iris, orient='v', 
    ax=axes[3])

Run Code Online (Sandbox Code Playgroud)

但是，我从翻译中收到此错误：

AttributeError: 'numpy.ndarray' object has no attribute 'boxplot'

Run Code Online (Sandbox Code Playgroud)

我对属性错误到底在哪里感到困惑。我需要改变什么？

python data-visualization boxplot seaborn data-science

Jos*_*ess

lucky-day

6
推荐指数

2
解决办法

2万
查看次数

用最相似的行中的值填充缺失值

我有下表。一些值为NaN。让我们假设列是高度相关的。以row 0和row 5我说，在这个值col2会4.0。row 1和的情况相同row 4。但是，在这种情况下row 6，没有完美匹配的样本，因此在这种情况下，我应该选择最相似的行，row 0并将NaN更改为3.0。我应该如何处理？有任何熊猫功能可以做到这一点吗？

example = pd.DataFrame({"col1": [3, 2, 8, 4, 2, 3, np.nan], 
                        "col2": [4, 3, 6, np.nan, 3, np.nan, 5], 
                        "col3": [7, 8, 9, np.nan, np.nan, 7, 7], 
                        "col4": [7, 8, 9, np.nan, np.nan, 7, 6]})

Run Code Online (Sandbox Code Playgroud)

输出：

    col1    col2    col3    col4
0   3.0     4.0     7.0     7.0
1   2.0     3.0     8.0     8.0
2   8.0     6.0     9.0     9.0
3 …

Run Code Online (Sandbox Code Playgroud)

python pandas data-science

Mar*_*ank

lucky-day

6
推荐指数

1
解决办法

65
查看次数

标签统计

data-science ×10

python ×8

machine-learning ×3

scikit-learn ×3

pandas ×2

algorithm ×1

arrays ×1

boxplot ×1

cross-validation ×1

data-analysis ×1

data-mining ×1

data-visualization ×1

indices ×1

matplotlib ×1

neural-network ×1

numpy ×1

pattern-mining ×1

seaborn ×1

tensorflow ×1

标签 统计

标签统计