我编写了以下R代码来识别目录中的重复文件.如何使用plyr包(或类似)对for-loop进行矢量化?我想实现一个比我提出的更惯用的R解决方案.
library("digest") # to compute the MD5 digest
test_dir = "/Users/user/Dropbox/kaggle/r_projects/test_photo"
filelist <- dir(test_dir, pattern = "JPG|AVI", recursive=TRUE,
all.files =TRUE, full.names=TRUE)
fl = list() #create and empty list to hold md5's and filenames
for (itm in filelist) {
file_digest = digest(itm, file=TRUE, algo="md5")
fl[[file_digest]]= c(fl[[file_digest]],itm)
}
fl
Run Code Online (Sandbox Code Playgroud)
输出是(使用一个小的测试目录):
> fl
$`5715b719723c5111b3a38a6ff8b7ca56`
[1] "/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_a/IMG_3480 copy.JPG"
[2] "/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_a/IMG_3480.JPG"
$`24fd4d7d252ca66c8d7a88b539c55112`
[1] "/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_a/IMG_3481 copy.JPG"
[2] "/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_a/IMG_3481.JPG"
[3] "/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_b/IMG_3481.JPG"
$`2a1d668c874dc856b9df0fbf3f2e81ec`
[1] "/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_a/IMG_3482 copy.JPG"
[2] "/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_a/IMG_3482.JPG"
[3] "/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_b/IMG_3482 copy.JPG"
[4] "/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_b/IMG_3482.JPG"
Run Code Online (Sandbox Code Playgroud)
我试过了:
h=ldply(filelist, digest, file=TRUE, algo="md5") …Run Code Online (Sandbox Code Playgroud) 通过将PCA添加到算法中,我正在努力提高kaggle数字识别教程的%96.5 SKlearn kNN预测分数,但基于PCA输出的新kNN预测非常可怕,如23%.
下面是完整的代码,如果你指出我错在哪里,我感激不尽.
import pandas as pd
import numpy as np
import pylab as pl
import os as os
from sklearn import metrics
%pylab inline
os.chdir("/users/******/desktop/python")
traindata=pd.read_csv("train.csv")
traindata=np.array(traindata)
traindata=traindata.astype(float)
X,y=traindata[:,1:],traindata[:,0]
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test= train_test_split(X,y,test_size=0.25, random_state=33)
#scale & PCA train data
from sklearn import preprocessing
from sklearn.decomposition import PCA
X_train_scaled = preprocessing.scale(X_train)
estimator = PCA(n_components=350)
X_train_pca = estimator.fit_transform(X_train_scaled)
# sum(estimator.explained_variance_ratio_) = 0.96
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=6)
neigh.fit(X_train_pca,y_train)
# scale & PCA test …Run Code Online (Sandbox Code Playgroud) 我正在使用我使用随机森林Algorthim的泰坦尼克数据集在Kaggle上练习R.
下面是代码
fit <- randomForest(as.factor(Survived) ~ Pclass + Sex + Age_Bucket + Embarked
+ Age_Bucket + Fare_Bucket + F_Name + Title + FamilySize + FamilyID,
data=train, importance=TRUE, ntree=5000)
Run Code Online (Sandbox Code Playgroud)
我收到以下错误
Error in randomForest.default(m, y, ...) :
NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning messages:
1: In data.matrix(x) : NAs introduced by coercion
2: In data.matrix(x) : NAs introduced by coercion
3: In data.matrix(x) : NAs introduced by coercion
4: In data.matrix(x) : NAs introduced by coercion
Run Code Online (Sandbox Code Playgroud)
我的数据如下所示
$ …Run Code Online (Sandbox Code Playgroud) 我是所有这些方法的新手,我试图得到一个简单的答案,或者如果有人可以指导我在网上的某个地方进行高级别的解释.我的谷歌搜索只返回了kaggle示例代码.
extratree和randomforrest基本相同吗?xgboost在为任何特定树选择特征时使用增强,即对特征进行采样.但那么其他两种算法如何选择这些功能呢?
谢谢!
我正在用R检查kaggle中的imdb电影数据集。
这是最小的repro数据集:
> movies <- data.frame(movie = as.factor(c("Movie 1", "Movie 2", "Movie 3", "Movie 4")), director = as.factor(c("Dir 1", "Dir 2", "Dir 1", "Dir 3")), director_rating = c(1000, 2000, 1000, 3000))
> movies
movie director director_rating
1 Movie 1 Dir 1 1000
2 Movie 2 Dir 2 2000
3 Movie 3 Dir 1 1000
4 Movie 4 Dir 3 3000
Run Code Online (Sandbox Code Playgroud)
请注意,具有相同导演的每一行具有相同的导演评级值。
我想列出导演,按等级排序,每位导演一行。以下代码有效:
> library(dplyr)
> movies %>%
group_by(director) %>%
summarize(director_rating = mean(director_rating)) %>%
arrange(desc(director_rating))
# A tibble: 3 x 2 …Run Code Online (Sandbox Code Playgroud) 是否可以在seaborn库的线性回归图上从pandas数据框创建某种悬停工具提示?
更具体地说,我有此代码:
import pandas as pd
import numpy as np
from pandas import Series, DataFrame
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
dframe=pd.read_csv('pokemon.csv')
sns.lmplot('hp', 'total', dframe)
Run Code Online (Sandbox Code Playgroud)
结果是线性回归图。我如何创建这样的工具提示,即当我将鼠标悬停在图表上的一个点上时,dframe将显示其中一个列中的对应值?在这种情况下,宠物小精灵的名字。
csv文件来自Kaggle。
我正在使用Datacamp平台为Titanic做Kaggle教程。
我知道在熊猫中使用.loc-使用列标签按行选择值...
我感到困惑的是,在Datacamp教程中,我们想在“性别”列中找到所有“男性”输入,并将其替换为0。他们使用以下代码来做到这一点:
titanic.loc[titanic["Sex"] == "male", "Sex"] = 0
Run Code Online (Sandbox Code Playgroud)
有人可以解释一下它如何工作吗?我以为.loc接受行和列的输入,那么==的作用是什么?
不应该是:
titanic.loc["male", "Sex"] = 0
Run Code Online (Sandbox Code Playgroud)
谢谢!
执行以下行时
!pip install kaggle
!kaggle competitions download -c dogs-vs-cats -p /content/
Run Code Online (Sandbox Code Playgroud)
我收到以下错误消息,
Traceback (most recent call last):
File "/usr/local/bin/kaggle", line 7, in <module>
from kaggle.cli import main
File "/usr/local/lib/python3.6/dist-packages/kaggle/__init__.py", line 23, in <module>
api.authenticate()
File "/usr/local/lib/python3.6/dist-packages/kaggle/api/kaggle_api_extended.py", line 109, in authenticate
self._load_config(config_data)
File "/usr/local/lib/python3.6/dist-packages/kaggle/api/kaggle_api_extended.py", line 151, in _load_config
raise ValueError('Error: Missing %s in configuration.' % item)
ValueError: Error: Missing username in configuration.
Run Code Online (Sandbox Code Playgroud)
我不知道刚发生了什么......同样的线路之前工作得很好.这是我第一次发现这个问题.
transform = transforms.Compose([transforms.ToPILImage(), transforms.ToTensor()])
Run Code Online (Sandbox Code Playgroud)
应用转换之前
应用转换后
Q.1 为什么要改变像素值?
Q.2 如何纠正?
python machine-learning python-imaging-library pytorch kaggle
请在下面找到将图像分类为 2 类的代码,我正在尝试使用 Kaggle TPU 执行这些代码。你能帮忙解决这里的问题吗?我遵循了 Kaggle 网站上的指南来使用 GPU,但仍然没有运气。
下面是代码生成的错误堆栈
import tensorflow as tf
tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)
tpu_strategy = tf.distribute.experimental.TPUStrategy(tpu)
print(tpu_strategy)
# save the final model to file
from keras.applications.vgg16 import VGG16
from keras.models import Model
from keras.layers import Dense
from keras.layers import Flatten
from keras.preprocessing.image import ImageDataGenerator
# define cnn model
def define_model():
with tpu_strategy.scope():
# load model
model = VGG16(include_top=False, input_shape=(224, 224, 3))
# mark loaded layers as not trainable
for layer in model.layers:
layer.trainable = …Run Code Online (Sandbox Code Playgroud) kaggle ×10
python ×7
r ×3
pandas ×2
coercion ×1
data-science ×1
dplyr ×1
keras ×1
knn ×1
pca ×1
plyr ×1
pytorch ×1
scikit-learn ×1
seaborn ×1
tensorflow ×1
tpu ×1
username ×1
valueerror ×1
xgboost ×1