我已经在OSX 10.9.4附带的2.7.5旁边安装了python 2.7.8.
现在我怎么能指向rPythonpython 2.7.8?
我已.bash_profile按如下方式修改了OSX ,以指向更新的python安装.
export PATH=/usr/local/Cellar/python/2.7.8/bin/:$PATH:usr/local/bin:
Run Code Online (Sandbox Code Playgroud)
现在,当我从终端运行python时,它正确运行较新的版本
mba:~ tommy$ which python
/usr/local/Cellar/python/2.7.8/bin//python
Run Code Online (Sandbox Code Playgroud)
但是,rPython仍然看到2.7.5.
> library(rPython)
Loading required package: RJSONIO
> python.exec("import sys; print(sys.version)")
2.7.5 (default, Mar 9 2014, 22:15:05)
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)]
Run Code Online (Sandbox Code Playgroud)
它看起来.bash_profile根本不被R使用......所以我试图修改R内的PATH但是仍然没有运气.
> Sys.getenv("PATH")
[1] "/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin"
> Sys.setenv(PATH = "usr/local/Cellar/python/2.7.8/bin")
> library(rPython)
Loading required package: RJSONIO
> python.exec("import sys; print(sys.version)")
2.7.5 (default, Mar 9 2014, 22:15:05)
[GCC 4.2.1 Compatible Apple LLVM 5.0 …Run Code Online (Sandbox Code Playgroud) 我想在a中添加多个列pandas DataFrame,并将它们设置为等于现有列.有一个简单的方法吗?在R我会做:
df <- data.frame(a=1:5)
df[c('b','c')] <- df$a
df
a b c
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
Run Code Online (Sandbox Code Playgroud)
在pandas此结果中KeyError: "['b' 'c'] not in index":
df = pd.DataFrame({'a': np.arange(1,6)})
df[['b','c']] = df.a
Run Code Online (Sandbox Code Playgroud) 如何在Spark中汇总多个列?例如,在SparkR中,以下代码用于获取一列的总和,但如果我尝试获取两列的总和,df则会出现错误.
# Create SparkDataFrame
df <- createDataFrame(faithful)
# Use agg to sum total waiting times
head(agg(df, totalWaiting = sum(df$waiting)))
##This works
# Use agg to sum total of waiting and eruptions
head(agg(df, total = sum(df$waiting, df$eruptions)))
##This doesn't work
Run Code Online (Sandbox Code Playgroud)
SparkR或PySpark代码都可以使用.
我有一个data.frame在R中的value列包含该类的数据character.我想确定value更改的行号.在下面的示例中,我想要退出4,7和9.有没有办法在没有循环的情况下执行此操作?
df <- data.frame(ind=1:10,
value=as.character(c(100,100,100,200,200,200,300,300,400,400)),
stringsAsFactors=F)
df
ind value
1 1 100
2 2 100
3 3 100
4 4 200
5 5 200
6 6 200
7 7 300
8 8 300
9 9 400
10 10 400
Run Code Online (Sandbox Code Playgroud) 我有一个SparkR DataFrame,如下所示:
#Create R data.frame
custId <- c(rep(1001, 5), rep(1002, 3), 1003)
date <- c('2013-08-01','2014-01-01','2014-02-01','2014-03-01','2014-04-01','2014-02-01','2014-03-01','2014-04-01','2014-04-01')
desc <- c('New','New','Good','New', 'Bad','New','Good','Good','New')
newcust <- c(1,1,0,1,0,1,0,0,1)
df <- data.frame(custId, date, desc, newcust)
#Create SparkR DataFrame
df <- createDataFrame(df)
display(df)
custId| date | desc | newcust
--------------------------------------
1001 | 2013-08-01| New | 1
1001 | 2014-01-01| New | 1
1001 | 2014-02-01| Good | 0
1001 | 2014-03-01| New | 1
1001 | 2014-04-01| Bad | 0
1002 | 2014-02-01| New | 1
1002 | …Run Code Online (Sandbox Code Playgroud) 我在 R 中有一个列表l,如下所示。我想删除唯一字母数字字符为 0 的元素。我该怎么做?
# Create list
l <- list(c('108', '50', '0]'), c('109','58','0','0]'), c('18','0'))
l
[[1]]
[1] "108" "50" "0]"
[[2]]
[1] "109" "58" "0" "0]"
[[3]]
[1] "18" "0"
# What I want:
l
[[1]]
[1] "108" "50"
[[2]]
[1] "109" "58"
[[3]]
[1] "18"
Run Code Online (Sandbox Code Playgroud) 我有一个 PySpark DataFrame,df其中有一些列,如下所示。该hour列采用 UTC 时间,我想根据该time_zone列创建一个具有本地时间的新列。我怎样才能在 PySpark 中做到这一点?
df
+-------------------------+------------+
| hour | time_zone |
+-------------------------+------------+
|2019-10-16T20:00:00+0000 | US/Eastern |
|2019-10-15T23:00:00+0000 | US/Central |
+-------------------------+------------+
#What I want:
+-------------------------+------------+---------------------+
| hour | time_zone | local_time |
+-------------------------+------------+---------------------+
|2019-10-16T20:00:00+0000 | US/Eastern | 2019-10-16T15:00:00 |
|2019-10-15T23:00:00+0000 | US/Central | 2019-10-15T17:00:00 |
+-------------------------+------------+---------------------+
Run Code Online (Sandbox Code Playgroud) 我正在尝试打开在 AWS Sagemaker 中创建的腌制 XGBoost 模型,以查看模型中特征的重要性。我正在尝试遵循这篇文章中的答案。但是,我收到如下所示的错误。当我尝试打电话时Booster.save_model,我收到一条错误消息'Estimator' object has no attribute 'save_model'。我该如何解决这个问题?
# Build initial model
sess = sagemaker.Session()
s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/train/'.format(bucket, prefix), content_type='csv')
xgb_cont = get_image_uri(region, 'xgboost', repo_version='0.90-1')
xgb = sagemaker.estimator.Estimator(xgb_cont, role, train_instance_count=1, train_instance_type='ml.m4.4xlarge',
output_path='s3://{}/{}'.format(bucket, prefix), sagemaker_session=sess)
xgb.set_hyperparameters(eval_metric='rmse', objective='reg:squarederror', num_round=100)
ts = strftime("%Y-%m-%d-%H-%M-%S", gmtime())
xgb_name = 'xgb-initial-' + ts
xgb.set_hyperparameters(eta=0.1, alpha=0.5, max_depth=10)
xgb.fit({'train': s3_input_train}, job_name=xgb_name)
# Load model to get feature importances
model_path = 's3://{}/{}//output/model.tar.gz'.format(bucket, prefix, xgb_name)
fs = s3fs.S3FileSystem()
with fs.open(model_path, 'rb') as …Run Code Online (Sandbox Code Playgroud) 我有一个SparkR DataFrame,我想(最常见)value获取每个唯一模式name。我怎样才能做到这一点?似乎没有内置mode功能。SparkR或PySpark解决方案都可以。
#Create DF
df <- data.frame(name = c("Thomas", "Thomas", "Thomas", "Bill", "Bill", "Bill"),
value = c(5, 5, 4, 3, 3, 7))
DF <- createDataFrame(df)
name | value
-----------------
Thomas | 5
Thomas | 5
Thomas | 4
Bill | 3
Bill | 3
Bill | 9
#What I want to get
name | mode(value)
-----------------
Thomas | 5
Bill | 3
Run Code Online (Sandbox Code Playgroud) 我正在尝试创建一个 Pandas DataFrame,其中一列多次使用 numpy repeat 函数。这是我在 R 中使用c和执行此操作的方法,rep并且它有效:
df <- data.frame(
date = seq.Date(as.Date('2018-12-01'), as.Date('2019-12-01'), by='month'),
value = c(rep(0.08, 7), rep(0.06, 6)),
)
Run Code Online (Sandbox Code Playgroud)
这是我在熊猫中尝试的内容,但它引发了错误arrays must all be same length:
import numpy as np
import pandas as pd
df= pd.DataFrame({
'date': pd.date_range('2018-12-01', '2019-12-01', freq='MS'),
'value': [np.repeat(0.08, 7), np.repeat(0.06, 6)]
})
Run Code Online (Sandbox Code Playgroud)
我怎么能在熊猫中做到这一点?