我试图从功能选择挑战中分析Gizette数据集
当我尝试用基于熊猫示例的标签系列连接火车数据帧时
投
ValueError:缓冲区dtype不匹配,预期'Python对象'但是'long long'
码:
import pandas as pd
trainData = pd.read_table(filepath_or_buffer='GISETTE/gisette_train.data'
,delim_whitespace=True
,header=None
,names=['AA','AB','AC','AD','AE','AF','AG','AH','AI','AJ','AK','AL','AM','AN','AO','AP','AQ','AR','AS','AT','AU','AV','AW','AX','AY','AZ','BA','BB','BC','BD','BE','BF','BG','BH','BI','BJ','BK','BL','BM','BN','BO','BP','BQ','BR','BS','BT','BU','BV','BW','BX','BY','BZ','CA','CB','CC','CD','CE','CF','CG','CH','CI','CJ','CK','CL','CM','CN','CO','CP','CQ','CR','CS','CT','CU','CV','CW','CX','CY','CZ','DA','DB','DC','DD','DE','DF','DG','DH','DI','DJ','DK','DL','DM','DN','DO','DP','DQ','DR','DS','DT','DU','DV','DW','DX','DY','DZ','EA','EB','EC','ED','EE','EF','EG','EH','EI','EJ','EK','EL','EM','EN','EO','EP','EQ','ER','ES','ET','EU','EV','EW','EX','EY','EZ','FA','FB','FC','FD','FE','FF','FG','FH','FI','FJ','FK','FL','FM','FN','FO','FP','FQ','FR','FS','FT','FU','FV','FW','FX','FY','FZ','GA','GB','GC','GD','GE','GF','GG','GH','GI','GJ','GK','GL','GM','GN','GO','GP','GQ','GR','GS','GT','GU','GV','GW','GX','GY','GZ','HA','HB','HC','HD','HE','HF','HG','HH','HI','HJ','HK','HL','HM','HN','HO','HP','HQ','HR','HS','HT','HU','HV','HW','HX','HY','HZ','IA','IB','IC','ID','IE','IF','IG','IH','II','IJ','IK','IL','IM','IN','IO','IP','IQ','IR','IS','IT','IU','IV','IW','IX','IY','IZ','JA','JB','JC','JD','JE','JF','JG','JH','JI','JJ','JK','JL','JM','JN','JO','JP','JQ','JR','JS','JT','JU','JV','JW','JX','JY','JZ','KA','KB','KC','KD','KE','KF','KG','KH','KI','KJ','KK','KL','KM','KN','KO','KP','KQ','KR','KS','KT','KU','KV','KW','KX','KY','KZ','LA','LB','LC','LD','LE','LF','LG','LH','LI','LJ','LK','LL','LM','LN','LO','LP','LQ','LR','LS','LT','LU','LV','LW','LX','LY','LZ','MA','MB','MC','MD','ME','MF','MG','MH','MI','MJ','MK','ML','MM','MN','MO','MP','MQ','MR','MS','MT','MU','MV','MW','MX','MY','MZ','NA','NB','NC','ND','NE','NF','NG','NH','NI','NJ','NK','NL','NM','NN','NO','NP','NQ','NR','NS','NT','NU','NV','NW','NX','NY','NZ','OA','OB','OC','OD','OE','OF','OG','OH','OI','OJ','OK','OL','OM','ON','OO','OP','OQ','OR','OS','OT','OU','OV','OW','OX','OY','OZ','PA','PB','PC','PD','PE','PF','PG','PH','PI','PJ','PK','PL','PM','PN','PO','PP','PQ','PR','PS','PT','PU','PV','PW','PX','PY','PZ','QA','QB','QC','QD','QE','QF','QG','QH','QI','QJ','QK','QL','QM','QN','QO','QP','QQ','QR','QS','QT','QU','QV','QW','QX','QY','QZ','RA','RB','RC','RD','RE','RF','RG','RH','RI','RJ','RK','RL','RM','RN','RO','RP','RQ','RR','RS','RT','RU','RV','RW','RX','RY','RZ','SA','SB','SC','SD','SE','SF','SG','SH','SI','SJ','SK','SL','SM','SN','SO','SP','SQ','SR','SS','ST','SU','SV','SW','SX','SY','SZ','TA','TB','TC','TD','TE','TF'])
# print 'finished with train data'
trainLabel = pd.read_table(filepath_or_buffer='GISETTE/gisette_train.labels'
,squeeze=True
,names=['label']
,delim_whitespace=True
,header=None)
trainData.info()
Run Code Online (Sandbox Code Playgroud)
输出
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 6000 entries
Columns: 500 entries, AA to TF
dtypes: int64(500)None
trainLabel.describe()
Run Code Online (Sandbox Code Playgroud)
输出
count 6000.000000
mean 0.000000
std 1.000083
min -1.000000
25% -1.000000
50% 0.000000
75% 1.000000
max 1.000000
dtype: float64
readyToTrain = pd.concat([trainData, trainLabel], axis=1)
Run Code Online (Sandbox Code Playgroud)
完整的堆栈跟踪
File "C:\env\Python27\lib\site-packages\pandas\tools\merge.py", line 717, in concat
verify_integrity=verify_integrity) …Run Code Online (Sandbox Code Playgroud) 我修改了OneHotEncoder示例以实际训练LogisticRegression.我的问题是如何将生成的权重映射回分类变量?
def oneHotEncoderExample(sqlContext: SQLContext): Unit = {
val df = sqlContext.createDataFrame(Seq(
(0, "a", 1.0),
(1, "b", 1.0),
(2, "c", 0.0),
(3, "d", 1.0),
(4, "e", 1.0),
(5, "f", 0.0)
)).toDF("id", "category", "label")
df.show()
val indexer = new StringIndexer()
.setInputCol("category")
.setOutputCol("categoryIndex")
.fit(df)
val indexed = indexer.transform(df)
indexed.select("id", "categoryIndex").show()
val encoder = new OneHotEncoder()
.setInputCol("categoryIndex")
.setOutputCol("features")
val encoded = encoder.transform(indexed)
encoded.select("id", "features").show()
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.01)
val pipeline = new Pipeline()
.setStages(Array(indexer, encoder, lr))
// Fit the pipeline to …Run Code Online (Sandbox Code Playgroud) 查看http://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html.这些示例似乎只包含Java和Scala.
Spark MLlib是否支持Python的PCA分析?如果是这样,请指出一个例子.如果没有,如何将Spark与scikit-learn结合起来?
使用spark-1.6.0-bin-hadoop2.6根据http://spark.apache.org/docs/latest/configuration.html
我可以用spark.executor.memory设置堆大小,它是来自spark-submit的--executor -memory
运行我的作业时,执行程序内存不会超过分配的内存但我收到错误:
java.lang.OutOfMemoryError:Java堆空间
我提交的工作是:
./bin/spark-submit \
--class edu.gatech.cse8803.main.Main \
--master spark://ec2-52-23-155-99.compute-1.amazonaws.com:6066 \
--deploy-mode cluster \
--executor-memory 27G \
--total-executor-cores 100 \
/root/final_project/phenotyping_w_anchors_161-assembly-1.0.jar \
1000
Run Code Online (Sandbox Code Playgroud)
我使用2个m4.2xlarge实例(32.0 GB,8个内核)
给出"空"指标数据帧:
Index Ind_A Ind_B
1 0 0
2 0 0
3 0 0
4 0 0
Run Code Online (Sandbox Code Playgroud)
和值的数据框:
Index Indicators
1 Ind_A
3 Ind_A
3 Ind_B
4 Ind_A
Run Code Online (Sandbox Code Playgroud)
我想最终得到:
Index Ind_A Ind_B
1 1 0
2 0 0
3 1 1
4 1 0
Run Code Online (Sandbox Code Playgroud)
没有for循环有没有办法做到这一点?
我想将字符串列表转换为数据框.给定结构:
lst <- list(NULL, "PSYC", c("PSYC", "PHIL"), "PHIL")
Run Code Online (Sandbox Code Playgroud)
我想生成一个数据帧
Index major_cd
1 NULL
2 PSYC
3 PSYC
3 PHIL
4 PHIL
Run Code Online (Sandbox Code Playgroud)
请注意列表中的第3项如何变为数据框的2行.
假设我有一个像这样的 RDD -> (String, Date, Int)
[("sam", 02-25-2016, 2), ("sam",02-14-2016, 4), ("pam",03-16-2016, 1), ("pam",02-16-2016, 5)]
Run Code Online (Sandbox Code Playgroud)
我想将它转换成这样的列表 ->
[("sam", 02-14-2016, 4), ("pam",02-16-2016, 5)]
Run Code Online (Sandbox Code Playgroud)
其中值是记录,其中日期是每个键的最小值。做这个的最好方式是什么?
给出类似pseduo的代码:
dateRange <- reactive({
input$select_dates #action button
save_selected_date_range()
isolate(input$dateRange)
})
customerId <- reactive({
#check if customer has saved date range if so trigger
saved_info <- saved_preferences(input$customerId)
if(nrow(saved_info) > 0) {
flog.info(saved_info)
updateDateRangeInput(session, "dateRange", start = saved_info$start, end = saved_info$start)
}
input$customerId
})
Run Code Online (Sandbox Code Playgroud)
场景:
输入: 选定的日期范围和客户选择器.按下操作按钮时会注册日期范围.
期望的行动: 我们希望能够在选择客户时加载已保存的日期范围(如果可用).
问题: 如何触发输入$ select_dates,就像按下操作按钮一样?像没有计时器的invalidateLater之类的东西会很好.或者,如果有一种手动方式将输入$ select_dates标记或标记为无效.
apache-spark ×4
dataframe ×2
r ×2
scala ×2
amazon-ec2 ×1
indicator ×1
list ×1
pandas ×1
pca ×1
python ×1
python-2.7 ×1
shiny ×1