我试图将csv文件"read_ex.csv"读入数组.我在web/stackoverflow上无休止地搜索,找到一种方法将文件读入数组.我能做的最好的是以流式方式读取它,但由于文件的大小可变,我无法将其存储在数组中.我相信ArrayList是一种处理可变大小数组的方法,但我不知道如何使用它.基本上我希望能够在while循环结束后访问String数组"values".
import java.util.Scanner;
import java.io.FileNotFoundException;
import java.io.File;
public class sortarray {
public static void main (String []agrs){
String fileName= "read_ex.csv";
File file= new File(fileName);
try{
Scanner inputStream= new Scanner(file);
while(inputStream.hasNext()){
String data= inputStream.next();
String[] values = data.split(",");
System.out.println(values[1]);
}
inputStream.close();
}catch (FileNotFoundException e) {
e.printStackTrace();
}
//This prints out the working directory
System.out.println("Present Project Directory : "+ System.getProperty("user.dir"));
}
}
Run Code Online (Sandbox Code Playgroud) 我有一些月度数据,我想在我的数据框中添加一列,将第一列中的最小值与第一列中的最大值相关联.第一列中的第二个最小值到第一列中的第二个最大值,例如......
这是一些示例数据
x1<-c(100,151,109,59,161,104,170,101)
dat<-data.frame(x1)
rownames(dat)<-c('Apr','May', 'Jun','Jul', 'Aug', 'Sep', 'Oct', 'Nov')
x1
Apr 100
May 151
Jun 109
Jul 59
Aug 161
Sep 104
Oct 170
Nov 101
Run Code Online (Sandbox Code Playgroud)
我试图让我的数据看起来像这样
x1 x2
Apr 100 161
May 151 101
Jun 109 104
Jul 59 170
Aug 161 100
Sep 104 109
Oct 170 59
Nov 101 151
Run Code Online (Sandbox Code Playgroud)
我带着等级,排序和顺序进入圈子.任何帮助,将不胜感激.
我正在尝试在 Spark 中的随机森林上运行交叉验证。
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
data = nds.sc.parallelize([
LabeledPoint(0.0, [0,402,6,0]),
LabeledPoint(0.0, [3,500,3,0]),
LabeledPoint(1.0, [1,590,1,1]),
LabeledPoint(1.0, [3,328,5,0]),
LabeledPoint(1.0, [4,351,4,0]),
LabeledPoint(0.0, [2,372,2,0]),
LabeledPoint(0.0, [4,302,5,0]),
LabeledPoint(1.0, [1,387,2,0]),
LabeledPoint(1.0, [1,419,3,0]),
LabeledPoint(0.0, [1,370,5,0]),
LabeledPoint(0.0, [1,410,4,0]),
LabeledPoint(0.0, [2,509,7,1]),
LabeledPoint(0.0, [1,307,5,0]),
LabeledPoint(0.0, [0,424,4,1]),
LabeledPoint(0.0, [1,509,2,1]),
LabeledPoint(1.0, [3,361,4,0]),
])
train=data.toDF(['label','features'])
numfolds =2
rf = RandomForestClassifier(labelCol="label", featuresCol="features")
evaluator = MulticlassClassificationEvaluator()
paramGrid = ParamGridBuilder().addGrid(rf.maxDepth,
[4,8,10]).addGrid(rf.impurity, ['entropy','gini']).addGrid(rf.featureSubsetStrategy, [6,8,10]).build()
pipeline = Pipeline(stages=[rf])
crossval = CrossValidator(
estimator=pipeline,
estimatorParamMaps=paramGrid, …
Run Code Online (Sandbox Code Playgroud) 我有一个由一列整数和另一列 numpy 数组组成的 Pandas DataFrame
DataFrame({'col_1':[1434,3046,3249,3258], 'col_2':[np.array([1434, 1451, 1467]),np.array([3046, 3304]),
np.array([3249, 3246, 3298, 3299, 3220]),np.array([3258, 3263, 3307])]})
col_1 col_2
0 1434 [1434, 1451, 1467]
1 3046 [3046, 3304]
2 3249 [3249, 3246, 3298, 3299, 3220]
3 3258 [3258, 3263, 3307]
Run Code Online (Sandbox Code Playgroud)
我想转换为以下格式的 Spark DataFrame
df=sc.parallelize([ [1434,[1434, 1451, 1467]],
[3046,[3046, 3304]],
[3249,[3046, 3304]],
[3258,[3258, 3263, 3307]]]).toDF(['col_1','col_2'])
df.select('col_1',explode(col('col_2')).alias('col_2')).show(14)
+-----+-----+
|col_1|col_2|
+-----+-----+
| 1434| 1434|
| 1434| 1451|
| 1434| 1467|
| 3046| 3046|
| 3046| 3304|
| 3249| 3046|
| 3249| 3304|
| …
Run Code Online (Sandbox Code Playgroud)