Hao*_*ang 8 scala apache-spark
我需要很多随机数,每行一个.结果应该是这样的:
24324 24324
4234234 4234234
1310313 1310313
...
Run Code Online (Sandbox Code Playgroud)
所以我写了这个火花代码(对不起,我是Spark和scala的新手):
import util.Random
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object RandomIntegerWriter {
def main(args: Array[String]) {
if (args.length < 2) {
System.err.println("Usage: RandomIntegerWriter <num Integers> <outDir>")
System.exit(1)
}
val conf = new SparkConf().setAppName("Spark RandomIntegerWriter")
val spark = new SparkContext(conf)
val distData = spark.parallelize(Seq.fill(args(0).toInt)(Random.nextInt))
distData.saveAsTextFile(args(1))
spark.stop()
}
}
Run Code Online (Sandbox Code Playgroud)
注意:现在我只想为每行生成一个数字.
但似乎当数字变大时,程序将报告错误.对这段代码有什么想法吗?
谢谢.
vmh*_*ker 13
在Spark 1.4中,您可以使用DataFrame API执行此操作:
In [1]: from pyspark.sql.functions import rand, randn
In [2]: # Create a DataFrame with one int column and 10 rows.
In [3]: df = sqlContext.range(0, 10)
In [4]: df.show()
+--+
|id|
+--+
| 0|
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
+--+
In [4]: # Generate two other columns using uniform distribution and normal distribution.
In [5]: df.select("id", rand(seed=10).alias("uniform"), randn(seed=27).alias("normal")).show()
+--+-------------------+--------------------+
|id| uniform| normal|
+--+-------------------+--------------------+
| 0| 0.7224977951905031| -0.1875348803463305|
| 1| 0.2953174992603351|-0.26525647952450265|
| 2| 0.4536856090041318| -0.7195024130068081|
| 3| 0.9970412477032209| 0.5181478766595276|
| 4|0.19657711634539565| 0.7316273979766378|
| 5|0.48533720635534006| 0.07724879367590629|
| 6| 0.7369825278894753| -0.5462256961278941|
| 7| 0.5241113627472694| -0.2542275002421211|
| 8| 0.2977697066654349| -0.5752237580095868|
| 9| 0.5060159582230856| 1.0900096472044518|
+--+-------------------+--------------------+
Run Code Online (Sandbox Code Playgroud)
尝试
val distData = spark.parallelize(Seq[Int](), numPartitions)
.mapPartitions { _ => {
(1 to recordsPerPartition).map{_ => Random.nextInt}.iterator
}}
Run Code Online (Sandbox Code Playgroud)
它将在驱动程序端创建一个空集合,但在工作端生成许多随机整数.记录总数为:numPartitions * recordsPerPartition
| 归档时间: |
|
| 查看次数: |
13955 次 |
| 最近记录: |