我有一个类型的RDD:
dataset :org.apache.spark.rdd.RDD[(String, Double)] = MapPartitionRDD[26]
Run Code Online (Sandbox Code Playgroud)
这相当于 (Pedro, 0.0833), (Hello, 0.001828) ...
我想总结所有的价值,0.0833+0.001828..但我找不到合适的解决方案.
我试图启动我从IPython笔记本中找到的代码(我还添加了一些代码,如:interactive(True)...)我的问题是当我使用"运行模块"和Idle时它会启动"数据".情节"然后它加载,没有任何反应.data.plot似乎不起作用.
谢谢,如果你有任何想法.
注意:如果没有"交互式(True)",则会显示一个"运行时错误"框
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import interactive
interactive(True)
# read data into a DataFrame
data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0)
print(data.head())
# print the shape of the DataFrame
print data.shape
# visualize the relationship between the features and the response using scatterplots
fig, axs = plt.subplots(1, 3, sharey=True)
data.plot(kind='scatter', x='TV', y='Sales', ax=axs[0], figsize=(16, 8))
data.plot(kind='scatter', x='Radio', y='Sales', ax=axs[1])
data.plot(kind='scatter', x='Newspaper', y='Sales', ax=axs[2])
Run Code Online (Sandbox Code Playgroud) 我试图读取一个看起来像这样的文件:
you 0.0432052044116
i 0.0391075831328
the 0.0328010698268
to 0.0237549924919
a 0.0209682886489
it 0.0198104294359
Run Code Online (Sandbox Code Playgroud)
我想将它存储在RDD(键,值)中(例如,你,0.0432).目前我只做了那个算法
val filename = "freq2.txt"
try {
for (line <- Source.fromFile(filename).getLines()) {
val tuple = line.split(" ")
val key = tuple(0)
val words = tuple(1)
println(s"${key}")
println(s"${words}")
}
} catch {
case ex: FileNotFoundException => println("Couldn't find that file.")
case ex: IOException => println("Had an IOException trying to read that file")
}
Run Code Online (Sandbox Code Playgroud)
但我不知道如何存储数据......