我正在尝试将retrosheet事件文件读入spark.事件文件的结构如此.
id,TEX201403310
version,2
info,visteam,PHI
info,hometeam,TEX
info,site,ARL02
info,date,2014/03/31
info,number,0
info,starttime,1:07PM
info,daynight,day
info,usedh,true
info,umphome,joycj901
info,attendance,49031
start,reveb001,"Ben Revere",0,1,8
start,rollj001,"Jimmy Rollins",0,2,6
start,utlec001,"Chase Utley",0,3,4
start,howar001,"Ryan Howard",0,4,3
start,byrdm001,"Marlon Byrd",0,5,9
id,TEX201404010
version,2
info,visteam,PHI
info,hometeam,TEX
Run Code Online (Sandbox Code Playgroud)
正如您在每个游戏中看到的那样,事件会循环回来.
我已经将文件读入RDD,然后通过第二个for循环为每次迭代添加了一个键,这似乎有效.但我希望得到一些反馈,如果有一种清洁方式,使用火花方法这样做.
logFile = '2014TEX.EVA'
event_data = (sc
.textFile(logfile)
.collect())
idKey = 0
newevent_list = []
for line in event_dataFile:
if line.startswith('id'):
idKey += 1
newevent_list.append((idKey,line))
else:
newevent_list.append((idKey,line))
event_data = sc.parallelize(newevent_list)
Run Code Online (Sandbox Code Playgroud)