我已经阅读了多篇StackOverflow文章以及大部分Google十大搜索结果.我的问题偏离的地方是我在python中使用一个脚本来创建我的JSON文件.并且在10分钟后运行的下一个脚本无法读取该文件.
简短版本,我为我的在线业务生成线索.我正在尝试学习python,以便对这些线索进行更好的分析.我正在搜索2年的潜在客户,目的是保留有用的数据并丢弃任何个人信息 - 电子邮件地址,姓名等 - 同时还将30,000多个潜在客户保存到几十个文件中以便于访问.
因此,我的第一个脚本打开每个单独的主文件--30,000 + - 根据文件中的时间戳确定捕获的日期.然后它保存导致dict中的相应键.当所有数据已聚合到此dict中时,文本文件将使用json.dumps编写.
该字典的结构是:
addData['lead']['July_2013'] = { ... }
Run Code Online (Sandbox Code Playgroud)
其中'lead'键可以是lead,partial和其他几个,'July_2013'键显然是一个基于日期的键,可以是完整月份和2013年或2014年的任何组合,可以追溯到'February_2013'.
完整的错误是这样的:
ValueError: Unterminated string starting at: line 1 column 9997847 (char 9997846)
Run Code Online (Sandbox Code Playgroud)
但是我手动查看了文件,我的IDE说文件中只有76,655个字符.那怎么去了9997846?
失败的文件是要读取的第8个文件; 通过json.loads读入之后的其他7和所有其他文件就好了.
Python说有一个未终止的字符串,所以我查看了文件中JSON的结尾失败,看起来没问题.我已经看到一些关于JSON中的换行符的提及,但这个字符串都是一行.我已经看到提及\ vs \但快速查看整个文件我没有看到任何.其他文件确实有\并且它们读得很好.而且,这些文件都是由json.dumps创建的.
我无法发布该文件,因为它仍然包含个人信息.手动尝试验证76,000个char文件的JSON是不可行的.
关于如何调试这一点的想法将不胜感激.与此同时,我将尝试重建文件,看看这不仅仅是一次性错误,但需要一段时间.
---编辑---每个请求我在这里发布编写代码:
from p2p.basic import files as f
from p2p.adv import strTools as st
from p2p.basic import strTools as s
import os
import json
import copy
from datetime import datetime
import time
global leadDir
global archiveDir
global aggLeads
def aggregate_individual_lead_files():
"""
"""
# Get the aggLead global and
global aggLeads
# Get all the Files with a 'lead' extension & aggregate them
exts = [
'lead',
'partial',
'inp',
'err',
'nobuyer',
'prospect',
'sent'
]
for srchExt in exts:
agg = {}
leads = f.recursiveGlob(leadDir, '*.cd.' + srchExt)
print "There are {} {} files to process".format(len(leads), srchExt)
for lead in leads:
# Get the Base Filename
fname = f.basename(lead)
#uniqID = st.fetchBefore('.', fname)
#print "File: ", lead
# Get Lead Data
leadData = json.loads(f.file_get_contents(lead))
agg = agg_data(leadData, agg, fname)
aggLeads[srchExt] = copy.deepcopy(agg)
print "Aggregate Top Lvl Keys: ", aggLeads.keys()
print "Aggregate Next Lvl Keys: "
for key in aggLeads:
print "{}: ".format(key)
for arcDate in aggLeads[key].keys():
print "{}: {}".format(arcDate, len(aggLeads[key][arcDate]))
# raw_input("Press Enter to continue...")
def agg_data(leadData, agg, fname=None):
"""
"""
#print "Lead: ", leadData
# Get the timestamp of the lead
try:
ts = leadData['timeStamp']
leadData.pop('timeStamp')
except KeyError:
return agg
leadDate = datetime.fromtimestamp(ts)
arcDate = leadDate.strftime("%B_%Y")
#print "Archive Date: ", arcDate
try:
agg[arcDate][ts] = leadData
except KeyError:
agg[arcDate] = {}
agg[arcDate][ts] = leadData
except TypeError:
print "Timestamp: ", ts
print "Lead: ", leadData
print "Archive Date: ", arcDate
return agg
"""
if fname is not None:
archive_lead(fname, arcDate)
"""
#print "File: {} added to {}".format(fname, arcDate)
return agg
def archive_lead(fname, arcDate):
# Archive Path
newArcPath = archiveDir + arcDate + '//'
if not os.path.exists(newArcPath):
os.makedirs(newArcPath)
# Move the file to the archive
os.rename(leadDir + fname, newArcPath + fname)
def reformat_old_agg_data():
"""
"""
# Get the aggLead global and
global aggLeads
aggComplete = {}
aggPartial = {}
oldAggFiles = f.recursiveGlob(leadDir, '*.cd.agg')
print "There are {} old aggregate files to process".format(len(oldAggFiles))
for agg in oldAggFiles:
tmp = json.loads(f.file_get_contents(agg))
for uniqId in tmp:
leadData = tmp[uniqId]
if leadData['isPartial'] == True:
aggPartial = agg_data(leadData, aggPartial)
else:
aggComplete = agg_data(leadData, aggComplete)
arcData = dict(aggLeads['lead'].items() + aggComplete.items())
aggLeads['lead'] = arcData
arcData = dict(aggLeads['partial'].items() + aggPartial.items())
aggLeads['partial'] = arcData
def output_agg_files():
for ext in aggLeads:
for arcDate in aggLeads[ext]:
arcFile = leadDir + arcDate + '.cd.' + ext + '.agg'
if f.file_exists(arcFile):
tmp = json.loads(f.file_get_contents(arcFile))
else:
tmp = {}
arcData = dict(tmp.items() + aggLeads[ext][arcDate].items())
f.file_put_contents(arcFile, json.dumps(arcData))
def main():
global leadDir
global archiveDir
global aggLeads
leadDir = 'D://Server Data//eagle805//emmetrics//forms//leads//'
archiveDir = leadDir + 'archive//'
aggLeads = {}
# Aggregate all the old individual file
aggregate_individual_lead_files()
# Reformat the old aggregate files
reformat_old_agg_data()
# Write it all out to an aggregate file
output_agg_files()
if __name__ == "__main__":
main()
Run Code Online (Sandbox Code Playgroud)
这是读取代码:
from p2p.basic import files as f
from p2p.adv import strTools as st
from p2p.basic import strTools as s
import os
import json
import copy
from datetime import datetime
import time
global leadDir
global fields
global fieldTimes
global versions
def parse_agg_file(aggFile):
global leadDir
global fields
global fieldTimes
try:
tmp = json.loads(f.file_get_contents(aggFile))
except ValueError:
print "{} failed the JSON load".format(aggFile)
return False
print "Opening: ", aggFile
for ts in tmp:
try:
tmpTs = float(ts)
except:
print "Timestamp: ", ts
continue
leadData = tmp[ts]
for field in leadData:
if field not in fields:
fields[field] = []
fields[field].append(float(ts))
def determine_form_versions():
global fieldTimes
global versions
# Determine all the fields and their start and stop times
times = []
for field in fields:
minTs = min(fields[field])
fieldTimes[field] = [minTs, max(fields[field])]
times.append(minTs)
print 'Min ts: {}'.format(minTs)
times = set(sorted(times))
print "Times: ", times
print "Fields: ", fieldTimes
versions = {}
for ts in times:
d = datetime.fromtimestamp(ts)
ver = d.strftime("%d_%B_%Y")
print "Version: ", ver
versions[ver] = []
for field in fields:
if ts in fields[field]:
versions[ver].append(field)
def main():
global leadDir
global fields
global fieldTimes
leadDir = 'D://Server Data//eagle805//emmetrics//forms//leads//'
fields = {}
fieldTimes = {}
aggFiles = f.glob(leadDir + '*.lead.agg')
for aggFile in aggFiles:
parse_agg_file(aggFile)
determine_form_versions()
print "Versions: ", versions
if __name__ == "__main__":
main()
Run Code Online (Sandbox Code Playgroud)
Gab*_*lin 15
所以我想出来了......我发布这个答案以防其他人犯同样的错误.
首先,我找到了一个解决方案,但我不确定为什么会这样.从我的原始代码,这是我的file_get_contents功能:
def file_get_contents(fname):
if s.stripos(fname, 'http://'):
import urllib2
return urllib2.urlopen(fname).read(maxUrlRead)
else:
return open(fname).read(maxFileRead)
Run Code Online (Sandbox Code Playgroud)
我用过它:
tmp = json.loads(f.file_get_contents(aggFile))
Run Code Online (Sandbox Code Playgroud)
这失败了,一遍又一遍.然而,当我试图让Python至少给我的JSON字符串通过把JSON验证我碰到的提及进来json.loadVS json.loads.所以我尝试了这个:
a = open('D://Server Data//eagle805//emmetrics//forms//leads\July_2014.cd.lead.agg')
b = json.load(a)
Run Code Online (Sandbox Code Playgroud)
虽然我没有在我的整个代码中测试这个输出,但这个代码块实际上读取了文件,解码了JSON,甚至会显示数据而不会崩溃Spyder.Spyder中的变量资源管理器显示b是一个大小为1465的字典,它应该具有多少记录.从字典末尾显示的文本部分看起来都很好.总的来说,我对数据的正确解析具有相当高的可信度.
当我编写file_get_contents函数时,我看到了几个建议,我总是提供最大的字节数来读取,以防止Python挂起错误的返回.价值maxReadFile是1E7.当我手动强制maxReadFile是1E9一切工作的罚款.原来文件刚好在1.2E7字节以下.因此,读取文件的结果字符串不是文件中的完整字符串,因此是无效的JSON.
通常情况下,我认为这是一个错误,但显然在打开和读取文件时,您需要能够一次只读取一个块用于内存管理.关于maxReadFile价值,我对自己的短视有所了解.错误消息是正确的,但是我发送了一个疯狂的追逐.
希望这可以节省别人一些时间.