Python和JSON:ValueError:未终止的字符串,起始于:

Gab*_*lin 10 json python-2.7

我已经阅读了多篇StackOverflow文章以及大部分Google十大搜索结果.我的问题偏离的地方是我在python中使用一个脚本来创建我的JSON文件.并且在10分钟后运行的下一个脚本无法读取该文件.

简短版本,我为我的在线业务生成线索.我正在尝试学习python,以便对这些线索进行更好的分析.我正在搜索2年的潜在客户,目的是保留有用的数据并丢弃任何个人信息 - 电子邮件地址,姓名等 - 同时还将30,000多个潜在客户保存到几十个文件中以便于访问.

因此,我的第一个脚本打开每个单独的主文件--30,000 + - 根据文件中的时间戳确定捕获的日期.然后它保存导致dict中的相应键.当所有数据已聚合到此dict中时,文本文件将使用json.dumps编写.

该字典的结构是:

addData['lead']['July_2013'] = { ... }
Run Code Online (Sandbox Code Playgroud)

其中'lead'键可以是lead,partial和其他几个,'July_2013'键显然是一个基于日期的键,可以是完整月份和2013年或2014年的任何组合,可以追溯到'February_2013'.

完整的错误是这样的:

ValueError: Unterminated string starting at: line 1 column 9997847 (char 9997846)
Run Code Online (Sandbox Code Playgroud)

但是我手动查看了文件,我的IDE说文件中只有76,655个字符.那怎么去了9997846?

失败的文件是要读取的第8个文件; 通过json.loads读入之后的其他7和所有其他文件就好了.

Python说有一个未终止的字符串,所以我查看了文件中JSON的结尾失败,看起来没问题.我已经看到一些关于JSON中的换行符的提及,但这个字符串都是一行.我已经看到提及\ vs \但快速查看整个文件我没有看到任何.其他文件确实有\并且它们读得很好.而且,这些文件都是由json.dumps创建的.

我无法发布该文件,因为它仍然包含个人信息.手动尝试验证76,000个char文件的JSON是不可行的.

关于如何调试这一点的想法将不胜感激.与此同时,我将尝试重建文件,看看这不仅仅是一次性错误,但需要一段时间.

  • Python 2.7通过Spyder和Anaconda
  • Windows 7专业版

---编辑---每个请求我在这里发布编写代码:

from p2p.basic import files as f
from p2p.adv import strTools as st
from p2p.basic import strTools as s

import os
import json
import copy
from datetime import datetime
import time


global leadDir
global archiveDir
global aggLeads


def aggregate_individual_lead_files():
    """

    """

    # Get the aggLead global and 
    global aggLeads

    # Get all the Files with a 'lead' extension & aggregate them
    exts = [
        'lead',
        'partial',
        'inp',
        'err',
        'nobuyer',
        'prospect',
        'sent'
    ]

    for srchExt in exts:
        agg = {}
        leads = f.recursiveGlob(leadDir, '*.cd.' + srchExt)
        print "There are {} {} files to process".format(len(leads), srchExt)

        for lead in leads:
            # Get the Base Filename
            fname = f.basename(lead)
            #uniqID = st.fetchBefore('.', fname)

            #print "File: ", lead

            # Get Lead Data
            leadData = json.loads(f.file_get_contents(lead))

            agg = agg_data(leadData, agg, fname)

        aggLeads[srchExt] = copy.deepcopy(agg)

        print "Aggregate Top Lvl Keys: ", aggLeads.keys()
        print "Aggregate Next Lvl Keys: "

        for key in aggLeads:
            print "{}: ".format(key)

            for arcDate in aggLeads[key].keys():
                print "{}: {}".format(arcDate, len(aggLeads[key][arcDate]))

        # raw_input("Press Enter to continue...")


def agg_data(leadData, agg, fname=None):
    """

    """
    #print "Lead: ", leadData

    # Get the timestamp of the lead
    try:
        ts = leadData['timeStamp']
        leadData.pop('timeStamp')
    except KeyError:
        return agg

    leadDate = datetime.fromtimestamp(ts)
    arcDate = leadDate.strftime("%B_%Y")

    #print "Archive Date: ", arcDate

    try:
        agg[arcDate][ts] = leadData
    except KeyError:
        agg[arcDate] = {}
        agg[arcDate][ts] = leadData
    except TypeError:
        print "Timestamp: ", ts
        print "Lead: ", leadData
        print "Archive Date: ", arcDate
        return agg

    """
    if fname is not None:
        archive_lead(fname, arcDate)
    """

    #print "File: {} added to {}".format(fname, arcDate)

    return agg


def archive_lead(fname, arcDate):
    # Archive Path
    newArcPath = archiveDir + arcDate + '//'

    if not os.path.exists(newArcPath):
        os.makedirs(newArcPath)

    # Move the file to the archive
    os.rename(leadDir + fname, newArcPath + fname)


def reformat_old_agg_data():
    """

    """

    # Get the aggLead global and 
    global aggLeads
    aggComplete = {}
    aggPartial = {}

    oldAggFiles = f.recursiveGlob(leadDir, '*.cd.agg')
    print "There are {} old aggregate files to process".format(len(oldAggFiles))

    for agg in oldAggFiles:
        tmp = json.loads(f.file_get_contents(agg))

        for uniqId in tmp:
            leadData = tmp[uniqId]

            if leadData['isPartial'] == True:
                aggPartial = agg_data(leadData, aggPartial)
            else:
                aggComplete = agg_data(leadData, aggComplete)

    arcData = dict(aggLeads['lead'].items() + aggComplete.items())
    aggLeads['lead'] = arcData

    arcData = dict(aggLeads['partial'].items() + aggPartial.items())
    aggLeads['partial'] = arcData    


def output_agg_files():
    for ext in aggLeads:
        for arcDate in aggLeads[ext]:
            arcFile = leadDir + arcDate + '.cd.' + ext + '.agg'

            if f.file_exists(arcFile):
                tmp = json.loads(f.file_get_contents(arcFile))
            else:
                tmp = {}

            arcData = dict(tmp.items() + aggLeads[ext][arcDate].items())

            f.file_put_contents(arcFile, json.dumps(arcData))


def main():
    global leadDir
    global archiveDir
    global aggLeads

    leadDir = 'D://Server Data//eagle805//emmetrics//forms//leads//'
    archiveDir = leadDir + 'archive//'
    aggLeads = {}


    # Aggregate all the old individual file
    aggregate_individual_lead_files()

    # Reformat the old aggregate files
    reformat_old_agg_data()

    # Write it all out to an aggregate file
    output_agg_files()


if __name__ == "__main__":
    main()
Run Code Online (Sandbox Code Playgroud)

这是读取代码:

from p2p.basic import files as f
from p2p.adv import strTools as st
from p2p.basic import strTools as s

import os
import json
import copy
from datetime import datetime
import time


global leadDir
global fields
global fieldTimes
global versions


def parse_agg_file(aggFile):
    global leadDir
    global fields
    global fieldTimes

    try:
        tmp = json.loads(f.file_get_contents(aggFile))
    except ValueError:
        print "{} failed the JSON load".format(aggFile)
        return False

    print "Opening: ", aggFile

    for ts in tmp:
        try:
            tmpTs = float(ts)
        except:
            print "Timestamp: ", ts
            continue

        leadData = tmp[ts]

        for field in leadData:
            if field not in fields:
                fields[field] = []

            fields[field].append(float(ts))


def determine_form_versions():
    global fieldTimes
    global versions

    # Determine all the fields and their start and stop times
    times = []
    for field in fields:
        minTs = min(fields[field])
        fieldTimes[field] = [minTs, max(fields[field])]
        times.append(minTs)
        print 'Min ts: {}'.format(minTs)

    times = set(sorted(times))
    print "Times: ", times
    print "Fields: ", fieldTimes

    versions = {}
    for ts in times:
        d = datetime.fromtimestamp(ts)
        ver = d.strftime("%d_%B_%Y")

        print "Version: ", ver

        versions[ver] = []
        for field in fields:
            if ts in fields[field]:
                versions[ver].append(field)


def main():
    global leadDir
    global fields
    global fieldTimes

    leadDir = 'D://Server Data//eagle805//emmetrics//forms//leads//'
    fields = {}
    fieldTimes = {}

    aggFiles = f.glob(leadDir + '*.lead.agg')

    for aggFile in aggFiles:
        parse_agg_file(aggFile)

    determine_form_versions()

    print "Versions: ", versions




if __name__ == "__main__":
    main()
Run Code Online (Sandbox Code Playgroud)

Gab*_*lin 15

所以我想出来了......我发布这个答案以防其他人犯同样的错误.

首先,我找到了一个解决方案,但我不确定为什么会这样.从我的原始代码,这是我的file_get_contents功能:

def file_get_contents(fname):
    if s.stripos(fname, 'http://'):
        import urllib2
        return urllib2.urlopen(fname).read(maxUrlRead)
    else:
        return open(fname).read(maxFileRead)
Run Code Online (Sandbox Code Playgroud)

我用过它:

tmp = json.loads(f.file_get_contents(aggFile))
Run Code Online (Sandbox Code Playgroud)

这失败了,一遍又一遍.然而,当我试图让Python至少给我的JSON字符串通过把JSON验证我碰到的提及进来json.loadVS json.loads.所以我尝试了这个:

a = open('D://Server Data//eagle805//emmetrics//forms//leads\July_2014.cd.lead.agg')
b = json.load(a)
Run Code Online (Sandbox Code Playgroud)

虽然我没有在我的整个代码中测试这个输出,但这个代码块实际上读取了文件,解码了JSON,甚至会显示数据而不会崩溃Spyder.Spyder中的变量资源管理器显示b是一个大小为1465的字典,它应该具有多少记录.从字典末尾显示的文本部分看起来都很好.总的来说,我对数据的正确解析具有相当高的可信度.

当我编写file_get_contents函数时,我看到了几个建议,我总是提供最大的字节数来读取,以防止Python挂起错误的返回.价值maxReadFile1E7.当我手动强制maxReadFile1E9一切工作的罚款.原来文件刚好在1.2E7字节以下.因此,读取文件的结果字符串不是文件中的完整字符串,因此是无效的JSON.

通常情况下,我认为这是一个错误,但显然在打开和读取文件时,您需要能够一次只读取一个块用于内存管理.关于maxReadFile价值,我对自己的短视有所了解.错误消息是正确的,但是我发送了一个疯狂的追逐.

希望这可以节省别人一些时间.


Sam*_*tha 10

我遇到了同样的问题.事实证明,文件的最后一行是不完整的,可能是由于下载突然停止,因为我发现有足够的数据并且只是停止了终端上的进程.