Python:HTML 解析错误

Plu*_*ug4 1 html python parsing

我使用以下内容:

from HTMLParser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()
Run Code Online (Sandbox Code Playgroud)

删除文本中的 HTML 标签。但是,对于我的一个文件,当我这样做时:

fdir = open('0001005214-12-000007.txt')
text = fdir.read()
strip_tags(text)
Run Code Online (Sandbox Code Playgroud)

我收到以下错误:

    Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "G:/Dropbox/Textual/codes/Python/Parsing/Word_Count.py", line 26, in strip_tags
    s.feed(html)
  File "C:\Users\Martineau\Anaconda\lib\HTMLParser.py", line 117, in feed
    self.goahead(0)
  File "C:\Users\Martineau\Anaconda\lib\HTMLParser.py", line 169, in goahead
    k = self.parse_html_declaration(i)
  File "C:\Users\Martineau\Anaconda\lib\HTMLParser.py", line 245, in parse_html_declaration
    return self.parse_marked_section(i)
  File "C:\Users\Martineau\Anaconda\lib\markupbase.py", line 160, in parse_marked_section
    self.error('unknown status keyword %r in marked section' % rawdata[i+3:j])
  File "C:\Users\Martineau\Anaconda\lib\HTMLParser.py", line 124, in error
    raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: unknown status keyword 't\n' in marked section, at line 35210, column 58
Run Code Online (Sandbox Code Playgroud)

这个错误是什么意思?我怎样才能绕过这个错误?

我想要解析的实际文件是这个

Jon*_*ice 5

问题很简单,但是很混乱。您没有解析 HTML。您正在解析似乎是 SEC 自行开发的 SGML 词汇表中的 HTML。使困惑?并不感到惊讶。访问数据链接、保存文件并打开它的过程如下所示:

    <SEC-DOCUMENT>0001005214-12-000007.txt : 20120430
    <SEC-HEADER>0001005214-12-000007.hdr.sgml : 20120430
    <ACCEPTANCE-DATETIME>20120430163103
    ACCESSION NUMBER:       0001005214-12-000007
    CONFORMED SUBMISSION TYPE:  10-K
    PUBLIC DOCUMENT COUNT:      12
    CONFORMED PERIOD OF REPORT: 20120131
    FILED AS OF DATE:       20120430
    DATE AS OF CHANGE:      20120430

    FILER:

        COMPANY DATA:   
            COMPANY CONFORMED NAME:         AMERICAN WAGERING INC
            CENTRAL INDEX KEY:          0001005214
            STANDARD INDUSTRIAL CLASSIFICATION: SERVICES-MISCELLANEOUS AMUSEMENT & RECREATION [7990]
            IRS NUMBER:             880344658
            STATE OF INCORPORATION:         NV
            FISCAL YEAR END:            0105

        FILING VALUES:
            FORM TYPE:      10-K
            SEC ACT:        1934 Act
            SEC FILE NUMBER:    000-20685
            FILM NUMBER:        12795496

        BUSINESS ADDRESS:   
            STREET 1:       675 GRIER DR
            CITY:           LAS VEGAS
            STATE:          NV
            ZIP:            89119
            BUSINESS PHONE:     7027350101

        MAIL ADDRESS:   
            STREET 1:       675 GRIER DR
            CITY:           LAS VEGAS
            STATE:          NV
            ZIP:            89119
    </SEC-HEADER>
    <DOCUMENT>
    <TYPE>10-K
    <SEQUENCE>1
    <FILENAME>formtenk-01312012.htm
    <DESCRIPTION>FORM 10 K 1.31.2012
    <TEXT>
    <html>
    <head>
        <title>formtenk-01312012.htm</title>
        <!--Licensed to: American Wagering, Inc.-->
        <!--Document Created using EDGARizer 2020 5.4.1.0-->
        <!--Copyright 1995 - 2009 Thomson Reuters. All rights reserved.-->
    </head>
    <body bgcolor="#ffffff" style="DISPLAY: inline; FONT-FAMILY: Palatino Linotype; FONT-SIZE: 9pt">
    <div>
Run Code Online (Sandbox Code Playgroud)

然后跳过大量HTML 行,我们在以下位置重新找到它:

    </div>
  </body>
</html>
</TEXT>
</DOCUMENT>
<DOCUMENT>
<TYPE>ZIP
<SEQUENCE>33
<FILENAME>0001005214-12-000007-xbrl.zip
<DESCRIPTION>IDEA: XBRL DOCUMENT
<TEXT>
begin 644 0001005214-12-000007-xbrl.zip
M4$L#!!0````(`/"#GD":H45DWI(``/X8"``1`!P`8F5T;2TR,#$R,#$S,2YX
M;6Q55`D``Z/VGD^C]IY/=7@+``$$)0X```0Y`0``[#UI;QLYEM\7V/_`T223
M!)!DE20?<HZ!XZ1[W)T+<;I[@<5B0%51$MMU+<FRK/WU^]XCZY!<\I&V$RDN
MH`]9Q>/=%TM\+_YY$87L7"@MD_AER^OV6DS$?A+(>/JRE>D.U[Z4K7^^^L__
M>/&W3N=G$0O%C0C8>,&^S))()S'[+#(#"[`CWQ<A3.G@X(NQ"AFL'>M#_"A?
Run Code Online (Sandbox Code Playgroud)

现在我们已经从 HTML 变成了字符串编码的XBRL文件。然后跳过大量这些行,我们以以下内容结束文件:

    MN?<,9P8'``"4-```$0`8```````!````I($][P``8F5T;2TR,#$R,#$S,2YX
    M<V155`4``Z/VGD]U>`L``00E#@``!#D!``!02P4&``````8`!@`:`@``CO8`
    #````
    `
    end

    </TEXT>
    </DOCUMENT>
    <DOCUMENT>
    <TYPE>XML
    <SEQUENCE>34
    <FILENAME>FilingSummary.xml
    <DESCRIPTION>IDEA: XBRL DOCUMENT
    <TEXT>
    <XBRL>
    <?xml version="1.0" encoding="utf-8"?>
    <FilingSummary xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
      <Version>2.4.0.6</Version>
      <ProcessingTime />
      <ReportFormat>Html</ReportFormat>
      <ContextCount>27</ContextCount>
      <ElementCount>111</ElementCount>
      <EntityCount>1</EntityCount>
      <FootnotesReported>false</FootnotesReported>
      <SegmentCount>5</SegmentCount>
      <ScenarioCount>0</ScenarioCount>
      <TuplesReported>false</TuplesReported>
      <UnitCount>4</UnitCount>
      <MyReports>
        <Report>
          <IsDefault>false</IsDefault>
          <HasEmbeddedReports>false</HasEmbeddedReports>
          <HtmlFileName>R1.htm</HtmlFileName>
          <LongName>000100 - Document - Document and Entity Information</LongName>
          <ReportType>Sheet</ReportType>
          <Role>http://americanwagering.com/role/DocumentAndEntityInformation</Role>
          <ShortName>Document and Entity Information</ShortName>
        </Report>
        <Report>
          <IsDefault>true</IsDefault>
          <HasEmbeddedReports>false</HasEmbeddedReports>
          <HtmlFileName>R2.htm</HtmlFileName>
          <LongName>010000 - Statement - CONSOLIDATED BALANCE SHEETS</LongName>
          <ReportType>Sheet</ReportType>
          <Role>http://americanwagering.com/role/ConsolidatedBalanceSheets</Role>
          <ShortName>CONSOLIDATED BALANCE SHEETS</ShortName>
        </Report>
        <Report>
          <IsDefault>false</IsDefault>
          <HasEmbeddedReports>false</HasEmbeddedReports>
          <HtmlFileName>R3.htm</HtmlFileName>
          <LongName>010100 - Statement - CONSOLIDATED BALANCE SHEETS (Parenthetical)</LongName>
          <ReportType>Sheet</ReportType>
          <Role>http://americanwagering.com/role/ConsolidatedBalanceSheetsParenthetical</Role>
          <ShortName>CONSOLIDATED BALANCE SHEETS (Parenthetical)</ShortName>
        </Report>
        <Report>
          <IsDefault>false</IsDefault>
          <HasEmbeddedReports>false</HasEmbeddedReports>
          <HtmlFileName>R4.htm</HtmlFileName>
          <LongName>020000 - Statement - CONSOLIDATED STATEMENTS OF OPERATIONS</LongName>
          <ReportType>Sheet</ReportType>
          <Role>http://americanwagering.com/role/ConsolidatedStatementsOfOperations</Role>
          <ShortName>CONSOLIDATED STATEMENTS OF OPERATIONS</ShortName>
        </Report>
        <Report>
          <IsDefault>false</IsDefault>
          <HasEmbeddedReports>false</HasEmbeddedReports>
          <HtmlFileName>R5.htm</HtmlFileName>
          <LongName>030000 - Statement - CONSOLIDATED STATEMENTS OF STOCKHOLDERS' EQUITY (DEFICIENCY)</LongName>
          <ReportType>Sheet</ReportType>
          <Role>http://americanwagering.com/role/ConsolidatedStatementsOfStockholdersEquityDeficiency</Role>
          <ShortName>CONSOLIDATED STATEMENTS OF STOCKHOLDERS' EQUITY (DEFICIENCY)</ShortName>
        </Report>
        <Report>
          <IsDefault>false</IsDefault>
          <HasEmbeddedReports>false</HasEmbeddedReports>
          <HtmlFileName>R6.htm</HtmlFileName>
          <LongName>040000 - Statement - CONSOLIDATED STATEMENTS OF CASH FLOWS</LongName>
          <ReportType>Sheet</ReportType>
          <Role>http://americanwagering.com/role/ConsolidatedStatementsOfCashFlows</Role>
          <ShortName>CONSOLIDATED STATEMENTS OF CASH FLOWS</ShortName>
        </Report>
        <Report>
          <IsDefault>false</IsDefault>
          <HasEmbeddedReports>false</HasEmbeddedReports>
          <HtmlFileName>R7.htm</HtmlFileName>
          <LongName>060100 - Disclosure - Organization, Risks and Uncertainties, and Summary of Significant Accounting Policies</LongName>
          <ReportType>Sheet</ReportType>
          <Role>http://americanwagering.com/role/OrganizationRisksAndUncertaintiesAndSummaryOfSignificantAccountingPolicies</Role>
          <ShortName>Organization, Risks and Uncertainties, and Summary of Significant Accounting Policies</ShortName>
        </Report>
        <Report>
          <IsDefault>false</IsDefault>
          <HasEmbeddedReports>false</HasEmbeddedReports>
          <HtmlFileName>R8.htm</HtmlFileName>
          <LongName>060200 - Disclosure - Property and Equipment</LongName>
          <ReportType>Sheet</ReportType>
          <Role>http://americanwagering.com/role/PropertyAndEquipment</Role>
          <ShortName>Property and Equipment</ShortName>
        </Report>
        <Report>
          <IsDefault>false</IsDefault>
          <HasEmbeddedReports>false</HasEmbeddedReports>
          <HtmlFileName>R9.htm</HtmlFileName>
          <LongName>060300 - Disclosure - Debt</LongName>
          <ReportType>Sheet</ReportType>
          <Role>http://americanwagering.com/role/Debt</Role>
          <ShortName>Debt</ShortName>
        </Report>
        <Report>
          <IsDefault>false</IsDefault>
          <HasEmbeddedReports>false</HasEmbeddedReports>
          <HtmlFileName>R10.htm</HtmlFileName>
          <LongName>060400 - Disclosure - Series A Preferred Stock</LongName>
          <ReportType>Sheet</ReportType>
          <Role>http://americanwagering.com/role/SeriesPreferredStock</Role>
          <ShortName>Series A Preferred Stock</ShortName>
        </Report>
        <Report>
          <IsDefault>false</IsDefault>
          <HasEmbeddedReports>false</HasEmbeddedReports>
          <HtmlFileName>R11.htm</HtmlFileName>
          <LongName>060500 - Disclosure - Stock Options and Other Equity and Related Party Transactions</LongName>
          <ReportType>Sheet</ReportType>
          <Role>http://americanwagering.com/role/StockOptionsAndOtherEquityAndRelatedPartyTransactions</Role>
          <ShortName>Stock Options and Other Equity and Related Party Transactions</ShortName>
        </Report>
        <Report>
          <IsDefault>false</IsDefault>
          <HasEmbeddedReports>false</HasEmbeddedReports>
          <HtmlFileName>R12.htm</HtmlFileName>
          <LongName>060600 - Disclosure - Commitments and Contingencies</LongName>
          <ReportType>Sheet</ReportType>
          <Role>http://americanwagering.com/role/CommitmentsAndContingencies</Role>
          <ShortName>Commitments and Contingencies</ShortName>
        </Report>
        <Report>
          <IsDefault>false</IsDefault>
          <HasEmbeddedReports>false</HasEmbeddedReports>
          <HtmlFileName>R13.htm</HtmlFileName>
          <LongName>060700 - Disclosure - Related Party Transactions</LongName>
          <ReportType>Sheet</ReportType>
          <Role>http://americanwagering.com/role/RelatedPartyTransactions</Role>
          <ShortName>Related Party Transactions</ShortName>
        </Report>
        <Report>
          <IsDefault>false</IsDefault>
          <HasEmbeddedReports>false</HasEmbeddedReports>
          <HtmlFileName>R14.htm</HtmlFileName>
          <LongName>060800 - Disclosure - Income Taxes</LongName>
          <ReportType>Sheet</ReportType>
          <Role>http://americanwagering.com/role/IncomeTaxes</Role>
          <ShortName>Income Taxes</ShortName>
        </Report>
        <Report>
          <IsDefault>false</IsDefault>
          <HasEmbeddedReports>false</HasEmbeddedReports>
          <HtmlFileName>R15.htm</HtmlFileName>
          <LongName>060900 - Disclosure - Business Segments</LongName>
          <ReportType>Sheet</ReportType>
          <Role>http://americanwagering.com/role/BusinessSegments</Role>
          <ShortName>Business Segments</ShortName>
        </Report>
        <Report>
          <IsDefault>false</IsDefault>
          <HasEmbeddedReports>false</HasEmbeddedReports>
          <HtmlFileName>R16.htm</HtmlFileName>
          <LongName>061000 - Disclosure - Additional Supplementary Cash Flow Information</LongName>
          <ReportType>Sheet</ReportType>
          <Role>http://americanwagering.com/role/AdditionalSupplementaryCashFlowInformation</Role>
          <ShortName>Additional Supplementary Cash Flow Information</ShortName>
        </Report>
        <Report>
          <IsDefault>false</IsDefault>
          <HasEmbeddedReports>false</HasEmbeddedReports>
          <HtmlFileName>R17.htm</HtmlFileName>
          <LongName>061100 - Disclosure - Financial Instruments</LongName>
          <ReportType>Sheet</ReportType>
          <Role>http://americanwagering.com/role/FinancialInstruments</Role>
          <ShortName>Financial Instruments</ShortName>
        </Report>
        <Report>
          <IsDefault>false</IsDefault>
          <HasEmbeddedReports>false</HasEmbeddedReports>
          <LongName>All Reports</LongName>
          <ReportType>Book</ReportType>
          <ShortName>All Reports</ShortName>
        </Report>
      </MyReports>
      <Logs>
        <Log type="Info">Process Flow-Through: 010000 - Statement - CONSOLIDATED BALANCE SHEETS</Log>
        <Log type="Info">   Process Flow-Through: Removing column 'Jan. 31, 2010'</Log>
        <Log type="Info">Process Flow-Through: 010100 - Statement - CONSOLIDATED BALANCE SHEETS (Parenthetical)</Log>
        <Log type="Info">Process Flow-Through: 020000 - Statement - CONSOLIDATED STATEMENTS OF OPERATIONS</Log>
        <Log type="Info">Process Flow-Through: 040000 - Statement - CONSOLIDATED STATEMENTS OF CASH FLOWS</Log>
      </Logs>
      <InputFiles>
        <File>betm-20120131.xml</File>
        <File>betm-20120131.xsd</File>
        <File>betm-20120131_cal.xml</File>
        <File>betm-20120131_def.xml</File>
        <File>betm-20120131_lab.xml</File>
        <File>betm-20120131_pre.xml</File>
      </InputFiles>
      <SupplementalFiles />
      <BaseTaxonomies />
      <HasPresentationLinkbase>true</HasPresentationLinkbase>
      <HasCalculationLinkbase>true</HasCalculationLinkbase>
    </FilingSummary>
    </XBRL>
    </TEXT>
    </DOCUMENT>
    </SEC-DOCUMENT>
Run Code Online (Sandbox Code Playgroud)

总而言之,您拥有一个以文本格式编码的多部分文档,其中包含标题、文本部分、HTML 部分、XBRL 文件和报告。如果您想使用简单的方式HTMLParser阅读它,则必须先删除 HTML 部分。

那么,该怎么做呢?尝试像这样的预处理步骤:

import os

def html_part(filepath):
    """
    Generator returning only the HTML lines from an
    SEC Edgar SGML multi-part file.
    """
    start, stop = '<html>\n', '</html>\n'
    filepath = os.path.expanduser(filepath)
    with open(filepath) as f:
        # find start indicator, yield it
        for line in f:
            if line == start:
                yield line
                break
        # yield lines until stop indicator found, yield and stop
        for line in f:
            yield line
            if line == stop:
                raise StopIteration


origpath = '0001005214-12-000007.txt'
htmlpath = origpath.replace('.txt', '.html')

with open(htmlpath, "w") as out:
    out.write(''.join(html_part(origpath)))
Run Code Online (Sandbox Code Playgroud)

一旦您只删除了 HTML 行,您就可以使用原始代码来解析 中的文件htmlpath,这才是真正的 HTML 部分。