Plu*_*ug4 1 html python parsing
我使用以下内容:
from HTMLParser import HTMLParser
class MLStripper(HTMLParser):
def __init__(self):
self.reset()
self.fed = []
def handle_data(self, d):
self.fed.append(d)
def get_data(self):
return ''.join(self.fed)
def strip_tags(html):
s = MLStripper()
s.feed(html)
return s.get_data()
Run Code Online (Sandbox Code Playgroud)
删除文本中的 HTML 标签。但是,对于我的一个文件,当我这样做时:
fdir = open('0001005214-12-000007.txt')
text = fdir.read()
strip_tags(text)
Run Code Online (Sandbox Code Playgroud)
我收到以下错误:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "G:/Dropbox/Textual/codes/Python/Parsing/Word_Count.py", line 26, in strip_tags
s.feed(html)
File "C:\Users\Martineau\Anaconda\lib\HTMLParser.py", line 117, in feed
self.goahead(0)
File "C:\Users\Martineau\Anaconda\lib\HTMLParser.py", line 169, in goahead
k = self.parse_html_declaration(i)
File "C:\Users\Martineau\Anaconda\lib\HTMLParser.py", line 245, in parse_html_declaration
return self.parse_marked_section(i)
File "C:\Users\Martineau\Anaconda\lib\markupbase.py", line 160, in parse_marked_section
self.error('unknown status keyword %r in marked section' % rawdata[i+3:j])
File "C:\Users\Martineau\Anaconda\lib\HTMLParser.py", line 124, in error
raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: unknown status keyword 't\n' in marked section, at line 35210, column 58
Run Code Online (Sandbox Code Playgroud)
这个错误是什么意思?我怎样才能绕过这个错误?
我想要解析的实际文件是这个
问题很简单,但是很混乱。您没有解析 HTML。您正在解析似乎是 SEC 自行开发的 SGML 词汇表中的 HTML。使困惑?并不感到惊讶。访问数据链接、保存文件并打开它的过程如下所示:
<SEC-DOCUMENT>0001005214-12-000007.txt : 20120430
<SEC-HEADER>0001005214-12-000007.hdr.sgml : 20120430
<ACCEPTANCE-DATETIME>20120430163103
ACCESSION NUMBER: 0001005214-12-000007
CONFORMED SUBMISSION TYPE: 10-K
PUBLIC DOCUMENT COUNT: 12
CONFORMED PERIOD OF REPORT: 20120131
FILED AS OF DATE: 20120430
DATE AS OF CHANGE: 20120430
FILER:
COMPANY DATA:
COMPANY CONFORMED NAME: AMERICAN WAGERING INC
CENTRAL INDEX KEY: 0001005214
STANDARD INDUSTRIAL CLASSIFICATION: SERVICES-MISCELLANEOUS AMUSEMENT & RECREATION [7990]
IRS NUMBER: 880344658
STATE OF INCORPORATION: NV
FISCAL YEAR END: 0105
FILING VALUES:
FORM TYPE: 10-K
SEC ACT: 1934 Act
SEC FILE NUMBER: 000-20685
FILM NUMBER: 12795496
BUSINESS ADDRESS:
STREET 1: 675 GRIER DR
CITY: LAS VEGAS
STATE: NV
ZIP: 89119
BUSINESS PHONE: 7027350101
MAIL ADDRESS:
STREET 1: 675 GRIER DR
CITY: LAS VEGAS
STATE: NV
ZIP: 89119
</SEC-HEADER>
<DOCUMENT>
<TYPE>10-K
<SEQUENCE>1
<FILENAME>formtenk-01312012.htm
<DESCRIPTION>FORM 10 K 1.31.2012
<TEXT>
<html>
<head>
<title>formtenk-01312012.htm</title>
<!--Licensed to: American Wagering, Inc.-->
<!--Document Created using EDGARizer 2020 5.4.1.0-->
<!--Copyright 1995 - 2009 Thomson Reuters. All rights reserved.-->
</head>
<body bgcolor="#ffffff" style="DISPLAY: inline; FONT-FAMILY: Palatino Linotype; FONT-SIZE: 9pt">
<div>
Run Code Online (Sandbox Code Playgroud)
然后跳过大量HTML 行,我们在以下位置重新找到它:
</div>
</body>
</html>
</TEXT>
</DOCUMENT>
<DOCUMENT>
<TYPE>ZIP
<SEQUENCE>33
<FILENAME>0001005214-12-000007-xbrl.zip
<DESCRIPTION>IDEA: XBRL DOCUMENT
<TEXT>
begin 644 0001005214-12-000007-xbrl.zip
M4$L#!!0````(`/"#GD":H45DWI(``/X8"``1`!P`8F5T;2TR,#$R,#$S,2YX
M;6Q55`D``Z/VGD^C]IY/=7@+``$$)0X```0Y`0``[#UI;QLYEM\7V/_`T223
M!)!DE20?<HZ!XZ1[W)T+<;I[@<5B0%51$MMU+<FRK/WU^]XCZY!<\I&V$RDN
MH`]9Q>/=%TM\+_YY$87L7"@MD_AER^OV6DS$?A+(>/JRE>D.U[Z4K7^^^L__
M>/&W3N=G$0O%C0C8>,&^S))()S'[+#(#"[`CWQ<A3.G@X(NQ"AFL'>M#_"A?
Run Code Online (Sandbox Code Playgroud)
现在我们已经从 HTML 变成了字符串编码的XBRL文件。然后跳过大量这些行,我们以以下内容结束文件:
MN?<,9P8'``"4-```$0`8```````!````I($][P``8F5T;2TR,#$R,#$S,2YX
M<V155`4``Z/VGD]U>`L``00E#@``!#D!``!02P4&``````8`!@`:`@``CO8`
#````
`
end
</TEXT>
</DOCUMENT>
<DOCUMENT>
<TYPE>XML
<SEQUENCE>34
<FILENAME>FilingSummary.xml
<DESCRIPTION>IDEA: XBRL DOCUMENT
<TEXT>
<XBRL>
<?xml version="1.0" encoding="utf-8"?>
<FilingSummary xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<Version>2.4.0.6</Version>
<ProcessingTime />
<ReportFormat>Html</ReportFormat>
<ContextCount>27</ContextCount>
<ElementCount>111</ElementCount>
<EntityCount>1</EntityCount>
<FootnotesReported>false</FootnotesReported>
<SegmentCount>5</SegmentCount>
<ScenarioCount>0</ScenarioCount>
<TuplesReported>false</TuplesReported>
<UnitCount>4</UnitCount>
<MyReports>
<Report>
<IsDefault>false</IsDefault>
<HasEmbeddedReports>false</HasEmbeddedReports>
<HtmlFileName>R1.htm</HtmlFileName>
<LongName>000100 - Document - Document and Entity Information</LongName>
<ReportType>Sheet</ReportType>
<Role>http://americanwagering.com/role/DocumentAndEntityInformation</Role>
<ShortName>Document and Entity Information</ShortName>
</Report>
<Report>
<IsDefault>true</IsDefault>
<HasEmbeddedReports>false</HasEmbeddedReports>
<HtmlFileName>R2.htm</HtmlFileName>
<LongName>010000 - Statement - CONSOLIDATED BALANCE SHEETS</LongName>
<ReportType>Sheet</ReportType>
<Role>http://americanwagering.com/role/ConsolidatedBalanceSheets</Role>
<ShortName>CONSOLIDATED BALANCE SHEETS</ShortName>
</Report>
<Report>
<IsDefault>false</IsDefault>
<HasEmbeddedReports>false</HasEmbeddedReports>
<HtmlFileName>R3.htm</HtmlFileName>
<LongName>010100 - Statement - CONSOLIDATED BALANCE SHEETS (Parenthetical)</LongName>
<ReportType>Sheet</ReportType>
<Role>http://americanwagering.com/role/ConsolidatedBalanceSheetsParenthetical</Role>
<ShortName>CONSOLIDATED BALANCE SHEETS (Parenthetical)</ShortName>
</Report>
<Report>
<IsDefault>false</IsDefault>
<HasEmbeddedReports>false</HasEmbeddedReports>
<HtmlFileName>R4.htm</HtmlFileName>
<LongName>020000 - Statement - CONSOLIDATED STATEMENTS OF OPERATIONS</LongName>
<ReportType>Sheet</ReportType>
<Role>http://americanwagering.com/role/ConsolidatedStatementsOfOperations</Role>
<ShortName>CONSOLIDATED STATEMENTS OF OPERATIONS</ShortName>
</Report>
<Report>
<IsDefault>false</IsDefault>
<HasEmbeddedReports>false</HasEmbeddedReports>
<HtmlFileName>R5.htm</HtmlFileName>
<LongName>030000 - Statement - CONSOLIDATED STATEMENTS OF STOCKHOLDERS' EQUITY (DEFICIENCY)</LongName>
<ReportType>Sheet</ReportType>
<Role>http://americanwagering.com/role/ConsolidatedStatementsOfStockholdersEquityDeficiency</Role>
<ShortName>CONSOLIDATED STATEMENTS OF STOCKHOLDERS' EQUITY (DEFICIENCY)</ShortName>
</Report>
<Report>
<IsDefault>false</IsDefault>
<HasEmbeddedReports>false</HasEmbeddedReports>
<HtmlFileName>R6.htm</HtmlFileName>
<LongName>040000 - Statement - CONSOLIDATED STATEMENTS OF CASH FLOWS</LongName>
<ReportType>Sheet</ReportType>
<Role>http://americanwagering.com/role/ConsolidatedStatementsOfCashFlows</Role>
<ShortName>CONSOLIDATED STATEMENTS OF CASH FLOWS</ShortName>
</Report>
<Report>
<IsDefault>false</IsDefault>
<HasEmbeddedReports>false</HasEmbeddedReports>
<HtmlFileName>R7.htm</HtmlFileName>
<LongName>060100 - Disclosure - Organization, Risks and Uncertainties, and Summary of Significant Accounting Policies</LongName>
<ReportType>Sheet</ReportType>
<Role>http://americanwagering.com/role/OrganizationRisksAndUncertaintiesAndSummaryOfSignificantAccountingPolicies</Role>
<ShortName>Organization, Risks and Uncertainties, and Summary of Significant Accounting Policies</ShortName>
</Report>
<Report>
<IsDefault>false</IsDefault>
<HasEmbeddedReports>false</HasEmbeddedReports>
<HtmlFileName>R8.htm</HtmlFileName>
<LongName>060200 - Disclosure - Property and Equipment</LongName>
<ReportType>Sheet</ReportType>
<Role>http://americanwagering.com/role/PropertyAndEquipment</Role>
<ShortName>Property and Equipment</ShortName>
</Report>
<Report>
<IsDefault>false</IsDefault>
<HasEmbeddedReports>false</HasEmbeddedReports>
<HtmlFileName>R9.htm</HtmlFileName>
<LongName>060300 - Disclosure - Debt</LongName>
<ReportType>Sheet</ReportType>
<Role>http://americanwagering.com/role/Debt</Role>
<ShortName>Debt</ShortName>
</Report>
<Report>
<IsDefault>false</IsDefault>
<HasEmbeddedReports>false</HasEmbeddedReports>
<HtmlFileName>R10.htm</HtmlFileName>
<LongName>060400 - Disclosure - Series A Preferred Stock</LongName>
<ReportType>Sheet</ReportType>
<Role>http://americanwagering.com/role/SeriesPreferredStock</Role>
<ShortName>Series A Preferred Stock</ShortName>
</Report>
<Report>
<IsDefault>false</IsDefault>
<HasEmbeddedReports>false</HasEmbeddedReports>
<HtmlFileName>R11.htm</HtmlFileName>
<LongName>060500 - Disclosure - Stock Options and Other Equity and Related Party Transactions</LongName>
<ReportType>Sheet</ReportType>
<Role>http://americanwagering.com/role/StockOptionsAndOtherEquityAndRelatedPartyTransactions</Role>
<ShortName>Stock Options and Other Equity and Related Party Transactions</ShortName>
</Report>
<Report>
<IsDefault>false</IsDefault>
<HasEmbeddedReports>false</HasEmbeddedReports>
<HtmlFileName>R12.htm</HtmlFileName>
<LongName>060600 - Disclosure - Commitments and Contingencies</LongName>
<ReportType>Sheet</ReportType>
<Role>http://americanwagering.com/role/CommitmentsAndContingencies</Role>
<ShortName>Commitments and Contingencies</ShortName>
</Report>
<Report>
<IsDefault>false</IsDefault>
<HasEmbeddedReports>false</HasEmbeddedReports>
<HtmlFileName>R13.htm</HtmlFileName>
<LongName>060700 - Disclosure - Related Party Transactions</LongName>
<ReportType>Sheet</ReportType>
<Role>http://americanwagering.com/role/RelatedPartyTransactions</Role>
<ShortName>Related Party Transactions</ShortName>
</Report>
<Report>
<IsDefault>false</IsDefault>
<HasEmbeddedReports>false</HasEmbeddedReports>
<HtmlFileName>R14.htm</HtmlFileName>
<LongName>060800 - Disclosure - Income Taxes</LongName>
<ReportType>Sheet</ReportType>
<Role>http://americanwagering.com/role/IncomeTaxes</Role>
<ShortName>Income Taxes</ShortName>
</Report>
<Report>
<IsDefault>false</IsDefault>
<HasEmbeddedReports>false</HasEmbeddedReports>
<HtmlFileName>R15.htm</HtmlFileName>
<LongName>060900 - Disclosure - Business Segments</LongName>
<ReportType>Sheet</ReportType>
<Role>http://americanwagering.com/role/BusinessSegments</Role>
<ShortName>Business Segments</ShortName>
</Report>
<Report>
<IsDefault>false</IsDefault>
<HasEmbeddedReports>false</HasEmbeddedReports>
<HtmlFileName>R16.htm</HtmlFileName>
<LongName>061000 - Disclosure - Additional Supplementary Cash Flow Information</LongName>
<ReportType>Sheet</ReportType>
<Role>http://americanwagering.com/role/AdditionalSupplementaryCashFlowInformation</Role>
<ShortName>Additional Supplementary Cash Flow Information</ShortName>
</Report>
<Report>
<IsDefault>false</IsDefault>
<HasEmbeddedReports>false</HasEmbeddedReports>
<HtmlFileName>R17.htm</HtmlFileName>
<LongName>061100 - Disclosure - Financial Instruments</LongName>
<ReportType>Sheet</ReportType>
<Role>http://americanwagering.com/role/FinancialInstruments</Role>
<ShortName>Financial Instruments</ShortName>
</Report>
<Report>
<IsDefault>false</IsDefault>
<HasEmbeddedReports>false</HasEmbeddedReports>
<LongName>All Reports</LongName>
<ReportType>Book</ReportType>
<ShortName>All Reports</ShortName>
</Report>
</MyReports>
<Logs>
<Log type="Info">Process Flow-Through: 010000 - Statement - CONSOLIDATED BALANCE SHEETS</Log>
<Log type="Info"> Process Flow-Through: Removing column 'Jan. 31, 2010'</Log>
<Log type="Info">Process Flow-Through: 010100 - Statement - CONSOLIDATED BALANCE SHEETS (Parenthetical)</Log>
<Log type="Info">Process Flow-Through: 020000 - Statement - CONSOLIDATED STATEMENTS OF OPERATIONS</Log>
<Log type="Info">Process Flow-Through: 040000 - Statement - CONSOLIDATED STATEMENTS OF CASH FLOWS</Log>
</Logs>
<InputFiles>
<File>betm-20120131.xml</File>
<File>betm-20120131.xsd</File>
<File>betm-20120131_cal.xml</File>
<File>betm-20120131_def.xml</File>
<File>betm-20120131_lab.xml</File>
<File>betm-20120131_pre.xml</File>
</InputFiles>
<SupplementalFiles />
<BaseTaxonomies />
<HasPresentationLinkbase>true</HasPresentationLinkbase>
<HasCalculationLinkbase>true</HasCalculationLinkbase>
</FilingSummary>
</XBRL>
</TEXT>
</DOCUMENT>
</SEC-DOCUMENT>
Run Code Online (Sandbox Code Playgroud)
总而言之,您拥有一个以文本格式编码的多部分文档,其中包含标题、文本部分、HTML 部分、XBRL 文件和报告。如果您想使用简单的方式HTMLParser
阅读它,则必须先删除 HTML 部分。
那么,该怎么做呢?尝试像这样的预处理步骤:
import os
def html_part(filepath):
"""
Generator returning only the HTML lines from an
SEC Edgar SGML multi-part file.
"""
start, stop = '<html>\n', '</html>\n'
filepath = os.path.expanduser(filepath)
with open(filepath) as f:
# find start indicator, yield it
for line in f:
if line == start:
yield line
break
# yield lines until stop indicator found, yield and stop
for line in f:
yield line
if line == stop:
raise StopIteration
origpath = '0001005214-12-000007.txt'
htmlpath = origpath.replace('.txt', '.html')
with open(htmlpath, "w") as out:
out.write(''.join(html_part(origpath)))
Run Code Online (Sandbox Code Playgroud)
一旦您只删除了 HTML 行,您就可以使用原始代码来解析 中的文件htmlpath
,这才是真正的 HTML 部分。
归档时间: |
|
查看次数: |
1309 次 |
最近记录: |