我正在从Google文档中提取数据,处理数据并将其写入文件(最终我将粘贴到Wordpress页面).
它有一些非ASCII符号.如何将这些安全地转换为可以在HTML源中使用的符号?
目前我正在将所有内容转换为Unicode,在Python字符串中将它们连接在一起,然后执行:
import codecs
f = codecs.open('out.txt', mode="w", encoding="iso-8859-1")
f.write(all_html.encode("iso-8859-1", "replace"))
Run Code Online (Sandbox Code Playgroud)
最后一行有编码错误:
UnicodeDecodeError:'ascii'编解码器无法解码位置12286中的字节0xa0:序数不在范围内(128)
部分解决方案:
这个Python运行时没有错误:
row = [unicode(x.strip()) if x is not None else u'' for x in row]
all_html = row[0] + "<br/>" + row[1]
f = open('out.txt', 'w')
f.write(all_html.encode("utf-8"))
Run Code Online (Sandbox Code Playgroud)
但是如果我打开实际的文本文件,我会看到许多符号,如:
Qur’an
Run Code Online (Sandbox Code Playgroud)
也许我需要写一些文本文件以外的东西?
我正在使用scrapy来尝试从Google学术搜索中获取一些我需要的数据.以下面的链接为例:http : //scholar.google.com/scholar?q=intitle%3Apython+xpath
现在,我想从这个页面上删除所有标题.我遵循的流程如下:
scrapy shell "http://scholar.google.com/scholar?q=intitle%3Apython+xpath"
Run Code Online (Sandbox Code Playgroud)
这给了我scrapy外壳,我在其中:
>>> sel.xpath('//h3[@class="gs_rt"]/a').extract()
[
u'<a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.122.4438&rep=rep1&type=pdf"><b>Python </b>Paradigms for XML</a>',
u'<a href="https://svn.eecs.jacobs-university.de/svn/eecs/archive/bsc-2009/sbhushan.pdf">NCClient: A <b>Python </b>Library for NETCONF Clients</a>',
u'<a href="http://hal.archives-ouvertes.fr/hal-00759589/">PALSE: <b>Python </b>Analysis of Large Scale (Computer) Experiments</a>',
u'<a href="http://i.iinfo.cz/r2/kd/xmlprague2007.pdf#page=53"><b>Python </b>and XML</a>',
u'<a href="http://www.loadaveragezero.com/app/drx/Programming/Languages/Python/">drx: <b>Python </b>Programming Language [Computers: Programming: Languages: <b>Python</b>]-loadaverageZero</a>',
u'<a href="http://www.worldcolleges.info/sites/default/files/py10.pdf">XML and <b>Python </b>Tutorial</a>',
u'<a href="http://dl.acm.org/citation.cfm?id=2555791">Zato\u2014agile ESB, SOA, REST and cloud integrations in <b>Python</b></a>',
u'<a href="ftp://ftp.sybex.com/4021/4021index.pdf">XML Processing with Perl, <b>Python</b>, and PHP</a>',
u'<a href="http://books.google.com/books?hl=en&lr=&id=El4TAgAAQBAJ&oi=fnd&pg=PT8&dq=python+xpath&ots=RrFv0f_Y6V&sig=tSXzPJXbDi6KYnuuXEDnZCI7rDA"><b>Python </b>& XML</a>',
u'<a href="https://code.grnet.gr/projects/ncclient/repository/revisions/efed7d4cd5ac60cbb7c1c38646a6d6dfb711acc9/raw/docs/proposal.pdf">A <b>Python </b>Module for NETCONF …Run Code Online (Sandbox Code Playgroud)