我有一长串代码,我希望在多行之间分解.我使用什么,语法是什么?
例如,添加一串字符串,
e = 'a' + 'b' + 'c' + 'd'
Run Code Online (Sandbox Code Playgroud)
并将它分成两行:
e = 'a' + 'b' +
'c' + 'd'
Run Code Online (Sandbox Code Playgroud) 我正在使用
data=urllib2.urlopen(url).read()
Run Code Online (Sandbox Code Playgroud)
我想知道:
如何判断URL中的数据是否被gzip压缩?
如果数据被压缩,urllib2会自动解压缩数据吗?数据总是一个字符串吗?
我想使用urllib下载文件并在保存之前将文件解压缩到内存中.
这就是我现在所拥有的:
response = urllib2.urlopen(baseURL + filename)
compressedFile = StringIO.StringIO()
compressedFile.write(response.read())
decompressedFile = gzip.GzipFile(fileobj=compressedFile, mode='rb')
outfile = open(outFilePath, 'w')
outfile.write(decompressedFile.read())
Run Code Online (Sandbox Code Playgroud)
这最终会写出空文件.我怎样才能实现我追求的目标?
更新答案:
#! /usr/bin/env python2
import urllib2
import StringIO
import gzip
baseURL = "https://www.kernel.org/pub/linux/docs/man-pages/"
# check filename: it may change over time, due to new updates
filename = "man-pages-5.00.tar.gz"
outFilePath = filename[:-3]
response = urllib2.urlopen(baseURL + filename)
compressedFile = StringIO.StringIO(response.read())
decompressedFile = gzip.GzipFile(fileobj=compressedFile)
with open(outFilePath, 'w') as outfile:
outfile.write(decompressedFile.read())
Run Code Online (Sandbox Code Playgroud) 我正在寻找Python(2.7)中的一种方法来执行具有3个要求的HTTP请求:
我已经检查了所有python HTTP库,但它们都不符合我的要求.例如:
urllib2:很好,但没有汇集
import urllib2
import json
r = urllib2.urlopen('https://github.com/timeline.json', timeout=5)
content = r.read(100+1)
if len(content) > 100:
print 'too large'
r.close()
else:
print json.loads(content)
r = urllib2.urlopen('https://github.com/timeline.json', timeout=5)
content = r.read(100000+1)
if len(content) > 100000:
print 'too large'
r.close()
else:
print json.loads(content)
Run Code Online (Sandbox Code Playgroud)
请求:没有最大尺寸
import requests
r = requests.get('https://github.com/timeline.json', timeout=5, stream=True)
r.headers['content-length'] # does not exists for this request, and not safe
content = r.raw.read(100000+1)
print content # ARF this is gzipped, so not the real …Run Code Online (Sandbox Code Playgroud) 我的本地机器上有一个名为 的 tar.gz 文件abc.aXML.gz,其中包含许多 XML 文件。我想从这些文件中找到一些数据,但不知道如何使用Elementtree和解析这些文件gzip。
import xml.etree.ElementTree as ET
import gzip
document = ET.parse(gzip("abc.aXML.gz"))
root = document.getroot()
Run Code Online (Sandbox Code Playgroud) 我试图解析一个范围从(20MB-3GB)的巨大XML文件.文件是来自不同仪器的样本.所以,我正在做的是从文件中找到必要的元素信息并将它们插入数据库(Django).
我文件样本的一小部分.命名空间存在于所有文件中.文件的有趣特征是它们具有比文本更多的节点属性
<?xml VERSION="1.0" encoding="ISO-8859-1"?>
<mzML xmlns="http://psi.hupo.org/ms/mzml" xmlns:xs="http://www.w3.org/2001/XMLSchema-instance" xs:schemaLocation="http://psi.hupo.org/ms/mzml http://psidev.info/files/ms/mzML/xsd/mzML1.1.0.xsd" accession="plgs_example" version="1.1.0" id="urn:lsid:proteios.org:mzml.plgs_example">
<instrumentConfiguration id="QTOF">
<cvParam cvRef="MS" accession="MS:1000189" name="Q-Tof ultima"/>
<componentList count="4">
<source order="1">
<cvParam cvRef="MS" accession="MS:1000398" name="nanoelectrospray"/>
</source>
<analyzer order="2">
<cvParam cvRef="MS" accession="MS:1000081" name="quadrupole"/>
</analyzer>
<analyzer order="3">
<cvParam cvRef="MS" accession="MS:1000084" name="time-of-flight"/>
</analyzer>
<detector order="4">
<cvParam cvRef="MS" accession="MS:1000114" name="microchannel plate detector"/>
</detector>
</componentList>
</instrumentConfiguration>
Run Code Online (Sandbox Code Playgroud)
小但完整的文件在这里
所以我到目前为止所做的就是将findall用于所有感兴趣的元素.
import xml.etree.ElementTree as ET
tree=ET.parse('plgs_example.mzML')
root=tree.getroot()
NS="{http://psi.hupo.org/ms/mzml}"
s=tree.findall('.//{http://psi.hupo.org/ms/mzml}instrumentConfiguration')
for ins in range(len(s)):
insattrib=s[ins].attrib
# It will print out all the id attribute …Run Code Online (Sandbox Code Playgroud) 嘿,我试图将这个小片段从2端口移植到Python 3.
Python 2:
def _download_database(self, url):
try:
with closing(urllib.urlopen(url)) as u:
return StringIO(u.read())
except IOError:
self.__show_exception(sys.exc_info())
return None
Run Code Online (Sandbox Code Playgroud)
Python 3:
def _download_database(self, url):
try:
with closing(urllib.request.urlopen(url)) as u:
response = u.read().decode('utf-8')
return StringIO(response)
except IOError:
self.__show_exception(sys.exc_info())
return None
Run Code Online (Sandbox Code Playgroud)
但我还是得到了
utf-8 codec can't decode byte 0x8f in position 12: invalid start byte
Run Code Online (Sandbox Code Playgroud)
我需要使用StringIO,因为它是一个zipfile,我想用该函数解析它:
def _parse_zip(self, raw_zip):
try:
zip = zipfile.ZipFile(raw_zip)
filelist = map(lambda x: x.filename, zip.filelist)
db_file = 'IpToCountry.csv' if 'IpToCountry.csv' in filelist else filelist[0]
with closing(StringIO(zip.read(db_file))) as raw_database:
return_val …Run Code Online (Sandbox Code Playgroud) 我将使用wikitionary转储来进行POS标记.不知何故,它在下载时卡住了.这是我的代码:
import nltk
from urllib import urlopen
from collections import Counter
import gzip
url = 'http://dumps.wikimedia.org/enwiktionary/latest/enwiktionary-latest-all-titles-in-ns0.gz'
fStream = gzip.open(urlopen(url).read(), 'rb')
dictFile = fStream.read()
fStream.close()
text = nltk.Text(word.lower() for word in dictFile())
tokens = nltk.word_tokenize(text)
Run Code Online (Sandbox Code Playgroud)
这是我得到的错误:
Traceback (most recent call last):
File "~/dir1/dir1/wikt.py", line 15, in <module>
fStream = gzip.open(urlopen(url).read(), 'rb')
File "/usr/lib/python2.7/gzip.py", line 34, in open
return GzipFile(filename, mode, compresslevel)
File "/usr/lib/python2.7/gzip.py", line 89, in __init__
fileobj = self.myfileobj = __builtin__.open(filename, mode or 'rb')
TypeError: file() argument 1 must be …Run Code Online (Sandbox Code Playgroud) python ×7
gzip ×4
stringio ×2
urllib ×2
urllib2 ×2
xml ×2
file ×1
http ×1
line-breaks ×1
long-lines ×1
lxml ×1
max-size ×1
parsing ×1
performance ×1
python-3.x ×1
syntax ×1
tar ×1
timeout ×1
urlopen ×1