相关疑难解决方法(0)

response = urllib2.urlopen(baseURL + filename)
compressedFile = StringIO.StringIO()
compressedFile.write(response.read())
decompressedFile = gzip.GzipFile(fileobj=compressedFile, mode='rb')
outfile = open(outFilePath, 'w')
outfile.write(decompressedFile.read())

Run Code Online (Sandbox Code Playgroud)

这最终会写出空文件.我怎样才能实现我追求的目标？

更新答案:

#! /usr/bin/env python2
import urllib2
import StringIO
import gzip

baseURL = "https://www.kernel.org/pub/linux/docs/man-pages/"        
# check filename: it may change over time, due to new updates
filename = "man-pages-5.00.tar.gz" 
outFilePath = filename[:-3]

response = urllib2.urlopen(baseURL + filename)
compressedFile = StringIO.StringIO(response.read())
decompressedFile = gzip.GzipFile(fileobj=compressedFile)

with open(outFilePath, 'w') as outfile:
    outfile.write(decompressedFile.read())

Run Code Online (Sandbox Code Playgroud)

python gzip file urllib2 stringio

Ore*_*ail

2019 07-18

38
推荐指数

3
解决办法

4万
查看次数

具有超时,最大大小和连接池的http请求

我正在寻找Python(2.7)中的一种方法来执行具有3个要求的HTTP请求:

超时(可靠性)
内容最大尺寸(安全性)
连接池(用于性能)

我已经检查了所有python HTTP库,但它们都不符合我的要求.例如:

urllib2:很好,但没有汇集

import urllib2
import json

r = urllib2.urlopen('https://github.com/timeline.json', timeout=5)
content = r.read(100+1)
if len(content) > 100: 
    print 'too large'
    r.close()
else:
    print json.loads(content)

r = urllib2.urlopen('https://github.com/timeline.json', timeout=5)
content = r.read(100000+1)
if len(content) > 100000: 
    print 'too large'
    r.close()
else:
    print json.loads(content)

Run Code Online (Sandbox Code Playgroud)

请求:没有最大尺寸

import requests
r = requests.get('https://github.com/timeline.json', timeout=5, stream=True)
r.headers['content-length'] # does not exists for this request, and not safe
content = r.raw.read(100000+1)
print content # ARF this is gzipped, so not the real …

Run Code Online (Sandbox Code Playgroud)

python timeout connection-pooling http max-size

Aur*_*ert

2014 05-07

8
推荐指数

1
解决办法

1万
查看次数

在 python 中解析 xml.gz 文件

我的本地机器上有一个名为的 tar.gz 文件abc.aXML.gz，其中包含许多 XML 文件。我想从这些文件中找到一些数据，但不知道如何使用Elementtree和解析这些文件gzip。

import xml.etree.ElementTree as ET
import gzip
document = ET.parse(gzip("abc.aXML.gz"))
root = document.getroot()

Run Code Online (Sandbox Code Playgroud)

python xml gzip tar

sha*_*han

2015 10-29

8
推荐指数

1
解决办法

1万
查看次数

在ElementTree(1.3.0)Python中进行XML解析的有效方法

我试图解析一个范围从(20MB-3GB)的巨大XML文件.文件是来自不同仪器的样本.所以,我正在做的是从文件中找到必要的元素信息并将它们插入数据库(Django).

我文件样本的一小部分.命名空间存在于所有文件中.文件的有趣特征是它们具有比文本更多的节点属性

<?xml VERSION="1.0" encoding="ISO-8859-1"?>
<mzML xmlns="http://psi.hupo.org/ms/mzml" xmlns:xs="http://www.w3.org/2001/XMLSchema-instance" xs:schemaLocation="http://psi.hupo.org/ms/mzml http://psidev.info/files/ms/mzML/xsd/mzML1.1.0.xsd" accession="plgs_example" version="1.1.0" id="urn:lsid:proteios.org:mzml.plgs_example">

    <instrumentConfiguration id="QTOF">
                    <cvParam cvRef="MS" accession="MS:1000189" name="Q-Tof ultima"/>
                    <componentList count="4">
                            <source order="1">
                                    <cvParam cvRef="MS" accession="MS:1000398" name="nanoelectrospray"/>
                            </source>
                            <analyzer order="2">
                                    <cvParam cvRef="MS" accession="MS:1000081" name="quadrupole"/>
                            </analyzer>
                            <analyzer order="3">
                                    <cvParam cvRef="MS" accession="MS:1000084" name="time-of-flight"/>
                            </analyzer>
                            <detector order="4">
                                    <cvParam cvRef="MS" accession="MS:1000114" name="microchannel plate detector"/>
                            </detector>
                    </componentList>
     </instrumentConfiguration>

Run Code Online (Sandbox Code Playgroud)

小但完整的文件在这里

所以我到目前为止所做的就是将findall用于所有感兴趣的元素.

import xml.etree.ElementTree as ET
tree=ET.parse('plgs_example.mzML')
root=tree.getroot()
NS="{http://psi.hupo.org/ms/mzml}"
s=tree.findall('.//{http://psi.hupo.org/ms/mzml}instrumentConfiguration')
for ins in range(len(s)):
    insattrib=s[ins].attrib
    # It will print out all the id attribute …

Run Code Online (Sandbox Code Playgroud)

python xml performance parsing lxml

thc*_*and

2014 04-02

5
推荐指数

1
解决办法

1921
查看次数

从Python 2移植到Python 3:'utf-8编解码器无法解码字节'

嘿,我试图将这个小片段从2端口移植到Python 3.

Python 2:

def _download_database(self, url):
  try:
    with closing(urllib.urlopen(url)) as u:
      return StringIO(u.read())
  except IOError:
    self.__show_exception(sys.exc_info())
  return None

Run Code Online (Sandbox Code Playgroud)

Python 3:

def _download_database(self, url):
  try:
    with closing(urllib.request.urlopen(url)) as u:
      response = u.read().decode('utf-8')
      return StringIO(response)
  except IOError:
    self.__show_exception(sys.exc_info())
  return None

Run Code Online (Sandbox Code Playgroud)

但我还是得到了

utf-8 codec can't decode byte 0x8f in position 12: invalid start byte

Run Code Online (Sandbox Code Playgroud)

我需要使用StringIO,因为它是一个zipfile,我想用该函数解析它:

   def _parse_zip(self, raw_zip):
  try:
     zip = zipfile.ZipFile(raw_zip)

     filelist = map(lambda x: x.filename, zip.filelist)
     db_file  = 'IpToCountry.csv' if 'IpToCountry.csv' in filelist else filelist[0]

     with closing(StringIO(zip.read(db_file))) as raw_database:
        return_val …

Run Code Online (Sandbox Code Playgroud)

urllib stringio python-3.x

Fra*_*ler

2015 12-18

2
推荐指数

1
解决办法

1445
查看次数

尝试下载gzip文件时遇到麻烦

我将使用wikitionary转储来进行POS标记.不知何故,它在下载时卡住了.这是我的代码:

import nltk
from urllib import urlopen
from collections import Counter
import gzip

url = 'http://dumps.wikimedia.org/enwiktionary/latest/enwiktionary-latest-all-titles-in-ns0.gz'
fStream = gzip.open(urlopen(url).read(), 'rb')
dictFile = fStream.read()
fStream.close()

text = nltk.Text(word.lower() for word in dictFile())
tokens = nltk.word_tokenize(text)

Run Code Online (Sandbox Code Playgroud)

这是我得到的错误:

Traceback (most recent call last):
File "~/dir1/dir1/wikt.py", line 15, in <module>
fStream = gzip.open(urlopen(url).read(), 'rb')
File "/usr/lib/python2.7/gzip.py", line 34, in open
return GzipFile(filename, mode, compresslevel)
File "/usr/lib/python2.7/gzip.py", line 89, in __init__
fileobj = self.myfileobj = __builtin__.open(filename, mode or 'rb')
TypeError: file() argument 1 must be …

Run Code Online (Sandbox Code Playgroud)

python gzip urllib urlopen

作者

2013 08-09

1
推荐指数

1
解决办法

1015
查看次数