使用Python从Internet下载大型CSV文件的进度

wat*_*wer 5 python csv python-3.x python-requests

我正在阅读McKinney的数据分析书,他已经分享了150MB的文件.尽管在使用请求通过http下载文件时,Progress Bar已经广泛讨论了这个主题,但我发现接受的答案中的代码引发了错误.我是初学者,所以我无法解决这个问题.

我想下载以下文件:

https://raw.githubusercontent.com/wesm/pydata-book/2nd-edition/datasets/fec/P00000001-ALL.csv
Run Code Online (Sandbox Code Playgroud)

这是没有进度条的代码:

DATA_PATH='./Data'
filename = "P00000001-ALL.csv"
url_without_filename = "https://raw.githubusercontent.com/wesm/pydata-book/2nd-edition/datasets/fec"

url_with_filename = url_without_filename + "/" + filename
local_filename = DATA_PATH + '/' + filename

#Write the file on local disk
r = requests.get(url_with_filename)  #without streaming
with open(local_filename, 'w', encoding=r.encoding) as f:
    f.write(r.text)
Run Code Online (Sandbox Code Playgroud)

这很好用,但因为没有进度条,我想知道发生了什么.

这里是从Progress Bar改编的代码,同时通过http下载文件和请求以及如何使用requests.py在python中下载大文件?

#Option 2:
#Write the file on local disk
r = requests.get(url_with_filename, stream=True)  # added stream parameter
total_size = int(r.headers.get('content-length', 0))

with open(local_filename, 'w', encoding=r.encoding) as f:
    #f.write(r.text)
    for chunk in tqdm(r.iter_content(1024), total=total_size, unit='B', unit_scale=True):
        if chunk:
            f.write(chunk)
Run Code Online (Sandbox Code Playgroud)

第二个选项存在两个问题(即使用流和tqdm包):

a)文件大小未正确计算.实际大小为157MB,但total_size结果是25MB.

b)比a)更大的问题是我得到以下错误:

 0%|          | 0.00/24.6M [00:00<?, ?B/s] Traceback (most recent call last):   File "C:\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 3265, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)   File "<ipython-input-31-abbe9270092b>", line 6, in <module>
    f.write(data) TypeError: write() argument must be str, not bytes
Run Code Online (Sandbox Code Playgroud)

作为初学者,我不确定如何解决这两个问题.我花了很多时间浏览git页面tqdm,但我无法遵循它.我很感激任何帮助.


我假设读者知道我们需要导入requeststqdm.所以,我没有包含导入这些基本包的代码.


以下是那些好奇的人的代码:

with open(local_filename, 'wb') as f:
    r = requests.get(url_with_filename, stream=True)  # added stream parameter
    # total_size = int(r.headers.get('content-length', 0))
    local_filename = DATA_PATH + '/' + filename
    total_size = len(r.content)
    downloaded = 0
    # chunk_size = max(1024*1024,int(total_size/1000))
    chunk_size = 1024
    #for chunk in tqdm(r.iter_content(chunk_size=chunk_size),total=total_size,unit='KB',unit_scale=True):
    for chunk in r.iter_content(chunk_size=chunk_size):
        downloaded += len(chunk)
        a=f.write(chunk)
        done = int(50 * downloaded/ total_size)
        sys.stdout.write("\r[%s%s]" % ('=' * done, ' ' * (50 - done)))
        sys.stdout.flush()
Run Code Online (Sandbox Code Playgroud)

HaR*_*HaR 0

with open(filename, 'wb', encoding=r.encoding) as f:
    f.write(r.content)
Run Code Online (Sandbox Code Playgroud)

这应该可以解决你的写作问题。Write r.contentnotSince是您需要在文件中写入r.text 的内容type(r.content)<class 'bytes'>