P.E*_*ido 51 python random io import-from-csv pandas
我想要读取的CSV文件不适合主内存.如何读取它的几个(~10K)随机行并对所选数据帧进行一些简单的统计?
dlm*_*dlm 57
假设CSV文件中没有标题:
import pandas
import random
n = 1000000 #number of records in file
s = 10000 #desired sample size
filename = "data.txt"
skip = sorted(random.sample(xrange(n),n-s))
df = pandas.read_csv(filename, skiprows=skip)
Run Code Online (Sandbox Code Playgroud)
如果read_csv有一个keeprows,或者如果skiprows使用回调函数而不是列表,那会更好.
带标头和未知文件长度:
import pandas
import random
filename = "data.txt"
n = sum(1 for line in open(filename)) - 1 #number of records in file (excludes header)
s = 10000 #desired sample size
skip = sorted(random.sample(xrange(1,n+1),n-s)) #the 0-indexed header will not be included in the skip list
df = pandas.read_csv(filename, skiprows=skip)
Run Code Online (Sandbox Code Playgroud)
exp*_*rer 34
@ dlm的答案很棒但是从v0.20.0开始,skiprows确实接受了一个可调用的.可调用接收行号作为参数.
如果你可以指定你想要的行数百分比而不是多少行,你甚至不需要获得文件大小,你只需要读一遍文件.假设第一行有一个标题:
import pandas as pd
import random
p = 0.01 # 1% of the lines
# keep the header, then take only 1% of lines
# if random from [0,1] interval is greater than 0.01 the row will be skipped
df = pd.read_csv(
filename,
header=0,
skiprows=lambda i: i>0 and random.random() > p
)
Run Code Online (Sandbox Code Playgroud)
或者,如果你想采取每一n行:
n = 100 # every 100th line = 1% of the lines
df = pd.read_csv(filename, header=0, skiprows=lambda i: i % n != 0)
Run Code Online (Sandbox Code Playgroud)
Bar*_*Bar 23
这不是在Pandas中,但它通过bash更快地实现了相同的结果:
shuf -n 100000 data/original.tsv > data/sample.tsv
Run Code Online (Sandbox Code Playgroud)
该shuf命令将随机输入,而-n参数表示我们在输出中需要多少行.
相关问题:https://unix.stackexchange.com/q/108581
这里有7M线csv的基准(2008年):
最佳答案:
def pd_read():
filename = "2008.csv"
n = sum(1 for line in open(filename)) - 1 #number of records in file (excludes header)
s = 100000 #desired sample size
skip = sorted(random.sample(range(1,n+1),n-s)) #the 0-indexed header will not be included in the skip list
df = pandas.read_csv(filename, skiprows=skip)
df.to_csv("temp.csv")
Run Code Online (Sandbox Code Playgroud)
使用时shuf:
%time pd_read()
CPU times: user 18.4 s, sys: 448 ms, total: 18.9 s
Wall time: 18.9 s
Run Code Online (Sandbox Code Playgroud)
所以shuf大约快12倍,重要的是不会将整个文件读入内存.
des*_*ble 11
这是一种算法,不需要事先计算文件中的行数,因此您只需要读取一次文件.
假设您想要m个样本.首先,算法保留前m个样本.当它以概率m/i看到第i个样本(i> m)时,算法使用该样本随机替换已经选择的样本.
通过这样做,对于任何i> m,我们总是具有从第一个i样本中随机选择的m个样本的子集.
见下面的代码:
import random
n_samples = 10
samples = []
for i, line in enumerate(f):
if i < n_samples:
samples.append(line)
elif random.random() < n_samples * 1. / (i+1):
samples[random.randint(0, n_samples-1)] = line
Run Code Online (Sandbox Code Playgroud)