use*_*_12 8 python python-3.x scikit-learn
I have around 10k .bytes files in my directory and I want to use count vectorizer to get n_gram counts (i.e fit on train and transform on test set).
In those 10k files I have 8k files as train and 2k as test.
files =
['bfiles/GhHS0zL9cgNXFK6j1dIJ.bytes',
'bfiles/8qCPkhNr1KJaGtZ35pBc.bytes',
'bfiles/bLGq2tnA8CuxsF4Py9RO.bytes',
'bfiles/C0uidNjwV8lrPgzt1JSG.bytes',
'bfiles/IHiArX1xcBZgv69o4s0a.bytes',
...............................
...............................]
print(open(files[0]).read())
'A4 AC 4A 00 AC 4F 00 00 51 EC 48 00 57 7F 45 00 2D 4B 42 45 E9 77 51 4D 89 1D 19 40 30 01 89 45 E7 D9 F6 47 E7 59 75 49 1F ....'
Run Code Online (Sandbox Code Playgroud)
I can't do something like below and pass everything to CountVectorizer.
file_content = []
for file in file:
file_content.append(open(file).read())
Run Code Online (Sandbox Code Playgroud)
I can't append each file text to a big nested lists of files and then use CountVectorizer because the all combined text file size exceeds 150gb. I don't have resources to do that because CountVectorizer use huge amount of memory.
I need a more efficient way of solving this, Is there some other way I can achieve what I want without loading everything into memory at once. Any help is much appreciated.
All I could achieve was read one file and then use CountVectorizer but I don't know how to achieve what I'm looking for.
cv = CountVectorizer(ngram_range=(1, 4))
temp = cv.fit_transform([open(files[0]).read()])
temp
<1x451500 sparse matrix of type '<class 'numpy.int64'>'
with 335961 stored elements in Compressed Sparse Row format>
Run Code Online (Sandbox Code Playgroud)
您可以使用以下流程构建解决方案:
1) 遍历您的文件并在您的文件中创建一组所有令牌。在下面的示例中,这是使用 Counter 完成的,但您可以使用 python 集来实现相同的结果。这里的好处是 Counter 还会为您提供每个术语的总出现次数。
2) 将 CountVectorizer 与标记集/列表相匹配。您可以使用 ngram_range=(1, 4) 实例化 CountVectorizer。为了限制 df_new_data 中的特征数量,避免在此之下。
3)像往常一样转换新数据。
下面的示例适用于小数据。我希望您可以调整代码以满足您的需求。
import glob
import pandas as pd
import numpy as np
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer
# Create a list of file names
pattern = 'C:\\Bytes\\*.csv'
csv_files = glob.glob(pattern)
# Instantiate Counter and loop through the files chunk by chunk
# to create a dictionary of all tokens and their number of occurrence
counter = Counter()
c_size = 1000
for file in csv_files:
for chunk in pd.read_csv(file, chunksize=c_size, index_col=0, header=None):
counter.update(chunk[1])
# Fit the CountVectorizer to the counter keys
vectorizer = CountVectorizer(lowercase=False)
vectorizer.fit(list(counter.keys()))
# Loop through your files chunk by chunk and accummulate the counts
counts = np.zeros((1, len(vectorizer.get_feature_names())))
for file in csv_files:
for chunk in pd.read_csv(file, chunksize=c_size,
index_col=0, header=None):
new_counts = vectorizer.transform(chunk[1])
counts += new_counts.A.sum(axis=0)
# Generate a data frame with the total counts
df_new_data = pd.DataFrame(counts, columns=vectorizer.get_feature_names())
df_new_data
Out[266]:
00 01 0A 0B 10 11 1A 1B A0 A1 \
0 258.0 228.0 286.0 251.0 235.0 273.0 259.0 249.0 232.0 233.0
AA AB B0 B1 BA BB
0 248.0 227.0 251.0 254.0 255.0 261.0
Run Code Online (Sandbox Code Playgroud)
数据生成代码:
import numpy as np
import pandas as pd
def gen_data(n):
numbers = list('01')
letters = list('AB')
numlet = numbers + letters
x = np.random.choice(numlet, size=n)
y = np.random.choice(numlet, size=n)
df = pd.DataFrame({'X': x, 'Y': y})
return df.sum(axis=1)
n = 2000
df_1 = gen_data(n)
df_2 = gen_data(n)
df_1.to_csv('C:\\Bytes\\df_1.csv')
df_2.to_csv('C:\\Bytes\\df_2.csv')
df_1.head()
Out[218]:
0 10
1 01
2 A1
3 AB
4 1A
dtype: object
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
219 次 |
| 最近记录: |