如何在 Polars DataFrame 上应用字数统计 我有一个字符串列,我想对所有文本进行字数统计。谢谢
数据框示例:
0 Would never order again.
1 I'm not sure it gives me any type of glow and ...
2 Goes on smoothly a bit sticky and color is clo...
3 Preferisco altri prodotti della stessa marca.
4 The moisturizing advertised is non-existent.
Run Code Online (Sandbox Code Playgroud)
如果我使用 pandas 它会像这样
df.Description.str.split(expand=True).stack().value_counts().reset_index()
Run Code Online (Sandbox Code Playgroud)
结果:
index 0
0 the 2
1 and 2
2 brown 2
3 is 2
4 any 1
5 The 1
6 moisturizing 1
7 like 1
8 …Run Code Online (Sandbox Code Playgroud) 我在 mongo 集合上有(900k, 300)条记录。当我尝试将数据读取到 Pandas 时,内存消耗会急剧增加,直到进程被终止。1.5GB~如果我从 csv 文件中读取数据,我必须提到数据适合内存()。
我的机器是 32GB RAM 和 16 个 CPU 的 Centos 7。
我的简单代码:
client = MongoClient(host,port)
collection = client[db_name][collection_name]
cursor = collection.find()
df = pd.DataFrame(list(cursor))
Run Code Online (Sandbox Code Playgroud)
我的多处理代码:
def read_mongo_parallel(skipses):
print("Starting process")
client = MongoClient(skipses[4],skipses[5])
db = client[skipses[2]]
collection = db[skipses[3]]
print("range of {} to {}".format(skipses[0],skipses[0]+skipses[1]))
cursor = collection.find().skip(skipses[0]).limit(skipses[1])
return list(cursor)
all_lists = []
with concurrent.futures.ProcessPoolExecutor(max_workers=multiprocessing.cpu_count()) as executor:
for rows in executor.map(read_mongo_parallel, skipesess):
all_lists.extend(rows)
df = pd.DataFrame(all_lists)
Run Code Online (Sandbox Code Playgroud)
两种方法的内存增加并杀死内核,
我在做什么?