小编MPA*_*MPA的帖子

Python - Polars - 字符串列上的值计数

如何在 Polars DataFrame 上应用字数统计我有一个字符串列，我想对所有文本进行字数统计。谢谢

数据框示例：

0                             Would never order again.
1    I'm not sure it gives me any type of glow and ...
2    Goes on smoothly a bit sticky and color is clo...
3        Preferisco altri prodotti della stessa marca.
4         The moisturizing advertised is non-existent.

Run Code Online (Sandbox Code Playgroud)

如果我使用 pandas 它会像这样

df.Description.str.split(expand=True).stack().value_counts().reset_index()

Run Code Online (Sandbox Code Playgroud)

结果：

           index  0
0             the  2
1             and  2
2           brown  2
3              is  2
4             any  1
5             The  1
6    moisturizing  1
7            like  1
8 …

Run Code Online (Sandbox Code Playgroud)

python-polars

MPA*_*MPA

2022 02-15

4
推荐指数

1
解决办法

6230
查看次数

使用 pymongo 客户端从 MongoDB 向 Pandas 读取数据时出现 OOM

我在 mongo 集合上有(900k, 300)条记录。当我尝试将数据读取到 Pandas 时，内存消耗会急剧增加，直到进程被终止。1.5GB~如果我从 csv 文件中读取数据，我必须提到数据适合内存（）。

我的机器是 32GB RAM 和 16 个 CPU 的 Centos 7。

我的简单代码：

client = MongoClient(host,port)
collection = client[db_name][collection_name]
cursor = collection.find()
df = pd.DataFrame(list(cursor))

Run Code Online (Sandbox Code Playgroud)

我的多处理代码：

def read_mongo_parallel(skipses):


    print("Starting process")
    client = MongoClient(skipses[4],skipses[5])
    db = client[skipses[2]]
    collection = db[skipses[3]]
    print("range of {} to {}".format(skipses[0],skipses[0]+skipses[1]))

    cursor = collection.find().skip(skipses[0]).limit(skipses[1])

    return list(cursor)

all_lists = []
with concurrent.futures.ProcessPoolExecutor(max_workers=multiprocessing.cpu_count()) as executor:
        for  rows in  executor.map(read_mongo_parallel, skipesess):
            all_lists.extend(rows)


df = pd.DataFrame(all_lists)

Run Code Online (Sandbox Code Playgroud)

两种方法的内存增加并杀死内核，

我在做什么？

python pymongo pandas

MPA*_*MPA

2020 02-05

2
推荐指数

1
解决办法

1039
查看次数