小编The*_*der的帖子

对于大文本数据，如何使 Pandas df 列中的文本处理速度更快？

我有一个超过 1GB 的聊天数据 (chat.txt) 的大文本文件，格式如下：

john|12-02-1999|hello#,there#,how#,are#,you#,tom$ 
tom|12-02-1999|hey#,john$,hows#, it#, goin#
mary|12-03-1999|hello#,boys#,fancy#,meetin#,ya'll#,here#
...
...
john|12-02-2000|well#,its#,been#,nice#,catching#,up#,with#,you#,and#, mary$
mary|12-03-2000|catch#,you#,on#,the#,flipside#,tom$,and#,john$

Run Code Online (Sandbox Code Playgroud)

我想处理此文本并分别汇总每个用户的某些关键字的字数（比如 500 个字 - 你好，不错，喜欢......晚餐，没有）。此过程还涉及从每个单词中删除所有尾随特殊字符

输出看起来像

user   hello   nice   like    .....    dinner  No  
Tom    10000   500     300    .....    6000    0
John   6000    1200    200    .....    3000    5
Mary   23      9000    10000  .....    100     9000

Run Code Online (Sandbox Code Playgroud)

这是我目前的pythonic解决方案：

chat_data = pd.read_csv("chat.txt", sep="|", names =["user","date","words"])
user_lst = chat_data.user.unique()
user_grouped_data= pd.DataFrame(columns=["user","words"])
user_grouped_data['user']=user_lst

for i,row in user_grouped_data.iterrows():
    id = row["user"]
    temp = chat_data[chat_data["user"]==id]
    user_grouped_data.loc[i,"words"] = ",".join(temp["words"].tolist())

result = pd.DataFrame(columns=[ "user", "hello", …

Run Code Online (Sandbox Code Playgroud)

python regex dataframe python-3.x pandas

The*_*der

2020 10-11

6
推荐指数

1
解决办法

272
查看次数

检查两个字符串在Python中是否包含相同的单词集

我正在尝试比较两个句子，看看它们是否包含相同的单词集。
例如：比较“今天是美好的一天”和“今天是美好的一天”应该返回true
我现在正在使用来自集合模块的Counter函数

from collections import Counter


vocab = {}
for line in file_ob:
    flag = 0
    for sentence in vocab:
        if Counter(sentence.split(" ")) == Counter(line.split(" ")):
            vocab[sentence]+=1
            flag = 1
            break
        if flag==0:
            vocab[line]=1

Run Code Online (Sandbox Code Playgroud)

它似乎可以正常工作几行，但是我的文本文件有1000多个，并且从未完成执行。还有其他方法，更有效的方法可以帮助我计算整个文件的结果吗？

编辑：

我只需要替换Counter方法，就可以替换它。而且实施上没有任何变化。

python text text-extraction python-2.7

The*_*der

2019 06-26

5
推荐指数

1
解决办法

1338
查看次数