将 dask 数据框中的列转换为 Doc2Vec 的 TaggedDocument

Question

将 dask 数据框中的列转换为 Doc2Vec 的 TaggedDocument

ZdW*_*ite 2 python gensim dask doc2vec

介绍

目前，我正在尝试将 dask 与 gensim 配合使用来进行 NLP 文档计算，并且在将我的语料库转换为“ TaggedDocument ”时遇到问题。

因为我尝试了很多不同的方法来解决这个问题，所以我将列出我的尝试。

每次处理这个问题的尝试都会遇到略有不同的困境。

首先是一些初步的假设。

数据

df.info()
<class 'dask.dataframe.core.DataFrame'>
Columns: 5 entries, claim_no to litigation
dtypes: object(2), int64(3)

Run Code Online (Sandbox Code Playgroud)

  claim_no   claim_txt I                                    CL ICC lit
0 8697278-17 battery comprising interior battery active ele... 106 2 0

Run Code Online (Sandbox Code Playgroud)

所需输出

>>tagged_document[0]
>>TaggedDocument(words=['battery', 'comprising', 'interior', 'battery', 'active', 'elements', 'battery', 'cell', 'casing', 'said', 'cell', 'casing', 'comprising', 'first', 'casing', 'element', 'first', 'contact', 'surface', 'second', 'casing', 'element', 'second', 'contact', 'surface', 'wherein', 'assembled', 'position', 'first', 'second', 'contact', 'surfaces', 'contact', 'first', 'second', 'casing', 'elements', 'encase', 'active', 'materials', 'battery', 'cell', 'interior', 'space', 'wherein', 'least', 'one', 'gas', 'tight', 'seal', 'layer', 'arranged', 'first', 'second', 'contact', 'surfaces', 'seal', 'interior', 'space', 'characterized', 'one', 'first', 'second', 'contact', 'surfaces', 'comprises', 'electrically', 'insulating', 'void', 'volume', 'layer', 'first', 'second', 'contact', 'surfaces', 'comprises', 'formable', 'material', 'layer', 'fills', 'voids', 'surface', 'void', 'volume', 'layer', 'hermetically', 'assembled', 'position', 'form', 'seal', 'layer'], tags=['8697278-17'])
>>len(tagged_document) == len(df['claim_txt'])

Run Code Online (Sandbox Code Playgroud)

错误号 1 不允许生成器

def read_corpus_tag_sub(df,corp='claim_txt',tags=['claim_no']):
    for i, line in enumerate(df[corp]):
        yield gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(line), (list(df.loc[i,tags].values)))

tagged_document = df.map_partitions(read_corpus_tag_sub,meta=TaggedDocument)
tagged_document = tagged_document.compute()

Run Code Online (Sandbox Code Playgroud)

类型错误：无法序列化类型生成器的对象。

我发现在仍然使用发电机的情况下没有办法解决这个问题。解决这个问题就太好了！因为这对于普通熊猫来说非常有效。

错误号 2 仅每个分区的第一个元素

def read_corpus_tag_sub(df,corp='claim_txt',tags=['claim_no']):
    for i, line in enumerate(df[corp]):
        return gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(line), (list(df.loc[i,tags].values)))

tagged_document = df.map_partitions(read_corpus_tag_sub,meta=TaggedDocument)
tagged_document = tagged_document.compute()

Run Code Online (Sandbox Code Playgroud)

这个有点愚蠢，因为该函数不会迭代（我知道），但给出了所需的格式，但只返回每个分区中的第一行。

错误号 3 函数调用在 100% cpu 时挂起

def read_corpus_tag_sub(df,corp='claim_txt',tags=['claim_no']):
    tagged_list = []
    for i, line in enumerate(df[corp]):
        tagged = gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(line), (list(df.loc[i,tags].values)))
        tagged_list.append(tagged)
    return tagged_list

Run Code Online (Sandbox Code Playgroud)

据我所知，在重构循环外部的返回时，该函数会挂起在 dask 客户端中构建内存，并且我的 CPU 利用率达到 100%，但没有计算任何任务。请记住，我以相同的方式调用该函数。

熊猫解决方案

def tag_corp(corp,tag):
    return gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(corp), ([tag]))

tagged_document = [tag_corp(x,y) for x,y in list(zip(df_smple['claim_txt'],df_smple['claim_no']))]

Run Code Online (Sandbox Code Playgroud)

列出比较我还没有测试过这个解决方案

其他熊猫解决方案

tagged_document = list(read_corpus_tag_sub(df))

Run Code Online (Sandbox Code Playgroud)

这个解决方案将持续几个小时。然而，当它完成后，我没有足够的内存来处理这件事。

结论（？）

我现在感觉超级迷失。这是我看过的主题列表。我承认我对 dask 真的很陌生，我刚刚花了很多时间，我觉得我在做一件愚蠢的事。

Answer 1

goj*_*omo 5

我不熟悉 Dask API/限制，但一般来说：

\n\n

如果您可以将数据作为（单词，标签）元组 \xe2\x80\x93 进行迭代，甚至忽略Doc2Vec/TaggedDocument步骤 \xe2\x80\x93 那么 Dask 端将被处理，并且将这些元组转换为TaggedDocument实例应该是微不足道的
一般来说，对于大型数据集，您不想（并且可能没有足够的 RAM 来）将完整数据集实例化为list内存中的 \xe2\x80\x93 ，因此您的尝试涉及 alist()或.append()可能正在工作，最多可达点，但耗尽本地内存（导致严重交换）和/或只是没有到达数据末尾。

\n\n

对于大型数据集，更好的方法是创建一个可迭代对象，每次要求迭代数据时（因为Doc2Vec训练需要多次传递），可以依次提供每个项目 \xe2\x80\x93 但永远不会将整个数据集读入内存中的对象。

\n\n

关于这种模式的一篇很好的博文是：Python 中的数据流：生成器、迭代器、可迭代对象

\n\n

鉴于您所显示的代码，我怀疑适合您的方法可能是这样的：

\n\n

from gensim.utils import simple_preprocess\n\nclass MyDataframeCorpus(object):\n    def __init__(self, source_df, text_col, tag_col):\n        self.source_df = source_df\n        self.text_col = text_col\n        self.tag_col = tag_col\n\n    def __iter__(self):\n        for i, row in self.source_df.iterrows():\n            yield TaggedDocument(words=simple_preprocess(row[self.text_col]), \n                                 tags=[row[self.tag_col]])\n\ncorpus_for_doc2vec = MyDataframeCorpus(df, \'claim_txt\', \'claim_no\')\n

Run Code Online (Sandbox Code Playgroud)\n

归档时间：	6 年，6 月前
查看次数：	1335 次
最近记录：	6 年，6 月前