有没有办法提高 nltk.sentiment.vader 情绪分析器的性能？

Question

有没有办法提高 nltk.sentiment.vader 情绪分析器的性能？

Cur*_*ma_ 4 python performance data-manipulation sentiment-analysis vader

我的文字来源于社交网络，所以你可以想象它的本质，我认为文字是我所能想象的干净和最小的；执行以下消毒后：

没有网址，没有用户名
没有标点符号，没有重音
没有数字
没有停用词（我认为维达无论如何都会这样做）

我认为运行时间是线性的，我不打算进行任何并行化，因为更改可用代码需要付出大量的努力，例如，对于大约 1000 个文本，范围从 ~50 kb 到 ~150 kb 字节，它需要大约

在我的机器上运行时间约为 10 分钟。

有没有更好的方法来输入算法以加快烹饪时间？代码就像 SentimentIntensityAnalyzer 的工作一样简单，这是主要部分

sid = SentimentIntensityAnalyzer()

c.execute("select body, creation_date, group_id from posts where (substring(lower(body) from (%s))=(%s)) and language=\'en\' order by creation _ date DESC (s,s,)")
conn.commit()
if(c.rowcount>0):
                dump_fetched = c.fetchall()

textsSql=pd.DataFrame(dump_fetched,columns=['body','created_at', 'group_id'])
del dump_fetched
gc.collect()
texts = textsSql['body'].values
# here, some data manipulation: steps listed above
polarity_ = [sid.polarity_scores(s)['compound'] for s in texts]

Run Code Online (Sandbox Code Playgroud)

Answer 1

Dhr*_*hak 6

/1。您不需要删除停用词，nltk+vader 已经这样做了。

/2。您不需要删除标点符号，因为除了处理开销之外，这也会影响维达的极性计算。所以，继续标点符号。

    >>> txt = "this is superb!"
    >>> s.polarity_scores(txt)
    {'neg': 0.0, 'neu': 0.313, 'pos': 0.687, 'compound': 0.6588}
    >>> txt = "this is superb"
    >>> s.polarity_scores(txt)
    {'neg': 0.0, 'neu': 0.328, 'pos': 0.672, 'compound': 0.6249}

Run Code Online (Sandbox Code Playgroud)

/3.你也应该引入句子标记化，因为它会提高准确性，然后根据句子计算段落的平均极性。示例：https : //github.com/cjhutto/vaderSentiment/blob/master/vaderSentiment /vaderSentiment.py#L517

/4。极性计算彼此完全独立，并且可以使用小尺寸的多处理池，例如 10，以提供良好的速度提升。

polarity_ = [sid.polarity_scores(s)['compound'] for s in texts]

归档时间：	9 年前
查看次数：	3769 次
最近记录：	8 年，11 月前