python中非英语推文的情感分析

mnm*_*mnm 5 python twitter nlp python-2.7 sentiment-analysis

目标:将每条推文分类为正面或负面,并将其写入输出文件,其中包含用户名、原始推文和推文的情绪。

代码:

import re,math
input_file="raw_data.csv"
fileout=open("Output.txt","w")
wordFile=open("words.txt","w")
expression=r"(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"

fileAFINN = 'AFINN-111.txt'
afinn = dict(map(lambda (w, s): (w, int(s)), [ws.strip().split('\t') for ws in open(fileAFINN)]))

pattern=re.compile(r'\w+')
pattern_split = re.compile(r"\W+")
words = pattern_split.split(input_file.lower())
print "File processing started"
with open(input_file,'r') as myfile:
for line in myfile:
    line = line.lower()

    line=re.sub(expression," ",line)
    words = pattern_split.split(line.lower())
    sentiments = map(lambda word: afinn.get(word, 0), words)
    #print sentiments
    # How should you weight the individual word sentiments?
    # You could do N, sqrt(N) or 1 for example. Here I use sqrt(N)
    """
    Returns a float for sentiment strength based on the input text.
    Positive values are positive valence, negative value are negative valence.
    """
    if sentiments:
        sentiment = float(sum(sentiments))/math.sqrt(len(sentiments))
        #wordFile.write(sentiments)
    else:
        sentiment = 0
    wordFile.write(line+','+str(sentiment)+'\n')
fileout.write(line+'\n')
print "File processing completed"

fileout.close()
myfile.close()
wordFile.close()
Run Code Online (Sandbox Code Playgroud)

问题:显然,output.txt 文件是

abc some tweet text 0
bcd some more tweets 1
efg some more tweet 0
Run Code Online (Sandbox Code Playgroud)

问题 1:如何在用户 ID 推文文本情感之间添加逗号?输出应该是这样的;

 abc,some tweet text,0
 bcd,some other tweet,1
 efg,more tweets,0
Run Code Online (Sandbox Code Playgroud)

问题 2:推文是马来语(BM),而我使用的 AFINN 词典是英文单词。所以分类是错误的。你知道我可以使用什么国语词典吗?

问题 3:如何将此代码打包到 JAR 文件中?

谢谢。

Kri*_*hes 1

问题一:

output.txt当前仅由您正在阅读的行组成,因为fileout.write(line+'\n'). 由于它是空间分隔的,因此您可以很容易地分隔线

line_data = line.split(' ') # Split the line into a list, separated by spaces
user_id = line_data[0] # The first element of the list
tweets = line_data[1:-1] # The middle elements of the list
sentiment = line_data[-1] # The last element of the list
fileout.write(user_id + "," + " ".join(tweets) + "," + sentiment +'\n')
Run Code Online (Sandbox Code Playgroud)

问题 2:快速谷歌搜索给了我这个。不确定它是否具有您需要的一切:https://archive.org/stream/grammardictionar02craw/grammardictionar02craw_djvu.txt

问题 3:尝试 Jython http://www.jython.org/archive/21/docs/jythonc.html