R中西里尔文的情感分析

Question

R中西里尔文的情感分析

我无法在此页面上发表评论，我在该页面上找到了俄语/西里尔语语言的情绪分析文本分析功能

get_sentiment_rus <- function(char_v, method="custom", lexicon=NULL, path_to_tagger = NULL, cl = NULL, language = "english") {
  language <- tolower(language)
  russ.char.yes <- "[\u0401\u0410-\u044F\u0451]"
  russ.char.no <- "[^\u0401\u0410-\u044F\u0451]"

    if (is.na(pmatch(method, c("syuzhet", "afinn", "bing", "nrc", 
                             "stanford", "custom")))) 
    stop("Invalid Method")
  if (!is.character(char_v)) 
    stop("Data must be a character vector.")
  if (!is.null(cl) && !inherits(cl, "cluster")) 
    stop("Invalid Cluster")
  if (method == "syuzhet") {
    char_v <- gsub("-", "", char_v)
  }
  if (method == "afinn" || method == "bing" || method == "syuzhet") {
    word_l <- strsplit(tolower(char_v), "[^A-Za-z']+")
    if (is.null(cl)) {
      result <- unlist(lapply(word_l, get_sent_values, 
                              method))
    }
    else {
      result <- unlist(parallel::parLapply(cl = cl, word_l, 
                                           get_sent_values, method))
    }
  }
  else if (method == "nrc") {
#    word_l <- strsplit(tolower(char_v), "[^A-Za-z']+")
    word_l <- strsplit(tolower(char_v), paste0(russ.char.no, "+"), perl=T)
    lexicon <- dplyr::filter_(syuzhet:::nrc, ~lang == tolower(language), 
                              ~sentiment %in% c("positive", "negative"))
    lexicon[which(lexicon$sentiment == "negative"), "value"] <- -1
    result <- unlist(lapply(word_l, get_sent_values, method, 
                            lexicon))
  }
  else if (method == "custom") {
#    word_l <- strsplit(tolower(char_v), "[^A-Za-z']+")
    word_l <- strsplit(tolower(char_v), paste0(russ.char.no, "+"), perl=T)
    result <- unlist(lapply(word_l, get_sent_values, method, 
                            lexicon))
  }
  else if (method == "stanford") {
    if (is.null(path_to_tagger)) 
      stop("You must include a path to your installation of the coreNLP package.  See http://nlp.stanford.edu/software/corenlp.shtml")
    result <- get_stanford_sentiment(char_v, path_to_tagger)
  }
  return(result)
}

Run Code Online (Sandbox Code Playgroud)

它给出了一个错误

> mysentiment <- get_sentiment_rus(as.character(corpus))
 Show Traceback

 Rerun with Debug
 Error in UseMethod("filter_") : 
  no applicable method for 'filter_' applied to an object of class "NULL"

Run Code Online (Sandbox Code Playgroud)

并且情感分数等于0

> SentimentScores <- data.frame(colSums(mysentiment[,]))
> SentimentScores
             colSums.mysentiment.....
anger                               0
anticipation                        0
disgust                             0
fear                                0
joy                                 0
sadness                             0
surprise                            0
trust                               0
negative                            0
positive                            0

Run Code Online (Sandbox Code Playgroud)

你能指出问题可能在哪里吗？或者建议任何其他工作方法进行情感分析？? 只是想知道什么包支持俄语。

我正在寻找任何对俄语文本进行情感分析的工作方法。

Answer 1

JBG*_*ber 6

在我看来，您的函数并没有在您的文本中真正找到任何情感词。这可能与您使用的情感词典有关。与其尝试修复此函数，不如考虑采用一种整洁的方法，这在“Text Mining with R. A Tidy Approach”一书中进行了概述。优点是它不介意西里尔字母，并且非常容易理解和调整。

首先，我们需要一个带有情感值的字典。我在GitHub 上找到了一个，我们可以直接读入 R：

library(rvest)
library(stringr)
library(tidytext)
library(dplyr)

dict <- readr::read_csv("https://raw.githubusercontent.com/text-machine-lab/sentimental/master/sentimental/word_list/russian.csv")

Run Code Online (Sandbox Code Playgroud)

接下来，让我们获取一些要使用的测试数据。没有特别的原因，我使用了英国脱欧的俄语维基百科条目并抓取了文本：

brexit <- "https://ru.wikipedia.org/wiki/%D0%92%D1%8B%D1%85%D0%BE%D0%B4_%D0%92%D0%B5%D0%BB%D0%B8%D0%BA%D0%BE%D0%B1%D1%80%D0%B8%D1%82%D0%B0%D0%BD%D0%B8%D0%B8_%D0%B8%D0%B7_%D0%95%D0%B2%D1%80%D0%BE%D0%BF%D0%B5%D0%B9%D1%81%D0%BA%D0%BE%D0%B3%D0%BE_%D1%81%D0%BE%D1%8E%D0%B7%D0%B0" %>% 
  read_html() %>% 
  html_nodes("body") %>% 
  html_text() %>%
  tibble(text = .)

Run Code Online (Sandbox Code Playgroud)

现在可以将这些数据转换为整洁的格式。我首先将文本分成段落，因此我们可以单独检查段落的情绪分数。

brexit_tidy <- brexit %>%
  unnest_tokens(output = "paragraph", input = "text", token = "paragraphs") %>% 
  mutate(id = seq_along(paragraph)) %>% 
  unnest_tokens(output = "word", input = "paragraph", token = "words")

Run Code Online (Sandbox Code Playgroud)

从这一点来看，字典与整洁数据一起使用的方式非常直接。您只需将数据框与情感值（即字典）和数据框与文本中的单词组合起来。在文本和字典匹配的地方，添加情感值。所有其他值都被删除。

# apply dictionary
brexit_sentiment <- brexit_tidy %>% 
  inner_join(dict, by = "word")

head(brexit_sentiment)
#> # A tibble: 6 x 3
#>      id word         score
#>   <int> <chr>        <dbl>
#> 1     7 ????????      -1.7
#> 2    13 ??????        -5  
#> 3    22 ????????????   5  
#> 4    22 ??????        -5  
#> 5    23 ?????          1.7
#> 6    39 ??????        -5

Run Code Online (Sandbox Code Playgroud)

您可能更喜欢每个段落的值，而不是每个单词的值。这可以通过获取每个段落的平均值轻松完成：

# group sentiment by paragraph
brexit_sentiment %>% 
  group_by(id) %>% 
  summarise(sentiment = mean(score))
#> # A tibble: 25 x 2
#>       id sentiment
#>    <int>     <dbl>
#>  1     7     -1.7 
#>  2    13     -5   
#>  3    22      0   
#>  4    23      1.7 
#>  5    39     -5   
#>  6    42      5   
#>  7    43     -1.88
#>  8    44     -3.32
#>  9    45     -3.35
#> 10    47      1.7 
#> # … with 15 more rows

Run Code Online (Sandbox Code Playgroud)

如有必要，可以通过多种方式改进这种方法：

为了摆脱不同的词形，你可以将词词形还原，使匹配的可能性更大
如果您的文本包含拼写错误，您可以考虑匹配与例如模糊连接相似的单词
你可以找到或创建一个比我在谷歌搜索“俄罗斯情感词典”时找到的第一页的词典更好的词典

归档时间：	6 年，3 月前
查看次数：	521 次
最近记录：	6 年，3 月前