在R中查找文本中最常出现的单词

Mad*_*een 2 r n-gram

有人可以帮我解决如何使用R在文本中找到最常用的两个和三个单词吗?

我的文字是......

text <- c("There is a difference between the common use of the term phrase and its technical use in linguistics. In common usage, a phrase is usually a group of words with some special idiomatic meaning or other significance, such as \"all rights reserved\", \"economical with the truth\", \"kick the bucket\", and the like. It may be a euphemism, a saying or proverb, a fixed expression, a figure of speech, etc. In grammatical analysis, particularly in theories of syntax, a phrase is any group of words, or sometimes a single word, which plays a particular role within the grammatical structure of a sentence. It does not have to have any special meaning or significance, or even exist anywhere outside of the sentence being analyzed, but it must function there as a complete grammatical unit. For example, in the sentence Yesterday I saw an orange bird with a white neck, the words an orange bird with a white neck form what is called a noun phrase, or a determiner phrase in some theories, which functions as the object of the sentence. Theorists of syntax differ in exactly what they regard as a phrase; however, it is usually required to be a constituent of a sentence, in that it must include all the dependents of the units that it contains. This means that some expressions that may be called phrases in everyday language are not phrases in the technical sense. For example, in the sentence I can't put up with Alex, the words put up with (meaning \'tolerate\') may be referred to in common language as a phrase (English expressions like this are frequently called phrasal verbs\ but technically they do not form a complete phrase, since they do not include Alex, which is the complement of the preposition with.")
Run Code Online (Sandbox Code Playgroud)

ali*_*ire 9

tidytext包使这种事情变得非常简单:

library(tidytext)
library(dplyr)

data_frame(text = text) %>% 
    unnest_tokens(word, text) %>%    # split words
    anti_join(stop_words) %>%    # take out "a", "an", "the", etc.
    count(word, sort = TRUE)    # count occurrences

# Source: local data frame [73 x 2]
# 
#           word     n
#          (chr) (int)
# 1       phrase     8
# 2     sentence     6
# 3        words     4
# 4       called     3
# 5       common     3
# 6  grammatical     3
# 7      meaning     3
# 8         alex     2
# 9         bird     2
# 10    complete     2
# ..         ...   ...
Run Code Online (Sandbox Code Playgroud)

如果问题是要求二元组和三元组的计数,tokenizers::tokenize_ngrams则很有用:

library(tokenizers)

tokenize_ngrams(text, n = 3L, n_min = 2L, simplify = TRUE) %>%    # tokenize bigrams and trigrams
    as_data_frame() %>%    # structure
    count(value, sort = TRUE)    # count

# Source: local data frame [531 x 2]
# 
#           value     n
#          (fctr) (int)
# 1        of the     5
# 2      a phrase     4
# 3  the sentence     4
# 4          as a     3
# 5        in the     3
# 6        may be     3
# 7    a complete     2
# 8   a phrase is     2
# 9    a sentence     2
# 10      a white     2
# ..          ...   ...
Run Code Online (Sandbox Code Playgroud)


Man*_*mar 7

你的文字是:

text <- c("There is a difference between the common use of the term phrase and its technical use in linguistics. In common usage, a phrase is usually a group of words with some special idiomatic meaning or other significance, such as \"all rights reserved\", \"economical with the truth\", \"kick the bucket\", and the like. It may be a euphemism, a saying or proverb, a fixed expression, a figure of speech, etc. In grammatical analysis, particularly in theories of syntax, a phrase is any group of words, or sometimes a single word, which plays a particular role within the grammatical structure of a sentence. It does not have to have any special meaning or significance, or even exist anywhere outside of the sentence being analyzed, but it must function there as a complete grammatical unit. For example, in the sentence Yesterday I saw an orange bird with a white neck, the words an orange bird with a white neck form what is called a noun phrase, or a determiner phrase in some theories, which functions as the object of the sentence. Theorists of syntax differ in exactly what they regard as a phrase; however, it is usually required to be a constituent of a sentence, in that it must include all the dependents of the units that it contains. This means that some expressions that may be called phrases in everyday language are not phrases in the technical sense. For example, in the sentence I can't put up with Alex, the words put up with (meaning \'tolerate\') may be referred to in common language as a phrase (English expressions like this are frequently called phrasal verbs\ but technically they do not form a complete phrase, since they do not include Alex, which is the complement of the preposition with.")
Run Code Online (Sandbox Code Playgroud)

自然语言处理中,2个词的短语被称为" bi-gram ",3个词的短语被称为" tri-gram ",依此类推.通常,给定的n字组合称为" n-gram ".

首先,我们安装ngram包(可在CRAN上使用)

# Install package "ngram"
install.packages("ngram")
Run Code Online (Sandbox Code Playgroud)

然后,我们将找到最常见的双字和三字短语

library(ngram)

# To find all two-word phrases in the test "text":
ng2 <- ngram(text, n = 2)

# To find all three-word phrases in the test "text":
ng3 <- ngram(text, n = 3)
Run Code Online (Sandbox Code Playgroud)

最后,我们将使用以下各种方法打印对象(ngrams):

print(ng, output="truncated")

print(ngram(x), output="full")

get.phrasetable(ng)

ngram::ngram_asweka(text, min=2, max=3)
Run Code Online (Sandbox Code Playgroud)

我们也可以使用马尔可夫链来哄骗新的序列:

# if we are using ng2 (bi-gram)
lnth = 2 
babble(ng = ng2, genlen = lnth)

# if we are using ng3 (tri-gram)
lnth = 3  
babble(ng = ng3, genlen = lnth)
Run Code Online (Sandbox Code Playgroud)


tal*_*lat 3

以下是针对 5 个最常见单词的简单基本 R 方法:

head(sort(table(strsplit(gsub("[[:punct:]]", "", text), " ")), decreasing = TRUE), 5)

#     a    the     of     in phrase 
#    21     18     12     10      8 
Run Code Online (Sandbox Code Playgroud)

它返回的是一个带有频率计数的整数向量,向量的名称对应于所计数的单词。

  • gsub("[[:punct:]]", "", text)删除标点符号,因为你不想计算它,我猜
  • strsplit(gsub("[[:punct:]]", "", text), " ")按空格分割字符串
  • table()计算独特元素的频率
  • sort(..., decreasing = TRUE)以降序对它们进行排序
  • head(..., 5)仅选择最常见的 5 个单词

  • 尽管我喜欢这个答案作为“查找最常见单词”的解决方案,但我相信更多的转换可能比仅仅删除标点符号更有帮助。特别是,我认为将所有条目转换为小写可能是一个好主意。我正在考虑使用“tm”包提供替代方案,但问题似乎已经得到了令OP满意的答案。 (2认同)