标签: text-analysis

如何从给定文本中自动识别标签(关键字)？

它的行为应该像Firefox的Delicious工具栏一样; 它列出了可以点击的标签.效果如下:

在此输入图像描述

代码应该能够找到文本的关键词.任何好的算法或开源项目推荐？

我找到了这篇文章,但对于我的具体需求来说有点过于笼统.

algorithm full-text-search text-analysis

lka*_*htz

2017 05-23

5
推荐指数

1
解决办法

4285
查看次数

从R中的文档语料库中删除"空"字符项？

我正在使用R中的包tm和lda主题模型新闻文章的语料库.但是,我得到一个"非角色"问题,因为""这会弄乱我的主题.这是我的工作流程:

text <- Corpus(VectorSource(d$text))
newtext <- lapply(text, tolower)
sw <- c(stopwords("english"), "ahram", "online", "egypt", "egypts", "egyptian")
newtext <- lapply(newtext, function(x) removePunctuation(x))
newtext <- lapply(newtext, function(x) removeWords(x, sw))
newtext <- lapply(newtext, function(x) removeNumbers(x))
newtext <- lapply(newtext, function(x) stripWhitespace(x))
d$processed <- unlist(newtext)
corpus <- lexicalize(d$processed)
k <- 40
result <-lda.collapsed.gibbs.sampler(corpus$documents, k, corpus$vocab, 500, .02, .05,
compute.log.likelihood = TRUE, trace = 2L)

Run Code Online (Sandbox Code Playgroud)

不幸的是,当我训练lda模型时,一切看起来都很棒,除了最常出现的单词是"".我尝试通过从下面给出的词汇中删除它并如上所述重新估计模型来解决这个问题:

newtext <- lapply(newtext, function(x) removeWords(x, ""))

Run Code Online (Sandbox Code Playgroud)

但是,它仍然存在,如下所示:

str_split(newtext[[1]], " ")

[[1]]
 [1] ""              "body" …

Run Code Online (Sandbox Code Playgroud)

r text-analysis text-mining lda topic-modeling

作者

2012 05-08

5
推荐指数

1
解决办法

6303
查看次数

Python Context Free Grammar和PCFG生成基准测试？

我知道在Python中有一些用于通用CFG和PCFG的函数; 然而它们似乎都有不同的速度.

例如:NLTK,PyParsing.

是否有最近的基准测试比较与速度和内存使用相关的各种属性？

python nlp text-analysis nltk context-free-grammar

Foo*_*ack

2013 05-27

5
推荐指数

1
解决办法

644
查看次数

查找文本中的所有位置/城市/地点

如果我有一个包含例如加泰罗尼亚语报纸文章的文本,我怎么能从该文本中找到所有城市？

我一直在寻找用于python的包nltk,我已经下载了加泰罗尼亚语语料库(nltk.corpus.cess_cat).

我现在拥有的:我已经从nltk.download()安装了所有必需的东西.我现在所拥有的一个例子:

te = nltk.word_tokenize('Tots els gats son de Sant Cugat del Valles.')

nltk.pos_tag(te)

Run Code Online (Sandbox Code Playgroud)

这个城市是'Sant Cugat del Valles'.我从输出中得到的是:

[('Tots', 'NNS'),
 ('els', 'NNS'),
 ('gats', 'NNS'),
 ('son', 'VBP'),
 ('de', 'IN'),
 ('Sant', 'NNP'),
 ('Cugat', 'NNP'),
 ('del', 'NN'),
 ('Valles', 'NNP')]

Run Code Online (Sandbox Code Playgroud)

NNP似乎表示名字的第一个字母是大写的名词.有没有办法获得地方或城市,而不是所有的名字？谢谢

python text-analysis corpus nltk tagged-corpus

sar*_*nes

2015 05-10

5
推荐指数

3
解决办法

2万
查看次数

客户端JavaScript代码分析器

是否有一个JavaScript代码分析器可以在客户端使用来分析代码模式？我发现以下内容但似乎这只是常规文本并给你=标志等我需要一些可以在客户端运行的代码分析(JS代码),有没有可以使用的？

function parseData() {
  var rawData = document.getElementById('data').value.trim(),  
      result,
      output = $('#output'),
      table = $('table').remove(),
      header,
      row,
      cell,
      ul,
      slice,
      wpm = [],        
      wpmAvg = [];

  output.empty();
  table.find('thead, tbody').empty();

  if ($('[name="format"]:checked').val() === 'text') {
    // Simple text        
    result = analyzeText(rawData);
    output.append('Word count: ' + result.count + '<br><br>Frequent words:<br>');
    ul = $('<ul>');
    _.forEach(result.frequentWords, function(value, key) {
      ul.append('<li>' + value.word + ': ' + value.count + '</li>');
    });
    output.append(ul);       
  }
  else {
    // JSON
    try {
      data = JSON.parse(rawData);
    } …

Run Code Online (Sandbox Code Playgroud)

javascript jquery text-analysis

作者

2016 05-29

5
推荐指数

1
解决办法

211
查看次数

如何使用tf-idf对新文档进行分类？

如果我使用TfidfVectorizerfrom sklearn生成特征向量为：

features = TfidfVectorizer(min_df=0.2, ngram_range=(1,3)).fit_transform(myDocuments)

然后，我将如何生成特征向量以对新文档进行分类？由于您无法为单个文档计算tf-idf。

用以下方法提取特征名称是否正确？

feature_names = TfidfVectorizer.get_feature_names()

然后根据feature_names？计算新文档的术语频率。

但是，那么我将不会获得具有单词重要性信息的权重。

python text-analysis text-mining tf-idf scikit-learn

Isb*_*ter

2018 02-01

5
推荐指数

1
解决办法

2024
查看次数

在原始文本上或在引理/词干过程之后计算单词 n-gram？

我正在考虑在原始文本上使用 word n-grams 技术。但我有一个疑问：

在文本上应用引理/词干后，使用单词 n-gram 是否有意义？如果不是，为什么我应该只在原始文件上使用单词 n-gram？什么是优点和缺点？

information-retrieval text-analysis stemming lemmatization n-gram

Ale*_*dro

2017 11-13

5
推荐指数

1
解决办法

1495
查看次数

如何检查字符串是否包含R中的罗马数字？

我的数据集'ad'中有一个住宅地址专栏.我想检查没有数字(包括罗马数字)的地址.我正在使用

ad$check <- grepl("[[:digit:]]",ad$address)

Run Code Online (Sandbox Code Playgroud)

标记出没有数字的地址.如何对包含罗马数字的地址执行相同操作？

例如:"X楼,DLF Building-III,ABC City"

regex r text-analysis roman-numerals

Pri*_*a T

2018 03-07

5
推荐指数

1
解决办法

406
查看次数

删除 R 中的德语停用词

我有带有评论栏的调查数据。我正在寻找对回复的情绪分析。问题是数据中有很多语言，我不知道如何从集合中消除多种语言停用词

'nps' 是我的数据源，nps$customer.feedback 是评论栏。

首先我对数据进行标记

#TOKENISE
comments <- nps %>% 
  filter(!is.na(cusotmer.feedback)) %>% 
  select(cat, Comment) %>% 
  group_by(row_number(), cat) 

  comments <- comments %>% ungroup()

Run Code Online (Sandbox Code Playgroud)

摆脱停用词

nps_words <-  nps_words %>% anti_join(stop_words, by = c('word'))

Run Code Online (Sandbox Code Playgroud)

然后使用 Stemming 和 get_sentimets("bing") 按情绪显示字数。

 #stemgraph
  nps_words %>% 
  mutate(word = wordStem(word)) %>% 
  inner_join(get_sentiments("bing") %>% mutate(word = wordStem(word)), by = 
  c('word')) %>%
  count(cat, word, sentiment) %>%
  group_by(cat, sentiment) %>%
  top_n(7) %>%
  ungroup() %>%
  ggplot(aes(x=reorder(word, n), y = n, fill = sentiment)) +
  geom_col() +
  coord_flip() +
  facet_wrap( ~cat, scales = …

Run Code Online (Sandbox Code Playgroud)

text r text-analysis text-mining

Sea*_*n M

2018 08-21

5
推荐指数

1
解决办法

4479
查看次数

安装Open GRM thrax时出错

我已经在Ubuntu中安装了Open Fst并且工作正常.现在我正在尝试安装Open GRM thrax.我尝试过安装2个不同版本的thrax.

Thrax 1.1.0版:

thraxOpenGrm/thrax-1.1.0$ ./configure

Run Code Online (Sandbox Code Playgroud)

以下是我得到的错误.

checking how to hardcode library paths into programs... immediate
checking for bison... no
checking for byacc... no
checking for std::tr1::hash<long long unsigned>... yes
checking for __gnu_cxx::slist<int>... yes
checking fst/fst.h usability... yes
checking fst/fst.h presence... no
configure: WARNING: fst/fst.h: accepted by the compiler, rejected by the preprocessor!
configure: WARNING: fst/fst.h: proceeding with the compiler's result
checking for fst/fst.h... yes
checking fst/extensions/far/far.h usability... yes
checking fst/extensions/far/far.h presence... no
configure: WARNING: fst/extensions/far/far.h: accepted by …

Run Code Online (Sandbox Code Playgroud)

c++ text-analysis text-mining ubuntu-14.04 openfst

Tej*_*sad

2015 03-26

4
推荐指数

1
解决办法

1803
查看次数