标签: tokenize

简单的递归下降解析器？

我正在编写一个用于编译成JS的模板语言的解析器(如果相关的话).我从一些简单的正则表达式开始,这似乎有效,但正则表达式非常脆弱,所以我决定编写一个解析器.我开始编写一个简单的解析器,通过推送/弹出堆栈来记住状态,但事情不断升级,直到我手上有一个递归下降解析器.

不久之后,我比较了以前所有解析方法的性能.递归下降解析器是迄今为止最慢的.我被困了:是否值得使用递归下降解析器来处理简单的事情,或者我是否有理由采取快捷方式？我很想去纯粹的正则表达式路线,这种路线非常快(几乎比RD解析器快3倍),但在某种程度上非常黑客且无法维护.我认为性能并不是非常重要,因为编译模板是缓存的,但是递归下降解析器是每个任务的正确工具吗？我想我的问题可以被视为更具哲学性的问题:在多大程度上牺牲性能的可维护性/灵活性是值得的？

javascript parsing templates tokenize lexer

lti*_*mer

lucky-day

6
推荐指数

2
解决办法

2141
查看次数

当包含一个停用词的短语时,如何在solr中停止结果？

在使用Solr搜索具有停用词的短语时遇到问题.Solr使用stopword发送结果,这不是我预期的输出.

我在stopwords.txt文件中添加了一个单词"test" .在schema.xml文件中,我有像这样的字段

<field name="searchword" type="text" indexed="true" stored="true"   />

Run Code Online (Sandbox Code Playgroud)

我索引了一些数据,然后尝试在solr浏览器窗口中搜索如下:searchword:"test",我没有得到结果.然后我又给了一个像searchword这样的短语:"测试数据",我得到了结果.如何避免这种情况？如果它包含停止词Solr不应该给出任何结果.当包含一个停用词的短语时,如何在solr中停止结果？

以下是我正在使用的fieldType:

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.CommonGramsFilterFactory" words="stopwords.txt" ignoreCase="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
    </analyzer>
    <analyzer type="query">         
        <tokenizer class="solr.WhitespaceTokenizerFactory" />
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" type="phrase"/>
    </analyzer>
</fieldType>

Run Code Online (Sandbox Code Playgroud)

我需要解决方案Solr没有提供任何结果,而我给出包含禁用词的短语(测试)

search solr tokenize stop-words

Sri*_*m M

2011 11-30

6
推荐指数

1
解决办法

1292
查看次数

ElasticSearch Stemming

我正在使用ElasticSerach,我想为英语设置基本的词干.所以基本上,战斗机返回战斗或包含战斗根的任何单词.

我有点困惑如何实现这一点.我正在阅读分析器,标记器和过滤器,并且可以在ElasticSearch中使用多种词干算法.我只是不确定使用哪种组合 - 雪球,干扰器,搬运工干或同义词过滤器.

此外,映射的示例将非常有用.

lucene stemming tokenize analyzer elasticsearch

Gab*_*bar

2013 01-30

6
推荐指数

1
解决办法

5866
查看次数

如何标记中文文档

我将收到用中文编写的文档，我必须对其进行标记并将其保存在数据库表中。我正在尝试 Lucene 的 CJKBigramFilter，但它所做的只是将 2 个字符联合在一起，其含义与文档中的含义不同。假设这是文件“Hello My name is Pradeep”中的一行，在中文传统中是“\xe4\xbd\xa0\xe5\xa5\xbd\xe6\x88\x91\xe7\x9a\x84\xe5\x90\ x8d\xe5\xad\x97\xe6\x98\xaf\xe6\x99\xae\xe6\x8b\x89\xe8\xbf\xaa\xe6\x99\xae”。当我对其进行标记时，它会转换为下面的 2 个字母单词。\n\xe4\xbd\xa0\xe5\xa5\xbd - Hello\n\xe5\x90\x8d\xe5\xad\x97 - 名称\n\ xe5\xa5\xbd\xe6\x88\x91 - 嗯，我\n\xe5\xad\x97\xe6\x98\xaf - 字是\n\xe6\x88\x91\xe7\x9a\x84 - 我的\n\ xe6\x8b\x89\xe8\xbf\xaa - Radi\n\xe6\x98\xaf\xe6\x99\xae - 是 S & P\n\xe6\x99\xae\xe6\x8b\x89 - 普拉\ n\xe7\x9a\x84\xe5\x90\x8d - 以\n\xe8\xbf\xaa\xe6\x99\xae 的名义 - Dipp。\n我想要的只是将其转换为相同的英文翻译。\nI我正在为此使用 Lucene...如果您有任何其他有利的 opne 源，请指示我。\n提前致谢

java tokenize

Pra*_*eep

lucky-day

6
推荐指数

1
解决办法

5429
查看次数

c ++ tokenize std string

可能重复:
如何在C++中对字符串进行标记？

您好我想知道如何用strtok标记std字符串

string line = "hello, world, bye";    
char * pch = strtok(line.c_str(),",");

Run Code Online (Sandbox Code Playgroud)

我收到以下错误

error: invalid conversion from ‘const char*’ to ‘char*’
error: initializing argument 1 of ‘char* strtok(char*, const char*)’

Run Code Online (Sandbox Code Playgroud)

我正在寻找一种快速简便的方法,因为我认为它不需要太多时间

c++ tokenize strtok

Dan*_*ore

2017 05-23

6
推荐指数

2
解决办法

3万
查看次数

使用Stanford NLP进行文本标记化:过滤不需要的单词和字符

我Stanford NLP在分类工具中用于字符串标记化.我想唯一有意义的话,但我得到的非字标记(如---,>,.等),而不是重要的话像am,is,to(停用词).有人知道解决这个问题的方法吗？

java machine-learning tokenize stanford-nlp

dmi*_*ony

2018 10-03

6
推荐指数

2
解决办法

6240
查看次数

你如何从python日期时间中仅提取日期？

我在python中有一个数据帧.其中一列被标记time,这是一个时间戳.使用以下代码,我已将时间戳转换为datetime:

milestone['datetime'] = milestone.apply(lambda x: datetime.datetime.fromtimestamp(x['time']), axis = 1)

Run Code Online (Sandbox Code Playgroud)

现在我想分开(标记化)日期和时间,并有两个不同的列,如milestone['only_date']和milestone['only_time'].我该怎么做呢？

python datetime tokenize pandas

SZA*_*SZA

2015 11-21

6
推荐指数

1
解决办法

4万
查看次数

在python中的Nltk法语标记器无法正常工作

为什么python附带的法语标记器不能为我工作？难道我做错了什么？

我正在做

import nltk
content_french = ["Les astronomes amateurs jouent également un rôle important en recherche; les plus sérieux participant couramment au suivi d'étoiles variables, à la découverte de nouveaux astéroïdes et de nouvelles comètes, etc.", 'Séquence vidéo.', "John Richard Bond explique le rôle de l'astronomie."]
tokenizer = nltk.data.load('tokenizers/punkt/PY3/french.pickle')
for i in content_french:
        print(i)
        print(tokenizer.tokenize(i))

Run Code Online (Sandbox Code Playgroud)

但我得到非标记化输出

John Richard Bond explique le rôle de l'astronomie.
["John Richard Bond explique le rôle de l'astronomie."]

Run Code Online (Sandbox Code Playgroud)

python tokenize nltk

Ati*_*rag

lucky-day

6
推荐指数

2
解决办法

5048
查看次数

使用正则表达式（括号）进行标记

我有以下文字：

I don't like to eat Cici's food (it is true)

我需要将它标记为

['i', 'don't', 'like', 'to', 'eat', 'Cici's', 'food', '(', 'it', 'is', 'true', ')']

我发现以下正则表达式(['()\w]+|\.)拆分如下：

['i', 'don't', 'like', 'to', 'eat', 'Cici's', 'food', '(it', 'is', 'true)']

如何从令牌中取出括号并使其成为自己的令牌？

谢谢你的想法。

regex string split tokenize

Jür*_* K.

2019 10-20

6
推荐指数

1
解决办法

7063
查看次数

使用encode_plus方法时令牌索引序列长度错误

在尝试使用encode_plusTransformers 库中提供的方法为 BERT 编码问答对时，我遇到了一个奇怪的错误。

我正在使用来自这个 Kaggle 比赛的数据。给定问题标题、问题正文和答案，模型必须预测 30 个值（回归问题）。我的目标是将以下编码作为 BERT 的输入：

[CLS] question_title question_body [SEP] 回答 [SEP]

但是，当我尝试使用

tokenizer = transformers.BertTokenizer.from_pretrained("bert-base-uncased")

Run Code Online (Sandbox Code Playgroud)

并仅对来自 train.csv 的第二个输入进行编码，如下所示：

inputs = tokenizer.encode_plus(
            df_train["question_title"].values[1] + " " + df_train["question_body"].values[1], # first sequence to be encoded
            df_train["answer"].values[1], # second sequence to be encoded
            add_special_tokens=True, # [CLS] and 2x [SEP] 
            max_len = 512,
            pad_to_max_length=True
            )

Run Code Online (Sandbox Code Playgroud)

我收到以下错误：

Token indices sequence length is longer than the specified maximum sequence length for this model (46 > 512). …

Run Code Online (Sandbox Code Playgroud)

nlp tokenize bert-language-model huggingface-transformers

Nie*_*els

2020 06-20

6
推荐指数

1
解决办法

3801
查看次数

标签统计

tokenize ×10

java ×2

python ×2

analyzer ×1

bert-language-model ×1

c++ ×1

datetime ×1

elasticsearch ×1

huggingface-transformers ×1

javascript ×1

lexer ×1

lucene ×1

machine-learning ×1

nlp ×1

nltk ×1

pandas ×1

parsing ×1

regex ×1

search ×1

solr ×1

split ×1

stanford-nlp ×1

stemming ×1

stop-words ×1

string ×1

strtok ×1

templates ×1

标签 统计

标签统计