我想要标记大约 100 行文本,它们类似于以下内容:
<word> <unknown number of spaces and tabs> <number>
Run Code Online (Sandbox Code Playgroud)
我在使用 VBA 查找 tokenize 函数时遇到问题。在 VBA 中标记此类字符串的最简单方法是什么?
这是我拥有的当前文本,但正则表达式不正确,无法拆分句子更正。请帮助纠正我的正则表达式,谢谢。
import nltk
import os, sys, re, glob
from nltk.tokenize import RegexpTokenizer
jp_sent_tokenizer = nltk.RegexpTokenizer(u'[^??????]*[???]')
para = []
para.append (jp_sent_tokenizer.tokenize(u' ???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????otak otak ??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? ')
for index in range(len(para[0])):
print para[0][index]
print 'this is eos'
#print line
print 'this is eop'
Run Code Online (Sandbox Code Playgroud)
我得到这个输出:
??????????????????????????????????????????
this is eos
????????????????????????????????????????????????????????????????????????????
this is eos
???????????????????????
this is eos
???????
this is eos
??????????????????????????????????
this is eos
??????????????????????????????????????????????????????????
this is eos
this is eop
Run Code Online (Sandbox Code Playgroud)
正确的输出应该是这样的:
??????????????????????????????????????????????
this is eos
????????????????????????????????????????????????????????????????????????????
this is eos
???????????????????????
this is …Run Code Online (Sandbox Code Playgroud) 我正在使用CString::Tokenize方法来使用分隔符标记字符串,但我注意到一些奇怪的事情,我在循环内的字符串上调用该方法,因为我想检索字符串中的所有标记,这是我的代码:
CString strToken;
for(int nTokenPos = 0; nTokenPos < dialog->myValue.GetLength(); nTokenPos++)
{
//TRACE( "The Size of the string is %d\n", dialog->myValue.GetLength());
TRACE( "Iteration No %d\n",nTokenPos);
strToken = dialog->myValue.Tokenize(_T("X"), nTokenPos);
strToken+="\n";
OutputDebugString(strToken);
}
Run Code Online (Sandbox Code Playgroud)
注意:dialog->myValue是我想要标记的字符串。当我在“99X1596”(例如)上测试该代码时,输出为:
Iteration No 0
99
Iteration No 4
596
Run Code Online (Sandbox Code Playgroud)
另一个例子:'4568X6547' 输出:
Iteration No 0
4568
Iteration No 6
547
Run Code Online (Sandbox Code Playgroud)
我不知道为什么它会忽略分隔符“X”之后的第一个字符,还会跳过一次迭代!
我正在尝试索引包含连字符但不包含空格、句点或任何其他标点符号的字符串。我不想根据连字符分割单词,而是希望连字符成为索引文本的一部分。
例如,我的 6 个文本字符串是:
我希望能够在这些字符串中搜索包含 "play" 的文本或以 "magazine" 开头的文本。
我已经能够使用ngram使包含“play”的文本正常工作。但是,连字符导致文本分割,并且它包含连字符后面的单词中“杂志”的结果。我只希望出现以“magazine”开头的字符串。
根据上面的示例,当以“magazine”开头时,只有这 3 个应该出现:
请帮助我的 ElasticSearch 索引示例:
DELETE /sample
PUT /sample
{
"settings": {
"index.number_of_shards":5,
"index.number_of_replicas": 0,
"analysis": {
"filter": {
"nGram_filter": {
"type": "nGram",
"min_gram": 2,
"max_gram": 20,
"token_chars": [
"letter",
"digit"
]
},
"word_delimiter_filter": {
"type": "word_delimiter",
"preserve_original": true,
"catenate_all" : true
}
},
"analyzer": {
"ngram_index_analyzer": {
"type" : "custom",
"tokenizer": "lowercase",
"filter" …Run Code Online (Sandbox Code Playgroud) 假设我有一个像这样的字符串:
"IgotthistextfromapdfIscraped.HowdoIsplitthis?"
Run Code Online (Sandbox Code Playgroud)
我想制作:
"I got this text from a pdf I scraped. How do I split this?"
Run Code Online (Sandbox Code Playgroud)
我该怎么做?
该问题与:InvalidArgumentError(回溯见上文):indices[1] = 10 is not in [0, 10) 我需要它用于 R,因此是上面链接中给出的另一种解决方案。
maxlen <- 40
chars <- c("'", "-", " ", "!", "\"", "(", ")", ",", ".", ":", ";", "?", "[", "]", "_", "=", "0", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z")
tokenizer <- text_tokenizer(char_level = T, filters = NULL)
tokenizer %>% fit_text_tokenizer(chars)
unlist(tokenizer$word_index)
Run Code Online (Sandbox Code Playgroud)
输出是:
' - ! " ( ) , …Run Code Online (Sandbox Code Playgroud) 从文档中并不完全清楚,但我可以看到它BertTokenizer是用 初始化的pad_token='[PAD]',所以我假设当你用 编码时add_special_tokens=True它会自动填充它。鉴于此pad_token_id=0,我0在以下内容token_ids中看不到任何s :
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
tokens = tokenizer.tokenize(text)
token_ids = tokenizer.encode(text, add_special_tokens=True, max_length=2048)
# Print the original sentence.
print('Original: ', text)
# Print the sentence split into tokens.
print('\nTokenized: ', tokens)
# Print the sentence mapped to token ids.
print('\nToken IDs: ', token_ids)
Run Code Online (Sandbox Code Playgroud)
输出:
Original: Toronto's key stock index ended higher in brisk trading on Thursday, extending Wednesday's rally despite being weighed down by losses …Run Code Online (Sandbox Code Playgroud) 假设我有一个包含 python 代码的字符串。
input = "import nltk
from nltk.stem import PorterStemmer
porter_stemmer=PorterStemmer()
words=["connect","connected","connection","connections","connects"]
stemmed_words=[porter_stemmer.stem(word) for word in words]
stemmed_words"
Run Code Online (Sandbox Code Playgroud)
如何对代码进行标记?我找到了 tokenize 模块(https://docs.python.org/3/library/tokenize.html)。但是,我不清楚如何使用该模块。它有 tokenize.tokenize(readline) 但参数采用生成器,而不是字符串。
我只是使用 Huggingface 转换器库,并在运行 run_lm_finetuning.py 时收到以下消息: AttributeError: 'GPT2TokenizerFast' object has no attribute 'max_len'。其他人有这个问题或知道如何解决它吗?谢谢!
我的完整实验运行:mkdir 实验
对于 5 中的纪元,执行 python run_lm_finetuning.py
--model_name_or_path distilgpt2
--model_type gpt2
--train_data_filesmall_dataset_train_preprocessed.txt
--output_direxperiments/epochs_$epoch
--do_train
--overwrite_output_dir
--per_device_train_batch_size 4
--num_train_epochs $epoch 完成
transformer-model tokenize huggingface-transformers huggingface-tokenizers gpt-2
我正在使用 Huggingface Transformers 训练用于令牌分类的 XLM-RoBERTa 模型。我已经微调过的模型的最大标记长度是 166。我在训练数据中截断了较长的序列并填充了较短的序列。现在,在推理/预测期间,我想预测所有标记,即使是长度超过 166 的序列。但是,如果我正确阅读文档,溢出的标记就会被丢弃。那是对的吗?我不完全确定“return_overflowing_tokens”和 stride 参数的作用。它们可以用来将太长的序列分成两个或更多个较短的序列吗?
我已经尝试将文本数据分割成句子以具有更小的块,但其中一些仍然超过最大标记长度。如果溢出的令牌能够自动添加到附加序列中,那将是理想的。