我一直在研究字符级文本生成的示例:https://keras.rstudio.com/articles/examples/lstm_text_ Generation.html
我无法将此示例扩展到单词级模型。请参阅下面的代表
library(keras)
library(readr)
library(stringr)
library(purrr)
library(tokenizers)
# Parameters
maxlen <- 40
# Data Preparation
# Retrieve text
path <- get_file(
'nietzsche.txt',
origin='https://s3.amazonaws.com/text-datasets/nietzsche.txt'
)
# Load, collapse, and tokenize text
text <- read_lines(path) %>%
str_to_lower() %>%
str_c(collapse = "\n") %>%
tokenize_words( simplify = TRUE)
print(sprintf("corpus length: %d", length(text)))
words <- text %>%
unique() %>%
sort()
print(sprintf("total words: %d", length(words)))
Run Code Online (Sandbox Code Playgroud)
这使:
[1] "corpus length: 101345"
[1] "total words: 10283"
Run Code Online (Sandbox Code Playgroud)
当我继续下一步时,我遇到了问题:
# Cut the text in semi-redundant sequences of maxlen …
Run Code Online (Sandbox Code Playgroud)