在AdaGram.jl上训练文本时遇到的问题

Question

在AdaGram.jl上训练文本时遇到的问题

suh*_*rav 2 machine-learning julia word2vec

我是Julia编程语言的新手。我正在尝试在我的计算机上安装自适应跳过语法（AdaGram）模型。我面临以下问题。在训练模型之前，我们需要标记化文件和字典文件。现在我的问题是，tokenize.sh和dictionary.sh应该输入什么？请让我知道生成输出文件的实际方式以及其扩展名。

这是我指的网站链接：https : //github.com/sbos/AdaGram.jl。这与https://code.google.com/p/word2vec/完全相似

Answer 1

Vin*_*ynd 5

该软件包提供了一些Shell脚本来预处理数据并适合模型：您必须从Shell（即Julia外部）调用它们。

# Install the package
julia -e 'Pkg.clone("https://github.com/sbos/AdaGram.jl.git")'
julia -e 'Pkg.build("AdaGram")'

# Download some text
wget http://www.gutenberg.org/ebooks/100.txt.utf-8

# Tokenize the text, and count the words
~/.julia/v0.3/AdaGram/utils/tokenize.sh 100.txt.utf-8 text.txt
~/.julia/v0.3/AdaGram/utils/dictionary.sh text.txt dictionary.txt

# Train the model
~/.julia/v0.3/AdaGram/train.sh text.txt dictionary.txt model

Run Code Online (Sandbox Code Playgroud)

然后，您可以使用来自Julia的模型：

using AdaGram
vm, dict = load_model("model");
expected_pi(vm, dict.word2id["hamlet"])
nearest_neighbors(vm, dict, "hamlet", 1, 10)

Run Code Online (Sandbox Code Playgroud)

归档时间：	10 年，8 月前
查看次数：	562 次
最近记录：	10 年，8 月前