随机播放两个并行文本文件

Question

随机播放两个并行文本文件

我有两个句子对齐的平行语料库（文本文件），大约有 5000 万个单词。（来自 Europarl 语料库 -> 法律文件的平行翻译）。我现在想打乱两个文件的行，但都以相同的方式。我想使用 gshuf（我在 Mac 上）使用一个独特的随机源来解决这个问题。

gshuf --random-source /path/to/some/random/data file1
gshuf --random-source /path/to/some/random/data file2

Run Code Online (Sandbox Code Playgroud)

但是我收到了错误消息end of file，因为显然随机种子需要包含要排序的文件包含的所有单词。真的吗？如果是，我应该如何创建一个适合我需要的随机种子？如果不是，我可以通过什么其他方式并行随机化文件？我想过将它们粘贴在一起，随机化然后再次拆分。但是，这看起来很难看，因为我需要首先找到文件中没有的分隔符。

Answer 1

fro*_*utz 13

我不知道是否有更优雅的方法，但这对我有用：

mkfifo onerandom tworandom threerandom
tee onerandom tworandom threerandom < /dev/urandom > /dev/null &
shuf --random-source=onerandom onefile > onefile.shuf &
shuf --random-source=tworandom twofile > twofile.shuf &
shuf --random-source=threerandom threefile > threefile.shuf &
wait

Run Code Online (Sandbox Code Playgroud)

结果：

$ head -n 3 *.shuf
==> onefile.shuf <==
24532 one
47259 one
58678 one

==> threefile.shuf <==
24532 three
47259 three
58678 three

==> twofile.shuf <==
24532 two
47259 two
58678 two

Run Code Online (Sandbox Code Playgroud)

但是这些文件必须具有完全相同的行数。

GNU Coreutils 文档还为使用openssl种子随机生成器的重复随机性提供了一个很好的解决方案：

https://www.gnu.org/software/coreutils/manual/html_node/Random-sources.html#Random-sources
get_seeded_random()
{
  seed="$1"
  openssl enc -aes-256-ctr -pass pass:"$seed" -nosalt \
    </dev/zero 2>/dev/null
}

shuf -i1-100 --random-source=<(get_seeded_random 42)
Run Code Online (Sandbox Code Playgroud)

但是，请考虑使用比“42”更好的种子，除非您希望其他人也能够重现“您的”随机结果。

您不能从一个管道中读取相同的数据三次。你必须以某种方式多路复用，这就是“tee”所做的...... (2认同)

归档时间：	10 年，3 月前
查看次数：	2548 次
最近记录：	7 年，2 月前