我有数百个带有各种列表元素的文本文件(以千计).下面给出了三个简化的代表文件(这里的行元素为颜色).
group1.txt
red
blue
red
green
pink
red
Run Code Online (Sandbox Code Playgroud)
group2.txt
yellow
brown
cyan
yellow
brown
red
violet
orange
Run Code Online (Sandbox Code Playgroud)
group3.txt
orange
violet
pink
cyan
grey
Run Code Online (Sandbox Code Playgroud)
我可以用以下脚本创建一个排序计数表 -
awk -F '\t' '{print $1}' * | sort | uniq -c | sort -nr
Run Code Online (Sandbox Code Playgroud)
>
4 red
2 yellow
2 violet
2 pink
2 orange
2 cyan
2 brown
1 grey
1 green
1 blue
Run Code Online (Sandbox Code Playgroud)
我想创建一个列联表如下 -
Colour group1 group2 group3
red 3 1 0
green 1 0 0
blue 0 0 0
yellow 0 2 0
orange 0 1 1
grey 0 0 1
violet 0 1 1
pink 1 0 1
brown 0 2
cyan 0 1 1
Run Code Online (Sandbox Code Playgroud)
如何使用awk,python,perl或R创建此列联表?
这是R的解决方案.
设置文件(这只是我们有一个例子可以使用 - 不是用于构造列联表的实际机器的一部分):
writeLines(c("red","blue","red","green","pink","red"),
con="group1.txt")
writeLines(c("yellow","brown","cyan","yellow","brown","red",
"violet","orange"),
con="group2.txt")
writeLines(c("orange","violet","pink","cyan","grey"),
con="group3.txt")
Run Code Online (Sandbox Code Playgroud)
大部分的工作是在阅读和整理数据:假设我们知道文件被称为groupNN.txt那里NN是一个数字...
flist <- list.files(pattern="group[0-9]+.txt")
grpnames <- gsub("\\.txt$","",flist)
Run Code Online (Sandbox Code Playgroud)
读取颜色文件:
col_list <- lapply(flist,scan,what="character")
Run Code Online (Sandbox Code Playgroud)
组ID的匹配向量:
grpvec <- rep(grpnames,sapply(col_list,length))
Run Code Online (Sandbox Code Playgroud)
现在只需使用table:
table(unlist(col_list),grpvec)
## grp
## col group1 group2 group3
## blue 1 0 0
## brown 0 2 0
## cyan 0 1 1
## green 1 0 0
## grey 0 0 1
## orange 0 1 1
## pink 1 0 1
## red 3 1 0
## violet 0 1 1
## yellow 0 2 0
Run Code Online (Sandbox Code Playgroud)
(这是按字母顺序排列的;我不确定这对你有多重要......)