我有一个超过40.000行(file1)的文件,我想提取与file2中的模式匹配的行(约6000行).我像这样使用grep,但它很慢:
grep -f file2 file1 > out
有没有更快的方法来使用awk或sed?
这是我文件的一些摘录:
File1:
scitn003869.2| scign003869 CGCATGTGTGCATGTATTATCGTATCCCTTG
scitn007747.1| scign007747 CACGCAGACGCAGTGGAGCATTCCAGGTCACAA
scitn003155.1| scign003155 TAAAAATCGTTAGCACTCGCTTGGTACACTAAC
scitn018252.1| scign018252 CGTGTGTGTGCATATGTGTGCATGCGTG
scitn004671.2| scign004671 TCCTCAGGTTTTGAAAGGCAGGGTAAGTGCT
Run Code Online (Sandbox Code Playgroud)
File2:
scign000003
scign000004
scign000005
scign004671
scign000013
Run Code Online (Sandbox Code Playgroud)
`
我有几个基因名称列表如下:
列表1:
XLOC_012482
XLOC_019357
XLOC_014642
XLOC_010021
XLOC_013282
Run Code Online (Sandbox Code Playgroud)
列表2:
XLOC_012482
XLOC_019357
XLOC_004860
XLOC_004022
XLOC_002278
Run Code Online (Sandbox Code Playgroud)
项目list3:
XLOC_004860
XLOC_004022
XLOC_006292
XLOC_006616
XLOC_013802
Run Code Online (Sandbox Code Playgroud)
我想提取所有列表对之间的共同元素.我试过使用intersect但是我不能在角色上使用它,而且我也不知道如何在所有成对组合上执行此操作.
我打开了一个带有爆破结果的文件,并以fasta格式打印到屏幕上.
代码如下所示:
result_handle = open("/Users/jonbra/Desktop/my_blast.xml")
from Bio.Blast import NCBIXML
blast_records = NCBIXML.parse(result_handle)
blast_record = blast_records.next()
for alignment in blast_record.alignments:
for hsp in alignment.hsps:
print '>', alignment.title
print hsp.sbjct
Run Code Online (Sandbox Code Playgroud)
这会将fasta文件列表输出到屏幕.但是如何创建文件并将fasta输出保存到此文件中?
更新:我想我必须用something.write()替换循环中的print语句,但是我们如何编写'>',alignment.title?
编辑:此问题解决了如何仅为x轴标签的子集着色.这不是一个重复的问题.
我已经制作了x轴标签来代表核苷酸序列,我想在核苷酸的不同部分添加颜色.怎么样?谢谢.
ggplot(data = miRNA3) +
geom_line(mapping = aes(x = Position, y = Count_combined)) +
scale_y_continuous(breaks = seq(0, 120, 10)) +
ylab("Count") +
scale_x_continuous(breaks=1:150, labels=c("T", "G", "A", "T", "G", "T", "C", "C", "G", "T", "G", "T", "C", "C", "A", "C", "T", "C", "G", "T", "T", "G", "T", "T", "T", "T", "C", "A", "A", "C", "T", "T", "C", "T", "T", "C", "C", "C", "G", "C", "A", "A", "T", "T", "T", "A", "C", "C", "T", "T", "C", "A", "T", "G", …Run Code Online (Sandbox Code Playgroud) 我有一个制表符分隔文件,我想删除仅在前三列中相同的行(保留一个副本).我更喜欢使用unix,例如awk或uniq.
输入文件:
Supercontig_1.1 241783 286397 5677 52
Supercontig_1.1 241783 286397 5678 53
Supercontig_1.1 241783 286397 5679 53
Supercontig_1.2 10500 25700 3000 57
Supercontig_1.2 10500 25700 3001 59
Supercontig_1.2 10500 25700 3002 59
Supercontig_1.3 2000 7000 5686 60
Supercontig_1.3 2000 7000 5687 60
Run Code Online (Sandbox Code Playgroud)
输出:
Supercontig_1.1 241783 286397 5677 52
Supercontig_1.2 10500 25700 3000 57
Supercontig_1.3 2000 7000 5686 60
Run Code Online (Sandbox Code Playgroud) 我正在绘制具有不同轴范围的多个不同图,因此此问题不仅适用于我在此处显示的代码。我尝试修改刻度线的间隔和间隔,但是对于某些绘图,y轴总是在最后一个间隔之后继续。但对于一些刚刚工作了罚款喜欢这样。
这是我当前的情节和代码。我想以60的刻度线结束y轴:
# Set nucleotide sequence for x-axis labels
my_labs = c("C", "T", "A", "C", "A", "T", "A", "A", "A", "T", "A", "C", "A", "C", "A", "T", "G", "T", "C", "T", "C", "T", "G", "C", "T", "C", "G", "T", "T", "C", "G", "G", "G", "G", "G", "C", "C", "G", "G", "T", "A", "T", "G", "C", "T", "A", "C", "A", "C", "G", "G", "A", "A", "C", "G", "T", "G", "A", "G", "A", "G", "A", "C", "C", "C", "C", "T", …Run Code Online (Sandbox Code Playgroud) 我想要
我怎么能在BioPython中做到这一点?
我在tab分隔的列中有这样的数据:
1 name1 attribute1
1 name1 attribute2
1 name1 attribute3
31 name2 attribute1
31 name2 attribute2
31 name2 attribute3
444 name3 attribute1
444 name3 attribute2
444 name3 attribute3
Run Code Online (Sandbox Code Playgroud)
我希望这样:
001 name1 attribute1
001 name1 attribute2
001 name1 attribute3
031 name2 attribute1
031 name2 attribute2
031 name2 attribute3
444 name3 attribute1
444 name3 attribute2
444 name3 attribute3
Run Code Online (Sandbox Code Playgroud)
我可以在unix或perl中执行此操作吗?