标签: bioinformatics

R 中的长矢量图/覆盖图

我真的需要你的 R 技能。已经研究这个情节好几天了。我是 R 新手，所以这可以解释它。

我有染色体的序列覆盖数据（基本上是每个染色体长度上每个位置的值，使向量的长度达到数百万）。我想为我的阅读内容制作一个很好的覆盖图。这是我到目前为止得到的：在此输入图像描述

看起来不错，但我缺少 y 标签，这样我就可以知道它是哪条染色体，而且我在修改 x 轴时也遇到了麻烦，所以它在覆盖范围结束的地方结束。此外，我自己的数据要大得多，使得这个图特别需要很长时间。这就是我尝试这个 HilbertVisplotLongVector 的原因。它有效，但我不知道如何修改它、x 轴、标签、如何记录 y 轴，以及矢量在绘图上都获得相同的长度，即使它们的长度不相等。

source("http://bioconductor.org/biocLite.R")
biocLite("HilbertVis")
library(HilbertVis)
chr1 <- abs(makeRandomTestData(len=1.3e+07)) 
chr2 <- abs(makeRandomTestData(len=1e+07)) 

par(mfcol=c(8, 1), mar=c(1, 1, 1, 1), ylog=T)

# 1st way of trying with some code I found on stackoverflow
# Chr1
plotCoverage <- function(chr1, start, end) { # Defines coverage plotting function.
  plot.new()
  plot.window(c(start, length(chr1)), c(0, 10))
  axis(1, labels=F) 
  axis(4)
  lines(start:end, log(chr1[start:end]), type="l")
}
plotCoverage(chr1, start=1, end=length(chr1)) # Plots coverage result.

# Chr2
plotCoverage <- …

Run Code Online (Sandbox Code Playgroud)

plot r vector bioinformatics ggplot2

use*_*000

2013 02-10

2
推荐指数

1
解决办法

5414
查看次数

寻找基因间区域

我想提取染色体的基因间坐标。我编写了一大块代码，但由于我对这些包不熟悉，我不确定我是否遵循了正确的逻辑：

library(IRanges)
library(GenomicFeatures)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb = transcriptsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, by = "gene",use.names=TRUE)

#For example, only I am interested in intergenic coordinates in chromosome 1
seqlevels(txdb, force=TRUE) = c("chr1")

#Creates IRanges 

ir = IRanges(start=unlist(start(txdb)), end=unlist(end(txdb)), names=names(txdb)) 

# Reduce overlapping gene positions to continuous range
ir.red = reduce(ir) # Collapses ranges of overlapping genes to continuous range.

#Identify overlaps among the initial and reduced range data sets
overlap = findOverlaps(ir, ir.red)

Run Code Online (Sandbox Code Playgroud)

我需要处理“+”和“-”链吗？

r bioinformatics bioconductor

c02*_*023

2015 04-16

2
推荐指数

1
解决办法

1522
查看次数

BioPython 中系统发育树的子树

我有一个纽维克格式的系统发育树。我想根据终端节点的标签（因此基于物种列表）拉出一棵子树。我正在使用的树的副本可以在这里找到： http: //hgdownload.soe.ucsc.edu/goldenPath/dm6/multiz27way/dm6.27way.nh

目前我已经使用 BioPython 阅读了树，如下所示：

from Bio import Phylo
#read in Phylogenetic Tree
tree = Phylo.read('dm6.27way.nh', 'newick')
#list of species of interest
species_list = ['dm6', 'droSim1', 'droSec1', 'droYak3', 'droEre2', 'droBia2', 'droSuz1', 'droAna3', 'droBip2', 'droEug2', 'droEle2', 'droKik2', 'droTak2', 'droRho2', 'droFic2']

Run Code Online (Sandbox Code Playgroud)

我如何仅提取species_list中物种的子树？

python tree bioinformatics biopython

use*_*816

lucky-day

2
推荐指数

1
解决办法

1616
查看次数

将文本添加到文本文件中的每 n:th 行

此 sed 命令行脚本在文件中的每一行前面添加文本：

sed -i 's/^/to be prepended/g' text.txt

Run Code Online (Sandbox Code Playgroud)

我怎样才能让它只在每 n 行执行此^操作？

我正在处理测序数据，在“norma”多重 fasta 格式中，首先有一个以 > 开头的标识符行，然后有其他文本。

下一行以随机 DNA 序列开头，如“AATTGCC”等，当该字符串完成新行和新标识符时，我如何在序列行的开头添加文本（附加碱基）？

sed bioinformatics

Gsu*_*ell

2016 11-28

2
推荐指数

1
解决办法

352
查看次数

使用 Bio.phylo 从几棵树中制作一个共识树

我对肠杆菌基因组中的 4 个管家基因感兴趣。

所以我有我的管家基因，我在 NR 上大放异彩并下载了比对的序列。

我使用 MEGA7 软件和最大似然法制作了系统发育树。Boostrap 方法进行了 200 次迭代。

我将树导出为 newick 文件。

所以现在，我的 4 个管家基因有 4 棵树。我想用我的 4 棵树创建一个共识树。

就我个人而言，我尝试使用 Bio.Phylo 的共识树（http://biopython.org/DIST/docs/api/Bio.Phylo.Consensus-module.html#strict_consensus）（http://biopython.org/wiki/Phylo）。我选择了 Majority_consensus 函数，它运行得很好。但我有一个问题。

我的“脚本”是这样的：

import os

import sys

from Bio import Phylo

from Bio.Phylo.Consensus import *

fichier=sys.argv[1]

fichier2=sys.argv[2]

fichier3=sys.argv[3]

fichier4=sys.argv[4]

tree1=Phylo.read(fichier, 'newick')

tree2=Phylo.read(fichier2, 'newick')

tree3=Phylo.read(fichier3, 'newick')

tree4=Phylo.read(fichier4, 'newick')

trees=tree1,tree2,tree3,tree4

majority_tree = majority_consensus(trees, 0.5)

Phylo.draw(majority_tree)

Run Code Online (Sandbox Code Playgroud)

问题在于共识树依赖于顺序。例如，当我try trees = tree1,tree2,tree3,tree4和trees = tree2,tree4,tree1,tree3

有人知道另一种软件可以从 newick 文件中制作共识树吗？

我需要帮助Bio.Phylo。如果有人了解更多有关此包的信息，那就太好了。

python bioinformatics biopython phylogeny

Non*_*sad

2017 04-04

2
推荐指数

1
解决办法

1282
查看次数

如何随机选择50个文件并将它们复制到不同的文件夹/目录（MacOS）？

我试图从目录中随机选择 50 个文件（.nexus 文件）并将它们复制到单独的目标目录中。

我在一个关于类似主题的问题页面上发现了以下脚本，其中随机选择 8 个项目：

舒夫-zn8-e *.jpg | xargs -0 cp -vt 目标/

shuf 随机排列当前目录中的 *.jpg 文件列表。-z 是以零结束每一行，以便正确处理带有特殊字符的文件。-n8 在 8 个文件后退出 shuf。xargs -0 读取由空字符分隔的输入（来自 shuf -z）并运行 cp。-v 详细打印每个副本。-t 是指定目标目录。

当在终端中输入这个命令时，我发现 -t 对于 MacOS 来说不是一个有效的命令。然而，删除它后，我似乎仍然无法让脚本工作。输入以下脚本的输出是：

用法： cp [-R [-H | -L | -P]] [-fi | -P]] -n] [-apvXc] 源文件目标文件 cp [-R [-H | -L | -P]] [-fi | -P]] -n] [-apvXc] 源文件 ... 目标目录

我一直在使用的代码如下。我使用 gshuf 而不是 shuf，长文件路径是要复制到的文件的目标目录。

我对编码等相当陌生，因此任何建议将不胜感激！谢谢！

gshuf -zn50 -e *.nexus | gshuf -zn50 -e *.nexus …

macos shell terminal bioinformatics

Bra*_*n W

lucky-day

2
推荐指数

1
解决办法

3365
查看次数

如何将包含 S4 对象的大型列表编写为 CSV 文件？

我有运行并输出一个大列表的代码。我一直坚持将输出写入文件，因为我不断收到不同的错误，因此我无法以通常用于数据帧的任何方式写入文件。

我使用的代码和数据是这样的：

library(GeneOverlap)\nlibrary(dplyr)\nlibrary(stringr)\n\ndataset1 <- structure(list(Gene = c("Gene1", "Gene1", "Gene2", "Gene3", "Gene3.", \n"Gene3"), Gene_count = c(5L, 5L, 3L, 16L, 16L, 16L), Phenotype = c("Phenotype1", \n"Phenotype2", "Phenotype1", "Phenotype6", "Phenotype2", "Phenotype1"\n)), row.names = c(NA, -6L), class = c("data.table", "data.frame"\n))\n\n\ndataset2 <- structure(list(Gene = c("Gene1", "Gene1", "Gene4", "Gene2", "Gene6", \n"Gene7"), Gene_count = c(10L, 10L, 4L, 17L, 3L, 2L), Phenotype = c("Phenotype1", \n"Phenotype2", "Phenotype1", "Phenotype6", "Phenotype2", "Phenotype1"\n)), row.names = c(NA, -6L), class = c("data.table", "data.frame"\n))\n\nd1_split <- split(dataset1, dataset1$Phenotype)\nd2_split <- split(dataset2, dataset2$Phenotype)\n\n# this should be …

Run Code Online (Sandbox Code Playgroud)

r bioinformatics r-s4

LN3*_*LN3

2022 03-07

2
推荐指数

1
解决办法

679
查看次数

无法在 Mac 上安装 Bioconda 软件包

第一次使用命令行，我尝试在带有 M1 芯片的 Mac 环境中安装 Bioconda 软件包（fastp 和 Bowtie1），但我不断收到相同的错误：

Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
PackagesNotFoundError: The following packages are not available from current channels:
  - fastp

Current channels:
  - https://conda.anaconda.org/bioconda/osx-arm64
  - https://conda.anaconda.org/bioconda/noarch
  - https://conda.anaconda.org/conda-forge/osx-arm64
  - https://conda.anaconda.org/conda-forge/noarch
  - https://repo.anaconda.com/pkgs/main/osx-arm64
  - https://repo.anaconda.com/pkgs/main/noarch
  - https://repo.anaconda.com/pkgs/r/osx-arm64
  - https://repo.anaconda.com/pkgs/r/noarch
To search for alternate channels that may provide the …

Run Code Online (Sandbox Code Playgroud)

bioinformatics conda apple-m1 bioconda

ewj*_*wjc

2023 06-03

2
推荐指数

1
解决办法

2298
查看次数

谁能发现我的 Perl for 循环中的问题吗？

我有一个文本文件，其中包含FASTA 格式的蛋白质序列。FASTA 文件的第一行是标题，其余的是感兴趣的序列。每个字母都是一个氨基酸。我想编写一个程序来查找基序 VSEX（X 是任何氨基酸，其他是特定氨基酸）并打印出基序本身及其找到的位置。这是我的代码：

#!usr/bin/perl
open (IN,'P51170.fasta.txt');
while(<IN>) {
$seq.=$_;
$seq=~s/ //g;
chomp $seq;
} 
#print $seq;

$j=0;
$l= length $seq;
#print $l;
for ($i=0, $i<=$l-4,$i++){
    $j=$i+1;
    $motif= substr ($seq,$i,4);
    if ($motif=~m/VSE(.)/) {
        print "motif $motif found in position $j \n" ;
    }
}

Run Code Online (Sandbox Code Playgroud)

我很确定我搞乱了循环，但我不知道出了什么问题。我在 cygwin 上得到的输出如下：

motif  found in position 2
motif  found in position 2
motif  found in position 2

Run Code Online (Sandbox Code Playgroud)

regex perl bioinformatics

Nic*_*foe

2023 01-04

2
推荐指数

1
解决办法

103
查看次数

运行 perl 脚本并收到“选项规范错误：”

我对运行 perl 脚本非常陌生，并且遇到了我怀疑是由于变量分配引起的问题。

这是原始脚本的开头：

#!/usr/bin/perl -w

#####################################
# gbstrim.pl 
# John Garbe
# June 2016
# 
#
#####################################

=head1 NAME

gbstrim.pl - 
 - Trim padding and/or cut site residue from beginning of reads
 - Trim sequencing adapter from ends of reads
 - Trim all reads to length

NOTE: this doesn't handle paired-end data

=head1 SYNOPSIS

gbstrim.pl --enzyme1 bamhi --enzyme2 psti --read R1 [--threads 1] [--minlength 20] --fastqfile file.fastq --outputfile out.fastq 

=head1 DESCRIPTION

Options:
    --fastqfile sample.fastq : fastq file of reads …

Run Code Online (Sandbox Code Playgroud)

perl bioinformatics getopt

Fro*_*y01

lucky-day

2
推荐指数

1
解决办法

84
查看次数