我有一个从Python的Pandas包生成的数据帧.如何使用pandas包中的DataFrame生成热图.
import numpy as np
from pandas import *
Index= ['aaa','bbb','ccc','ddd','eee']
Cols = ['A', 'B', 'C','D']
df = DataFrame(abs(np.random.randn(5, 4)), index= Index, columns=Cols)
>>> df
A B C D
aaa 2.431645 1.248688 0.267648 0.613826
bbb 0.809296 1.671020 1.564420 0.347662
ccc 1.501939 1.126518 0.702019 1.596048
ddd 0.137160 0.147368 1.504663 0.202822
eee 0.134540 3.708104 0.309097 1.641090
>>>
Run Code Online (Sandbox Code Playgroud) 我有数据(其中100个),每个都对应一个bin(0到99).我需要将这些数据绘制为直方图.但是,直方图会对这些数据进行计数,并且无法正确绘制,因为我的数据已经被分箱.
import random
import matplotlib.pyplot as plt
x = random.sample(range(1000), 100)
xbins = [0, len(x)]
#plt.hist(x, bins=xbins, color = 'blue')
#Does not make the histogram correct. It counts the occurances of the individual counts.
plt.plot(x)
#plot works but I need this in histogram format
plt.show()
Run Code Online (Sandbox Code Playgroud) 如何根据最后一行的值对列进行排序?在下面的例子中,我的最终df将按以下顺序排列:'ddd''aaa''ppp''ffff'.
>>> df = DataFrame(np.random.randn(10, 4), columns=['ddd', 'fff', 'aaa', 'ppp'])
>>> df
ddd fff aaa ppp
0 -0.177438 0.102561 -1.318710 1.321252
1 0.980348 0.786721 0.374506 -1.411019
2 0.405112 0.514216 1.761983 -0.529482
3 1.659710 -1.017048 -0.737615 -0.388145
4 -0.472223 1.407655 -0.129119 -0.912974
5 1.221324 -0.656599 0.563152 -0.900710
6 -1.816420 -2.898094 -0.232047 -0.648904
7 2.793261 0.568760 -0.850100 0.654704
8 -2.180891 2.054178 -1.050897 -1.461458
9 -1.123756 1.245987 -0.239863 0.359759
Run Code Online (Sandbox Code Playgroud) 有没有办法将'head -1'和'grep'命令合并为一个目录中的所有文件,并将输出重定向到输出文件.我可以使用'sed'来做到这一点,但它似乎没有grep那么快.
sed -n '1p;/6330162/p' infile*.txt > outfile.txt
Run Code Online (Sandbox Code Playgroud)
使用grep我可以一次执行以下一个文件:
head -1 infile1.txt; grep -i '6330162' infile1.txt > outfile.txt
Run Code Online (Sandbox Code Playgroud)
但是,我需要为目录中的所有文件执行此操作.插入通配符没有帮助,因为它首先打印标题然后输出grep输出.
我试图找到使用 python 中的 PANDAS 包创建的树状图。下面显示了示例数据。
import numpy as np
from pandas import *
import matplotlib.pyplot as plt
from hcluster import pdist, linkage, dendrogram
from numpy.random import rand
Index= ['aaa','bbb','ccc','ddd','eee']
Cols = ['A', 'B', 'C','D']
df = DataFrame(abs(np.random.randn(5, 4)), index= Index, columns=Cols)
>>> df
A B C D
aaa 0.987415 0.192240 0.709559 0.317106
bbb 0.856932 0.252441 1.183127 0.712855
ccc 1.687198 0.462673 1.046469 0.159287
ddd 0.977152 2.657582 0.491975 0.027280
eee 0.120464 0.945034 0.142658 0.537024
>>>
X = df.T.values #Transpose values
Y = …Run Code Online (Sandbox Code Playgroud) 我正在寻找一个顺时针方向旋转 90 度的图。此类图的类似示例是“hist(x,orientation='horizontal')”。有什么方法可以实现类似的方向。
#Make horizontal plots.
import random
import matplotlib.pyplot as plt
x = random.sample(range(1000), 100)
x
plt.plot(x) #orientation='horizontal'
plt.show()
Run Code Online (Sandbox Code Playgroud) 我有一个元组列表,如下所述(此元组按第二个值的降序排序):
from string import ascii_letters
myTup = zip (ascii_letters, range(10)[::-1])
threshold = 5.5
>>> myTup
[('a', 9), ('b', 8), ('c', 7), ('d', 6), ('e', 5), ('f', 4), ('g', 3), ('h', 2), \
('i', 1), ('j', 0)]
Run Code Online (Sandbox Code Playgroud)
给定一个阈值,丢弃所有第二个值小于此阈值的元组的最佳方法是什么.
我有超过500万元组,因此不希望按元组基础执行比较元组,因此删除或添加到另一个元组列表.
如何获得下面'str'中'uniprotkb:'和'(基因名称)'之间的所有值:
str = 'uniprotkb:HIST1H3D(gene name)|uniprotkb:HIST1H3A(gene name)|uniprotkb:HIST1H3B(gene name)|uniprotkb:HIST1H3C(gene name)|uniprotkb:HIST1H3E(gene name)|uniprotkb:HIST1H3F(gene name)|uniprotkb:HIST1H3G(gene name)|uniprotkb:HIST1H3H(gene name)|uniprotkb:HIST1H3I(gene name)|uniprotkb:HIST1H3J(gene name)'
Run Code Online (Sandbox Code Playgroud)
结果是:
HIST1H3D
HIST1H3A
HIST1H3B
HIST1H3C
HIST1H3E
HIST1H3F
HIST1H3G
HIST1H3H
HIST1H3I
HIST1H3J
Run Code Online (Sandbox Code Playgroud) 我有五个要点,我需要根据这些要点创建树状图。可以使用“树状图”功能来找到这些点的顺序,如下所示。但是,我不想使用树状图,因为它速度慢并且会导致大量点出错(我在这里用Python替代的方法来查找树状图)问这个问题。有人可以指出我如何将“链接”输出(Z)转换为“树状图(Z)['ivl']”值。
>>> from hcluster import pdist, linkage, dendrogram
>>> import numpy
>>> from numpy.random import rand
>>> x = rand(5,3)
>>> Y = pdist(x)
>>> Z = linkage(Y)
>>> Z
array([[ 1. , 3. , 0.11443378, 2. ],
[ 0. , 4. , 0.47941843, 2. ],
[ 5. , 6. , 0.67596472, 4. ],
[ 2. , 7. , 0.79993986, 5. ]])
>>>
>>> dendrogram(Z)['ivl']
['2', '1', '3', '0', '4']
>>>
Run Code Online (Sandbox Code Playgroud) 我有500万个序列(探针具体)如下.我需要从每个字符串中提取名称.
这里的名字是1007_s_at:123:381,10073_s_at:128:385等等..
我正在使用lapply函数,但它花费了太多时间.我还有其他几个类似的文件.你会建议一个更快的方法来做到这一点.
nm = c(
"probe:HG-Focus:1007_s_at:123:381; Interrogation_Position=3570; Antisense;",
"probe:HG-Focus:1007_s_at:128:385; Interrogation_Position=3615; Antisense;",
"probe:HG-Focus:1007_s_at:133:441; Interrogation_Position=3786; Antisense;",
"probe:HG-Focus:1007_s_at:142:13; Interrogation_Position=3878; Antisense;" ,
"probe:HG-Focus:1007_s_at:156:191; Interrogation_Position=3443; Antisense;",
"probe:HTABC:1007_s_at:244:391; Interrogation_Position=3793; Antisense;")
extractProbe <- function(x) sub("probe:", "", strsplit(x, ";", fixed=TRUE)[[1]][1], ignore.case=TRUE)
pr = lapply(nm, extractProbe)
Run Code Online (Sandbox Code Playgroud)
产量
1007_s_at:123:381
1007_s_at:128:385
1007_s_at:133:441
1007_s_at:142:13
1007_s_at:156:191
1007_s_at:244:391
Run Code Online (Sandbox Code Playgroud) python ×8
pandas ×3
dendrogram ×2
matplotlib ×2
data-mining ×1
dataframe ×1
grep ×1
heatmap ×1
histogram ×1
lapply ×1
orientation ×1
plot ×1
r ×1
regex ×1
shell ×1
sorting ×1
string ×1
tuples ×1