标签: data-analysis

Scipy 的 cut_tree() 不返回请求的簇数，并且使用 scipy 和 fastcluster 获得的链接矩阵不匹配

我正在使用fastcluster包与scipy.cluster.hierarchy模块函数进行凝聚层次聚类 ( AHC ) 实验，在中，我发现cut_tree()函数的令人费解的行为。Python 3

我毫无问题地对数据进行聚类，并Z使用linkage_vector()with获得链接矩阵method=ward。然后，我想切割树状图树以获得固定数量的簇（例如 33），并且我使用正确地执行此操作cut_tree(Z, n_clusters=33)。（请记住，AHC 是一种确定性方法，生成连接所有数据点的二叉树，这些数据点位于树的叶子；您可以在任何级别查看这棵树，以“查看”您最终想要的集群数量；所有cut_tree() 的作用是返回一组从 0 到 n_clusters - 1 的“n_cluster”整数标签，归因于数据集的每个点。）

我在其他实验中已经做过很多次，并且总是得到我请求的集群数量。问题是，对于这个数据集，当我要求cut_tree()33 个簇时，它只给我 32 个。我不明白为什么会出现这种情况。这可能是一个错误吗？您知道的任何错误吗cut_tree()？我尝试调试这种行为，并使用 scipy 的links()函数执行相同的聚类实验。将生成的链接矩阵作为输入，cut_tree()我没有得到意外数量的簇作为输出。我还验证了两种方法输出的链接矩阵不相等。

我使用的 [数据集]由 10680 个向量组成，每个向量有 20 个维度。检查以下实验：

import numpy as np
import fastcluster as fc
import scipy.cluster.hierarchy as hac
from scipy.spatial.distance import pdist

### *Load dataset (10680 vectors, each with 20 …

Run Code Online (Sandbox Code Playgroud)

python debugging hierarchical-clustering data-analysis scipy

PDR*_*DRX

2017 10-10

3
推荐指数

1
解决办法

1215
查看次数

Pandas 解析 csv 错误 - 预期 1 个字段，找到 9 个字段

我正在尝试从 .csv 文件解析：

planets = pd.read_csv("planets.csv", sep=',')

Run Code Online (Sandbox Code Playgroud)

但我总是会遇到这个错误：

ParserError: Error tokenizing data. C error: Expected 1 fields in line 13, saw 9

Run Code Online (Sandbox Code Playgroud)

这是我的 csv 文件的前几行：

# This file was produced by the test
# Tue Apr  3 06:03:27 2018
#
# COLUMN pl_hostname:    Host Name
# COLUMN pl_discmethod:  Discovery Method
# COLUMN pl_pnum:        Number of Planets in System
# COLUMN pl_orbper:      Orbital Period [days]
# COLUMN pl_orbsmax:     Orbit Semi-Major Axis [AU])
# COLUMN st_dist:        Distance [pc]
# COLUMN st_teff:        Effective …

Run Code Online (Sandbox Code Playgroud)

python csv data-analysis python-3.x pandas

Baa*_*aru

2018 04-03

3
推荐指数

1
解决办法

1万
查看次数

除了wordcloud中的默认停用词之外，如何添加额外的停用词？

我想将某些单词添加到 wordcloud 中使用的默认停用词列表中。当前代码：

all_text = " ".join(rev for rev in twitter_clean.text)
stop_words = ["https", "co", "RT"]
wordcloud = WordCloud(stopwords = stop_words, background_color="white").generate(all_text)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

Run Code Online (Sandbox Code Playgroud)

当我使用自定义 stop_words 变量时，诸如 "is"、"was" 和 "the" 之类的词都被解释并显示为高频词。但是，当我使用默认的停用词列表（没有停用词参数）时，还有许多其他词显示为非常频繁。如何将我的自定义 stop_words 变量以及默认停用词列表添加到我的 wordcloud？

python matplotlib data-analysis stop-words word-cloud

Com*_*and

2020 01-14

3
推荐指数

1
解决办法

1万
查看次数

Pandas：处理具有多种数据类型的列

我的数据帧 df 中有一个列，其中包含 float 和 str 类型的值：

df['ESTACD'].unique()

Output:
array([11.0, 32.0, 31.0, 35.0, 37.0, 84.0, 83.0, 81.0, 97.0, 39.0,
   38.0, 40.0, 34.0, 7.0, 17.0, 16.0, 14.0, 82.0, 8.0, '11', '40',
   '31', '39', '68', '97', '32', '33', '37', '38', '83', '84', '93',
   '35', '81', '67', '07', '80', '71', 'A3', '14', '17', '22', '34',
   '36', '82', '08'], dtype=object)

Run Code Online (Sandbox Code Playgroud)

我希望将此列的所有值转换为字符串类型。在这里使用astype(str)是不够的，因为我们最终得到“11.0”、“32.0”等值。

我能想到的唯一其他方法是使用 for 循环：

for i in range(len(df)):
    if (type(df['ESTACD'][i]) == float) or (df['ESTACD'][i].startswith('0')):
        df['ESTACD'][i] = str(int(df['ESTACD'][i]))

Run Code Online (Sandbox Code Playgroud)

然而，这对于大型数据集来说非常耗时。有没有办法在没有循环的情况下实现这个？

python data-analysis dataframe pandas

Atu*_*aji

lucky-day

3
推荐指数

1
解决办法

4008
查看次数

为什么我会收到 TypeError: unsupported operand type(s) for /: 'str' and 'int'？

在下面关于统计的问题中，我正在尝试用 python 进行“双样本独立 t 检验”。

一家百货公司的分析师想要评估最近的信用卡促销活动。为此，随机抽取了 500 名持卡人。一半人收到了宣传未来三个月内购买利率降低的广告，一半人收到了标准季节性广告。促销是否能有效提高销量？下面是我的代码。我在编写代码时犯了一些错误，请帮忙。

from scipy import stats

std_promo = cust[(cust['insert'] == 'Standard')]
new_promo = cust[(cust['insert'] == 'New Promotion')]

print(std_promo.head(3))
print(new_promo.head(3))

     id    insert      dollars
0   148  Standard  2232.771979
2   973  Standard  2327.092181
3  1096  Standard  1280.030541

     id         insert      dollars
1   572  New Promotion  1403.807542
4  1541  New Promotion  1513.563200
5  1947  New Promotion  1729.627996

print (std_promo.mean())
print (new_promo.mean())

    id         69003.000000
    dollars     1566.389031
    dtype: float64
    id         64998.244000
    dollars     1637.499983
    dtype: float64
    print (std_promo.std())
    print (new_promo.std())
    id         37753.106923 …

Run Code Online (Sandbox Code Playgroud)

python statistics machine-learning data-analysis pandas

bad*_*ddu

2019 12-23

3
推荐指数

1
解决办法

999
查看次数

如何使用R转换列中的行

我是 Re 的新手，有一个与此类似的数据集：

df <- data.frame(x = c(30, 1017, 1527, 1827,10496, 10794, 11270, 12261),
                 y = c(4.1, 2.6, 1.7, 1.1, 0.9, 1.1, 1.4, 3.1),
                 cod = c(3011, 3011, 3011, 3011, 3011, 3011, 3011, 2043),
                 label = c('start', 'start1', 'start2', 'start3', 'start4', 'start5', 'start6', 'start7'))

df

      x   y  cod  label
1    30 4.1 3011  start
2  1017 2.6 3011 start1
3  1527 1.7 3011 start2
4  1827 1.1 3011 start3
5 10496 0.9 3011 start4
6 10794 1.1 3011 start5
7 …

Run Code Online (Sandbox Code Playgroud)

r data-analysis

M.S*_*uza

lucky-day

3
推荐指数

1
解决办法

44
查看次数

在找到峰值时查找在时间序列数据中达到特定值的时间

我想在带有噪声的时间序列数据中找到达到某个值的时刻。如果数据中没有峰值，我可以在 MATLAB 中执行以下操作。

代码从这里

% create example data 
d=1:100;
t=d/100;
ts = timeseries(d,t);
% define threshold
thr = 55;
data = ts.data(:);
time = ts.time(:);
ind = find(data>thr,1,'first');
time(ind) %time where data>threshold

Run Code Online (Sandbox Code Playgroud)

但是当有噪音时，我不确定必须做什么。

In the time-series data plotted in the above image I want to find the time instant at which the y-axis value 5 is reached. The data actually stabilizes to 5 at t>=100 s. But due to the presence of noise in the data, we see a …

algorithm signal-processing time-series data-analysis convergence

Nat*_*sha

2021 04-06

3
推荐指数

1
解决办法

180
查看次数

绘图框 p 值显着性注释

我已经开始使用并喜欢绘制箱线图来表示我的数据。然而，我很难找到一种方法来对比这两个群体。使用Plotly时有没有办法引入数据之间的统计显着性比较？我想创建这样的图表：

其中 * 对应于 p 值 < 0.05，ns（不显着）对应于 p 值 > 0.05。我发现使用scipy.stats.ttest_ind()and stats.ttest_ind_from_stats()one 可以轻松找到两个分布的 p 值。

我在网上没有找到任何相关的帖子，我认为这是一个相当有用的实现，所以任何帮助将不胜感激！

data-analysis boxplot p-value plotly plotly-python

Man*_*gue

2021 05-12

3
推荐指数

1
解决办法

3941
查看次数

在 Pandas 中将字符串从列扩展到不同的单独列

我有以下形式的 df：

id   sid      steps
A     1       step1
A     1    step1-step2
A     1  step1-step2-step3

Run Code Online (Sandbox Code Playgroud)

它包含用户如何A在给定会话 (sid) 中浏览特定系列页面（步骤）的数据。我想采取这些破折号分隔的步骤并为每个步骤创建单独的列。

结果：

id     sid      steps       page_step1 page_step2  page_step3
  A     1       step1         step1        NA           NA
  A     1    step1-step2      step1      step2          NA
  A     1  step1-step2-step3  step1      step2        step3

Run Code Online (Sandbox Code Playgroud)

我不知道到底有多少步骤，所以我希望它们是动态创建的。整个星期都被困在这个问题上，谢谢！

python data-analysis dataframe pandas

S44*_*S44

lucky-day

3
推荐指数

1
解决办法

457
查看次数

如何在R中生成过渡类型表？

我有一些数据,其中有许多不同的id和不同时间(t1,t2,t3等)的状态列表,我想生成一个表,提供有关不同类型的状态变化的信息,发生了,所以对于样本数据看起来像这样(下面复制).

Run Code Online (Sandbox Code Playgroud)

例如,哪个会显示x更改为y两次并y更改为x一次.有谁知道我怎么能在R中做到这一点？

样本数据:

id <- c('a','b','c')
t1 <- c('x','y','z')
t2 <- c('y','y','z')
t3 <- c('z','y','x')
t4 <- c('z','x','y')
df <- cbind(id, t1, t2, t3, t4)

Run Code Online (Sandbox Code Playgroud)

r time-series dynamic data-analysis

unk*_*own

2017 09-19

2
推荐指数

1
解决办法

55
查看次数

标签统计

data-analysis ×10

python ×6

pandas ×4

dataframe ×2

r ×2

time-series ×2

algorithm ×1

boxplot ×1

convergence ×1

csv ×1

debugging ×1

dynamic ×1

hierarchical-clustering ×1

machine-learning ×1

matplotlib ×1

p-value ×1

plotly ×1

plotly-python ×1

python-3.x ×1

scipy ×1

signal-processing ×1

statistics ×1

stop-words ×1

word-cloud ×1

标签 统计

标签统计