如何设置假设中数据帧的最小和最大长度？

Question

如何设置假设中数据帧的最小和最大长度？

我有以下使用基因组数据创建数据框的策略：

from hypothesis.extra.pandas import columns, data_frames, column
import hypothesis.strategies as st


def mysort(tp):

    key = [-1, tp[1], tp[2], int(1e10)]

    return [x for _, x in sorted(zip(key, tp))]

positions = st.integers(min_value=0, max_value=int(1e7))
strands = st.sampled_from("+ -".split())
chromosomes = st.sampled_from(elements=["chr{}".format(str(e)) for e in list(range(1, 23)) + "X Y M".split()])

genomics_data = data_frames(columns=columns(["Chromosome", "Start", "End", "Strand"], dtype=int),
                            rows=st.tuples(chromosomes, positions, positions, strands).map(mysort))

Run Code Online (Sandbox Code Playgroud)

我对空数据帧并不真正感兴趣，因为它们是无效的，而且我还想生成一些非常长的 dfs。如何更改为测试用例创建的数据帧的大小？即最小尺寸为 1，平均尺寸为大？

Answer 1

The*_*Cat 5

您可以为 data_frames 构造函数提供一个索引参数，其中包含 min_size 和 max_size 选项：

from hypothesis.extra.pandas import data_frames, columns, range_indexes
import hypothesis.strategies as st

def mysort(tp):

    key = [-1, tp[1], tp[2], int(1e10)]

    return [x for _, x in sorted(zip(key, tp))]

chromosomes = st.sampled_from(["chr{}".format(str(e)) for e in list(range(1, 23)) + "X Y M".split()])

positions = st.integers(min_value=0, max_value=int(1e7))
strands = st.sampled_from("+ -".split())
dfs = data_frames(index=range_indexes(min_size=5), columns=columns("Chromosome Start End Strand".split(), dtype=int), rows=st.tuples(chromosomes, positions, positions, strands).map(mysort))

Run Code Online (Sandbox Code Playgroud)

生成 dfs 如下：

  Chromosome    Start      End Strand
0      chr11  1411202  8025685      +
1      chr18   902289  5026205      -
2      chr12  5343877  9282475      +
3      chr16  2279196  8294893      -
4      chr14  1365623  6192931      -
5      chr12  4602782  9424442      +
6      chr10   136262  1739408      +
7      chr15   521644  4861939      +

Run Code Online (Sandbox Code Playgroud)

归档时间：	7 年，9 月前
查看次数：	443 次
最近记录：	7 年，9 月前