SQL统计抽样

tba*_*cos 5 sql sql-server statistics

我正在寻找一些天才SQL帮助,我遇到了棘手的统计问题.

我要做的是从一组不平衡的用户配置文件中提取统计平衡的样本.一次为单个配置文件属性(例如性别)执行此操作会有点简单.但是,同时跨多个维度进行此操作需要一些复杂性.

为了论证,让我说我有这张桌子.

Profile.userID  
Profile.Gender  
Profile.Age  
Profile.Income
Run Code Online (Sandbox Code Playgroud)

如果我想从混合中提取一组配置文件,以便用户的新抽样大致匹配以下所有特征:

50% male, 50% female
30% young, 40% middle age, 40% old
40% low income, 40% middle income, 20% high income
Run Code Online (Sandbox Code Playgroud)

有没有人对如何解决这个问题有任何想法?

Gor*_*off 3

你遇到的是一个抽样问题。解决这个问题的关键是将数据分成三个变量组合的单独组。然后,计算每组边际概率的乘积(您的值是边际概率)。然后,对所有 18 个组进行标准化。

例如,Male-Young-Low 组的值为 0.5*0.3*0.4 = 0.06。您对所有 18 个组重复此操作,然后标准化为百分比(即将每个值除以所有值的总和)。结果如下:

Gender  Age     Income  Marg    Normalized
Male    Young   Low     0.06    5.5%
Male    Young   Middle  0.06    5.5%
Male    Young   High    0.03    2.7%
Male    Middle  Low     0.08    7.3%
Male    Middle  Middle  0.08    7.3%
Male    Middle  High    0.04    3.6%
Male    Old     Low     0.08    7.3%
Male    Old     Middle  0.08    7.3%
Male    Old     High    0.04    3.6%
Female  Young   Low     0.06    5.5%
Female  Young   Middle  0.06    5.5%
Female  Young   High    0.03    2.7%
Female  Middle  Low     0.08    7.3%
Female  Middle  Middle  0.08    7.3%
Female  Middle  High    0.04    3.6%
Female  Old     Low     0.08    7.3%
Female  Old     Middle  0.08    7.3%
Female  Old     High    0.04    3.6%
Run Code Online (Sandbox Code Playgroud)

这将成为每个组的采样率。下面是实际进行采样的伪 SQL 代码:

with SamplingRates (
    select 'Male' as gender, 'Young' as Age, 'Low' as income, 0.045 as SamplingRate,
    union all . . 
)
select t.*
from (select t.*,
            row_number() over (partition by gender, age, income order by <random>) as seqnum,
            count(*) over (partition by gender, age, income) as NumRecs
      from table t
     ) t join
     SampleRates sr
     on t.gender = sr.gender and t.age = sr.age and t.income = sr.income and
        seqnum <= sr.SamplingRate * NumRecs
Run Code Online (Sandbox Code Playgroud)