处理具有重复多值特征的数据集

Question

处理具有重复多值特征的数据集

moh*_*moh 12 python scipy multivalue-database feature-selection

我们有一个稀疏表示的数据集，具有25个功能和1个二进制标签。例如，一行数据集是：

Label: 0
exid: 24924687
Features:
11:0 12:1 13:0 14:6 15:0 17:2 17:2 17:2 17:2 17:2 17:2
21:11 21:42 21:42 21:42 21:42 21:42 
22:35 22:76 22:27 22:28 22:25 22:15 24:1888
25:9 33:322 33:452 33:452 33:452 33:452 33:452 35:14

Run Code Online (Sandbox Code Playgroud)

因此，有时功能具有多个值，它们可以相同或不同，并且网站上说：

一些分类特征是多值的（顺序无所谓）

我们不知道功能的语义和分配给它们的价值（由于某些隐私问题，它们对公众隐藏）

我们只知道：

Label 表示用户是否点击了推荐广告。
Features 正在描述已推荐给用户的产品。
Task 在给定产品广告的情况下，用于预测用户获得点击的可能性。

感谢对以下问题的任何评论：

将此类数据集导入Python数据结构的最佳方法是什么。
如何处理多值特征，特别是当它们具有相似的值重复k多次时？

Answer 1

Dre*_*rey 8

这是一个非常普遍的问题，但是据我所知，如果您想使用某些ML方法，将其首先转换为整齐的数据格式是明智的。

据我无法从文档中得知@RootTwo在他的注释中很好地引用了您，实际上您正在处理两个数据集：一个示例平面表和一个产品平面表。（如果需要，您可以稍后将两者合而为一。）

让我们首先创建一些解析器，将不同的行解码为内容丰富的数据结构：

对于带有示例的行，我们可以使用：

def process_example(example_line):
    # example ${exID}: ${hashID} ${wasAdClicked} ${propensity} ${nbSlots} ${nbCandidates} ${displayFeat1}:${v_1}
    #    0        1         2           3               4          5            6               7 ...
    feature_names = ['ex_id', 'hash', 'clicked', 'propensity', 'slots', 'candidates'] + \
                    ['display_feature_' + str(i) for i in range(1, 11)]
    are_numbers = [1, 3, 4, 5, 6]
    parts = example_line.split(' ')
    parts[1] = parts[1].replace(':', '')
    for i in are_numbers:
        parts[i] = float(parts[i])
        if parts[i].is_integer():
            parts[i] = int(parts[i])
    featues = [int(ft.split(':')[1]) for ft in parts[7:]]
    return dict(zip(feature_names, parts[1:7] + featues))

Run Code Online (Sandbox Code Playgroud)

这种方法很笨拙，但是可以完成工作：解析要素并在可能的情况下将其转换为数字。输出看起来像：

{'ex_id': 20184824,
 'hash': '57548fae76b0aa2f2e0d96c40ac6ae3057548faee00912d106fc65fc1fa92d68',
 'clicked': 0,
 'propensity': 1.416489e-07,
 'slots': 6,
 'candidates': 30,
 'display_feature_1': 728,
 'display_feature_2': 90,
 'display_feature_3': 1,
 'display_feature_4': 10,
 'display_feature_5': 16,
 'display_feature_6': 1,
 'display_feature_7': 26,
 'display_feature_8': 11,
 'display_feature_9': 597,
 'display_feature_10': 7}

Run Code Online (Sandbox Code Playgroud)

接下来是产品示例。正如您提到的，问题是值的多次出现。我认为将唯一的特征值对按其频率进行汇总是明智的。信息不会丢失，但是可以帮助我们对整齐的样本进行编码。那应该解决您的第二个问题。

{'ex_id': 20184824,
 'hash': '57548fae76b0aa2f2e0d96c40ac6ae3057548faee00912d106fc65fc1fa92d68',
 'clicked': 0,
 'propensity': 1.416489e-07,
 'slots': 6,
 'candidates': 30,
 'display_feature_1': 728,
 'display_feature_2': 90,
 'display_feature_3': 1,
 'display_feature_4': 10,
 'display_feature_5': 16,
 'display_feature_6': 1,
 'display_feature_7': 26,
 'display_feature_8': 11,
 'display_feature_9': 597,
 'display_feature_10': 7}

Run Code Online (Sandbox Code Playgroud)

基本上提取了每个示例的标签和特征（第40行的示例）：

[{'feature': 'product_feature_11',
  'value': 0,
  'frequency': 1,
  'label': 0,
  'ex_id': 19168103},
 {'feature': 'product_feature_12',
  'value': 1,
  'frequency': 1,
  'label': 0,
  'ex_id': 19168103},
 {'feature': 'product_feature_13',
  'value': 0,
  'frequency': 1,
  'label': 0,
  'ex_id': 19168103},
 {'feature': 'product_feature_14',
  'value': 2,
  'frequency': 1,
  'label': 0,
  'ex_id': 19168103},
 {'feature': 'product_feature_15',
  'value': 0,
  'frequency': 1,
  'label': 0,
  'ex_id': 19168103},
 {'feature': 'product_feature_17',
  'value': 2,
  'frequency': 2,
  'label': 0,
  'ex_id': 19168103},
 {'feature': 'product_feature_21',
  'value': 55,
  'frequency': 2,
  'label': 0,
  'ex_id': 19168103},
 {'feature': 'product_feature_22',
  'value': 14,
  'frequency': 1,
  'label': 0,
  'ex_id': 19168103},
 {'feature': 'product_feature_22',
  'value': 54,
  'frequency': 1,
  'label': 0,
  'ex_id': 19168103},
 {'feature': 'product_feature_24',
  'value': 3039,
  'frequency': 1,
  'label': 0,
  'ex_id': 19168103},
 {'feature': 'product_feature_25',
  'value': 721,
  'frequency': 1,
  'label': 0,
  'ex_id': 19168103},
 {'feature': 'product_feature_33',
  'value': 386,
  'frequency': 2,
  'label': 0,
  'ex_id': 19168103},
 {'feature': 'product_feature_35',
  'value': 963,
  'frequency': 1,
  'label': 0,
  'ex_id': 19168103}]

Run Code Online (Sandbox Code Playgroud)

因此，当您逐行处理流时，可以决定是映射示例还是产品：

import toolz  # pip install toolz

def process_product(product_line):
    # ${wasProduct1Clicked} exid:${exID} ${productFeat1_1}:${v1_1} ...
    parts = product_line.split(' ')
    meta = {'label': int(parts[0]),
            'ex_id': int(parts[1].split(':')[1])}
    # extract feautes that are ${productFeat1_1}:${v1_1} separated by ':' into a dictionary
    features = [('product_feature_' + str(i), int(v))
                for i, v in map(lambda x: x.split(':'), parts[2:])]
    # count each unique value and transform them into
    # feature_name X feature_value X feature_frequency
    products = [dict(zip(['feature', 'value', 'frequency'], (*k, v)))
                for k, v in toolz.countby(toolz.identity, features).items()]
    # now merge the meta information into each product
    return [dict(p, **meta) for p in products]

Run Code Online (Sandbox Code Playgroud)

我决定在这里创建一个生成器，因为如果您决定不使用它，它将以功能性方式处理数据pandas。否则，列表压缩将被您炸掉。

现在，最有趣的部分是：我们从给定（示例）URL中逐行读取行，并将其分配到相应的数据集（示例或产品）中。我将reduce在这里使用，因为它很有趣:-)。我不会详细说明map/reduce实际的操作（由您决定）。您始终可以使用简单的for循环来代替。

import urllib.request
import toolz  # pip install toolz

lines_stream = (line.decode("utf-8").strip() 
                for line in urllib.request.urlopen('http://www.cs.cornell.edu/~adith/Criteo/sample.txt'))

# if you care about concise but hacky approach you could do:
# blubb = list(toolz.partitionby(lambda x: 'hash' in x, process_file(lines_stream)))
# examples_only = blubb[slice(0, len(blubb), 2)]
# products_only = blubb[slice(1, len(blubb), 2)]

# but to introduce some functional approach lets implement a reducer
def dataset_reducer(datasets, content):
    which_one = 0 if 'hash' in content else 1
    datasets[which_one].append(content)
    return datasets

# and process the stream using the reducer. Which results in two datasets:
examples_dataset, product_dataset = toolz.reduce(dataset_reducer, process_stream(lines), [[], []])

Run Code Online (Sandbox Code Playgroud)

在这里，您可以将数据集转换为整洁的数据框，以用于应用机器学习。提防NaN/丢失值，分布等。可以将两个数据集结合使用，merge以获得一个样本X特征的大平面表。然后，您将或多或少地能够使用不同于的方法scikit-learn。

import pandas

examples_dataset = pandas.DataFrame(examples_dataset)
product_dataset = pandas.concat(pandas.DataFrame(p) for p in product_dataset)

Run Code Online (Sandbox Code Playgroud)

示例数据集

   candidates  clicked  ...    propensity  slots
0          30        0  ...  1.416489e-07      6
1          23        0  ...  5.344958e-01      3
2          23        1  ...  1.774762e-04      3
3          28        0  ...  1.158855e-04      6

Run Code Online (Sandbox Code Playgroud)

产品数据集（product_dataset.sample(10)）

       ex_id             feature  frequency  label  value
6   10244535  product_feature_21          1      0     10
9   37375474  product_feature_25          1      0      4
6   44432959  product_feature_25          1      0    263
15  62131356  product_feature_35          1      0     14
8   50383824  product_feature_24          1      0    228
8   63624159  product_feature_20          1      0     30
3   99375433  product_feature_14          1      0      0
9    3389658  product_feature_25          1      0     43
20  59461725  product_feature_31          8      0      4
11  17247719  product_feature_21          3      0      5

Run Code Online (Sandbox Code Playgroud)

要注意product_dataset。您可以将行中的特征作为列“透视”（请参阅重塑文档）。

归档时间：	6 年，6 月前
查看次数：	310 次
最近记录：	6 年，5 月前