处理完毕后,我的数据是一个表,其中有几列是要素,一列是标签.我想用它featuretools.dfs来帮我预测标签.可以直接进行,还是需要将单个表分成多个?
Max*_*ter 13
可以在单个表上运行DFS.例如,如果你有一个df带索引的pandas数据帧'index',你会写:
import featuretools as ft
es = ft.EntitySet('Transactions')
es.entity_from_dataframe(dataframe=df,
entity_id='log',
index='index')
fm, features = ft.dfs(entityset=es,
target_entity='log',
trans_primitives=['day', 'weekday', 'month'])
Run Code Online (Sandbox Code Playgroud)
生成的特征矩阵看起来像
In [1]: fm
Out[1]:
location pies sold WEEKDAY(date) MONTH(date) DAY(date)
index
1 main street 3 4 12 29
2 main street 4 5 12 30
3 main street 5 6 12 31
4 arlington ave. 18 0 1 1
5 arlington ave. 1 1 1 2
Run Code Online (Sandbox Code Playgroud)
这会将"transform"原语应用于您的数据.您通常希望添加更多实体以ft.dfs供使用,以便使用聚合原语.您可以阅读我们的文档中的差异.
标准工作流程是通过有趣的分类来标准化您的单个实体.如果你df是单桌
| index | location | pies sold | date |
|-------+----------------+-------+------------|
| 1 | main street | 3 | 2017-12-29 |
| 2 | main street | 4 | 2017-12-30 |
| 3 | main street | 5 | 2017-12-31 |
| 4 | arlington ave. | 18 | 2018-01-01 |
| 5 | arlington ave. | 1 | 2018-01-02 |
Run Code Online (Sandbox Code Playgroud)
你可能会对通过location以下方式归一化感兴趣:
es.normalize_entity(base_entity_id='log',
new_entity_id='locations',
index='location')
Run Code Online (Sandbox Code Playgroud)
你的新实体locations将拥有该表
| location | first_log_time |
|----------------+----------------|
| main street | 2018-12-29 |
| arlington ave. | 2000-01-01 |
Run Code Online (Sandbox Code Playgroud)
这将使功能像locations.SUM(log.pies sold)或locations.MEAN(log.pies sold)按位置添加或平均所有值.您可以在下面的示例中看到这些功能
In [1]: import pandas as pd
...: import featuretools as ft
...: df = pd.DataFrame({'index': [1, 2, 3, 4, 5],
...: 'location': ['main street',
...: 'main street',
...: 'main street',
...: 'arlington ave.',
...: 'arlington ave.'],
...: 'pies sold': [3, 4, 5, 18, 1]})
...: df['date'] = pd.date_range('12/29/2017', periods=5, freq='D')
...: df
...:
Out[1]:
index location pies sold date
0 1 main street 3 2017-12-29
1 2 main street 4 2017-12-30
2 3 main street 5 2017-12-31
3 4 arlington ave. 18 2018-01-01
4 5 arlington ave. 1 2018-01-02
In [2]: es = ft.EntitySet('Transactions')
...: es.entity_from_dataframe(dataframe=df, entity_id='log', index='index', t
...: ime_index='date')
...: es.normalize_entity(base_entity_id='log', new_entity_id='locations', ind
...: ex='location')
...:
Out[2]:
Entityset: Transactions
Entities:
log [Rows: 5, Columns: 4]
locations [Rows: 2, Columns: 2]
Relationships:
log.location -> locations.location
In [3]: fm, features = ft.dfs(entityset=es,
...: target_entity='log',
...: agg_primitives=['sum', 'mean'],
...: trans_primitives=['day'])
...: fm
...:
Out[3]:
location pies sold DAY(date) locations.DAY(first_log_time) locations.MEAN(log.pies sold) locations.SUM(log.pies sold)
index
1 main street 3 29 29 4.0 12
2 main street 4 30 29 4.0 12
3 main street 5 31 29 4.0 12
4 arlington ave. 18 1 1 9.5 19
5 arlington ave. 1 2 1 9.5 19
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
1658 次 |
| 最近记录: |