Split csv file by the value of a column - Apache Nifi

fit*_*ida 1 csv apache apache-nifi

I have a csv files, that it has the following structure.

ERP,J,JACKSON,8388 SOUTH CALIFORNIA ST.,TUCSON,AZ,85708,267-3352,,ALLENTON,MI,48002,810,710-0470,369-98-6555,462-11-4610,1953-05-00,F,
MARKETING,J,JACKSON,8388 SOUTH CALIFORNIA ST.,TUCSON,AZ,85708,267-3352,,ALLENTON,MI,48002,810,710-0470,369-98-6555,462-11-4610,1953-05-00,F,
Run Code Online (Sandbox Code Playgroud)

As you can see there is not header, but for your information the first part (first column) represents the sector where are getting the data.

What I have to do is depending on the first column value, for example (MARKETING or ERP) I have to send all that rows to a different output directory.

For example, all rows with ERP to /output/ERP/ all rows with MARKETING to /output/marketing/

I have an idea about how to do it, but my problem is about the RouteOnAttribute processor I am using, I don't know how to refer to the first column and to indicate what is the value (ERP or MARKETING) to later on send it to the correct output directory.

Here is my schema.

在此处输入图片说明

Thanks.

Shu*_*Shu 5

PartitionRecord在这种情况下,请使用处理器。

使用配置处理器record reader/writer controller services。即使没有标题,也可以在avro模式中使用col1,col2 ... etc。

  • 添加定义处理器以使用该字段对流文件进行分区的新属性。

现在分区记录处理器添加partition field attributewith值,通过利用此属性值,我们可以dynamically store files将其动态地放入受尊重的目录中。

流:

1.GetFile
2.PartitionRecord
3.PutFile //configure directory as /output/${<keep_partition_field_name_here>}
Run Code Online (Sandbox Code Playgroud)

请参考链接以配置分区记录处理器的用法。

(要么)

旧方法:

使用RouteText处理器而不是SplitText + RouteOnAttribute处理器

将RouteText处理器配置为

在此处输入图片说明

使用ERP/MARKETING连接连接到PutFile处理器,并使用RouteText.Route属性值将文件动态保存到目录中。

流:

1.GetFile
2.RouteText
3.PutFile //configure directory as /output/${RouteText.Route}/
Run Code Online (Sandbox Code Playgroud)

您还可以使用“ 组正则表达式”属性值来创建分区。

注意

使用PartitionRecord处理器将比RouteText处理器更有效