当我导航到 aws datapipeline 控制台时,它显示此横幅,
请注意,数据管道服务处于维护模式,我们不打算将该服务扩展到新区域。我们计划在 2023 年 2 月 28 日之前删除控制台访问权限。
aws datapipeline 服务会在不久的将来消失吗?
amazon-web-services amazon-data-pipeline deprecation-warning
我正在尝试使用AWS管道将CSV数据从S3存储桶传输到DynamoDB,以下是我的管道线脚本,它无法正常工作,
CSV文件结构
Name, Designation,Company
A,TL,C1
B,Prog, C2
Run Code Online (Sandbox Code Playgroud)
DynamoDb:N_Table,名称为哈希值
{
"objects": [
{
"id": "Default",
"scheduleType": "cron",
"name": "Default",
"role": "DataPipelineDefaultRole",
"resourceRole": "DataPipelineDefaultResourceRole"
},
{
"id": "DynamoDBDataNodeId635",
"schedule": {
"ref": "ScheduleId639"
},
"tableName": "N_Table",
"name": "MyDynamoDBData",
"type": "DynamoDBDataNode"
},
{
"emrLogUri": "s3://onlycsv/error",
"id": "EmrClusterId636",
"schedule": {
"ref": "ScheduleId639"
},
"masterInstanceType": "m1.small",
"coreInstanceType": "m1.xlarge",
"enableDebugging": "true",
"installHive": "latest",
"name": "ImportCluster",
"coreInstanceCount": "1",
"logUri": "s3://onlycsv/error1",
"type": "EmrCluster"
},
{
"id": "S3DataNodeId643",
"schedule": {
"ref": "ScheduleId639"
},
"directoryPath": "s3://onlycsv/data.csv",
"name": "MyS3Data",
"dataFormat": {
"ref": …
Run Code Online (Sandbox Code Playgroud) 我正在尝试使用RedShiftCopyActivity和datapipeline将一堆csv文件从S3复制到Redshift.
只要csv结构与表结构匹配,这就可以正常工作.在我的情况下,csv的列数比表少,然后RedShiftCopyActivity在stl_load_errors中出现"Delimiter not found"错误.
我想使用redshift copy命令"columns"选项.这样我就可以使它工作,但redshift复制命令的列部分似乎在RedShiftCopyActivity中不可用.
任何建议?
热烈欢迎所有小贴士.
非常感谢.
彼得
amazon-s3 amazon-web-services amazon-redshift amazon-data-pipeline
正是按照本页的逐步说明,我试图将我的一个DynamoDB表的内容导出到S3存储桶.我完全按照指示创建了一个管道,但它无法运行.它似乎无法识别/运行EC2资源来执行导出.当我通过AWS Console访问EMR时,我看到如下条目:
Cluster: df-0..._@EmrClusterForBackup_2015-03-06T00:33:04Terminated with errorsEMR service role arn:aws:iam::...:role/DataPipelineDefaultRole is invalid
Run Code Online (Sandbox Code Playgroud)
为什么我收到此消息?我是否需要为管道设置/配置其他东西?
更新:在IAM->Roles
AWS控制台下我看到这个DataPipelineDefaultResourceRole
:
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Action": [
"s3:List*",
"s3:Put*",
"s3:Get*",
"s3:DeleteObject",
"dynamodb:DescribeTable",
"dynamodb:Scan",
"dynamodb:Query",
"dynamodb:GetItem",
"dynamodb:BatchGetItem",
"dynamodb:UpdateTable",
"rds:DescribeDBInstances",
"rds:DescribeDBSecurityGroups",
"redshift:DescribeClusters",
"redshift:DescribeClusterSecurityGroups",
"cloudwatch:PutMetricData",
"datapipeline:PollForTask",
"datapipeline:ReportTaskProgress",
"datapipeline:SetTaskStatus",
"datapipeline:PollForTask",
"datapipeline:ReportTaskRunnerHeartbeat"
],
"Resource": ["*"]
}]
}
Run Code Online (Sandbox Code Playgroud)
这个用于DataPipelineDefaultRole
:
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Action": [
"s3:List*",
"s3:Put*",
"s3:Get*",
"s3:DeleteObject",
"dynamodb:DescribeTable",
"dynamodb:Scan",
"dynamodb:Query",
"dynamodb:GetItem",
"dynamodb:BatchGetItem",
"dynamodb:UpdateTable",
"ec2:DescribeInstances",
"ec2:DescribeSecurityGroups",
"ec2:RunInstances",
"ec2:CreateTags",
"ec2:StartInstances", …
Run Code Online (Sandbox Code Playgroud) export amazon-emr amazon-dynamodb amazon-iam amazon-data-pipeline
我不需要Hive或Pig,默认情况下,亚马逊数据管道将它们安装在它旋转的任何EMR集群上.这使得测试花费的时间比应该的长.有关如何禁用安装的任何想法?
我正在使用AWS数据管道服务将数据从RDS MySql
数据库传输到s3
然后再Redshift
传入,这很好用.
但是,我也有数据存在于一个RDS Postres
实例中,我想以相同的方式管道,但我很难设置jdbc连接.如果不支持,是否有解决方法?
"connectionString": "jdbc:postgresql://THE_RDS_INSTANCE:5432/THE_DB”
Run Code Online (Sandbox Code Playgroud) postgresql amazon-web-services amazon-redshift amazon-data-pipeline
尝试在sqlActivity中使用脚本参数时:
{
"id" : "ActivityId_3zboU",
"schedule" : { "ref" : "DefaultSchedule" },
"scriptUri" : "s3://location_of_script/unload.sql",
"name" : "unload",
"runsOn" : { "ref" : "Ec2Instance" },
"scriptArgument" : [ "'s3://location_of_unload/#format(minusDays(@scheduledStartTime,1),'YYYY/MM/dd/hhmm/')}'", "'aws_access_key_id=????;aws_secret_access_key=*******'" ],
"type" : "SqlActivity",
"dependsOn" : { "ref" : "ActivityId_YY69k" },
"database" : { "ref" : "RedshiftCluster" }
}
Run Code Online (Sandbox Code Playgroud)
其中unload.sql脚本包含:
unload ('
select *
from tbl1
')
to ?
credentials ?
delimiter ',' GZIP;
Run Code Online (Sandbox Code Playgroud)
要么 :
unload ('
select *
from tbl1
')
to ?::VARCHAR(255)
credentials ?::VARCHAR(255)
delimiter ',' GZIP; …
Run Code Online (Sandbox Code Playgroud) amazon-s3 amazon-web-services amazon-redshift amazon-data-pipeline
我正在使用AWS数据管道中的活动尝试将文件从s3位置移动到另一个位置.
我使用的命令是:
(aws s3 mv s3://foobar/Tagger/out//*/lastImage.txt s3://foobar/Tagger/testInput/lastImage.txt)
Run Code Online (Sandbox Code Playgroud)
但是我收到以下错误:
A client error (404) occurred when calling the HeadObject operation: Key "Tagger/out//*/lastImage.txt" does not exist
Run Code Online (Sandbox Code Playgroud)
但是,如果我用特定的目录名替换"*",它将起作用.问题是我不会总是知道目录的名称,所以我希望我可以使用"*"作为外卡.
我想使用AWS Data Pipeline将数据从Postgres RDS传输到AWS S3。有人知道这是怎么做的吗?
更确切地说,我想使用数据管道将Postgres表导出到AWS S3。我使用数据管道的原因是我想自动执行此过程,并且此导出将每周运行一次。
任何其他建议也将起作用。
postgresql amazon-s3 amazon-web-services amazon-rds amazon-data-pipeline
我曾经使用称为的数据管道模板Export DynamoDB table to S3
将DynamoDB表导出到文件。我最近更新了所有DynamoDB表以提供按需配置,并且该模板不再起作用。我敢肯定,这是因为旧模板指定了要消耗的DynamoDB吞吐量百分比,这与按需表无关。
我尝试将旧模板导出为JSON,删除对吞吐量百分比消耗的引用,然后创建新管道。但是,这没有成功。
谁能建议如何将提供吞吐量的旧式管道脚本转换为新的按需表脚本?
这是我原始的运行脚本:
{
"objects": [
{
"name": "DDBSourceTable",
"id": "DDBSourceTable",
"type": "DynamoDBDataNode",
"tableName": "#{myDDBTableName}"
},
{
"name": "EmrClusterForBackup",
"coreInstanceCount": "1",
"coreInstanceType": "m3.xlarge",
"releaseLabel": "emr-5.13.0",
"masterInstanceType": "m3.xlarge",
"id": "EmrClusterForBackup",
"region": "#{myDDBRegion}",
"type": "EmrCluster"
},
{
"failureAndRerunMode": "CASCADE",
"resourceRole": "DataPipelineDefaultResourceRole",
"role": "DataPipelineDefaultRole",
"scheduleType": "ONDEMAND",
"name": "Default",
"id": "Default"
},
{
"output": {
"ref": "S3BackupLocation"
},
"input": {
"ref": "DDBSourceTable"
},
"maximumRetries": "2",
"name": "TableBackupActivity",
"step": "s3://dynamodb-emr-#{myDDBRegion}/emr-ddb-storage-handler/2.1.0/emr-ddb-2.1.0.jar,org.apache.hadoop.dynamodb.tools.DynamoDbExport,#{output.directoryPath},#{input.tableName},#{input.readThroughputPercent}",
"id": "TableBackupActivity",
"runsOn": {
"ref": "EmrClusterForBackup"
},
"type": …
Run Code Online (Sandbox Code Playgroud)