直接在 AWS Sagemaker Pipeline 中访问参数的值

Question

直接在 AWS Sagemaker Pipeline 中访问参数的值

在返回 Pipeline 的函数内部，定义了 Parameter，例如（取自此处）

def get_pipeline(...):
   
    foo = ParameterString(
        name="Foo", default_value="foo"
    )

   # pipeline's steps definition here
   step = ProcessingStep(name=...,
                         job_arguments=["--foo", foo]
   )

   return pipeline = Pipeline(
        name=pipeline_name,
        parameters=[...],
        steps=[...],
        sagemaker_session=sagemaker_session,
    )

Run Code Online (Sandbox Code Playgroud)

我知道我可以通过简单地调用来访问参数的默认值foo.default_value，但是当默认值在运行时被覆盖时，我如何访问它的值，例如通过使用

pipeline.start(parameters=dict(Foo='bar'))

Run Code Online (Sandbox Code Playgroud)

？

我的假设是，在这种情况下，我不想读取默认值，因为它已被覆盖，但参数 API非常有限，并且没有提供name和所期望的任何内容default_value。

Answer 1

Giu*_*ano 3

正如文档中所写：

管道参数只能在运行时评估。如果需要在编译时评估管道参数，则会抛出异常。

使用参数作为ProcessingStep参数的方法

如果您的要求是将它们用于管道步骤，特别是ProcessingStep，则必须使用 run 方法来使用参数（与 job_arguments 不同）。

请参阅这个官方示例。

通过将 pipeline_session 传递给 sagemaker_session，调用 .run() 不会启动处理作业，它会返回运行作业所需的参数作为管道中的一个步骤。

step_process = ProcessingStep(
   step_args=your_processor.run(
       # ...
       arguments=["--foo", foo]
   )
)

Run Code Online (Sandbox Code Playgroud)

此外，还有一些限制：并非所有内置的Python操作都可以应用于参数。

取自上面链接的示例：

# An example of what not to do
my_string = "s3://{}/training".format(ParameterString(name="MyBucket", default_value=""))

# Another example of what not to do
int_param = str(ParameterInteger(name="MyBucket", default_value=1))

# Instead, if you want to convert the parameter to string type, do
int_param.to_string()

# A workaround is to use Join
my_string = Join(on="", values=[
    "s3://",
    ParameterString(name="MyBucket", default_value=""),
    "/training"]
)

Run Code Online (Sandbox Code Playgroud)

一种使用参数在内部操作管道的方法

就我个人而言，我更喜欢在开始之前获取管道定义时直接传递值：

def get_pipeline(my_param_hardcoded, ...):

    # here you can use my_param_hardcoded
   
    my_param = ParameterString(
        name="Foo", default_value="foo"
    )

   # pipeline's steps definition here

   return pipeline = Pipeline(
        name=pipeline_name,
        parameters=[my_param, ...],
        steps=[...],
        sagemaker_session=sagemaker_session,
    )
   return pipeline

Run Code Online (Sandbox Code Playgroud)

pipeline = get_pipeline(my_param_hardcoded, ...)
pipeline.start(parameters=dict(Foo=my_param_hardcoded))

Run Code Online (Sandbox Code Playgroud)

显然这不是一种真正优雅的方式，但我不认为它在概念上是错误的，因为毕竟它是一个将用于操作管道的参数，并且不能预先进行预处理（例如在配置文件中）。

使用的一个示例是创建一个基于 pipeline_name 的名称（它在 get_pipeline() 和 pipeline 参数中明确传递）。例如，如果我们想为步骤创建自定义名称，则可以通过连接两个字符串来给出，这不能在运行时发生，但必须使用此技巧来完成。

为什么传递一个我想要的参数，例如作为参数传递给ProcessingStep（请参阅编辑后的帖子）被认为是在编译时而不是在运行时需要的？ (2认同)

归档时间：	3 年前
查看次数：	1920 次
最近记录：	3 年前