小编Rah_mar的帖子

这些红条在git文件中的含义是什么区别

+号后面有一个红色条.这是什么？

14
推荐指数

2
解决办法

6450
查看次数

Github Actions 创建多个连续作业的矩阵

我正在尝试创建一个如下所示的工作流程，其中矩阵不仅仅包含一项作业，而是包含多个作业，对于我们想要构建、测试和部署的每个环境。

如果环境中的某个步骤失败，则不应运行该环境的后续步骤。

矩阵就像["Env A", "Env B", ... , "Env n"]

希望避免重复每个环境的所有作业。
不能使用单个矩阵作业进行构建，然后使用单个矩阵作业进行测试等，因为将不会维护环境中的黑白步骤依赖性。

有没有其他方法可以在不重复代码的情况下做到这一点？

yaml github github-actions

8
推荐指数

1
解决办法

4308
查看次数

启动时无法设置套接字.dbexit:rc:mongodb中有48个错误

我将mongo更新到版本3.2现在我收到此错误.早些时候我没有收到错误.只有在更新mongo后我才会收到此错误.我甚至试图再次停止并启动mongod服务,但它仍然显示相同的错误.

rahul ~ $ mongod
2016-01-10T23:39:51.696+0530 I CONTROL  [initandlisten] MongoDB starting : pid=6630 port=27017 dbpath=/data/db 64-bit host=rahulcomp24-HP-ENVY-15-Notebook-PC
2016-01-10T23:39:51.696+0530 I CONTROL  [initandlisten] db version v3.2.0
2016-01-10T23:39:51.696+0530 I CONTROL  [initandlisten] git version: 45d947729a0315accb6d4f15a6b06be6d9c19fe7
2016-01-10T23:39:51.696+0530 I CONTROL  [initandlisten] OpenSSL version: OpenSSL 1.0.1f 6 Jan 2014
2016-01-10T23:39:51.696+0530 I CONTROL  [initandlisten] allocator: tcmalloc
2016-01-10T23:39:51.696+0530 I CONTROL  [initandlisten] modules: none
2016-01-10T23:39:51.696+0530 I CONTROL  [initandlisten] build environment:
2016-01-10T23:39:51.696+0530 I CONTROL  [initandlisten]     distmod: ubuntu1404
2016-01-10T23:39:51.696+0530 I CONTROL  [initandlisten]     distarch: x86_64
2016-01-10T23:39:51.696+0530 I CONTROL  [initandlisten]     target_arch: x86_64
2016-01-10T23:39:51.696+0530 I …

Run Code Online (Sandbox Code Playgroud)

6
推荐指数

2
解决办法

1万
查看次数

在 Kedro Notebook 中设置参数

是否可以覆盖从 Kedro 笔记本中的parameters.yaml 文件中获取的属性？

我正在尝试动态更改笔记本中的参数值。我希望能够让用户能够运行标准管道，但具有可定制的参数。我不想更改 YAML 文件，我只想更改笔记本生命周期的参数。

我尝试在上下文中编辑参数，但这没有影响。

context.params.update({"test_param": 2})

Run Code Online (Sandbox Code Playgroud)

我是否遗漏了什么或者这不是预期的用例？

5
推荐指数

1
解决办法

1020
查看次数

如何在kedro中处理庞大的数据集

我有相当大的（~200Gb，~20M 行）原始 jsonl 数据集。我需要从那里提取重要的属性并将中间数据集存储在 csv 中以进一步转换为 HDF5、parquet 等。显然，我不能JSONDataSet用于加载原始数据集，因为它pandas.read_json在幕后使用，并使用 pandas如此规模的数据集听起来是个坏主意。所以我正在考虑逐行读取原始数据集，逐行处理并将处理后的数据附加到中间数据集。

我无法理解的是如何使其AbstractDataSet与它的_load和_save方法兼容。

PS 我知道我可以将其移出 kedro 的上下文，并将预处理数据集作为原始数据集引入，但这有点破坏了完整管道的整个想法。

5
推荐指数

1
解决办法

443
查看次数

使用 get_current_context 进行单元测试

我想为在 Apache Airflow 中使用 get_current_context 的函数构建一个单元测试。该函数在多个任务中使用来创建一个文件名，用于从这些不同的任务中读取和写入文件。

这是该函数的示例：

def get_filename():
    from airflow.operators.python import get_current_context

    context = get_current_context()
    dag_id = context['dag'].__dict__['_dag_id']
    log_time = context['data_interval_start'].strftime("%Y-%m-%d_%H-%M-%S")
    log_file = f'/path/logs/{dag_id}/{log_time}.txt'
    return log_file

Run Code Online (Sandbox Code Playgroud)

如何在单元测试中设置上下文以使该函数可执行？我什至不知道从哪里开始。

python unit-testing airflow

5
推荐指数

1
解决办法

2038
查看次数

管道在kedro中找不到节点

我正在关注管道教程，创建所有需要的文件，使用 kedro 启动kedro run --node=preprocessing_data但卡住了这样的错误消息：

ValueError: Pipeline does not contain nodes named ['preprocessing_data'].

Run Code Online (Sandbox Code Playgroud)

如果我不带node参数运行kedro ，我会收到

kedro.context.context.KedroContextError: Pipeline contains no nodes

Run Code Online (Sandbox Code Playgroud)

文件内容：

src/project/pipelines/data_engineering/nodes.py
def preprocess_data(data: SparkDataSet) -> None:
    print(data)
    return

Run Code Online (Sandbox Code Playgroud)

src/project/pipelines/data_engineering/pipeline.py
def create_pipeline(**kwargs):
    return Pipeline(
        [
            node(
                func=preprocess_data,
                inputs="data",
                outputs="preprocessed_data",
                name="preprocessing_data",
            ),
        ]
    )

Run Code Online (Sandbox Code Playgroud)

src/project/pipeline.py
def create_pipelines(**kwargs) -> Dict[str, Pipeline]:
    de_pipeline = de.create_pipeline()
    return {
        "de": de_pipeline,
        "__default__": Pipeline([])
    }

Run Code Online (Sandbox Code Playgroud)

3
推荐指数

2
解决办法

1152
查看次数

Step 函数无法触发 Fargate 集群上的 ECS 任务，权限问题

我正在我的 ECS fargate 集群上创建并运行任务。

任务定义（带有角色）和 Fargate 集群已创建。

当我在步骤函数中使用运行任务步骤时，出现以下错误，

{
  "Error": "ECS.AccessDeniedException",
  "Cause": "User: arn:aws:sts::xxxxxxxxxx:assumed-role/StepFunctions-my-state-machine-role-xxxxxxxxxx/xxxxxxxxxx is not authorized to perform: iam:PassRole on resource: arn:aws:iam::xxxxxxxxxx:role/my-app-dev-exec because no identity-based policy allows the iam:PassRole action (Service: AmazonECS; Status Code: 400; Error Code: AccessDeniedException; Request ID: xxxxxxxxxx-xxxxxxxxxx-xxxxxxxxxx; Proxy: null)"
}

Run Code Online (Sandbox Code Playgroud)

附加到步骤函数的角色具有以下策略（根据 AWS 提供的文档https://docs.aws.amazon.com/step-functions/latest/dg/ecs-iam.html）

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ecs:RunTask"
            ],
            "Resource": [
                "arn:aws:ecs:eu-west-1:xxxxxxxxxx:task-definition/*:*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "ecs:StopTask",
                "ecs:DescribeTasks"
            ],
            "Resource": [
                "arn:aws:ecs:eu-west-1:xxxxxxxxxx:task/*"
            ]
        }, …

Run Code Online (Sandbox Code Playgroud)

amazon-web-services amazon-ecs aws-step-functions

3
推荐指数

1
解决办法

5992
查看次数

3位数不包含000的正则表达式

我想在Google Form上写一个正则表达式

1到9之间的第一个字符第二个和第三个字母（大写），后三个字符应为541或001之类的数字，但不能为000

这个表达式也取000

[1-9][A-Z]{2}[0-9]{3}

Run Code Online (Sandbox Code Playgroud)

regex google-forms

2
推荐指数

1
解决办法

2123
查看次数

Airflow SLA 未实施

我是 Airflow 新手，正在尝试在我的 DAG 中实现 sla miss 功能

default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': datetime(2017,07,24),
    'email': ['jspsai@gmail.com'],
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 5,
    'retry_delay': timedelta(minutes=5),
    'sla':timedelta(seconds=30),
    'pool':'demo',
    'queue':'slaq',
    'run_as_user':'ec2-user'
}

Run Code Online (Sandbox Code Playgroud)

但这个 SLA 没有应用，我无法弄清楚问题是什么

我也在任务级别给出了 SLA，但运气不佳。

任何帮助是极大的赞赏。

谢谢

2
推荐指数

1
解决办法

2420
查看次数

如何将多个 CSV 文件添加到 Kedro 的目录中？

我有数百个 CSV 文件要类似地处理。为简单起见，我们可以假设它们都在./data/01_raw/（如./data/01_raw/1.csv、./data/02_raw/2.csv）等中。我宁愿不给每个文件一个不同的名称，并在构建我的管道时单独跟踪它们。我想知道是否有任何方法可以通过在catalog.yml文件中指定某些内容来批量读取所有文件？

2
推荐指数

1
解决办法

344
查看次数

pytest mocker 夹具模拟模块来自定义它的地方而不是使用它的地方

我有一些实用程序功能 src/utils/helper.py

想象一下，我func_a在 utils/helper.py 中有一个函数，它在我的项目中的多个地方使用。

每次我使用它时，我都会像这样导入它

from src.utils.helper import func_a

Run Code Online (Sandbox Code Playgroud)

现在我想func_a在我的测试中模拟这个。

我想在 conftest.py 中创建一个夹具，这样我就不需要为每个测试文件一次又一次地编写模拟函数。

问题是，在我的模拟函数中，我不能这样写。

https://pypi.org/project/pytest-mock/

mocker.patch('src.utils.helper.func_a', return_value="some_value", autospec=True)

Run Code Online (Sandbox Code Playgroud)

我必须为每个测试文件这样写

mocker.patch('src.pipeline.node_1.func_a', return_value="some_value", autospec=True)

Run Code Online (Sandbox Code Playgroud)

根据文档https://docs.python.org/3/library/unittest.mock.html#where-to-patch

因为我正在导入，func_a就像from src.utils.helper import func_a我必须模拟它的使用位置而不是它的定义位置。

但是这种方法的问题是我无法在 conftest.py 的夹具中定义它

目录结构

??? src
?   ??? pipeline
?   ?   ??? __init__.py
?   ?   ??? node_1.py
?   ?   ??? node_2.py
?   ?   ??? node_3.py
?   ??? utils
?       ??? __init__.py
?       ??? helper.py
??? tests
    ??? __init__.py
    ??? conftest.py
    ??? pipeline
        ??? …

Run Code Online (Sandbox Code Playgroud)

python unit-testing pytest pytest-mock

1
推荐指数

1
解决办法

2697
查看次数

标签统计

unit-testing ×2

amazon-web-services ×1

aws-step-functions ×1

git ×1

github-actions ×1

google-forms ×1

pytest-mock ×1

yaml ×1