我正在努力正确设置 Vertex AI 管道,该管道执行以下操作:
from google_cloud_pipeline_components import aiplatform as gcc_aip
from kfp.v2 import dsl
from kfp.v2.dsl import component
from kfp.v2.dsl import (
Output,
Artifact,
Model,
)
PROJECT_ID = 'my-gcp-project'
BUCKET_NAME = "mybucket"
PIPELINE_ROOT = "{}/pipeline_root".format(BUCKET_NAME)
@component
def get_input_data() -> str:
# getting data from API, save to Cloud Storage
# return GS URI
gcs_batch_input_path = 'gs://somebucket/file'
return gcs_batch_input_path
@component(
base_image="python:3.9",
packages_to_install=['google-cloud-aiplatform==1.8.0']
)
def load_ml_model(project_id: str, …Run Code Online (Sandbox Code Playgroud) 我有Scrapy(版本1.0.3)蜘蛛,其中我从网页中提取了一些数据,我也下载了文件,像这样(简化):
def extract_data(self, response):
title = response.xpath('//html/head/title/text()').extract()[0].strip()
my_item = MyItem()
my_item['title'] = title
file_url = response.xpath('...get url of file...')
file_urls = [file_url] # here there can be more urls, so I'm storing like a list
fi = FileItem()
fi['file_urls'] = file_urls
yield my_item
yield fi
Run Code Online (Sandbox Code Playgroud)
在pipelines.py中我只是重写FilePipeline来更改文件的名称:
from scrapy.pipelines.files import FilesPipeline
class CustomFilesPipeline(FilesPipeline):
def file_path(self, request, response=None, info=None):
filename = format_filename(request.url)
return filename
Run Code Online (Sandbox Code Playgroud)
在items.py我有:
class MyItem(scrapy.Item):
title = scrapy.Field()
class FileItem(scrapy.Item):
file_urls = scrapy.Field()
files = scrapy.Field()
Run Code Online (Sandbox Code Playgroud)
在settings.py我有:
ITEM_PIPELINES = { …Run Code Online (Sandbox Code Playgroud)