我正在尝试部署 SageMaker 端点,但它无限期地陷入“创建”阶段。下面是我的 Dockerfile 和训练/服务脚本。该模型训练没有任何问题。只有端点部署会陷入“创建”阶段。
下面是文件夹结构
文件夹结构
|_code
|_train_serve.py
|_Dockerfile
Run Code Online (Sandbox Code Playgroud)
下面是 Dockerfile
Dockerfile
# ##########################################################
# Adapt your container (to work with SageMaker)
# # https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-training-container.html
# # https://hub.docker.com/r/huanjason/scikit-learn/dockerfile
ARG REGION=us-east-1
FROM python:3.7
RUN apt-get update && apt-get -y install gcc
RUN pip3 install \
# numpy==1.16.2 \
numpy \
# scikit-learn==0.20.2 \
scikit-learn \
pandas \
# scipy==1.2.1 \
scipy \
mlflow
RUN rm -rf /root/.cache
ENV PYTHONUNBUFFERED=TRUE
ENV PYTHONDONTWRITEBYTECODE=TRUE
# Install sagemaker-training toolkit to enable SageMaker Python SDK …Run Code Online (Sandbox Code Playgroud) 我无法弄清楚我的 CTAS 查询出了什么问题,即使我没有提到任何分桶列,它也会在存储在分区内时将数据分解成更小的文件。有没有办法避免这些小文件并将每个分区存储为一个文件,因为小于 128 MB 的文件会导致额外的开销?
CREATE TABLE sampledb.yellow_trip_data_parquet
WITH(
format = 'PARQUET'
parquet_compression = 'GZIP',
external_location='s3://mybucket/Athena/tables/parquet/'
partitioned_by=ARRAY['year','month']
)
AS SELECT
VendorID,
tpep_pickup_datetime,
tpep_dropoff_datetime,
passenger_count,
trip_distance,
RatecodeID,
store_and_fwd_flag,
PULocationID,
DOLocationID,
payment_type,
fare_amount,
extra,
mta_tax,
tip_amount,
tolls_amount,
improvement_surcharge,
total_amount,
date_format(date_parse(tpep_pickup_datetime,'%Y-%c-%d %k:%i:%s'),'%Y') AS year,
date_format(date_parse(tpep_pickup_datetime,'%Y-%c-%d %k:%i:%s'),'%c') AS month
FROM sampleDB.yellow_trip_data_raw;
Run Code Online (Sandbox Code Playgroud)

如何在 Postgres 的文本列上添加唯一约束(忽略特殊字符)?
CREATE TABLE my_table(
SomeTextColumn citext
CONSTRAINT person_u_1 UNIQUE (SomeTextColumn)
);
Run Code Online (Sandbox Code Playgroud)
在上表中,我尝试添加一个唯一约束,该约束将通过忽略传入数据中的特殊字符来寻找唯一性
For example:
1. HelloWorld --> Gets inserted successfully
2. Hello World --> Should fail with duplicate constraint
2. Hello%$^&*W^%orld --> Should fail with duplicate constraint
Run Code Online (Sandbox Code Playgroud) 为什么Python中的并行处理比串行处理慢?
#!/usr/bin/env python3
import os
import time
from functools import partial
import multiprocessing as mp
def print_hello(i, typ):
print(typ + " : PID-" + str(os.getpid()) + "\n")
# print("Hello World " + str(i))
if __name__ == '__main__':
N= mp.cpu_count()
parallel_start_time = time.time()
with mp.Pool(processes = N) as p:
p.map(partial(print_hello,typ="Parallel"), (x for x in range(100)))
parallel_end_time = time.time()
serial_start_time = time.time()
for x in range(100):
print_hello(x, "Serial")
serial_end_time = time.time()
print("Parallel processing took " + str(parallel_end_time - parallel_start_time) + " seconds") …Run Code Online (Sandbox Code Playgroud)