XGBoost（免费套餐）的 Amazon Sagemaker ResourceLimitExceeded 错误

Question

XGBoost（免费套餐）的 Amazon Sagemaker ResourceLimitExceeded 错误

Fat*_*ici 9 python amazon-web-services boto3 amazon-sagemaker

我正在尝试在免费套餐 AWS Sagemaker 中创建 XGBoost 模型。我收到以下错误：

\n\n

“ResourceLimitExceeded：调用 CreateEndpoint 操作时发生错误 (ResourceLimitExceeded)：帐户级服务限制“端点使用的ml.m5.xlarge”为 0 个实例，当前利用率为 0 个实例，请求增量为 1 个实例”。。

\n\n

我应该使用什么正确的 train_instance_type ？

\n\n

这是我的代码：

\n\n

# import libraries\nimport boto3, re, sys, math, json, os, sagemaker, urllib.request\nfrom sagemaker import get_execution_role\nimport numpy as np                                \nimport pandas as pd                               \nimport matplotlib.pyplot as plt                   \nfrom IPython.display import Image                 \nfrom IPython.display import display               \nfrom time import gmtime, strftime                 \nfrom sagemaker.predictor import csv_serializer   \n\n# Define IAM role\nrole = get_execution_role()\nprefix = \'sagemaker/DEMO-xgboost-dm\'\ncontainers = {\'us-west-2\': \'433757028032.dkr.ecr.us-west-2.amazonaws.com/xgboost:latest\',\n              \'us-east-1\': \'811284229777.dkr.ecr.us-east-1.amazonaws.com/xgboost:latest\',\n              \'us-east-2\': \'825641698319.dkr.ecr.us-east-2.amazonaws.com/xgboost:latest\',\n              \'eu-west-1\': \'685385470294.dkr.ecr.eu-west-1.amazonaws.com/xgboost:latest\'} # each region has its XGBoost container\nmy_region = boto3.session.Session().region_name # set the region of the instance\n\n# Create an instance of the XGBoost model (an estimator), and define the model\xe2\x80\x99s hyperparameters.\n# Note: train_instance_type=\'ml.m5.large\' has 0 free credits! Use one of https://aws.amazon.com/sagemaker/pricing/ \nsess = sagemaker.Session()\nxgb = sagemaker.estimator.Estimator(containers[my_region],role, train_instance_count=1, train_instance_type=\'ml.m5.xlarge\',output_path=\'s3://{}/{}/output\'.format(\'my_s3_bucket\', prefix),sagemaker_session=sess)\nxgb.set_hyperparameters(max_depth=1,eta=0.2,gamma=4,min_child_weight=6,subsample=0.8,silent=0,objective=\'binary:logistic\',num_round=100)\n# Train the model using gradient optimization on a ml.m4.xlarge instance\n# After a few minutes, you should start to see the training logs being generated.\nxgb.fit({\'train\': s3_input_train})\n

Run Code Online (Sandbox Code Playgroud)\n\n

在这一步我看到的是：

\n\n

2019-10-22 06:32:51 Starting - Starting the training job...\n2019-10-22 06:33:00 Starting - Launching requested ML instances......\n2019-10-22 06:33:54 Starting - Preparing the instances for training...\n2019-10-22 06:34:41 Downloading - Downloading input data...\n2019-10-22 06:35:22 Training - Training image download completed. Training in progress..Arguments: train\n[2019-10-22:06:35:22:INFO] Running standalone xgboost training.\n[2019-10-22:06:35:22:INFO] Path /opt/ml/input/data/validation does not exist!\n[2019-10-22:06:35:22:INFO] File size need to be processed in the node: 3.38mb. Available memory size in the node: 8089.9mb\n[2019-10-22:06:35:22:INFO] Determined delimiter of CSV input is \',\'\n[06:35:22] S3DistributionType set as FullyReplicated\n[06:35:22] 28831x59 matrix with 1701029 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,\n[06:35:22] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 2 extra nodes, 0 pruned nodes, max_depth=1\n[0]#011train-error:0.102182\n[06:35:22] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 2 extra nodes, 0 pruned nodes, max_depth=1\n[1]#011train-error:0.102182\n[06:35:22] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 2 extra nodes, 0 pruned nodes, max_depth=1\n[2]#011train-error:0.102182\n[06:35:22] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 2 extra nodes, 0 pruned nodes, max_depth=1\n[3]#011train-error:0.102182\n[06:35:22] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 2 extra nodes, 0 pruned nodes, max_depth=1\n[4]#011train-error:0.102182\n[06:35:22] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 2 extra nodes, 0 pruned nodes, max_depth=1\n[5]#011train-error:0.102182\n[06:35:22] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 2 extra nodes, 0 pruned nodes, max_depth=1\n[6]#011train-error:0.102182\n[06:35:22] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 2 extra nodes, 0 pruned nodes, max_depth=1\n[7]#011train-error:0.10839\n[06:35:22] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 2 extra nodes, 0 pruned nodes, max_depth=1\n[8]#011train-error:0.102737\n[06:35:22] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 2 extra nodes, 0 pruned nodes, max_depth=1\n[9]#011train-error:0.107697\n

Run Code Online (Sandbox Code Playgroud)\n\n

然后当我部署这个时：

\n\n

# Deploy the model on a server and create an endpoint that you can access\nxgb_predictor = xgb.deploy(initial_instance_count=1,instance_type=\'ml.m5.xlarge\')\n---------------------------------------------------------------------------\nResourceLimitExceeded                     Traceback (most recent call last)\n<ipython-input-38-6d149f3edc98> in <module>()\n      1 # Deploy the model on a server and create an endpoint that you can access\n----> 2 xgb_predictor = xgb.deploy(initial_instance_count=1,instance_type=\'ml.m5.xlarge\')\n\n~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/estimator.py in deploy(self, initial_instance_count, instance_type, accelerator_type, endpoint_name, use_compiled_model, update_endpoint, wait, model_name, kms_key, **kwargs)\n    559             tags=self.tags,\n    560             wait=wait,\n--> 561             kms_key=kms_key,\n    562         )\n    563 \n\n~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/model.py in deploy(self, initial_instance_count, instance_type, accelerator_type, endpoint_name, update_endpoint, tags, kms_key, wait)\n    464         else:\n    465             self.sagemaker_session.endpoint_from_production_variants(\n--> 466                 self.endpoint_name, [production_variant], tags, kms_key, wait\n    467             )\n    468 \n\n~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in endpoint_from_production_variants(self, name, production_variants, tags, kms_key, wait)\n   1361 \n   1362             self.sagemaker_client.create_endpoint_config(**config_options)\n-> 1363         return self.create_endpoint(endpoint_name=name, config_name=name, tags=tags, wait=wait)\n   1364 \n   1365     def expand_role(self, role):\n\n~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in create_endpoint(self, endpoint_name, config_name, tags, wait)\n    975 \n    976         self.sagemaker_client.create_endpoint(\n--> 977             EndpointName=endpoint_name, EndpointConfigName=config_name, Tags=tags\n    978         )\n    979         if wait:\n\n~/anaconda3/envs/python3/lib/python3.6/site-packages/botocore/client.py in _api_call(self, *args, **kwargs)\n    355                     "%s() only accepts keyword arguments." % py_operation_name)\n    356             # The "self" in this scope is referring to the BaseClient.\n--> 357             return self._make_api_call(operation_name, kwargs)\n    358 \n    359         _api_call.__name__ = str(py_operation_name)\n\n~/anaconda3/envs/python3/lib/python3.6/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params)\n    659             error_code = parsed_response.get("Error", {}).get("Code")\n    660             error_class = self.exceptions.from_code(error_code)\n--> 661             raise error_class(parsed_response, operation_name)\n    662         else:\n    663             return parsed_response\n\nResourceLimitExceeded: An error occurred (ResourceLimitExceeded) when calling the CreateEndpoint operation: The account-level service limit \'ml.m5.xlarge for endpoint usage\' is 0 Instances, with current utilization of 0 Instances and a request delta of 1 Instances. Please contact AWS support to request an increase for this limit.\n

Run Code Online (Sandbox Code Playgroud)\n\n

编辑：尝试ml.m4.xlarge实例：

\n\n

当我使用 ml.m4.xlarge 时，我收到相同的消息“ResourceLimitExceeded：调用 CreateEndpoint 操作时发生错误 (ResourceLimitExceeded)：端点使用的帐户级服务限制 \'ml.m4.xlarge\' 为 0”实例，当前利用率为 0 个实例，请求增量为 1 个实例。请联系 AWS 支持以请求增加此限制。”

\n

Answer 1

Sai*_*ibō 11

请求增加 ml.m5.xlarge 限制的步骤

访问 aws 控制台https://console.aws.amazon.com/
点击右上角的支持
单击创建案例（橙色按钮）
选择服务限制增加单选按钮
对于限制类型，搜索并选择 SageMaker Notebook 实例
选择与亚马逊控制台右上角显示的区域相同的区域。
编写简短的用例描述
对于限制，选择 ml.[x].[x]（在您的情况下，选择 ml.m5.xlarge）
新限值 1

此手动支持票可能需要 48 小时才能转完。（对我来说，一天后我收到支持团队的回复，实例限制更改为 1）

Answer 2

Tom*_*mmy 3

根据此 AWS 页面，您每月将获得50 小时的 m4.xlarge 用于前两个月的培训，每月 125 小时的 m4.xlarge 用于前两个月的托管。因此，如果您在前两个月内，ml.m4.xlarge应该可以解决问题。

至于服务限制本身，根据这篇文章， 新创建的帐户将 SageMaker 中的每个实例类型（t2 介质除外）限制为 0，而不是默认限制。

因此，您毕竟需要联系AWS支持并要求增加您的限制。此外，如果您自己不是管理员，这可能会受到您帐户管理员的限制。因此，在这种情况下，这应该是您的第一个停靠港。

归档时间：	6 年，4 月前
查看次数：	19122 次
最近记录：	2 年，3 月前