Chr*_*s B 5 python amazon-web-services tensorflow tensorboard amazon-sagemaker
我正在使用 Amazon Sagemaker 通过 Tensorflow 训练模型,并且我希望能够在作业运行时监控训练进度。然而,在训练期间,不会将 Tensorboard 文件输出到 S3,只有训练作业完成后,文件才会上传到 S3。训练完成后,我可以下载文件并看到 Tensorboard 在整个训练过程中正确记录值,尽管训练完成后仅在 S3 中更新一次。
我想知道为什么 Sagemaker 在整个训练过程中不将 Tensorboard 信息上传到 S3?
以下是我在 Sagemaker 上的笔记本中启动训练工作的代码
import sagemaker
from sagemaker.tensorflow import TensorFlow
from sagemaker.debugger import DebuggerHookConfig, CollectionConfig, TensorBoardOutputConfig
import time
bucket = 'my-bucket'
output_prefix = 'training-jobs'
model_name = 'my-model'
dataset_name = 'my-dataset'
dataset_path = f's3://{bucket}/datasets/{dataset_name}'
output_path = f's3://{bucket}/{output_prefix}'
job_name = f'{model_name}-{dataset_name}-training-{time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())}'
s3_checkpoint_path = f"{output_path}/{job_name}/checkpoints" # Checkpoints are updated live as expected
s3_tensorboard_path = f"{output_path}/{job_name}/tensorboard" # Tensorboard data isn't appearing here until the training job has completed
tensorboard_output_config = TensorBoardOutputConfig(
s3_output_path=s3_tensorboard_path,
container_local_output_path= '/opt/ml/output/tensorboard' # I have confirmed this is the unaltered path being provided to tf.summary.create_file_writer()
)
role = sagemaker.get_execution_role()
estimator = TensorFlow(entry_point='main.py', source_dir='./', role=role, max_run=60*60*24*5,
output_path=output_path,
checkpoint_s3_uri=s3_checkpoint_path,
tensorboard_output_config=tensorboard_output_config,
instance_count=1, instance_type='ml.g4dn.xlarge',
framework_version='2.3.1', py_version='py37', script_mode=True)
dpe_estimator.fit({'train': dataset_path}, wait=True, job_name=job_name)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
1447 次 |
| 最近记录: |