小编ree*_*106的帖子

gcloud组件更新权限被拒绝

突然间,当我尝试运行任何gcloud命令时,我开始收到"Permission Denied"问题gcloud components update- 如果我跑了就避免了这个问题,sudo gcloud components update但我不清楚为什么突然需要sudo命令？我实际上一直在尝试运行GCMLE实验并且它有相同的错误/警告,因此我尝试更新组件并仍然遇到此问题.我已经旅行了几天,并且由于几天前这些相同的命令工作,所以没有做任何改变.此外,我没有改变我的操作系统(Mac High Sierra 10.13.3) - Google方面是否有任何可能解释这种行为变化的变化？永久解决此警告的最佳做法是什么？

(conda-env) MacBook-Pro:user$ gcloud components update
WARNING: Could not setup log file in /Users/$USERNAME/.config/gcloud/logs, (IOError: [Errno 13] Permission denied: u'/Users/$USERNAME/.config/gcloud/logs/2018.03.10/XX.XX.XX.XXXXXX.log')

Run Code Online (Sandbox Code Playgroud)

在sudo gcloud components update我能够启动GCMLE实验之后,我也得到了相同的警告(虽然我的工作现在成功提交).

WARNING: Could not setup log file in /Users/#USERNAME/.config/gcloud/logs, (IOError: [Errno 13] Permission denied: u'/Users/$USERNAME/.config/gcloud/logs/2018.03.10/XX.XX.XX.XXXXXX.log')

Run Code Online (Sandbox Code Playgroud)

google-cloud-platform gcloud google-cloud-ml

ree*_*106

lucky-day

11
推荐指数

2
解决办法

4467
查看次数

解析csv时升级到tf.dataset无法正常工作

我有一个GCMLE实验,我正在尝试升级我input_fn以使用新tf.data功能.我已根据此示例创建了以下input_fn

def input_fn(...):
    dataset = tf.data.Dataset.list_files(filenames).shuffle(num_shards) # shuffle up the list of input files
    dataset = dataset.interleave(lambda filename: # mix together records from cycle_length number of shards
                tf.data.TextLineDataset(filename).skip(1).map(lambda row: parse_csv(row, hparams)), cycle_length=5) 
    if shuffle:
      dataset = dataset.shuffle(buffer_size = 10000)
    dataset = dataset.repeat(num_epochs)
    dataset = dataset.batch(batch_size)
    iterator = dataset.make_one_shot_iterator()
    features = iterator.get_next()

    labels = features.pop(LABEL_COLUMN)

    return features, labels

Run Code Online (Sandbox Code Playgroud)

我parse_csv和我之前使用的相同,但目前还没有.我可以解决一些问题,但我不完全理解为什么我遇到这些问题.这是我的parse_csv()函数的开始

def parse_csv(..):
    columns = tf.decode_csv(rows, record_defaults=CSV_COLUMN_DEFAULTS)
    raw_features = dict(zip(FIELDNAMES, columns))

    words = …

Run Code Online (Sandbox Code Playgroud)

tensorflow google-cloud-ml tensorflow-datasets

ree*_*106

2018 05-09

8
推荐指数

1
解决办法

2689
查看次数

tensorflow每次运行发现多个图形事件

我正在为在本地模式下运行的ml引擎实验加载张量板，并收到以下警告：

"Found more than one graph event per run, or there was a metagraph containing a graph_def, as well as one or more graph events.  Overwriting the graph with the newest event.
W0825 19:26:12.435613 Reloader event_accumulator.py:311] Found more than one metagraph event per run. Overwriting the metagraph with the newest event."

Run Code Online (Sandbox Code Playgroud)

最初，我怀疑这是因为我没有清除自己的信息--logdir=$OUTPUT_PATH（如其他帖子所建议的-但是，即使我执行了操作，rm -rf $OUTPUT_PATH/*我仍然会在本地火车上收到此错误。该错误是否表明我的图表中存在更大的问题？

google-cloud-platform tensorflow tensorboard google-cloud-ml-engine

ree*_*106

lucky-day

7
推荐指数

1
解决办法

8989
查看次数

尝试运行分布式GCMLE作业时遇到抢占OS错误

我正在尝试运行分布式GCMLE培训工作，但不断出现以下错误：

An error was raised. This may be due to a preemption in a connected worker or parameter server. The current session will be closed and a new session will be created. Error: OS Error

Run Code Online (Sandbox Code Playgroud)

该Trainer软件包是一个自定义估算器，其建模方式与cloudml-samples普查自定义估算器相同：https：//github.com/GoogleCloudPlatform/cloudml-samples/tree/master/census/customestimator/trainer。可以肯定地说task.py文件是完全相同的，并且model.py文件中的input_fn()和parse_csv()函数是相同的，唯一的不同是在my的细节内model_fn()。

如果我将模型配置为与单个standard_p100GPU 一起运行，则可以约15步/秒的速度进行训练。但是，如果我将配置更新为具有4个工作程序和3个参数服务器的分布式设置（请参阅下面的配置），则会弹出抢占错误，并且10个步骤将花费约600秒...

config-distributed.yaml：

trainingInput:
  scaleTier: CUSTOM
  masterType: standard_p100
  workerType: standard_p100
  parameterServerType: large_model
  workerCount: 3
  parameterServerCount: 3

Run Code Online (Sandbox Code Playgroud)

如果我在普查自定义估算器样本中使用相同的配置，则模型的训练速度将比预期的更快，并且不会遇到抢占错误。我尝试修改人口普查示例以更精确地模仿我的确切代码，但仍然无法重现该错误。

尝试培训分布式ml引擎作业时，是否有人遇到过类似的抢占问题？关于如何更好地调试问题的任何建议？我在网上找到的唯一建议是建议参数服务器的数量至少为工作人员数量的一半（这就是为什么我将参数服务器提高到3个的原因），但是我仍然没有运气。

为了从日志中添加更多上下文，这是我尝试在分布式设置中进行训练时会发生的典型（重复）模式：

master-replica-0 loss = 16.5019, step = 53 …

Run Code Online (Sandbox Code Playgroud)

google-cloud-platform tensorflow google-cloud-ml

ree*_*106

2018 10-14

7
推荐指数

1
解决办法

829
查看次数

apache_beam.transforms.util.Reshuffle（）不适用于GCP数据流

我已经通过升级到了最新的apache_beam [gcp]软件包pip install --upgrade apache_beam[gcp]。但是，我注意到Reshuffle（）没有出现在[gcp]发行版中。这是否意味着我将无法Reshuffle()在任何数据流管道中使用？有没有办法解决？还是pip包可能不是最新的，如果Reshuffle（）在github上的master中，那么它将在dataflow上可用？

根据对这个问题的回答，我正在尝试从BigQuery读取数据，然后将数据随机化，然后再将其写入GCP存储桶中的CSV。我已经注意到，我用来训练GCMLE模型的.csv分片并不是真正随机的。在tensorflow中，我可以将批次随机化，但这只会对队列中建立的每个文件中的行进行随机化，而我的问题是当前正在生成的文件以某种方式存在偏差。如果对在数据流中写入CSV之前有其他洗牌的方法有任何建议，将不胜感激。

python google-cloud-platform google-cloud-dataflow apache-beam

ree*_*106

2018 02-03

6
推荐指数

1
解决办法

522
查看次数

如何解释tf.layers.dropout培训arg

对于我来说不清楚arg tf.layers.dropout()的文档training。

该文档指出：

training: Either a Python boolean, or a TensorFlow boolean scalar tensor
      (e.g. a placeholder). Whether to return the output in training mode
      (apply dropout) or in inference mode (return the input untouched).

Run Code Online (Sandbox Code Playgroud)

我的理解是，这取决于是否training = True或training = False辍学将被应用。但是，我不清楚是否True或False将应用辍学（即处于培训模式）。鉴于这是一个可选的参数，我预计tf.layers.dropout（）将在默认情况下适用，但默认情况下False它直观地training=False将表明默认是不训练。

为了使tf.layers.dropout（）实际应用，似乎需要类似以下内容：

tf.layers.dropout(input, 0.5, training = mode == Modes.TRAIN)

从我的文档来看，这不是很明显，因为这training是一个可选参数。

这似乎是的正确实现tf.layers.dropout吗？为什么training不自动将标志绑定Modes.TRAIN为默认标志，然后针对其他情况进行调整？默认training=False似乎很容易误导

python tensorflow

ree*_*106

lucky-day

6
推荐指数

1
解决办法

1027
查看次数

Google Cloud ML Engine GPU利用率

如果我在Google Cloud ML Engine作业中使用--scale-tier BASIC GPU，如何查看GPU利用率？我可以在“作业详细信息”选项卡上查看CPU利用率和内存利用率，但我想知道GPU的利用率是多少。这是否仅包含在CPU使用率中，还是有另一个选项卡可以查看GPU使用率？

此外，是否有任何方法可以查看哪些操作占用了大部分CPU使用率？我的CPU使用率非常高，内存非常低，并且输入生成器总是很满（100％），因此我试图更好地了解所花的时间，以便尝试优化模型性能。

google-cloud-platform tensorflow google-cloud-ml-engine

ree*_*106

lucky-day

5
推荐指数

1
解决办法

1004
查看次数

通过eval_metric_ops在Tensorboard中的Tensorflow图tf.metrics.precision_at_thresholds

tf.metrics.precision_at_thresholds()接受三个参数：labels, predictions, thresholds其中thresholds是[0,1]之间的python列表或阈值元组。然后该函数返回“形状为[len（thresholds）]的浮点张量”，这对于自动将eval_metric_ops绘制到张量板上是有问题的（因为我认为它们应该是标量的）。这些值会很好地打印到控制台，但我也想在张量板上绘制这些值。是否可以进行任何调整以能够在张量板上绘制该值？

python tensorflow tensorboard

ree*_*106

lucky-day

5
推荐指数

1
解决办法

1021
查看次数

Google Cloud Dataflow 从字典写入 CSV

我有一个值字典，我想使用 Python SDK 将其作为有效的 .CSV 文件写入 GCS。我可以将字典写为换行符分隔的文本文件，但我似乎找不到将字典转换为有效的 .CSV 的示例。有人可以建议在数据流管道中生成 csv 的最佳方法吗？这个问题的答案解决了从 CSV 文件中读取的问题，但并没有真正解决写入 CSV 文件的问题。我认识到 CSV 文件只是带有规则的文本文件，但我仍然在努力将数据字典转换为可以使用 WriteToText 写入的 CSV。

这是一个简单的示例字典，我想将其转换为 CSV：

test_input = [{'label': 1, 'text': 'Here is a sentence'},
              {'label': 2, 'text': 'Another sentence goes here'}]


test_input  | beam.io.WriteToText(path_to_gcs)

Run Code Online (Sandbox Code Playgroud)

上面的代码将生成一个文本文件，其中每个字典都位于换行符上。Apache Beam 中是否有我可以利用的功能（类似于csv.DictWriter）？

python google-cloud-dataflow apache-beam

ree*_*106

lucky-day

5
推荐指数

1
解决办法

6725
查看次数

Tensorboard的零值分数应如何解释？

I am running a cloud ML engine job and my tensorboard plots are showing the fraction of zero values for my hidden layers steadily increasing towards 1 as the number of steps increases. How should this plot be interpreted? I believe it is a good thing as more zero values would suggest that the model is getting more "certain" about the predictions that it is making.