标签: ray

类型错误：无法直接创建描述符

我尝试安装Ray，但出现错误：

TypeError: Descriptors cannot not be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
 1. Downgrade the protobuf package to 3.20.x or lower.
 2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).

Run Code Online (Sandbox Code Playgroud)

我尝试解决问题并降级protobuf：

Name: protobuf
Version: 3.20.0
Summary: Protocol Buffers
Home-page: https://developers.google.com/protocol-buffers/ …

Run Code Online (Sandbox Code Playgroud)

python ray tensorflow

has*_*u33

2022 06-26

203
推荐指数

6
解决办法

30万
查看次数

Ray 究竟是如何与工作人员共享数据的？

有许多简单的教程以及 SO 问题和答案，它们声称 Ray 以某种方式与工作人员共享数据，但这些都没有详细说明在哪个操作系统上共享的内容。

例如，在这个 SO 答案中：https ://stackoverflow.com/a/56287012/1382437 np 数组被序列化到共享对象存储中，然后由几个访问相同数据的工作人员使用（从该答案复制的代码）：

import numpy as np
import ray

ray.init()

@ray.remote
def worker_func(data, i):
    # Do work. This function will have read-only access to
    # the data array.
    return 0

data = np.zeros(10**7)
# Store the large array in shared memory once so that it can be accessed
# by the worker tasks without creating copies.
data_id = ray.put(data)

# Run worker_func 10 times in parallel. This will not create any …

Run Code Online (Sandbox Code Playgroud)

shared-memory ray

Joh*_*hsm

2021 02-06

11
推荐指数

2
解决办法

623
查看次数

如何解决 ray 不断增长的内存使用问题？

我开始使用 ray 进行分布式机器学习，但我已经遇到了一些问题。内存使用量只会增加，直到程序崩溃。尽管我不断清除列表，但内存以某种方式泄漏。知道为什么吗？

我的规格：操作系统平台和发行版：Ubuntu 16.04 Ray 安装自：二进制 Ray 版本：0.6.5 Python 版本：3.6.8

我已经尝试使用实验队列而不是 DataServer 类，但问题仍然相同。

import numpy as np
import ray
import time
ray.init(redis_max_memory=100000000)


@ray.remote
class Runner():
    def __init__(self, dataList):
        self.run(dataList)

    def run(self,dataList):
        while True:
            dataList.put.remote(np.ones(10))

@ray.remote
class Optimizer():
    def __init__(self, dataList):
        self.optimize(dataList)

    def optimize(self,dataList):
        while True:
            dataList.pop.remote()

@ray.remote
class DataServer():
    def __init__(self):
        self.dataList= []

    def put(self,data):
        self.dataList.append(data)

    def pop(self):
        if len(self.dataList) !=0:
            return self.dataList.pop()
    def get_size(self):
        return len(self.dataList)


dataServer = DataServer.remote()
runner = Runner.remote(dataServer)
optimizer1 = Optimizer.remote(dataServer)
optimizer2 = Optimizer.remote(dataServer) …

Run Code Online (Sandbox Code Playgroud)

python memory ray

TRZ*_*KLO

lucky-day

10
推荐指数

2
解决办法

3731
查看次数

GPU内存为空，但出现CUDA内存不足错误

在使用rayune （1 个 GPU 进行 1 次试验）训练此代码期间，经过几个小时的训练（大约 20 次试验）后，GPU 出现错误：0,1。即使终止训练过程后，GPU 仍然给出错误。CUDA out of memoryout of memory

如上所述，目前我的所有 GPU 设备都是空的。并且除了这两个进程之外没有其他Python进程在运行。

import torch
torch.rand(1, 2).to('cuda:0') # cuda out of memory error
torch.rand(1, 2).to('cuda:1') # cuda out of memory error
torch.rand(1, 2).to('cuda:2') # working
torch.rand(1, 2).to('cuda:3') # working

torch.cuda.device_count() # 4
torch.cuda.memory_reserved() # 0
torch.cuda.is_available() # True

Run Code Online (Sandbox Code Playgroud)

# error message of GPU 0, 1
RuntimeError: CUDA error: out of memory

Run Code Online (Sandbox Code Playgroud)

但是，GPU:0,1 会出错out of memory。如果我重新启动计算机（ubuntu …

cuda gpu nvidia ray pytorch

Kjy*_*ong

2022 05-02

10
推荐指数

1
解决办法

5931
查看次数

什么是 ray::IDLE 以及为什么某些工作线程内存不足？

我在 EC2 上运行 ray。我在 c5.large 实例上运行工作程序，该实例具有约 4G 的 RAM。

当我运行许多作业时，我看到以下错误消息：

  File "python/ray/_raylet.pyx", line 631, in ray._raylet.execute_task
  File "/home/ubuntu/project/env/lib/python3.6/site-packages/ray/memory_monitor.py", line 126, in raise_if_low_memory
    self.error_threshold))
ray.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node ip-172-31-43-111 is used (3.47 / 3.65 GB). The top 10 memory consumers are:

PID     MEM     COMMAND
21183   0.21GiB ray::IDLE
21185   0.21GiB ray::IDLE
21222   0.21GiB ray::IDLE
21260   0.21GiB ray::IDLE
21149   0.21GiB ray::IDLE
21298   0.21GiB ray::IDLE
21130   0.21GiB ray::IDLE
21148   0.21GiB ray::IDLE
21225   0.21GiB ray::IDLE
21257   0.21GiB ray::IDLE

In addition, …

Run Code Online (Sandbox Code Playgroud)

python-3.x ray

Hen*_*son

2020 02-06

9
推荐指数

1
解决办法

6467
查看次数

有没有办法限制Ray对象存储最大内存使用量

我正在尝试利用 Ray 的并行化模型来逐条处理文件记录。代码运行得很好，但是对象存储增长得很快，最终崩溃了。我避免使用 ray.get(function.remote()) 因为它会降低性能，因为该任务由几百万个子任务和等待任务完成的开销组成。有没有办法对对象存储设置全局限制？

#code which constantly backpressusre the obejct storage, freeing space, but causes performance to be worse than serial execution
for record in infile:
    ray.get(createNucleotideCount.remote(record, copy.copy(dinucDict), copy.copy(tetranucDict),dinucList,tetranucList, filename))

#code that maximizes throughput but makes the object storage grow constantly
for record in infile:
    createNucleotideCount.remote(record, copy.copy(dinucDict), copy.copy(tetranucDict),dinucList,tetranucList, filename)

#the called function returns either 0 or 1.

Run Code Online (Sandbox Code Playgroud)

python ray

Lip*_*cia

2019 09-03

8
推荐指数

1
解决办法

9829
查看次数

Is ray thread safe?

Assume that a ray actor is defined as below

@ray.remote
class Buffer:
    def __init__(self):
        self.memory = np.zeros(10)

    def modify_data(self, indices, values):
        self.memory[indices] = values

    def sample(self, size):
        indices = np.random.randint(0, 10, size)
        return self.memory[indices]

Run Code Online (Sandbox Code Playgroud)

Is it thread-safe to have other actors call methods of Buffer without any lock?

python ray

May*_*ybe

2019 12-24

7
推荐指数

1
解决办法

518
查看次数

更改 Ray RLlib Training 的 Logdir 而不是 ~/ray_results

我正在使用 Ray & RLlib 在 Ubuntu 系统上训练 RL 代理。Tensorboard 用于通过将其指向~/ray_results所有运行的所有日志文件的存储位置来监控训练进度。未使用 Ray Tune。

例如，在开始新的 Ray/RLlib 训练运行时，将在以下位置创建一个新目录

~/ray_results/DQN_ray_custom_env_2020-06-07_05-26-32djwxfdu1

Run Code Online (Sandbox Code Playgroud)

为了可视化训练进度，我们需要使用以下命令启动 Tensorboard

tensorboard --logdir=~/ray_results

Run Code Online (Sandbox Code Playgroud)

问题：是否可以配置 Ray/RLlib 以将日志文件的输出目录从~/ray_results其他位置更改？

另外，不是登录到一个名为类似的目录DQN_ray_custom_env_2020-06-07_05-26-32djwxfdu1，这个目录名称可以由我们自己设置吗？

尝试失败：尝试设置

os.environ['TUNE_RESULT_DIR'] = '~/another_dir`

Run Code Online (Sandbox Code Playgroud)

在运行之前ray.init()，但结果日志文件仍在写入~/ray_results。

python reinforcement-learning ray rllib ray-tune

Nyx*_*nyx

2020 06-07

7
推荐指数

2
解决办法

2028
查看次数

检查点射线调谐试验的最佳模型

所以我刚刚进行了一个tune实验并得到了以下输出：

+--------------------+------------+-------+-------------+----------------+--------+------------+
| Trial name         | status     | loc   |          lr |   weight_decay |   loss |   accuracy |
|--------------------+------------+-------+-------------+----------------+--------+------------|
| trainable_13720f86 | TERMINATED |       | 0.00116961  |     0.00371219 | 0.673  |     0.7977 |
| trainable_13792744 | TERMINATED |       | 0.109529    |     0.0862344  | 0.373  |     0.8427 |
| trainable_137ecd98 | TERMINATED |       | 4.35062e-06 |     0.0261442  | 0.6993 |     0.7837 |
| trainable_1383f9d0 | TERMINATED |       | 1.37858e-05 |     0.0974182  | 0.4538 |     0.8428 |
| trainable_13892f72 | …

Run Code Online (Sandbox Code Playgroud)

ray ray-tune

Kir*_*van

lucky-day

7
推荐指数

1
解决办法

3218
查看次数

我已经安装了 ray 模块，并且一直收到此警告FutureWarning: Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI, autoscaler, and dashboard will only be usable via pip install 'ray[default]'. Please update your install command. (pid=None) "update your install command.", FutureWarning)我更新了我的 pip 因为"update your install command."但我仍然不断收到它。

python ray

Far*_*aji

2021 05-13

7
推荐指数

1
解决办法

4301
查看次数