我尝试安装Ray,但出现错误:
TypeError: Descriptors cannot not be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
1. Downgrade the protobuf package to 3.20.x or lower.
2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).
Run Code Online (Sandbox Code Playgroud)
我尝试解决问题并降级protobuf:
Name: protobuf
Version: 3.20.0
Summary: Protocol Buffers
Home-page: https://developers.google.com/protocol-buffers/ …
Run Code Online (Sandbox Code Playgroud) 有许多简单的教程以及 SO 问题和答案,它们声称 Ray 以某种方式与工作人员共享数据,但这些都没有详细说明在哪个操作系统上共享的内容。
例如,在这个 SO 答案中:https ://stackoverflow.com/a/56287012/1382437 np 数组被序列化到共享对象存储中,然后由几个访问相同数据的工作人员使用(从该答案复制的代码):
import numpy as np
import ray
ray.init()
@ray.remote
def worker_func(data, i):
# Do work. This function will have read-only access to
# the data array.
return 0
data = np.zeros(10**7)
# Store the large array in shared memory once so that it can be accessed
# by the worker tasks without creating copies.
data_id = ray.put(data)
# Run worker_func 10 times in parallel. This will not create any …
Run Code Online (Sandbox Code Playgroud) 我开始使用 ray 进行分布式机器学习,但我已经遇到了一些问题。内存使用量只会增加,直到程序崩溃。尽管我不断清除列表,但内存以某种方式泄漏。知道为什么吗?
我的规格:操作系统平台和发行版:Ubuntu 16.04 Ray 安装自:二进制 Ray 版本:0.6.5 Python 版本:3.6.8
我已经尝试使用实验队列而不是 DataServer 类,但问题仍然相同。
import numpy as np
import ray
import time
ray.init(redis_max_memory=100000000)
@ray.remote
class Runner():
def __init__(self, dataList):
self.run(dataList)
def run(self,dataList):
while True:
dataList.put.remote(np.ones(10))
@ray.remote
class Optimizer():
def __init__(self, dataList):
self.optimize(dataList)
def optimize(self,dataList):
while True:
dataList.pop.remote()
@ray.remote
class DataServer():
def __init__(self):
self.dataList= []
def put(self,data):
self.dataList.append(data)
def pop(self):
if len(self.dataList) !=0:
return self.dataList.pop()
def get_size(self):
return len(self.dataList)
dataServer = DataServer.remote()
runner = Runner.remote(dataServer)
optimizer1 = Optimizer.remote(dataServer)
optimizer2 = Optimizer.remote(dataServer) …
Run Code Online (Sandbox Code Playgroud) 在使用rayune (1 个 GPU 进行 1 次试验)训练此代码期间,经过几个小时的训练(大约 20 次试验)后,GPU 出现错误:0,1。即使终止训练过程后,GPU 仍然给出错误。CUDA out of memory
out of memory
如上所述,目前我的所有 GPU 设备都是空的。并且除了这两个进程之外没有其他Python进程在运行。
import torch
torch.rand(1, 2).to('cuda:0') # cuda out of memory error
torch.rand(1, 2).to('cuda:1') # cuda out of memory error
torch.rand(1, 2).to('cuda:2') # working
torch.rand(1, 2).to('cuda:3') # working
torch.cuda.device_count() # 4
torch.cuda.memory_reserved() # 0
torch.cuda.is_available() # True
Run Code Online (Sandbox Code Playgroud)
# error message of GPU 0, 1
RuntimeError: CUDA error: out of memory
Run Code Online (Sandbox Code Playgroud)
但是,GPU:0,1 会出错out of memory
。如果我重新启动计算机(ubuntu …
我在 EC2 上运行 ray。我在 c5.large 实例上运行工作程序,该实例具有约 4G 的 RAM。
当我运行许多作业时,我看到以下错误消息:
File "python/ray/_raylet.pyx", line 631, in ray._raylet.execute_task
File "/home/ubuntu/project/env/lib/python3.6/site-packages/ray/memory_monitor.py", line 126, in raise_if_low_memory
self.error_threshold))
ray.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node ip-172-31-43-111 is used (3.47 / 3.65 GB). The top 10 memory consumers are:
PID MEM COMMAND
21183 0.21GiB ray::IDLE
21185 0.21GiB ray::IDLE
21222 0.21GiB ray::IDLE
21260 0.21GiB ray::IDLE
21149 0.21GiB ray::IDLE
21298 0.21GiB ray::IDLE
21130 0.21GiB ray::IDLE
21148 0.21GiB ray::IDLE
21225 0.21GiB ray::IDLE
21257 0.21GiB ray::IDLE
In addition, …
Run Code Online (Sandbox Code Playgroud) 我正在尝试利用 Ray 的并行化模型来逐条处理文件记录。代码运行得很好,但是对象存储增长得很快,最终崩溃了。我避免使用 ray.get(function.remote()) 因为它会降低性能,因为该任务由几百万个子任务和等待任务完成的开销组成。有没有办法对对象存储设置全局限制?
#code which constantly backpressusre the obejct storage, freeing space, but causes performance to be worse than serial execution
for record in infile:
ray.get(createNucleotideCount.remote(record, copy.copy(dinucDict), copy.copy(tetranucDict),dinucList,tetranucList, filename))
#code that maximizes throughput but makes the object storage grow constantly
for record in infile:
createNucleotideCount.remote(record, copy.copy(dinucDict), copy.copy(tetranucDict),dinucList,tetranucList, filename)
#the called function returns either 0 or 1.
Run Code Online (Sandbox Code Playgroud) Assume that a ray actor is defined as below
@ray.remote
class Buffer:
def __init__(self):
self.memory = np.zeros(10)
def modify_data(self, indices, values):
self.memory[indices] = values
def sample(self, size):
indices = np.random.randint(0, 10, size)
return self.memory[indices]
Run Code Online (Sandbox Code Playgroud)
Is it thread-safe to have other actors call methods of Buffer
without any lock?
我正在使用 Ray & RLlib 在 Ubuntu 系统上训练 RL 代理。Tensorboard 用于通过将其指向~/ray_results
所有运行的所有日志文件的存储位置来监控训练进度。未使用 Ray Tune。
例如,在开始新的 Ray/RLlib 训练运行时,将在以下位置创建一个新目录
~/ray_results/DQN_ray_custom_env_2020-06-07_05-26-32djwxfdu1
Run Code Online (Sandbox Code Playgroud)
为了可视化训练进度,我们需要使用以下命令启动 Tensorboard
tensorboard --logdir=~/ray_results
Run Code Online (Sandbox Code Playgroud)
问题:是否可以配置 Ray/RLlib 以将日志文件的输出目录从~/ray_results
其他位置更改?
另外,不是登录到一个名为类似的目录DQN_ray_custom_env_2020-06-07_05-26-32djwxfdu1
,这个目录名称可以由我们自己设置吗?
尝试失败:尝试设置
os.environ['TUNE_RESULT_DIR'] = '~/another_dir`
Run Code Online (Sandbox Code Playgroud)
在运行之前ray.init()
,但结果日志文件仍在写入~/ray_results
。
所以我刚刚进行了一个tune
实验并得到了以下输出:
+--------------------+------------+-------+-------------+----------------+--------+------------+
| Trial name | status | loc | lr | weight_decay | loss | accuracy |
|--------------------+------------+-------+-------------+----------------+--------+------------|
| trainable_13720f86 | TERMINATED | | 0.00116961 | 0.00371219 | 0.673 | 0.7977 |
| trainable_13792744 | TERMINATED | | 0.109529 | 0.0862344 | 0.373 | 0.8427 |
| trainable_137ecd98 | TERMINATED | | 4.35062e-06 | 0.0261442 | 0.6993 | 0.7837 |
| trainable_1383f9d0 | TERMINATED | | 1.37858e-05 | 0.0974182 | 0.4538 | 0.8428 |
| trainable_13892f72 | …
Run Code Online (Sandbox Code Playgroud) 我已经安装了 ray 模块,并且一直收到此警告FutureWarning: Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI, autoscaler, and dashboard will only be usable via
pip install 'ray[default]'. Please update your install command. (pid=None) "update your install command.", FutureWarning)
我更新了我的 pip 因为"update your install command."
但我仍然不断收到它。