celery .delay挂起(最近,不是auth问题)

Bac*_*con 5 python django rabbitmq celery

我正在运行Celery 2.2.4/djCelery 2.2.4,使用RabbitMQ 2.1.1作为后端.我最近在网上带来了两台新的芹菜服务器 - 我在两台机器上运行了2名工作人员,总共有18个线程,在我的新加工盒子上(36g RAM +双超线程四核),我正在运行10每个都有8个线程的工作者,总共180个线程 - 我的任务都很小,所以这应该没问题.

过去几天节点一直运行良好,但今天我注意到它.delaay()正在悬挂.当我打断它时,我看到一个指向这里的追溯:

File "/home/django/deployed/releases/20110608183345/virtual-env/lib/python2.5/site-packages/celery/task/base.py", line 324, in delay
    return self.apply_async(args, kwargs)
File "/home/django/deployed/releases/20110608183345/virtual-env/lib/python2.5/site-packages/celery/task/base.py", line 449, in apply_async
    publish.close()
File "/home/django/deployed/virtual-env/lib/python2.5/site-packages/kombu/compat.py", line 108, in close
    self.backend.close()
File "/home/django/deployed/virtual-env/lib/python2.5/site-packages/amqplib/client_0_8/channel.py", line 194, in close
    (20, 41),    # Channel.close_ok
File "/home/django/deployed/virtual-env/lib/python2.5/site-packages/amqplib/client_0_8/abstract_channel.py", line 89, in wait
    self.channel_id, allowed_methods)
File "/home/django/deployed/virtual-env/lib/python2.5/site-packages/amqplib/client_0_8/connection.py", line 198, in _wait_method
    self.method_reader.read_method()
File "/home/django/deployed/virtual-env/lib/python2.5/site-packages/amqplib/client_0_8/method_framing.py", line 212, in read_method
    self._next_method()
File "/home/django/deployed/virtual-env/lib/python2.5/site-packages/amqplib/client_0_8/method_framing.py", line 127, in _next_method
    frame_type, channel, payload = self.source.read_frame()
File "/home/django/deployed/virtual-env/lib/python2.5/site-packages/amqplib/client_0_8/transport.py", line 109, in read_frame
    frame_type, channel, size = unpack('>BHI', self._read(7))
File "/home/django/deployed/virtual-env/lib/python2.5/site-packages/amqplib/client_0_8/transport.py", line 200, in _read
    s = self.sock.recv(65536)
Run Code Online (Sandbox Code Playgroud)

我检查了Rabbit日志,我看到它尝试连接的过程如下:

=INFO REPORT==== 12-Jun-2011::22:58:12 ===
accepted TCP connection on 0.0.0.0:5672 from x.x.x.x:48569
Run Code Online (Sandbox Code Playgroud)

我将Celery日志级别设置为INFO,但我在Celery日志中看不到任何特别有趣的内容除了2个工作者无法连接到代理:

[2011-06-12 22:41:08,033: ERROR/MainProcess] Consumer: Connection to broker lost. Trying to re-establish connection...
Run Code Online (Sandbox Code Playgroud)

所有其他节点都可以毫无问题地连接.

我知道有一个帖子(RabbitMQ/Celery与Django挂起延迟/准备等等 - 没有有用的日志信息)去年有类似的性质,但我很确定这是不同的.可能是因为大量的工人正在创造某种竞争条件amqplib- 我发现这个线程似乎表明这amqplib不是线程安全的,不确定这对Celery是否重要.

编辑:我试过celeryctl purge两个节点 - 一个成功,但另一个失败,出现以下AMQP错误:

AMQPConnectionException(reply_code, reply_text, (class_id, method_id))
    amqplib.client_0_8.exceptions.AMQPConnectionException: 
    (530, u"NOT_ALLOWED - cannot redeclare exchange 'XXXXX' in vhost 'XXXXX' 
     with different type, durable or autodelete   value", (40, 10), 'Channel.exchange_declare')
Run Code Online (Sandbox Code Playgroud)

在两个节点上,inspect stats挂起上面的"无法关闭连接"回溯.我在这里不知所措.

EDIT2:我能够使用exchange.deletefrom 删除有问题的交换camqadm,现在第二个节点也挂起了:(.

EDIT3:最近也发生了变化的一件事是我向rabbitmq添加了一个额外的vhost,我的staging节点连接到了它.

Bac*_*con 6

希望这会为某人节省很多时间......虽然它确实不会让我感到尴尬:

/var在运行兔子的服务器上已满.随着我添加的所有节点,兔子正在进行更多的日志记录并填满/var- 我无法写入/var/lib/rabbitmq,因此没有消息通过.