Django Haystack更新指数更快

pro*_*ype 12 django solr django-haystack

我一直在使用Django Haystack一段时间,这太棒了!我有一个相当沉重的网站,数据需要不时更新(15到30分钟).

使用时python manage.py update_index,需要花费大量时间来更新数据.有没有办法加快速度?或者,如果可能,可能只更新已更改的数

我目前正在使用Django Haystack 1.2.7和Solr作为后端和Django 1.4.

谢谢!!!


编辑:

是的我已经尝试阅读文档的这一部分,但我真正需要的是一种加速索引的方法.也许只更新最近的数据而不是更新所有数据.我发现get_updated_field但不知道如何使用它.在文档中,它只提到了它的使用原因,但没有显示真实的例子.


编辑2:

start = DateTimeField(model_attr='start', null=True, faceted=True, --HERE?--)
Run Code Online (Sandbox Code Playgroud)

编辑3:

好吧我已经实现了解决方案,但当我尝试rebuild_index(45000数据)时,它几乎崩溃了我的电脑.等待10分钟后出现错误:

 File "manage.py", line 10, in <module>
    execute_from_command_line(sys.argv)
  File "/usr/local/lib/python2.7/dist-packages/django/core/management/__init__.py", line 443, in execute_from_command_line
    utility.execute()
  File "/usr/local/lib/python2.7/dist-packages/django/core/management/__init__.py", line 382, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/usr/local/lib/python2.7/dist-packages/django/core/management/base.py", line 196, in run_from_argv
    self.execute(*args, **options.__dict__)
  File "/usr/local/lib/python2.7/dist-packages/django/core/management/base.py", line 232, in execute
    output = self.handle(*args, **options)
  File "/usr/local/lib/python2.7/dist-packages/haystack/management/commands/rebuild_index.py", line 16, in handle
    call_command('update_index', **options)
  File "/usr/local/lib/python2.7/dist-packages/django/core/management/__init__.py", line 150, in call_command
    return klass.execute(*args, **defaults)
  File "/usr/local/lib/python2.7/dist-packages/django/core/management/base.py", line 232, in execute
    output = self.handle(*args, **options)
  File "/usr/local/lib/python2.7/dist-packages/haystack/management/commands/update_index.py", line 193, in handle
    return super(Command, self).handle(*apps, **options)
  File "/usr/local/lib/python2.7/dist-packages/django/core/management/base.py", line 304, in handle
    app_output = self.handle_app(app, **options)
  File "/usr/local/lib/python2.7/dist-packages/haystack/management/commands/update_index.py", line 229, in handle_app
    do_update(index, qs, start, end, total, self.verbosity)
  File "/usr/local/lib/python2.7/dist-packages/haystack/management/commands/update_index.py", line 109, in do_update
    index.backend.update(index, current_qs)
  File "/usr/local/lib/python2.7/dist-packages/haystack/backends/solr_backend.py", line 73, in update
    self.conn.add(docs, commit=commit, boost=index.get_field_weights())
  File "/usr/local/lib/python2.7/dist-packages/pysolr.py", line 686, in add
    m = ET.tostring(message, encoding='utf-8')
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1127, in tostring
    ElementTree(element).write(file, encoding, method=method)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 821, in write
    serialize(write, self._root, encoding, qnames, namespaces)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 940, in _serialize_xml
    _serialize_xml(write, e, encoding, qnames, None)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 940, in _serialize_xml
    _serialize_xml(write, e, encoding, qnames, None)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 915, in _serialize_xml
    write("<" + tag)
MemoryError
Run Code Online (Sandbox Code Playgroud)

Ste*_*ger 21

get_updated_field应该返回一个字符串,其中包含模型上包含模型更新日期的属性名称(haystack docs).具有auto_now = True的DateField将是理想的(Django docs).

例如,我的UserProfile模型有一个名为updated的字段

models.py

class UserProfile(models.Model):
    user = models.ForeignKey(User)
    # lots of other fields snipped
    updated = models.DateTimeField(auto_now=True)
Run Code Online (Sandbox Code Playgroud)

search_indexes.py

class UserProfileIndex(SearchIndex):
    text = CharField(document=True, use_template=True)
    user = CharField(model_attr='user')
    user_fullname = CharField(model_attr='user__get_full_name')

    def get_model(self):
        return UserProfile

    def get_updated_field(self):
        return "updated"
Run Code Online (Sandbox Code Playgroud)

然后,当我运行./manage.py update_index --age=10它时,仅索引在过去10小时内更新的用户配置文件.

  • 如果您的搜索索引模型引用了属于索引的其他模型,请小心.最后更新不会在他们改变时改变,然后他们不会索引(想象主要对象上的类别模型) (6认同)
  • 仅供参考,如果您使用Django的QuerySet方法进行批量更新`.update()`,则不会触发`auto_now`功能,因为post_save信号未被触发.这意味着上面的`--age`选项不能仅更新最近更改的模型.为了解决这个问题,你可以循环查询集并使用`.save()`,或继续使用`.update()`你只需要自己手动更新时间,例如`updated = datetime.datetime.now()` . (3认同)