Django ORM：如何按日期排序，然后选择按外键分组的最佳对象？

Question

Django ORM：如何按日期排序，然后选择按外键分组的最佳对象？

Geo*_*off 5 python django postgresql full-text-search django-orm

我意识到我的标题有点复杂，但请允许我演示一下。我使用的是 Django 2.2.5 和 Python 3。以下是我当前正在使用的模型：

from django.db import models
from django.db.models import F
from django.contrib.postgres.indexes import GinIndex
from django.contrib.postgres.search import SearchVectorField, SearchVector, SearchQuery, SearchRank

class Thread(models.Model):
    title = models.CharField(max_length=100)
    last_update = models.DateTimeField(auto_now=True)

class PostQuerySet(models.QuerySet):
    _search_vector = SearchVector('thread__type') + \
                     SearchVector('thread__title') + \
                     SearchVector('from_name') + \
                     SearchVector('from_email') + \
                     SearchVector('message')

    ###
    # There's code here that updates the `Post.search_vector` field for each `Post` object
    # using `PostQuerySet._search_vector`.
    ###

    def search(self, text):
        """
            Search posts using the indexed `search_vector` field. I can, for example, call
            `Post.objects.search('influenza h1n1')`.
        """
        search_query = SearchQuery(text)
        search_rank = SearchRank(F('search_vector'), search_query)
        return self.annotate(rank=search_rank).filter(search_vector=search_query).order_by('-rank')

class Post(models.Model):
    thread = models.ForeignKey(Thread, on_delete=models.CASCADE)
    timestamp = models.DateTimeField()
    from_name = models.CharField(max_length=100)
    from_email = models.EmailField()
    message = models.TextField()
    in_response_to = models.ManyToManyField('self', symmetrical=False, blank=True)
    search_vector = SearchVectorField(null=True)

    objects = PostQuerySet().as_manager()

    class Meta:
        ordering = ['timestamp']
        indexes = [
            GinIndex(fields=['search_vector'])
        ]

Run Code Online (Sandbox Code Playgroud)

（为了简洁起见，我删除了这些模型中的一些内容，并且我认为这些内容无关紧要，但如果以后变得重要，我会将其添加进去。）

在英语中，我正在使用一个代表电子邮件列表服务中的数据的应用程序。基本上，有一个Thread包含多个Post对象的；人们全部回复最初的帖子并创建讨论。我刚刚使用 Django 为 Django 中的全文搜索提供的内置支持实现了搜索功能。它超级快，我喜欢它。这是我在中搜索的示例views.py：

###
# Pull `query` from a form defined in `forms.py`.
###

search_results = Post.objects.search(query).order_by('-timestamp')

Run Code Online (Sandbox Code Playgroud)

这一切都很好，并且返回绝对有意义的搜索结果。但我刚刚遇到了一个情况，我不太确定如何处理。显示的结果并不像我想要的那么干净。这个查询给我带来的是Post与用户提供的匹配的所有对象query。这很好，但Post同一对象内可能有许多对象Thread会影响结果。可能是这样的：

post5 from thread2 - timestamp 2018-04-01, rank 0.5
post1 from thread3 - timestamp 2018-03-01, rank 0.25
post3 from thread2 - timestamp 2018-02-01, rank 0.75
post3 from thread1 - timestamp 2018-01-01, rank 0.6
post2 from thread1 - timestamp 2017-12-01, rank 0.7
post2 from thread2 - timestamp 2017-11-01, rank 0.7

Run Code Online (Sandbox Code Playgroud)

（这里，rank是 DjangoSearchRank方法返回的相关性。）

我真正想要的是：我想显示Post每个的最具代表性的匹配Thread，按降序时间戳排序。换句话说，对于搜索结果中Thread包含 a 的每个对象，只应显示最高的对象，并且那些最高的对象应按时间戳降序排序。因此，在上面的示例中，这些是我希望看到的结果：Postrank Postrank Post

post1 from thread3 - timestamp 2018-03-01, rank 0.25
post3 from thread2 - timestamp 2018-02-01, rank 0.75
post2 from thread1 - timestamp 2017-12-01, rank 0.7

Run Code Online (Sandbox Code Playgroud)

用几个循环来做我想做的事情是相当简单的for，但我真的希望有一种方法可以纯粹在 ORM 中实现这一点以提高效率。你们有什么建议吗？如果您需要我澄清有关问题设置或我想要的内容，请告诉我。

Answer 1

Pao*_*rre 3

我认为您必须查询Post模型，按线程、排名和时间戳对其进行排序，然后在线程distinct上使用。

搜索

这是按时间戳排序的搜索：

Post.objects.search("text").order_by("-timestamp")
Run Code Online (Sandbox Code Playgroud)
这是在我的本地 PostgreSQL 上执行的 SQL：

SELECT "post"."from_name", "thread"."title", "post"."timestamp", ts_rank("post"."search_vector", plainto_tsquery('text')) AS "rank" FROM "post" INNER JOIN "thread" ON ("post"."thread_id" = "thread"."id") WHERE "post"."search_vector" @@ (plainto_tsquery('dolor')) = TRUE ORDER BY "post"."timestamp" DESC
Run Code Online (Sandbox Code Playgroud)
这些是我的本地数据的搜索结果：

post1 from thread1 - timestamp 2019-07-01, rank 0.0607927 post2 from thread1 - timestamp 2019-06-01, rank 0.0759909 post1 from thread2 - timestamp 2019-06-01, rank 0.0759909 post2 from thread2 - timestamp 2019-05-01, rank 0.0607927 post3 from thread1 - timestamp 2019-05-01, rank 0.0607927 post1 from thread3 - timestamp 2019-05-01, rank 0.0607927 post3 from thread2 - timestamp 2019-04-01, rank 0.0759909 post4 from thread1 - timestamp 2019-04-01, rank 0.0759909 post2 from thread3 - timestamp 2019-04-01, rank 0.0759909 post5 from thread1 - timestamp 2019-03-01, rank 0.0607927 post3 from thread3 - timestamp 2019-03-01, rank 0.0607927 post4 from thread2 - timestamp 2019-03-01, rank 0.0607927 post5 from thread2 - timestamp 2019-02-01, rank 0.0759909 post4 from thread3 - timestamp 2019-02-01, rank 0.0759909 post5 from thread3 - timestamp 2019-01-01, rank 0.0759909
Run Code Online (Sandbox Code Playgroud)
解决方案

这是正确的查询，仅显示每个主题的最具代表性的匹配帖子（基于搜索排名），按时间戳降序排序

post1 from thread1 - timestamp 2019-07-01, rank 0.0607927 post2 from thread1 - timestamp 2019-06-01, rank 0.0759909 post1 from thread2 - timestamp 2019-06-01, rank 0.0759909 post2 from thread2 - timestamp 2019-05-01, rank 0.0607927 post3 from thread1 - timestamp 2019-05-01, rank 0.0607927 post1 from thread3 - timestamp 2019-05-01, rank 0.0607927 post3 from thread2 - timestamp 2019-04-01, rank 0.0759909 post4 from thread1 - timestamp 2019-04-01, rank 0.0759909 post2 from thread3 - timestamp 2019-04-01, rank 0.0759909 post5 from thread1 - timestamp 2019-03-01, rank 0.0607927 post3 from thread3 - timestamp 2019-03-01, rank 0.0607927 post4 from thread2 - timestamp 2019-03-01, rank 0.0607927 post5 from thread2 - timestamp 2019-02-01, rank 0.0759909 post4 from thread3 - timestamp 2019-02-01, rank 0.0759909 post5 from thread3 - timestamp 2019-01-01, rank 0.0759909
Run Code Online (Sandbox Code Playgroud)

这是在我的本地 PostgreSQL 上执行的 SQL：

Post.objects.search("text").order_by( "thread", "-rank", "-timestamp" ).distinct("thread")
Run Code Online (Sandbox Code Playgroud)

这些是我的本地数据的搜索结果：

post2 from thread1 - timestamp 2019-06-01, rank 0.0759909 post1 from thread2 - timestamp 2019-06-01, rank 0.0759909 post2 from thread3 - timestamp 2019-04-01, rank 0.0759909
Run Code Online (Sandbox Code Playgroud)

笔记

您可以阅读distinctDjango 官方文档了解更多信息。

更新

如果您需要绝对按时间戳逆序排序并且不需要显示排名，则可以使用子查询在上一个查询之后对帖子进行排序：

SELECT DISTINCT ON ("forum_post"."thread_id") "forum_post"."from_name", "forum_thread"."title", "forum_post"."timestamp", ts_rank("forum_post"."search_vector", plainto_tsquery('dolor')) AS "rank" FROM "forum_post" INNER JOIN "forum_thread" ON ("forum_post"."thread_id" = "forum_thread"."id") WHERE "forum_post"."search_vector" @@ (plainto_tsquery('dolor')) = TRUE ORDER BY "forum_post"."thread_id" ASC, "rank" DESC, "forum_post"."timestamp" DESC
Run Code Online (Sandbox Code Playgroud)

这是在我的本地 PostgreSQL 上执行的 SQL：

post2 from thread1 - timestamp 2019-06-01, rank 0.0759909 post1 from thread2 - timestamp 2019-06-01, rank 0.0759909 post2 from thread3 - timestamp 2019-04-01, rank 0.0759909
Run Code Online (Sandbox Code Playgroud)

这些是我的本地数据的搜索结果：

post2 from thread1 - timestamp 2019-06-01 post1 from thread2 - timestamp 2019-06-01 post2 from thread3 - timestamp 2019-04-01
Run Code Online (Sandbox Code Playgroud)

归档时间：	6 年，5 月前
查看次数：	4319 次
最近记录：	6 年，4 月前