如何使用这种形式的协同过滤实现相关的文章算法

Val*_*sso 6 mysql sql

正如标题所示,我在实现相关文章算法时遇到了问题.让我首先列出数据库中的表:

[articles]
id_article
id_category
name
content
publish_date
is_deleted

[categories]
id_category
id_parent
name

[tags_to_articles]
id_tag
id_article

[tags]
id_tag
name

[articles_to_authors]
id_article
id_author

[authors]
id_author
name
is_deleted

[related_articles]
id_article_left
id_article_right
related_score
Run Code Online (Sandbox Code Playgroud)

算法

除related_articles之外的所有其他表都包含数据.现在我想填写相关文章与文章之间的分数(非常重要:表格将作为定向图表,文章A与文章B的分数可能不同于B和A之间的分数,请参阅列表).分数计算如下:

  • 如果有问题的两篇文章具有相同的类别,则会在分数中添加一个数字(x)
  • 对于他们共同的每个作者,在分数中添加一个数字(y)
  • 对于他们共有的每个标签,将数字(z)添加到分数中
  • 如果我们用文章B计算文章A的分数,则now()和文章B的publish_date之间的差异将生成一个数字(t),该数字将从分数中减去

我的第一个(效率低下)方法

我试图像这样进行查询:

SELECT a.id, b.id, a.id_category, a.publish_date,
    b.id_category, b.publish_date,
    c.id_tag,
    e.id_author
FROM `articles` a, articles b, 
        tags_to_articles c, tags_to_articles d,
        articles_to_authors e, articles_to_authors f
WHERE a.id_article <> b.id_article AND 
(
    (a.id_article=c.id_article and c.id_tag=d.id_tag and d.id_article=b.id_article)
    OR
    (a.id=e.id_article and e.id_author=f.id_author and f.id_article=b.id_article)
    OR
    (a.id_category=b.id_category)
)
Run Code Online (Sandbox Code Playgroud)

从理论上讲,这将列出每个值得计算得分的元素.但是,这需要花费太多时间和资源.

还有另外一种方法吗?如果它得到一个可行的解决方案,我也愿意调整算法或表格.另外值得注意的是,分数计算是在cron中完成的,当然我不希望这个在每个页面请求上运行.

Jas*_*ter 4

我严重怀疑你是否能够用一条语句做这样的事情并获得任何类型的性能。把它分成几块。使用临时表。使用集合运算。

-- First, let's list all tables that share a category.
SELECT   a1.id_article as 'left_article',
         a2.id_article as 'right_article',
         1 as 'score'
INTO     #tempscore
FROM     #articles a1
   INNER JOIN #articles a2 ON
         a1.id_category = a2.id_category
     AND a1.id_article <> a2.id_article

-- Now, let's add up everything that shares an author
INSERT INTO #tempscore (left_article, right_article, score)
SELECT   ata1.id_article,
         ata2.id_article,
         2
FROM     #articles_to_authors ata1
   INNER JOIN #articles_to_authors ata2 ON
         ata1.id_author = ata2.id_author

-- Now, let's add up everything that shares a a tag
INSERT INTO #tempscore (left_article, right_article, score)
SELECT   ata1.id_article,
         ata2.id_article,
         4
FROM     #tags_to_articles ata1
   INNER JOIN #tags_to_articles ata2 ON
         ata1.id_tag = ata2.id_tag

-- We haven't looked at dates, yet, but let's go ahead and consolidate what we know.
SELECT   left_article as 'left_article',
         right_article as 'right_article',
         SUM (score) as 'total_score'
INTO     #cscore
FROM     #tempscore
GROUP BY left_article,
         right_article

-- Clean up some extranneous stuff
DELETE FROM #cscore WHERE left_article = right_article

-- Now we need to deal with dates
SELECT   DateDiff (Day, art1.publish_date, art2.publish_date) as 'datescore',
         art1.id_article as 'left_article',
         art2.publish_date as 'right_article'
INTO     #datescore
FROM     #cscore
   INNER JOIN #articles art1 ON
         #cscore.left_article = art1.id_article
   INNER JOIN #articles art2 ON
         #cscore.right_article = art2.id_article
WHERE    art1.publish_date > art2.publish_date

-- And finally, put it all together
INSERT INTO #related_articles (id_article_left, id_article_right, related_score)
SELECT   s1.left_article,
         s1.right_article,
         s1.total_score + IsNull (s2.datescore, 0)
FROM     #cscore s1
   LEFT  JOIN #datescore s2 ON
         s1.left_article = s2.left_article
     AND s1.right_article = s2.right_article
Run Code Online (Sandbox Code Playgroud)

在我的测试中,分数似乎是正确的,但我没有任何真实的样本数据可供参考,所以我不能确定。如果不出意外的话,这应该为您提供一个开始的基础。