MySQL/PHP:通过tag/taxonomy查找类似/相关项目

Tom*_*Tom 14 php mysql tagging relationship

我有一个城市表,看起来像这样.

|id| Name    |
|1 | Paris   |
|2 | London  |
|3 | New York|
Run Code Online (Sandbox Code Playgroud)

我有一个看起来像这样的标签表.

|id| tag            |
|1 | Europe         |
|2 | North America  |   
|3 | River          |
Run Code Online (Sandbox Code Playgroud)

和cities_tags表:

|id| city_id | tag_id |
|1 | 1       | 1      | 
|2 | 1       | 3      | 
|3 | 2       | 1      |
|4 | 2       | 3      | 
|5 | 3       | 2      |     
|6 | 3       | 3      |
Run Code Online (Sandbox Code Playgroud)

如何计算哪个是最密切相关的城市?例如.如果我正在看城市1(巴黎),结果应该是:伦敦(2),纽约(3)

我找到了Jaccard索引,但我不确定如何最好地实现它.

M K*_*aid 17

您质疑我如何计算哪个是最密切相关的城市?例如.如果我正在查看1号城市(巴黎),结果应该是:伦敦(2),纽约(3),根据您提供的数据集,只有一件事可以联系到城市之间的常见标签,所以共享公共标签的城市将是下面最接近的城市是子查询,它查找共享公共标签的城市(提供其他城市以查找其最近的城市)

SELECT * FROM `cities`  WHERE id IN (
SELECT city_id FROM `cities_tags` WHERE tag_id IN (
SELECT tag_id FROM `cities_tags` WHERE city_id=1) AND city_id !=1 )
Run Code Online (Sandbox Code Playgroud)

工作

我假设您将输入一个城市ID或名称以找到他们最接近的一个,在我的情况下,"Paris"具有id

 SELECT tag_id FROM `cities_tags` WHERE city_id=1
Run Code Online (Sandbox Code Playgroud)

它会找到巴黎当时拥有的所有标签

SELECT city_id FROM `cities_tags` WHERE tag_id IN (
    SELECT tag_id FROM `cities_tags` WHERE city_id=1) AND city_id !=1 )
Run Code Online (Sandbox Code Playgroud)

它将获取除巴黎之外的所有城市,这些城市具有与巴黎相同的标签

这是你的小提琴

虽然阅读有关Jaccard相似性/指数的内容,但我发现有些东西可以理解这些术语的实际内容,我们有两套A和B

设置A = {A,B,C,D,E}

设置B = {I,H,G,F,E,D}

计算jaccard相似度的公式是JS =(A交叉B)/(A联合B)

交点B = {D,E} = 2

联合B = {A,B,C,D,E,I,H,G,F} = 9

JS = 2/9 = 0.2222222222222222

现在转向你的场景

巴黎有tag_ids 1,3所以我们制作了这一套并称之为Set P = {Europe,River}

伦敦有tag_ids 1,3所以我们制作了这个集合并调用我们的集合L = {Europe,River}

纽约有tag_ids 2,3所以我们制作了这个,并称我们的Set NW = {North America,River}

使用伦敦JSPL = P与L/P联合L,JSPL = 2/2 = 1来证明JS Paris

利用纽约JSPNW = P与NW/P联合NW相交的JS Paris,JSPNW = 1/3 = 0.3333333333

到目前为止,这是查询完美的jaccard索引,您可以看到下面的小提琴示例

SELECT a.*, 
( (CASE WHEN a.`intersect` =0 THEN a.`union` ELSE a.`intersect` END ) /a.`union`) AS jaccard_index 
 FROM (
SELECT q.* ,(q.sets + q.parisset) AS `union` , 
(q.sets - q.parisset) AS `intersect`
FROM (
SELECT cities.`id`, cities.`name` , GROUP_CONCAT(tag_id SEPARATOR ',') sets ,
(SELECT  GROUP_CONCAT(tag_id SEPARATOR ',')  FROM `cities_tags` WHERE city_id= 1)AS parisset

FROM `cities_tags` 
LEFT JOIN `cities` ON (cities_tags.`city_id` = cities.`id`)
GROUP BY city_id ) q
) a ORDER BY jaccard_index DESC 
Run Code Online (Sandbox Code Playgroud)

在上面的查询中,我已经将结果集派生为两个子选择,以获取我的自定义计算别名

在此输入图像描述

您可以在上面的查询中添加过滤器,以便不计算与自身的相似性

SELECT a.*, 
( (CASE WHEN a.`intersect` =0 THEN a.`union` ELSE a.`intersect` END ) /a.`union`) AS jaccard_index 
 FROM (
SELECT q.* ,(q.sets + q.parisset) AS `union` , 
(q.sets - q.parisset) AS `intersect`
FROM (
SELECT cities.`id`, cities.`name` , GROUP_CONCAT(tag_id SEPARATOR ',') sets ,
(SELECT  GROUP_CONCAT(tag_id SEPARATOR ',')  FROM `cities_tags` WHERE city_id= 1)AS parisset

FROM `cities_tags` 
LEFT JOIN `cities` ON (cities_tags.`city_id` = cities.`id`) WHERE  cities.`id` !=1
GROUP BY city_id ) q
) a ORDER BY jaccard_index DESC
Run Code Online (Sandbox Code Playgroud)

因此,结果显示巴黎与伦敦密切相关,然后与纽约有关

Jaccard相似小提琴

  • 所以我很确定这个解决方案不可行.我用不同的数据制作了一个小提琴:http://sqlfiddle.com/#!2/add2a9/1.我通过将`1`改为`2`,将`3`改为`8`,将`2`改为`5`来改变`tag_id`s.结果是不同的,即使它应该是相同的(看到城市标签关系保持不变).这是因为在`q.sets - q.parisset`和`q.sets +`q.parisset``sets`和`parissets`被转换为整数时(所以只保留第一个逗号之前的部分:`2,2 `和`5,8`).原来的小提琴奏效的事实是巧合.这不是一个有效的答案. (2认同)

Tra*_*ner 7

select c.name, cnt.val/(select count(*) from cities) as jaccard_index
from cities c 
inner join 
  (
  select city_id, count(*) as val 
  from cities_tags 
  where tag_id in (select tag_id from cities_tags where city_id=1) 
  and not city_id in (1)
  group by city_id
  ) as cnt 
on c.id=cnt.city_id
order by jaccard_index desc
Run Code Online (Sandbox Code Playgroud)

此查询是静态引用的city_id=1,因此您必须在where tag_id in子句和not city_id in子句中都使用该变量.

如果我正确理解了Jaccard索引,那么它也会返回由"最密切相关"排序的值.我们的示例中的结果如下所示:

|name      |jaccard_index  |
|London    |0.6667         |
|New York  |0.3333         |
Run Code Online (Sandbox Code Playgroud)

编辑

更好地了解如何实施Jaccard指数:

在维基百科上阅读了关于Jaccard Index的更多信息之后,我想出了一个更好的方法来实现我们的示例数据集的查询.基本上,我们将独立地将我们选择的城市与列表中的每个城市进行比较,并使用共同标签的数量除以两个城市之间选择的不同总标签的数量.

select c.name, 
  case -- when this city's tags are a subset of the chosen city's tags
    when not_in.cnt is null 
  then -- then the union count is the chosen city's tag count
    intersection.cnt/(select count(tag_id) from cities_tags where city_id=1) 
  else -- otherwise the union count is the chosen city's tag count plus everything not in the chosen city's tag list
    intersection.cnt/(not_in.cnt+(select count(tag_id) from cities_tags where city_id=1)) 
  end as jaccard_index
  -- Jaccard index is defined as the size of the intersection of a dataset, divided by the size of the union of a dataset
from cities c 
inner join 
  (
    --  select the count of tags for each city that match our chosen city
    select city_id, count(*) as cnt 
    from cities_tags 
    where tag_id in (select tag_id from cities_tags where city_id=1) 
    and city_id!=1
    group by city_id
  ) as intersection
on c.id=intersection.city_id
left join
  (
    -- select the count of tags for each city that are not in our chosen city's tag list
    select city_id, count(tag_id) as cnt
    from cities_tags
    where city_id!=1
    and not tag_id in (select tag_id from cities_tags where city_id=1)
    group by city_id
  ) as not_in
on c.id=not_in.city_id
order by jaccard_index desc
Run Code Online (Sandbox Code Playgroud)

查询有点冗长,我不知道它的扩展程度如何,但它确实实现了一个真正的Jaccard索引,如问题中所要求的那样.以下是新查询的结果:

+----------+---------------+
| name     | jaccard_index |
+----------+---------------+
| London   |        1.0000 |
| New York |        0.3333 |
+----------+---------------+
Run Code Online (Sandbox Code Playgroud)

再次编辑以向查询添加评论,并在当前城市的标签是所选城市标签的子集时考虑

  • 请参阅实现真实Jaccard索引的新查询. (6认同)