给定含n元语法的一列VARCHAR与utf8mb4_unicode_ci归类:
+---------------------------+
| ngram |
+---------------------------+
| stack overflow |
| stack |
| overflow |
| stack overflow protection |
| overflow protection |
| protection |
+---------------------------+
Run Code Online (Sandbox Code Playgroud)
一个查询:
SELECT * FROM ngrams WHERE ngram IN ('stack', 'stack overflow', 'protection', 'overflow')
Run Code Online (Sandbox Code Playgroud)
给定此查询返回的行,如何仅保留返回行中具有最长ngram 的行?
在这个例子中,我得到3行:stack,stack overflow,和protection.
然后,我需要像这样过滤行:
stack,因为stack overflow存在于返回的行中stack overflow,因为没有其他返回的行是包含的ngram stack overflow(stack overflow protection在表中有,但它不在返回的行中)protection太overflow,因为stack overflow存在于返回的行中由于排序规则,它必须在MySQL中完成(MySQL之外的比较不会产生与MySQL相同的结果).(除非我不知道某些MySQL函数允许公开字符串的整理版本.)
我可以想到以下解决方案:(sql fiddle)
SELECT ngram
FROM ngrams n1
WHERE n1.ngram IN ('stack', 'stack overflow', 'protection')
AND NOT EXISTS (
SELECT 1
FROM ngrams n2
WHERE n2.ngram IN ('stack', 'stack overflow', 'protection')
AND LENGTH(n2.ngram) > LENGTH(n1.ngram)
AND CONCAT(' ', n2.ngram, ' ') LIKE CONCAT('% ', n1.ngram, ' %')
)
Run Code Online (Sandbox Code Playgroud)
但是效率很低,因为子查询将针对每个匹配的ngram执行.
所以我正在寻找
fth*_*lla 15
如果我理解你的逻辑,这个查询应该给你正确的结果:
SELECT n1.ngram
FROM
ngrams n1 LEFT JOIN ngrams n2
ON
n2.ngram IN ('stack', 'stack overflow', 'protection')
AND n2.ngram LIKE CONCAT('%', n1.ngram, '%')
AND CHAR_LENGTH(n1.ngram) < CHAR_LENGTH(n2.ngram)
WHERE
n1.ngram IN ('stack', 'stack overflow', 'protection')
AND n2.ngram IS NULL;
Run Code Online (Sandbox Code Playgroud)
请看这里的小提琴.但是因为我希望你的表可以有很多记录,而你的单词列表非常有限,为什么不在执行实际查询之前从这个列表中删除最短的ngrams?我的想法是减少清单
('stack', 'stack overflow', 'protection')
Run Code Online (Sandbox Code Playgroud)
至
('stack overflow', 'protection')
Run Code Online (Sandbox Code Playgroud)
这个查询应该做的伎俩:
SELECT *
FROM
ngrams
WHERE
ngram IN (
SELECT s1.ngram
FROM (
SELECT DISTINCT ngram
FROM ngrams
WHERE ngram IN ('stack','stack overflow','protection')
) s1 LEFT JOIN (
SELECT DISTINCT ngram
FROM ngrams
WHERE ngram IN ('stack','stack overflow','protection')
) s2
ON s2.ngram LIKE CONCAT('%', s1.ngram, '%')
AND CHAR_LENGTH(s1.ngram) < CHAR_LENGTH(s2.ngram)
WHERE
s2.ngram IS NULL
);
Run Code Online (Sandbox Code Playgroud)
是的我在再次ngrams将结果返回之前查询表两次ngrams,因为我们必须确保表中实际存在最长的值,但是如果ngram列上有正确的索引,那么使用DISTINCT的两个派生查询应该非常有效:
ALTER TABLE ngrams ADD INDEX idx_ngram (ngram);
Run Code Online (Sandbox Code Playgroud)
小提琴就在这里.
编辑:
正如samuil正确指出的那样,如果您只需要找到最短的ngram而不是与之关联的整行,那么您不需要外部查询,只需执行内部查询即可.使用正确的索引,两个SELECT DISTINCT查询将非常高效,即使JOIN无法优化(n2.ngram LIKE CONCAT('%', n1.ngram, '%')无法利用索引),它也只能在少数已经过滤的记录上执行,并且应该非常快.
您正在尝试过滤查询本身中的ngrams.分两步完成它可能更有效.从包含所有可能的ngrams的表开始:
CREATE TABLE original (ngram varchar(100) NOT NULL)
GO
CREATE TABLE refined (ngram varchar(100) NOT NULL PRIMARY KEY)
GO
INSERT INTO original (ngram)
SELECT DISTINCT ngram
FROM ngrams
WHERE ngram IN ('stack', 'stack overflow', 'protection')
GO
INSERT INTO refined (ngram)
SELECT ngram
FROM original
Run Code Online (Sandbox Code Playgroud)
然后删除你不想要的那些.对于每个ngram,生成所有可能的子串.对于每个子字符串,从列表中删除该条目(如果有).它需要几个嵌套循环,但除非你的ngram包含非常多的单词,否则它不会花费太多时间.
CREATE PROCEDURE refine()
BEGIN
DECLARE done INT DEFAULT FALSE;
DECLARE words varchar(100);
DECLARE posFrom, posTo int;
DECLARE cur CURSOR FOR SELECT ngram FROM original;
DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = TRUE;
OPEN cur;
read_loop: LOOP
FETCH cur INTO words;
IF done THEN
LEAVE read_loop;
END IF;
SET posFrom = 1;
REPEAT
SET posTo = LOCATE(' ', words, posFrom);
WHILE posTo > 0 DO
DELETE FROM refined WHERE ngram = SUBSTRING(words, posFrom, posTo - posFrom);
SET posTo = LOCATE(' ', words, posTo + 1);
END WHILE;
IF posFrom > 1 THEN
DELETE FROM refined WHERE ngram = SUBSTRING(words, posFrom);
END IF;
SET posFrom = LOCATE(' ', words, posFrom) + 1;
UNTIL posFrom = 1 END REPEAT;
END LOOP;
CLOSE cur;
END
Run Code Online (Sandbox Code Playgroud)
剩下的是一张只有最长的ngrams的表格:
CALL refine;
SELECT ngram FROM refined;
Run Code Online (Sandbox Code Playgroud)
SQL小提琴:http://sqlfiddle.com/#!2/029dc/1/1
编辑:我在桌子上添加了一个索引refined; 现在它应该在O(n)时间内运行.
在没有先查看其他解决方案的情况下执行此操作后,我发现它与您现有的最佳解决方案类似,但读取稍微简单,可能更高效;
SELECT n1.ngram
FROM ngrams n1
LEFT JOIN ngrams n2
ON n2.ngram IN ('stack', 'stack overflow', 'protection', 'overflow')
AND n1.ngram <> n2.ngram
AND INSTR(n2.ngram, n1.ngram) > 0
WHERE n1.ngram IN ('stack', 'stack overflow', 'protection', 'overflow')
AND n2.ngram IS NULL;
Run Code Online (Sandbox Code Playgroud)
由于该AND n1.ngram <> n2.ngram行没有计算,查询应该能够更有效地使用索引.
| 归档时间: |
|
| 查看次数: |
1577 次 |
| 最近记录: |