如何获得最长的“重叠”单词

Question

如何获得最长的“重叠”单词

前提：假设您有一个包含单词的表，其中有些单词可能不同，有些“可能重叠”，这意味着较长的单词以较短的单词开头，例如：

---------------
|    word     |
---------------
| dog         | *
| games       | *
| stat        |
| state       | 
| statement   | *
| fulfill     |
| fulfilled   | *
| fulfillment | *
---------------

Run Code Online (Sandbox Code Playgroud)

问题：在这种情况下，如何编写一个返回非重叠+最长重叠单词列表的查询？

在上面的示例中，所需的单词由 a 标识*，根据以下扩展说明：

dog并且games不与任何内容重叠，因此它们是“独奏/独特”类别中最长的
statementstate与and重叠stat并且是最长的
fulfilled与重叠fulfill并且更长（不与重叠fulfillment）
fulfillment与重叠fulfill并且更长（不与重叠fulfilled）

注：请注意，为了简单起见，数据样本有所减少。实际上，有几百万条记录需要查询，并且事先没有已知的搜索词，因此不可能直接使用诸如WHERE word LIKE 'stat%'. 不确定是否相关，但这些单词的最大长度相对较短，例如 20。

Answer 1

小智 6

就像是

select word
from   your_table t1
where  not exists (
                    select word
                    from   your_table t2
                    where  t2.word like t1.word || '_%'
                  )
;

Run Code Online (Sandbox Code Playgroud)

该查询将受益于上索引的存在word。但即便如此，也可能需要很长时间。无论如何，您都可以尝试一下并让我们知道会发生什么。

Answer 2

ast*_*ntx 2

只要您比较前缀，并且如果一个单词完全作为另一个单词的前缀包含在内，您就可以尝试match_recognize在一次传递中顺序检查前缀匹配。

但exists更清楚的是，尽管您应该检查索引为的真实数据集的性能word。

with a as (
  select
    column_value as word
  from sys.odcivarchar2list(
    'dog'
    , 'games'
    , 'stat'
    , 'state'
    , 'statement'
    , 'fulfill'
    , 'fulfilled'
    , 'fulfillment'
  )
)
select *
from a 
match_recognize(
  order by word desc /*Longest first*/
  measures
    a.word as word
  one row per match
  /*Exclude any number of words
    each of which is a prefix of the previous one
  */
  pattern (a {- b* -})
  define
    /*Current word is a prefix of the previous*/
    b as prev(word) like word || '%'
) t

Run Code Online (Sandbox Code Playgroud)

| 字 |
| :---------- |
| 声明|
| 游戏|
| 履行|
| 履行|
| 狗|

db<>在这里摆弄

归档时间：	4 年，7 月前
查看次数：	276 次
最近记录：	4 年，7 月前