突出显示(不删除)单词文档中的重复句子或短语

rpa*_*s21 9 vba ms-word word-vba

我得到的印象是,这是不可能的,但我想如果你正在寻找任何3-4个单词,在一个很长的论文中的任何地方出现相同的序列,我可以找到相同短语的重复.

我从过去的论文中复制并粘贴了大量文档,并希望找到一种简单的方法来查找这个40多页文档中的重复信息有很多不同的格式,但我愿意暂时摆脱格式化顺序找到重复的信息.

Sid*_*out 16

要突出显示所有重复的句子,您也可以使用ActiveDocument.Sentences(i).这是一个例子

逻辑

1)从数组中获取word文档中的所有句子

2)对数组进行排序

3)提取重复项

4)突出显示重复项

Option Explicit

Sub Sample()
    Dim MyArray() As String
    Dim n As Long, i As Long
    Dim Col As New Collection
    Dim itm

    n = 0
    '~~> Get all the sentences from the word document in an array
    For i = 1 To ActiveDocument.Sentences.Count
        n = n + 1
        ReDim Preserve MyArray(n)
        MyArray(n) = Trim(ActiveDocument.Sentences(i).Text)
    Next

    '~~> Sort the array
    SortArray MyArray, 0, UBound(MyArray)

    '~~> Extract Duplicates
    For i = 1 To UBound(MyArray)
        If i = UBound(MyArray) Then Exit For
        If InStr(1, MyArray(i + 1), MyArray(i), vbTextCompare) Then
            On Error Resume Next
            Col.Add MyArray(i), """" & MyArray(i) & """"
            On Error GoTo 0
        End If
    Next i

    '~~> Highlight duplicates
    For Each itm In Col
        Selection.Find.ClearFormatting
        Selection.HomeKey wdStory, wdMove
        Selection.Find.Execute itm
        Do Until Selection.Find.Found = False
            Selection.Range.HighlightColorIndex = wdPink
            Selection.Find.Execute
        Loop
    Next
End Sub

'~~> Sort the array
Public Sub SortArray(vArray As Variant, i As Long, j As Long)
  Dim tmp As Variant, tmpSwap As Variant
  Dim ii As Long, jj As Long

  ii = i: jj = j: tmp = vArray((i + j) \ 2)

  While (ii <= jj)
     While (vArray(ii) < tmp And ii < j)
        ii = ii + 1
     Wend
     While (tmp < vArray(jj) And jj > i)
        jj = jj - 1
     Wend
     If (ii <= jj) Then
        tmpSwap = vArray(ii)
        vArray(ii) = vArray(jj): vArray(jj) = tmpSwap
        ii = ii + 1: jj = jj - 1
     End If
  Wend
  If (i < jj) Then SortArray vArray, i, jj
  If (ii < j) Then SortArray vArray, ii, j
End Sub
Run Code Online (Sandbox Code Playgroud)

快照

之前

在此输入图像描述

在此输入图像描述


Gaf*_*ffi 4

我没有使用我自己的 DAWG 建议,我仍然有兴趣看看其他人是否有办法做到这一点,但我能够想出这个:

Option Explicit

Sub test()
Dim ABC As Scripting.Dictionary
Dim v As Range
Dim n As Integer
    n = 5
    Set ABC = FindRepeatingWordChains(n, ActiveDocument)
    ' This is a dictionary of word ranges (not the same as an Excel range) that contains the listing of each word chain/phrase of length n (5 from the above example).
    ' Loop through this collection to make your selections/highlights/whatever you want to do.
    If Not ABC Is Nothing Then
        For Each v In ABC
            v.Font.Color = wdColorRed
        Next v
    End If
End Sub

' This is where the real code begins.
Function FindRepeatingWordChains(ChainLenth As Integer, DocToCheck As Document) As Scripting.Dictionary
Dim DictWords As New Scripting.Dictionary, DictMatches As New Scripting.Dictionary
Dim sChain As String
Dim CurWord As Range
Dim MatchCount As Integer
Dim i As Integer

    MatchCount = 0

    For Each CurWord In DocToCheck.Words
        ' Make sure there are enough remaining words in our document to handle a chain of the length specified.
        If Not CurWord.Next(wdWord, ChainLenth - 1) Is Nothing Then
            ' Check for non-printing characters in the first/last word of the chain.
            ' This code will read a vbCr, etc. as a word, which is probably not desired.
            ' However, this check does not exclude these 'words' inside the chain, but it can be modified.
            If CurWord <> vbCr And CurWord <> vbNewLine And CurWord <> vbCrLf And CurWord <> vbLf And CurWord <> vbTab And _
                CurWord.Next(wdWord, ChainLenth - 1) <> vbCr And CurWord.Next(wdWord, ChainLenth - 1) <> vbNewLine And _
                CurWord.Next(wdWord, ChainLenth - 1) <> vbCrLf And CurWord.Next(wdWord, ChainLenth - 1) <> vbLf And _
                CurWord.Next(wdWord, ChainLenth - 1) <> vbTab Then
                sChain = CurWord
                For i = 1 To ChainLenth - 1
                    ' Add each word from the current word through the next ChainLength # of words to a temporary string.
                    sChain = sChain & " " & CurWord.Next(wdWord, i)
                Next i

                ' If we already have our temporary string stored in the dictionary, then we have a match, assign the word range to the returned dictionary.
                ' If not, then add it to the dictionary and increment our index.
                If DictWords.Exists(sChain) Then
                    MatchCount = MatchCount + 1
                    DictMatches.Add DocToCheck.Range(CurWord.Start, CurWord.Next(wdWord, ChainLenth - 1).End), MatchCount
                Else
                    DictWords.Add sChain, sChain
                End If
            End If
        End If
    Next CurWord

    ' If we found any matching results, then return that list, otherwise return nothing (to be caught by the calling function).
    If DictMatches.Count > 0 Then
        Set FindRepeatingWordChains = DictMatches
    Else
        Set FindRepeatingWordChains = Nothing
    End If

End Function
Run Code Online (Sandbox Code Playgroud)

我已经在来自该源的258 页文档 ( TheStory.txt)上对此进行了测试,并且只用了几分钟就运行了。

请参阅test()子部分以了解用法。

您需要引用 Microsoft Scripting Runtime 才能使用这些Scripting.Dictionary对象。如果这是不可取的,可以进行一些小的修改Collections来代替使用,但我更喜欢它,Dictionary因为它有有用的.Exists()方法。