将数组中的每个成员变为字母数字的最快方法是什么?

Jon*_*han 13 regex arrays excel vba excel-vba

最终的最终结果:

我想知道如果字符串更长,下面的结果是否会改变.我在同一台计算机上运行完全相同的测试,除了每个单元格有一个34个字符而不是4个字符的随机字符串.这些是结果:

Comintern (Regexp):       136.1  ms  
brettdj (Regexp):         139.9  ms  
Slai (Regexp):            158.4  ms  
*Original Regex:          161.0  ms*    
Comintern (AN):           170.1  ms  
Comintern (Hash):         183.6  ms  
ThunderFrame:             232.9  ms    
*Original replace:        372.9  ms*  
*Original InStr:          478.1  ms*  
CallumDA33:              1218.1 ms
Run Code Online (Sandbox Code Playgroud)

这真的显示了Regex的速度 - 所有使用Regex.replace的解决方案都明显更快,最好的是Comintern的实现.

总之,如果字符串很长,请使用数组,如果它们很短,请使用剪贴板.如果不确定,最佳结果是使用数组,但这可能会牺牲短字符串的一点性能.

最终结果:

非常感谢您提出的所有建议,显然我还有很多需要学习的地方.昨天我一直在想这个,所以我决定在家里重新运行一切.以下是最终结果,基于将这些中的每一个应用于30,000个四个字符串.

我家里的电脑是英特尔i7 @ 3.6 GHz,8GB内存,64位Windows 10和Excel 2016.与之前相似的条件我在后台运行进程,但我并没有在整个测试中积极做任何事情.

Original replace:  97.67  ms
Original InStr:    106.54 ms
Original Regex:    113.46 ms
ThunderFrame:      82.21  ms
Comintern (AN):    96.98  ms
Comintern (OR):    81.87  ms
Comintern (Hash):  101.18 ms
brettdj:           81.66  ms
CallumDA33:        201.64 ms
Slai:              68.38  ms
Run Code Online (Sandbox Code Playgroud)

因此,我接受了Slai的答案,因为它显然是一般实施的最快,但我会将它们全部重新运行,以反对实际数据,以检查这仍然有效.


原帖:

我在Excel中有一个数组列表.例如,我需要将数组的每个成员都变成字母数字

ABC123-001 -> ABC123001
ABC123/001 -> ABC123001
ABC123001  -> ABC123001
Run Code Online (Sandbox Code Playgroud)

这样做的最快方法是什么?

对于上下文,我们的部件号可以有不同的形式,所以我正在编写一个函数,在给定范围内找到最佳匹配.目前,使所有字母数字运行的函数部分运行大约需要50ms,而函数的其余部分总共需要大约30ms.我也无法避免使用Excel.

我自己做了一些工作(见下面的答案),但主要问题是我必须逐个遍历数组的每个元素 - 有没有更好的方法?我以前也从未进行过测试,所以任何有关改进它们的反馈都会非常感激.

这是我到目前为止所尝试的内容.

我正在使用MicroTimer,我的计算机配备了Intel i5 @ 2.5GHz,4GB内存,64位Windows 7.我已经在后台运行进程,但是在运行这些进程时我没有主动做其他任何事情.

我使用以下代码创建了30,000行随机符号:

=CHAR(RANDBETWEEN(1,60))&CHAR(RANDBETWEEN(48,57))&CHAR(RANDBETWEEN(37,140))&CHAR(RANDBETWEEN(37,140))
Run Code Online (Sandbox Code Playgroud)

(注意我们如何在60处停止第一个字符,因为'=' char(61)并且我们希望避免Excel将其解释为公式.此外,我们强制第二个字符为数字,因此我们可以保证其中至少有一个字母数字字符.)

1.使用基于案例的循环.平均时间:175毫秒

在使用功能这篇文章中,我们的范围内装入一个数组,适用所述函数应用于所述阵列的每个元素,并将其粘贴回.码:

Function AlphaNumericOnly(strSource As Variant) As String
    Dim i As Integer
    Dim strResult As String

    For i = 1 To Len(strSource)
        Select Case Asc(Mid(strSource, i, 1))
            Case 48 To 57, 65 To 90, 97 To 122: 'include 32 if you want to include space
                strResult = strResult & Mid(strSource, i, 1)
        End Select
    Next
    AlphaNumericOnly = strResult
End Function

Sub Replace()

    Dim inputSh As Worksheet
    Dim inputRng As Range
        Set inputSh = Sheets("Data")
        Set inputRng = inputSh.Range("A1:A30000")

    Dim outputSh As Worksheet
    Dim outputRng As Range
        Set outputSh = Sheets("Replace")
        Set outputRng = outputSh.Range("A1:A30000")

    Dim time1 As Double, time2 As Double
        time1 = MicroTimer

    Dim arr As Variant
        arr = inputRng

    Dim i As Integer
        For i = LBound(arr) To UBound(arr)
            arr(i, 1) = AlphaNumericOnly(arr(i, 1))
        Next i

    outputRng = arr

    time2 = MicroTimer

    Debug.Print (time2 - time1) * 1000

End Sub
Run Code Online (Sandbox Code Playgroud)

2.使用InStr()检查每个字符.平均时间:201毫秒

定义一串有效值.如果有效值出现在数组元素中,请逐个检查:

Sub InStr()

    Dim inputSh As Worksheet
    Dim inputRng As Range
        Set inputSh = Sheets("Data")
        Set inputRng = inputSh.Range("A1:A30000")

    Dim outputSh As Worksheet
    Dim outputRng As Range
        Set outputSh = Sheets("InStr")
        Set outputRng = outputSh.Range("A1:A30000")

    Dim time1 As Double, time2 As Double
        time1 = MicroTimer

    Dim arr As Variant
        arr = inputRng

    Dim validValues As String
        validValues = "01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" 'put numbers and capitals at the start as they are more likely'

    Dim i As Integer, j As Integer
    Dim result As String

        For i = LBound(arr) To UBound(arr)
        result = vbNullString
            For j = 1 To Len(arr(i, 1))
                If InStr(validValues, Mid(arr(i, 1), j, 1)) <> 0 Then
                    result = result & Mid(arr(i, 1), j, 1)
                End If
            Next j
        arr(i, 1) = result
        Next i

    outputRng = arr

    time2 = MicroTimer

    Debug.Print (time2 - time1) * 1000

End Sub
Run Code Online (Sandbox Code Playgroud)

3.在数组上使用regex.Replace.时间:171ms

定义正则表达式并使用它来替换数组的每个元素.

Sub Regex()

    Dim inputSh As Worksheet
    Dim inputRng As Range
        Set inputSh = Sheets("Data")
        Set inputRng = inputSh.Range("A1:A30000")

    Dim outputSh As Worksheet
    Dim outputRng As Range
        Set outputSh = Sheets("Regex")
        Set outputRng = outputSh.Range("A1:A30000")

    Dim time1 As Double, time2 As Double
        time1 = MicroTimer

    Dim arr As Variant
        arr = inputRng

    Dim objRegex As Object
        Set objRegex = CreateObject("vbscript.regexp")
        With objRegex
            .Global = True
            .ignorecase = True
            .Pattern = "[^\w]"
        End With

    Dim i As Integer
        For i = LBound(arr) To UBound(arr)
            arr(i, 1) = objRegex.Replace(arr(i, 1), vbNullString)
        Next i

    outputRng = arr

    time2 = MicroTimer

    Debug.Print (time2 - time1) * 1000

End Sub
Run Code Online (Sandbox Code Playgroud)

编辑:

@ThunderFrame - 我们的部件号通常采用以下格式:

  • 所有数字(例如32523452)
  • 字母和数字的混合(例如AB324K234或123H45645)
  • 字母和数字的混合,每个字母和数字由非字母数字字符链接(例如ABC001-001,ABC001/001,123/4557-121)

我已经考虑过在启动到替换之前在每个字符串上使用regex.test,但是我不确定这是否只是复制字符串然后测试它,在这种情况下我也可以让替换开始.

@Slai - 感谢您的链接 - 我会更详细地研究它

Sla*_*lai 7

不确定这是否会更快,因为它取决于太多因素,但可能值得测试.而不是Regex.分别放置每个值,您可以从剪贴板中获取复制的Range文本,并立即替换所有值.请注意,\w匹配下划线和Unicode字母,因此在正则表达式中更具体,可以使它更快.

'[a1:b30000] = [{"ABC123-009",""}]: Dim t As Double: t = Timer ' used for testing

Dim r As Range, s As String
Set r = ThisWorkbook.Worksheets("Data").UsedRange.Resize(, 1) ' Data!A1:A30000
With New MSForms.DataObject ' needs reference to "Microsoft Forms 2.0 Object Library" or use a bit slower late binding - With CreateObject("New:{1C3B4210-F441-11CE-B9EA-00AA006B1A69}")
   r.Copy
   .GetFromClipboard
    Application.CutCopyMode = False
    s = .GetText
    .Clear ' optional - clear the clipboard if using Range.PasteSpecial instead of Worksheet.PasteSpecial "Text"

    With New RegExp ' needs reference to "Microsoft VBScript Regular Expressions 5.5" or use a bit slower late binding - With CreateObject("VBScript.RegExp")
        .Global = True
        '.IgnoreCase = False ' .IgnoreCase is False by default
        .Pattern = "[^0-9A-Za-z\r\n]+" ' because "[^\w\r\n]+" also matches _ and Unicode letters
        s = .Replace(s, vbNullString)
    End With

    .SetText s
    .PutInClipboard
End With

' about 70% of the time is spent here in pasting the data 
r(, 2).PasteSpecial 'xlPasteValues ' paste the text from clipboard in B1

'Debug.Print Timer - t
Run Code Online (Sandbox Code Playgroud)

由于剪贴板开销,我希望这对于较少的值来说会更慢,并且由于需要内存,可能会因为更多的值而变慢.

在我的测试中,禁用事件似乎没有什么不同,但可能值得尝试.

请注意,当宏使用剪贴板时,另一个应用程序使用剪贴板的可能性很小.

如果早期绑定导致在不同计算机上运行相同编译宏的问题,则可以搜索宏反编译器或删除引用并切换到后期绑定.

  • @brettdj谢谢.我正在期待它的另一种方式,这个过程一下子变得更快,使用的字符越多.我的猜测是它需要的内存越来越慢,所以我想测试是否有两个部分处理范围会有所帮助.我对Evaluate的另一个想法可能更快,但它可能需要Excel 2016 CONCAT/TEXTJOIN功能,我仅限于Excel 2007. (2认同)

Com*_*ern 7

tl; dr - 正则表达式破坏VBA实现.如果这是代码挑战,@ brettj或@Slai应该赢得它.

有一堆技巧可以让你AlphaNumericOnly更快.

首先,您可以通过将其视为字节数组而不是字符串来消除绝大多数函数调用.这将删除对Mid$和的所有调用Asc.虽然这些功能非常快,但它们仍然增加了调用堆栈的开销和弹出.这增加了几十万次迭代.

第二个优化是不使用Case x To y语法,如果你可以避免它.原因与它如何编译有关 - 它没有编译成测试Case = Condition >= x And Condition <= y,它实际上创建了一个具有早期退出条件的循环,如下所示:

Case = False
For i = x To y
    If Condition = i Then
        Case = True
    End If
Next
Run Code Online (Sandbox Code Playgroud)

同样,不是一个巨大的性能打击,但它加起来.第三个优化是以一种方式对您的测试进行排序,使它们对数据集中最可能的命中进行排序.我在下面为主要字母量身定制了我的例子,其中大多数都是大写字母.您可以通过不同的订购做得更好.把它们放在一起,你得到的东西看起来像这样:

Public Function ByteAlphaNumeric(source As Variant) As String
    Dim chars() As Byte
    Dim outVal() As Byte
    chars = CStr(source)        'Load the array up.

    Dim bound As Long
    bound = UBound(chars)       'Size the outbound array.
    ReDim outVal(bound)

    Dim i As Long, pos As Long
    For i = 0 To bound Step 2   'Wide characters, only care about the ASCII range.
        Dim temp As Byte
        temp = chars(i)         'Pointer math isn't free. Cache it.
        Select Case True        'Order is important here.
            Case temp > 64 And temp < 91
                outVal(pos) = temp
                pos = pos + 2   'Advance the output pointer.
            Case temp < 48
            Case temp > 122
            Case temp > 96
                outVal(pos) = temp
                pos = pos + 2
            Case temp < 58
                outVal(pos) = temp
                pos = pos + 2
        End Select
    Next
    'This is likely the most expensive operation.
    ReDim Preserve outVal(pos)  'Trim the output array.
    ByteAlphaNumeric = outVal
End Function
Run Code Online (Sandbox Code Playgroud)

它是怎么做的?挺好的:

Public Sub Benchmark()
    Dim starting As Single, i As Long, dummy As String, sample As Variant

    sample = GetRandomString

    starting = Timer
    For i = 1 To 1000000
        dummy = AlphaNumericOnlyOP(sample)
    Next i
    Debug.Print "OP's AlphaNumericOnly: ", Timer - starting

    starting = Timer
    For i = 1 To 1000000
        dummy = AlphaNumericOnlyThunderframe(sample)
    Next i
    Debug.Print "ThunderFrame's AlphaNumericOnly: ", Timer - starting

    starting = Timer
    For i = 1 To 1000000
        dummy = AlphaNumeric(sample)
    Next i
    Debug.Print "CallumDA33's AlphaNumeric: ", Timer - starting

    starting = Timer
    For i = 1 To 1000000
        dummy = ByteAlphaNumeric(sample)
    Next i
    Debug.Print "ByteAlphaNumeric: ", Timer - starting

    Dim cast As String
    cast = CStr(sample)
    starting = Timer
    For i = 1 To 1000000
        dummy = ByteAlphaNumericString(cast)
    Next i
    Debug.Print "ByteAlphaNumericString: ", Timer - starting

    Set stripper = Nothing
    starting = Timer
    For i = 1 To 1000000
        dummy = OptimizedRegex(sample)
    Next i
    Debug.Print "OptimizedRegex: ", Timer - starting

End Sub

Private Function GetRandomString() As Variant
    Dim chars(30) As Byte, i As Long
    Randomize
    For i = 0 To 30 Step 2
        chars(i) = Int(96 * Rnd + 32)
    Next i
    Dim temp As String
    temp = chars
    GetRandomString = CVar(temp)
End Function
Run Code Online (Sandbox Code Playgroud)

随机15个字符的结果String:

OP`s AlphaNumericOnly:                     6.565918 
ThunderFrame`s AlphaNumericOnly:           3.617188 
CallumDA33`s AlphaNumeric:                23.518070 
ByteAlphaNumeric:                          2.354980
Run Code Online (Sandbox Code Playgroud)

注意,我省略了转换为函数并不容易的提交.你可能会注意到另外两个测试 - 它ByteAlphaNumericStringByteAlphaNumeric函数完全相同,但是它需要一个String输入而不是一个Variant并且摆脱了强制转换.这不是微不足道的:

ByteAlphaNumericString:                    2.226074
Run Code Online (Sandbox Code Playgroud)

最后,难以捉摸的OptimizedRegex函数(基本上是@ brettj的函数形式的代码用于比较计时):

Private stripper As RegExp  'Module level

Function OptimizedRegex(strSource As Variant) As String
    If stripper Is Nothing Then
        Set stripper = New RegExp
        With stripper
            .Global = True
            .Pattern = "[^0-9A-Za-z]"
        End With
    End If
    OptimizedRegex = stripper.Replace(strSource, vbNullString)
End Function
Run Code Online (Sandbox Code Playgroud)
OptimizedRegex:                            1.094727 
Run Code Online (Sandbox Code Playgroud)

编辑:奖金实施!

在我看来,哈希表查找可能比Select Case结构更快,所以我使用以下内容构建了一个Scripting.Dictionary:

Private hash As Scripting.Dictionary  'Module level

Function HashLookups(source As Variant) As String
    Dim chars() As Byte
    Dim outVal() As Byte

    chars = CStr(source)
    Dim bound As Long
    bound = UBound(chars)
    ReDim outVal(bound)

    Dim i As Long, pos As Long
    With hash
        For i = 0 To bound Step 2
            Dim temp As Byte
            temp = chars(i)
            If .Exists(temp) Then
                outVal(pos) = temp
                pos = pos + 2
            End If
        Next
    End With
    ReDim Preserve outVal(pos)
    HashLookups = outVal
End Function

Private Sub LoadHashTable()
    Set hash = New Scripting.Dictionary
    Dim i As Long
    For i = 48 To 57
        hash.Add i, vbNull
    Next
    For i = 65 To 90
        hash.Add i, vbNull
    Next
    For i = 97 To 122
        hash.Add i, vbNull
    Next
End Sub

'Test code:
    starting = Timer
    LoadHashTable
    For i = 1 To 1000000
        dummy = HashLookups(sample)
    Next i
    Debug.Print "HashLookups: ", Timer - starting
Run Code Online (Sandbox Code Playgroud)

事实证明并不是太破旧:

HashLookups:                               1.655273
Run Code Online (Sandbox Code Playgroud)

最终版本

醒来后认为我会尝试使用向量查找而不是哈希查找(只需填充值的字节数组以保留并将其用于测试).这似乎是合理的,因为它只是一个256元素的数组 - 基本上是一个真值表:

Private lookup(255) As Boolean 'Module level

Function VectorLookup(source As Variant) As String
    Dim chars() As Byte
    Dim outVal() As Byte

    chars = CStr(source)
    Dim bound As Long
    bound = UBound(chars)
    ReDim outVal(bound)

    Dim i As Long, pos As Long
    For i = 0 To bound Step 2
        Dim temp As Byte
        temp = chars(i)
        If lookup(temp) Then
            outVal(pos) = temp
            pos = pos + 2
        End If
    Next
    ReDim Preserve outVal(pos)
    VectorLookup = outVal
End Function

Private Sub GenerateTable()
    Dim i As Long
    For i = 48 To 57
        lookup(i) = True
    Next
    For i = 65 To 90
        lookup(i) = True
    Next
    For i = 97 To 122
        lookup(i) = True
    Next
End Sub
Run Code Online (Sandbox Code Playgroud)

假设查找表只生成一次,它的时钟速度比上面任何其他纯VBA方法快10-15%左右.


bre*_*tdj 5

感谢ThunderFrame(我是LHS的傻瓜Mid$)但是我从早期的角度获得了更好的表现,并RegExp进行了额外的小调整:

  • 使用Value2而不是Value
  • long而不是整数声明你的循环
  • .ignorecase = True 是多余的

    Sub Replace2()

    Dim inputSh As Worksheet
    Dim inputRng As Range
    Set inputSh = Sheets("Data")
    Set inputRng = inputSh.Range("A1:A30000")

    Dim outputSh As Worksheet
    Dim outputRng As Range
    Set outputSh = Sheets("Replace")
    Set outputRng = outputSh.Range("A1:A30000")

    Dim time1 As Double, time2 As Double
    time1 = MicroTimer

    Dim arr As Variant
    Dim objRegex As VBScript_RegExp_55.RegExp
    Dim i As Long

    Set objRegex = CreateObject("vbscript.regexp")
    With objRegex
            .Global = True
            .Pattern = "[^\w]"
    End With

    arr = inputRng.Value2
    For i = LBound(arr) To UBound(arr)
            arr(i, 1) = objRegex.Replace(arr(i, 1), vbNullString)
    Next i
    outputRng.Value2 = arr

    time2 = MicroTimer
    Debug.Print (time2 - time1) * 1000
    End Sub
Run Code Online (Sandbox Code Playgroud)