Jon*_*han 13 regex arrays excel vba excel-vba
最终的最终结果:
我想知道如果字符串更长,下面的结果是否会改变.我在同一台计算机上运行完全相同的测试,除了每个单元格有一个34个字符而不是4个字符的随机字符串.这些是结果:
Comintern (Regexp): 136.1 ms
brettdj (Regexp): 139.9 ms
Slai (Regexp): 158.4 ms
*Original Regex: 161.0 ms*
Comintern (AN): 170.1 ms
Comintern (Hash): 183.6 ms
ThunderFrame: 232.9 ms
*Original replace: 372.9 ms*
*Original InStr: 478.1 ms*
CallumDA33: 1218.1 ms
Run Code Online (Sandbox Code Playgroud)
这真的显示了Regex的速度 - 所有使用Regex.replace的解决方案都明显更快,最好的是Comintern的实现.
总之,如果字符串很长,请使用数组,如果它们很短,请使用剪贴板.如果不确定,最佳结果是使用数组,但这可能会牺牲短字符串的一点性能.
最终结果:
非常感谢您提出的所有建议,显然我还有很多需要学习的地方.昨天我一直在想这个,所以我决定在家里重新运行一切.以下是最终结果,基于将这些中的每一个应用于30,000个四个字符串.
我家里的电脑是英特尔i7 @ 3.6 GHz,8GB内存,64位Windows 10和Excel 2016.与之前相似的条件我在后台运行进程,但我并没有在整个测试中积极做任何事情.
Original replace: 97.67 ms
Original InStr: 106.54 ms
Original Regex: 113.46 ms
ThunderFrame: 82.21 ms
Comintern (AN): 96.98 ms
Comintern (OR): 81.87 ms
Comintern (Hash): 101.18 ms
brettdj: 81.66 ms
CallumDA33: 201.64 ms
Slai: 68.38 ms
Run Code Online (Sandbox Code Playgroud)
因此,我接受了Slai的答案,因为它显然是一般实施的最快,但我会将它们全部重新运行,以反对实际数据,以检查这仍然有效.
原帖:
我在Excel中有一个数组列表.例如,我需要将数组的每个成员都变成字母数字
ABC123-001 -> ABC123001
ABC123/001 -> ABC123001
ABC123001 -> ABC123001
Run Code Online (Sandbox Code Playgroud)
这样做的最快方法是什么?
对于上下文,我们的部件号可以有不同的形式,所以我正在编写一个函数,在给定范围内找到最佳匹配.目前,使所有字母数字运行的函数部分运行大约需要50ms,而函数的其余部分总共需要大约30ms.我也无法避免使用Excel.
我自己做了一些工作(见下面的答案),但主要问题是我必须逐个遍历数组的每个元素 - 有没有更好的方法?我以前也从未进行过测试,所以任何有关改进它们的反馈都会非常感激.
这是我到目前为止所尝试的内容.
我正在使用MicroTimer,我的计算机配备了Intel i5 @ 2.5GHz,4GB内存,64位Windows 7.我已经在后台运行进程,但是在运行这些进程时我没有主动做其他任何事情.
我使用以下代码创建了30,000行随机符号:
=CHAR(RANDBETWEEN(1,60))&CHAR(RANDBETWEEN(48,57))&CHAR(RANDBETWEEN(37,140))&CHAR(RANDBETWEEN(37,140))
Run Code Online (Sandbox Code Playgroud)
(注意我们如何在60处停止第一个字符,因为'=' char(61)并且我们希望避免Excel将其解释为公式.此外,我们强制第二个字符为数字,因此我们可以保证其中至少有一个字母数字字符.)
1.使用基于案例的循环.平均时间:175毫秒
在使用功能这篇文章中,我们的范围内装入一个数组,适用所述函数应用于所述阵列的每个元素,并将其粘贴回.码:
Function AlphaNumericOnly(strSource As Variant) As String
Dim i As Integer
Dim strResult As String
For i = 1 To Len(strSource)
Select Case Asc(Mid(strSource, i, 1))
Case 48 To 57, 65 To 90, 97 To 122: 'include 32 if you want to include space
strResult = strResult & Mid(strSource, i, 1)
End Select
Next
AlphaNumericOnly = strResult
End Function
Sub Replace()
Dim inputSh As Worksheet
Dim inputRng As Range
Set inputSh = Sheets("Data")
Set inputRng = inputSh.Range("A1:A30000")
Dim outputSh As Worksheet
Dim outputRng As Range
Set outputSh = Sheets("Replace")
Set outputRng = outputSh.Range("A1:A30000")
Dim time1 As Double, time2 As Double
time1 = MicroTimer
Dim arr As Variant
arr = inputRng
Dim i As Integer
For i = LBound(arr) To UBound(arr)
arr(i, 1) = AlphaNumericOnly(arr(i, 1))
Next i
outputRng = arr
time2 = MicroTimer
Debug.Print (time2 - time1) * 1000
End Sub
Run Code Online (Sandbox Code Playgroud)
2.使用InStr()检查每个字符.平均时间:201毫秒
定义一串有效值.如果有效值出现在数组元素中,请逐个检查:
Sub InStr()
Dim inputSh As Worksheet
Dim inputRng As Range
Set inputSh = Sheets("Data")
Set inputRng = inputSh.Range("A1:A30000")
Dim outputSh As Worksheet
Dim outputRng As Range
Set outputSh = Sheets("InStr")
Set outputRng = outputSh.Range("A1:A30000")
Dim time1 As Double, time2 As Double
time1 = MicroTimer
Dim arr As Variant
arr = inputRng
Dim validValues As String
validValues = "01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" 'put numbers and capitals at the start as they are more likely'
Dim i As Integer, j As Integer
Dim result As String
For i = LBound(arr) To UBound(arr)
result = vbNullString
For j = 1 To Len(arr(i, 1))
If InStr(validValues, Mid(arr(i, 1), j, 1)) <> 0 Then
result = result & Mid(arr(i, 1), j, 1)
End If
Next j
arr(i, 1) = result
Next i
outputRng = arr
time2 = MicroTimer
Debug.Print (time2 - time1) * 1000
End Sub
Run Code Online (Sandbox Code Playgroud)
3.在数组上使用regex.Replace.时间:171ms
定义正则表达式并使用它来替换数组的每个元素.
Sub Regex()
Dim inputSh As Worksheet
Dim inputRng As Range
Set inputSh = Sheets("Data")
Set inputRng = inputSh.Range("A1:A30000")
Dim outputSh As Worksheet
Dim outputRng As Range
Set outputSh = Sheets("Regex")
Set outputRng = outputSh.Range("A1:A30000")
Dim time1 As Double, time2 As Double
time1 = MicroTimer
Dim arr As Variant
arr = inputRng
Dim objRegex As Object
Set objRegex = CreateObject("vbscript.regexp")
With objRegex
.Global = True
.ignorecase = True
.Pattern = "[^\w]"
End With
Dim i As Integer
For i = LBound(arr) To UBound(arr)
arr(i, 1) = objRegex.Replace(arr(i, 1), vbNullString)
Next i
outputRng = arr
time2 = MicroTimer
Debug.Print (time2 - time1) * 1000
End Sub
Run Code Online (Sandbox Code Playgroud)
编辑:
@ThunderFrame - 我们的部件号通常采用以下格式:
我已经考虑过在启动到替换之前在每个字符串上使用regex.test,但是我不确定这是否只是复制字符串然后测试它,在这种情况下我也可以让替换开始.
@Slai - 感谢您的链接 - 我会更详细地研究它
不确定这是否会更快,因为它取决于太多因素,但可能值得测试.而不是Regex.分别放置每个值,您可以从剪贴板中获取复制的Range文本,并立即替换所有值.请注意,\w匹配下划线和Unicode字母,因此在正则表达式中更具体,可以使它更快.
'[a1:b30000] = [{"ABC123-009",""}]: Dim t As Double: t = Timer ' used for testing
Dim r As Range, s As String
Set r = ThisWorkbook.Worksheets("Data").UsedRange.Resize(, 1) ' Data!A1:A30000
With New MSForms.DataObject ' needs reference to "Microsoft Forms 2.0 Object Library" or use a bit slower late binding - With CreateObject("New:{1C3B4210-F441-11CE-B9EA-00AA006B1A69}")
r.Copy
.GetFromClipboard
Application.CutCopyMode = False
s = .GetText
.Clear ' optional - clear the clipboard if using Range.PasteSpecial instead of Worksheet.PasteSpecial "Text"
With New RegExp ' needs reference to "Microsoft VBScript Regular Expressions 5.5" or use a bit slower late binding - With CreateObject("VBScript.RegExp")
.Global = True
'.IgnoreCase = False ' .IgnoreCase is False by default
.Pattern = "[^0-9A-Za-z\r\n]+" ' because "[^\w\r\n]+" also matches _ and Unicode letters
s = .Replace(s, vbNullString)
End With
.SetText s
.PutInClipboard
End With
' about 70% of the time is spent here in pasting the data
r(, 2).PasteSpecial 'xlPasteValues ' paste the text from clipboard in B1
'Debug.Print Timer - t
Run Code Online (Sandbox Code Playgroud)
由于剪贴板开销,我希望这对于较少的值来说会更慢,并且由于需要内存,可能会因为更多的值而变慢.
在我的测试中,禁用事件似乎没有什么不同,但可能值得尝试.
请注意,当宏使用剪贴板时,另一个应用程序使用剪贴板的可能性很小.
如果早期绑定导致在不同计算机上运行相同编译宏的问题,则可以搜索宏反编译器或删除引用并切换到后期绑定.
tl; dr - 正则表达式破坏VBA实现.如果这是代码挑战,@ brettj或@Slai应该赢得它.
有一堆技巧可以让你AlphaNumericOnly更快.
首先,您可以通过将其视为字节数组而不是字符串来消除绝大多数函数调用.这将删除对Mid$和的所有调用Asc.虽然这些功能非常快,但它们仍然增加了调用堆栈的开销和弹出.这增加了几十万次迭代.
第二个优化是不使用Case x To y语法,如果你可以避免它.原因与它如何编译有关 - 它没有编译成测试Case = Condition >= x And Condition <= y,它实际上创建了一个具有早期退出条件的循环,如下所示:
Case = False
For i = x To y
If Condition = i Then
Case = True
End If
Next
Run Code Online (Sandbox Code Playgroud)
同样,不是一个巨大的性能打击,但它加起来.第三个优化是以一种方式对您的测试进行排序,使它们对数据集中最可能的命中进行排序.我在下面为主要字母量身定制了我的例子,其中大多数都是大写字母.您可以通过不同的订购做得更好.把它们放在一起,你得到的东西看起来像这样:
Public Function ByteAlphaNumeric(source As Variant) As String
Dim chars() As Byte
Dim outVal() As Byte
chars = CStr(source) 'Load the array up.
Dim bound As Long
bound = UBound(chars) 'Size the outbound array.
ReDim outVal(bound)
Dim i As Long, pos As Long
For i = 0 To bound Step 2 'Wide characters, only care about the ASCII range.
Dim temp As Byte
temp = chars(i) 'Pointer math isn't free. Cache it.
Select Case True 'Order is important here.
Case temp > 64 And temp < 91
outVal(pos) = temp
pos = pos + 2 'Advance the output pointer.
Case temp < 48
Case temp > 122
Case temp > 96
outVal(pos) = temp
pos = pos + 2
Case temp < 58
outVal(pos) = temp
pos = pos + 2
End Select
Next
'This is likely the most expensive operation.
ReDim Preserve outVal(pos) 'Trim the output array.
ByteAlphaNumeric = outVal
End Function
Run Code Online (Sandbox Code Playgroud)
它是怎么做的?挺好的:
Public Sub Benchmark()
Dim starting As Single, i As Long, dummy As String, sample As Variant
sample = GetRandomString
starting = Timer
For i = 1 To 1000000
dummy = AlphaNumericOnlyOP(sample)
Next i
Debug.Print "OP's AlphaNumericOnly: ", Timer - starting
starting = Timer
For i = 1 To 1000000
dummy = AlphaNumericOnlyThunderframe(sample)
Next i
Debug.Print "ThunderFrame's AlphaNumericOnly: ", Timer - starting
starting = Timer
For i = 1 To 1000000
dummy = AlphaNumeric(sample)
Next i
Debug.Print "CallumDA33's AlphaNumeric: ", Timer - starting
starting = Timer
For i = 1 To 1000000
dummy = ByteAlphaNumeric(sample)
Next i
Debug.Print "ByteAlphaNumeric: ", Timer - starting
Dim cast As String
cast = CStr(sample)
starting = Timer
For i = 1 To 1000000
dummy = ByteAlphaNumericString(cast)
Next i
Debug.Print "ByteAlphaNumericString: ", Timer - starting
Set stripper = Nothing
starting = Timer
For i = 1 To 1000000
dummy = OptimizedRegex(sample)
Next i
Debug.Print "OptimizedRegex: ", Timer - starting
End Sub
Private Function GetRandomString() As Variant
Dim chars(30) As Byte, i As Long
Randomize
For i = 0 To 30 Step 2
chars(i) = Int(96 * Rnd + 32)
Next i
Dim temp As String
temp = chars
GetRandomString = CVar(temp)
End Function
Run Code Online (Sandbox Code Playgroud)
随机15个字符的结果String:
Run Code Online (Sandbox Code Playgroud)OP`s AlphaNumericOnly: 6.565918 ThunderFrame`s AlphaNumericOnly: 3.617188 CallumDA33`s AlphaNumeric: 23.518070 ByteAlphaNumeric: 2.354980
注意,我省略了转换为函数并不容易的提交.你可能会注意到另外两个测试 - 它ByteAlphaNumericString与ByteAlphaNumeric函数完全相同,但是它需要一个String输入而不是一个Variant并且摆脱了强制转换.这不是微不足道的:
Run Code Online (Sandbox Code Playgroud)ByteAlphaNumericString: 2.226074
最后,难以捉摸的OptimizedRegex函数(基本上是@ brettj的函数形式的代码用于比较计时):
Private stripper As RegExp 'Module level
Function OptimizedRegex(strSource As Variant) As String
If stripper Is Nothing Then
Set stripper = New RegExp
With stripper
.Global = True
.Pattern = "[^0-9A-Za-z]"
End With
End If
OptimizedRegex = stripper.Replace(strSource, vbNullString)
End Function
Run Code Online (Sandbox Code Playgroud)
Run Code Online (Sandbox Code Playgroud)OptimizedRegex: 1.094727
在我看来,哈希表查找可能比Select Case结构更快,所以我使用以下内容构建了一个Scripting.Dictionary:
Private hash As Scripting.Dictionary 'Module level
Function HashLookups(source As Variant) As String
Dim chars() As Byte
Dim outVal() As Byte
chars = CStr(source)
Dim bound As Long
bound = UBound(chars)
ReDim outVal(bound)
Dim i As Long, pos As Long
With hash
For i = 0 To bound Step 2
Dim temp As Byte
temp = chars(i)
If .Exists(temp) Then
outVal(pos) = temp
pos = pos + 2
End If
Next
End With
ReDim Preserve outVal(pos)
HashLookups = outVal
End Function
Private Sub LoadHashTable()
Set hash = New Scripting.Dictionary
Dim i As Long
For i = 48 To 57
hash.Add i, vbNull
Next
For i = 65 To 90
hash.Add i, vbNull
Next
For i = 97 To 122
hash.Add i, vbNull
Next
End Sub
'Test code:
starting = Timer
LoadHashTable
For i = 1 To 1000000
dummy = HashLookups(sample)
Next i
Debug.Print "HashLookups: ", Timer - starting
Run Code Online (Sandbox Code Playgroud)
事实证明并不是太破旧:
Run Code Online (Sandbox Code Playgroud)HashLookups: 1.655273
醒来后认为我会尝试使用向量查找而不是哈希查找(只需填充值的字节数组以保留并将其用于测试).这似乎是合理的,因为它只是一个256元素的数组 - 基本上是一个真值表:
Private lookup(255) As Boolean 'Module level
Function VectorLookup(source As Variant) As String
Dim chars() As Byte
Dim outVal() As Byte
chars = CStr(source)
Dim bound As Long
bound = UBound(chars)
ReDim outVal(bound)
Dim i As Long, pos As Long
For i = 0 To bound Step 2
Dim temp As Byte
temp = chars(i)
If lookup(temp) Then
outVal(pos) = temp
pos = pos + 2
End If
Next
ReDim Preserve outVal(pos)
VectorLookup = outVal
End Function
Private Sub GenerateTable()
Dim i As Long
For i = 48 To 57
lookup(i) = True
Next
For i = 65 To 90
lookup(i) = True
Next
For i = 97 To 122
lookup(i) = True
Next
End Sub
Run Code Online (Sandbox Code Playgroud)
假设查找表只生成一次,它的时钟速度比上面任何其他纯VBA方法快10-15%左右.
感谢ThunderFrame(我是LHS的傻瓜Mid$)但是我从早期的角度获得了更好的表现,并RegExp进行了额外的小调整:
Value2而不是Value.ignorecase = True 是多余的码
Sub Replace2()
Dim inputSh As Worksheet
Dim inputRng As Range
Set inputSh = Sheets("Data")
Set inputRng = inputSh.Range("A1:A30000")
Dim outputSh As Worksheet
Dim outputRng As Range
Set outputSh = Sheets("Replace")
Set outputRng = outputSh.Range("A1:A30000")
Dim time1 As Double, time2 As Double
time1 = MicroTimer
Dim arr As Variant
Dim objRegex As VBScript_RegExp_55.RegExp
Dim i As Long
Set objRegex = CreateObject("vbscript.regexp")
With objRegex
.Global = True
.Pattern = "[^\w]"
End With
arr = inputRng.Value2
For i = LBound(arr) To UBound(arr)
arr(i, 1) = objRegex.Replace(arr(i, 1), vbNullString)
Next i
outputRng.Value2 = arr
time2 = MicroTimer
Debug.Print (time2 - time1) * 1000
End Sub
Run Code Online (Sandbox Code Playgroud)