现在我看到一个句子是否包含一个特定的单词,通过将句子分成一个数组然后做一个包含来查看它是否包含单词.就像是:
"This is my awesome sentence.".split(" ").include?('awesome')
Run Code Online (Sandbox Code Playgroud)
但我想知道用短语做这个的最快方法是什么.就好像我想查看句子"这是我真棒的句子".包含短语"我很棒的句子".我正在抓句子并比较大量的短语,所以速度有点重要.
the*_*Man 12
以下是一些变化:
require 'benchmark'
lorem = ('Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut' # !> unused literal ignored
'enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in' # !> unused literal ignored
'reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident,' # !> unused literal ignored
'sunt in culpa qui officia deserunt mollit anim id est laborum.' * 10) << ' foo'
lorem.split.include?('foo') # => true
lorem['foo'] # => "foo"
lorem.include?('foo') # => true
lorem[/foo/] # => "foo"
lorem[/fo{2}/] # => "foo"
lorem[/foo$/] # => "foo"
lorem[/fo{2}$/] # => "foo"
lorem[/fo{2}\Z/] # => "foo"
/foo/.match(lorem)[-1] # => "foo"
/foo$/.match(lorem)[-1] # => "foo"
/foo/ =~ lorem # => 621
n = 500_000
puts RUBY_VERSION
puts "n=#{ n }"
Benchmark.bm(25) do |x|
x.report("array search:") { n.times { lorem.split.include?('foo') } }
x.report("literal search:") { n.times { lorem['foo'] } }
x.report("string include?:") { n.times { lorem.include?('foo') } }
x.report("regex:") { n.times { lorem[/foo/] } }
x.report("wildcard regex:") { n.times { lorem[/fo{2}/] } }
x.report("anchored regex:") { n.times { lorem[/foo$/] } }
x.report("anchored wildcard regex:") { n.times { lorem[/fo{2}$/] } }
x.report("anchored wildcard regex2:") { n.times { lorem[/fo{2}\Z/] } }
x.report("/regex/.match") { n.times { /foo/.match(lorem)[-1] } }
x.report("/regex$/.match") { n.times { /foo$/.match(lorem)[-1] } }
x.report("/regex/ =~") { n.times { /foo/ =~ lorem } }
x.report("/regex$/ =~") { n.times { /foo$/ =~ lorem } }
x.report("/regex\Z/ =~") { n.times { /foo\Z/ =~ lorem } }
end
Run Code Online (Sandbox Code Playgroud)
以及Ruby 1.9.3的结果:
1.9.3
n=500000
user system total real
array search: 12.960000 0.010000 12.970000 ( 12.978311)
literal search: 0.800000 0.000000 0.800000 ( 0.807110)
string include?: 0.760000 0.000000 0.760000 ( 0.758918)
regex: 0.660000 0.000000 0.660000 ( 0.657608)
wildcard regex: 0.660000 0.000000 0.660000 ( 0.660296)
anchored regex: 0.660000 0.000000 0.660000 ( 0.664025)
anchored wildcard regex: 0.660000 0.000000 0.660000 ( 0.664897)
anchored wildcard regex2: 0.320000 0.000000 0.320000 ( 0.328876)
/regex/.match 1.430000 0.000000 1.430000 ( 1.424602)
/regex$/.match 1.430000 0.000000 1.430000 ( 1.434538)
/regex/ =~ 0.530000 0.000000 0.530000 ( 0.538128)
/regex$/ =~ 0.540000 0.000000 0.540000 ( 0.536318)
/regexZ/ =~ 0.210000 0.000000 0.210000 ( 0.214547)
Run Code Online (Sandbox Code Playgroud)
并且1.8.7:
1.8.7
n=500000
user system total real
array search: 21.250000 0.000000 21.250000 ( 21.296039)
literal search: 0.660000 0.000000 0.660000 ( 0.660102)
string include?: 0.610000 0.000000 0.610000 ( 0.612433)
regex: 0.950000 0.000000 0.950000 ( 0.946308)
wildcard regex: 2.840000 0.000000 2.840000 ( 2.850198)
anchored regex: 0.950000 0.000000 0.950000 ( 0.951270)
anchored wildcard regex: 2.870000 0.010000 2.880000 ( 2.874209)
anchored wildcard regex2: 2.870000 0.000000 2.870000 ( 2.868291)
/regex/.match 1.470000 0.000000 1.470000 ( 1.479383)
/regex$/.match 1.480000 0.000000 1.480000 ( 1.498106)
/regex/ =~ 0.680000 0.000000 0.680000 ( 0.677444)
/regex$/ =~ 0.700000 0.000000 0.700000 ( 0.704486)
/regexZ/ =~ 0.700000 0.000000 0.700000 ( 0.701943)
Run Code Online (Sandbox Code Playgroud)
因此,从结果来看,使用固定字符串搜索'foobar'['foo']比使用正则表达式'foobar'[/foo/]慢,后者比等效字符串慢'foobar' =~ /foo/.
OP原始解决方案受到严重影响,因为它遍历字符串两次:一次将其拆分为单个字,第二次迭代数组寻找实际目标字.随着字符串大小的增加,其性能会降低.
编辑:有一点我觉得Ruby的性能很有趣,就是锚定的正则表达式比未锚定的正则表达式略慢.在Perl中,几年前我第一次运行这种基准时,情况正好相反.
这是使用Fruity的更新版本.各种表达式返回不同的结果.如果要查看目标字符串是否存在,可以使用任何一个.如果你想看看这个值是否在字符串的末尾,就像这些正在测试一样,或者为了得到目标的位置,那么一些肯定比其他的更快,所以选择相应的.
require 'fruity'
TARGET_STR = (' ' * 100) + ' foo'
TARGET_STR['foo'] # => "foo"
TARGET_STR[/foo/] # => "foo"
TARGET_STR[/fo{2}/] # => "foo"
TARGET_STR[/foo$/] # => "foo"
TARGET_STR[/fo{2}$/] # => "foo"
TARGET_STR[/fo{2}\Z/] # => "foo"
TARGET_STR[/fo{2}\z/] # => "foo"
TARGET_STR[/foo\Z/] # => "foo"
TARGET_STR[/foo\z/] # => "foo"
/foo/.match(TARGET_STR)[-1] # => "foo"
/foo$/.match(TARGET_STR)[-1] # => "foo"
/foo/ =~ TARGET_STR # => 101
/foo$/ =~ TARGET_STR # => 101
/foo\Z/ =~ TARGET_STR # => 101
TARGET_STR.include?('foo') # => true
TARGET_STR.index('foo') # => 101
TARGET_STR.rindex('foo') # => 101
puts RUBY_VERSION
puts "TARGET_STR.length = #{ TARGET_STR.length }"
puts
puts 'compare fixed string vs. unanchored regex'
compare do
fixed_str { TARGET_STR['foo'] }
unanchored_regex { TARGET_STR[/foo/] }
end
puts
puts 'compare /foo/ to /fo{2}/'
compare do
unanchored_regex { TARGET_STR[/foo/] }
unanchored_regex2 { TARGET_STR[/fo{2}/] }
end
puts
puts 'compare unanchored vs. anchored regex' # !> assigned but unused variable - delay
compare do
unanchored_regex { TARGET_STR[/foo/] }
anchored_regex_dollar { TARGET_STR[/foo$/] }
anchored_regex_Z { TARGET_STR[/foo\Z/] }
anchored_regex_z { TARGET_STR[/foo\z/] }
end
puts
puts 'compare /foo/, match and =~'
compare do
unanchored_regex { TARGET_STR[/foo/] }
unanchored_match { /foo/.match(TARGET_STR)[-1] }
unanchored_eq_match { /foo/ =~ TARGET_STR }
end
puts
puts 'compare fixed, unanchored, Z, include?, index and rindex'
compare do
fixed_str { TARGET_STR['foo'] }
unanchored_regex { TARGET_STR[/foo/] }
anchored_regex_Z { TARGET_STR[/foo\Z/] }
include_eh { TARGET_STR.include?('foo') }
_index { TARGET_STR.index('foo') }
_rindex { TARGET_STR.rindex('foo') }
end
Run Code Online (Sandbox Code Playgroud)
结果如下:
# >> 2.2.3
# >> TARGET_STR.length = 104
# >>
# >> compare fixed string vs. unanchored regex
# >> Running each test 8192 times. Test will take about 1 second.
# >> fixed_str is faster than unanchored_regex by 2x ± 0.1
# >>
# >> compare /foo/ to /fo{2}/
# >> Running each test 8192 times. Test will take about 1 second.
# >> unanchored_regex2 is similar to unanchored_regex
# >>
# >> compare unanchored vs. anchored regex
# >> Running each test 8192 times. Test will take about 1 second.
# >> anchored_regex_z is similar to anchored_regex_Z
# >> anchored_regex_Z is faster than unanchored_regex by 19.999999999999996% ± 10.0%
# >> unanchored_regex is similar to anchored_regex_dollar
# >>
# >> compare /foo/, match and =~
# >> Running each test 8192 times. Test will take about 1 second.
# >> unanchored_eq_match is faster than unanchored_regex by 2x ± 0.1 (results differ: 101 vs foo)
# >> unanchored_regex is faster than unanchored_match by 3x ± 0.1
# >>
# >> compare fixed, unanchored, Z, include?, index and rindex
# >> Running each test 32768 times. Test will take about 3 seconds.
# >> _rindex is similar to include_eh (results differ: 101 vs true)
# >> include_eh is faster than _index by 10.000000000000009% ± 10.0% (results differ: true vs 101)
# >> _index is faster than fixed_str by 19.999999999999996% ± 10.0% (results differ: 101 vs foo)
# >> fixed_str is faster than anchored_regex_Z by 39.99999999999999% ± 10.0%
# >> anchored_regex_Z is similar to unanchored_regex
Run Code Online (Sandbox Code Playgroud)
修改字符串的大小揭示了很多要知道的东西.
更改为1,000个字符:
# >> 2.2.3
# >> TARGET_STR.length = 1004
# >>
# >> compare fixed string vs. unanchored regex
# >> Running each test 4096 times. Test will take about 1 second.
# >> fixed_str is faster than unanchored_regex by 50.0% ± 10.0%
# >>
# >> compare /foo/ to /fo{2}/
# >> Running each test 2048 times. Test will take about 1 second.
# >> unanchored_regex2 is similar to unanchored_regex
# >>
# >> compare unanchored vs. anchored regex
# >> Running each test 8192 times. Test will take about 1 second.
# >> anchored_regex_z is faster than anchored_regex_Z by 10.000000000000009% ± 10.0%
# >> anchored_regex_Z is faster than unanchored_regex by 3x ± 0.1
# >> unanchored_regex is similar to anchored_regex_dollar
# >>
# >> compare /foo/, match and =~
# >> Running each test 4096 times. Test will take about 1 second.
# >> unanchored_eq_match is similar to unanchored_regex (results differ: 1001 vs foo)
# >> unanchored_regex is faster than unanchored_match by 2x ± 0.1
# >>
# >> compare fixed, unanchored, Z, include?, index and rindex
# >> Running each test 32768 times. Test will take about 4 seconds.
# >> _rindex is faster than anchored_regex_Z by 2x ± 1.0 (results differ: 1001 vs foo)
# >> anchored_regex_Z is faster than include_eh by 2x ± 0.1 (results differ: foo vs true)
# >> include_eh is faster than fixed_str by 10.000000000000009% ± 10.0% (results differ: true vs foo)
# >> fixed_str is similar to _index (results differ: foo vs 1001)
# >> _index is similar to unanchored_regex (results differ: 1001 vs foo)
Run Code Online (Sandbox Code Playgroud)
把它压到10,000:
# >> 2.2.3
# >> TARGET_STR.length = 10004
# >>
# >> compare fixed string vs. unanchored regex
# >> Running each test 512 times. Test will take about 1 second.
# >> fixed_str is faster than unanchored_regex by 39.99999999999999% ± 10.0%
# >>
# >> compare /foo/ to /fo{2}/
# >> Running each test 256 times. Test will take about 1 second.
# >> unanchored_regex2 is similar to unanchored_regex
# >>
# >> compare unanchored vs. anchored regex
# >> Running each test 8192 times. Test will take about 3 seconds.
# >> anchored_regex_z is similar to anchored_regex_Z
# >> anchored_regex_Z is faster than unanchored_regex by 21x ± 1.0
# >> unanchored_regex is similar to anchored_regex_dollar
# >>
# >> compare /foo/, match and =~
# >> Running each test 256 times. Test will take about 1 second.
# >> unanchored_eq_match is similar to unanchored_regex (results differ: 10001 vs foo)
# >> unanchored_regex is faster than unanchored_match by 10.000000000000009% ± 10.0%
# >>
# >> compare fixed, unanchored, Z, include?, index and rindex
# >> Running each test 32768 times. Test will take about 18 seconds.
# >> _rindex is faster than anchored_regex_Z by 2x ± 0.1 (results differ: 10001 vs foo)
# >> anchored_regex_Z is faster than include_eh by 15x ± 1.0 (results differ: foo vs true)
# >> include_eh is similar to _index (results differ: true vs 10001)
# >> _index is similar to fixed_str (results differ: 10001 vs foo)
# >> fixed_str is faster than unanchored_regex by 39.99999999999999% ± 10.0%
Run Code Online (Sandbox Code Playgroud)