Just the other week in one of my university Comp. Sci. classes I was asked to use a supplied Linked List to create a Concordance from standard input (in C I might add). The problem wasn’t necessarily hard, in fact, it was simple enough some friends and I realized it was a great Ruby one-liner candidate; Sure enough this was the result after no more than a minute of jabbering:
hash = Hash.new(0); str.split.each { |m| hash[m] += 1}
Well thats all fine and dandy… A plain old Ruby one-liner. My friend Stef, however, suggested this close alternative:
hash = Hash.new(0); str.scan(/\w+/m) { |m| hash[m] += 1}
Whats different? Well Stef’s code uses a regex scan of a “m”ultiline string, then adds 1 to each match in the hash. His regex takes series of 1 or more “\w”ord characters to be a match. Whereas my code uses Ruby’s built-in “split” method to split on whitespace, then iterate over the resultant array.
This is how split works:
str = "My name is Ryan." str.split #=> ["My","name","is","Ryan."]
For simple strings, like “My name is Ryan.”, Stef’s regex scan works almost identically. For this example we will ignore the fact that “\w” won’t match things like ‘-’, its not really that important at the moment.
As any good Computer Scientists our divergence of methods lead to a great argument… Which one was better? From my point of view, “split.each” is much more readable, clearly (without a regex) splits on whitespace, and is nearly as terse as the regex equivalent. From Stef’s point of view he A) didn’t have to use “each” and B) had more control over the split. We agreed to disagree, clearly each works best in different situations. Split is best for a simple split, but Scan is far more versatile.
Having put semantics aside we began wrestling over which one would be faster. We threw together this “Benchmark” script:
require 'benchmark'str = "1 2 3 4-5 6 7 8-9" Benchmark.bm do |bm| bm.report("split: ") {10000.times do hash = Hash.new(0); str.split.each { |m| hash[m] += 1}; end } bm.report("scan: (\w+) ") { 10000.times do hash = Hash.new(0); str.scan(/\w+/m) { |m| hash[m] = 1} end } bm.report("scan: (\w+(-\w+)?) ") { 10000.times do hash = Hash.new(0); str.scan(/(\w+(-\w+)?)/m) { |m| hash[m] += 1} end } end
Here’s the result of those benchmarks on various Ruby versions:
Native environment tests - 1.8.7
Creating one hash and clear it: (hash.clear instead of hash = Hash.new(0))
user system total real
split: 0.490000 0.150000 0.640000 ( 0.656165)
scan: (\w+) 0.800000 0.180000 0.980000 ( 1.003529)
scan: (w+(-w+)?) 1.390000 0.340000 1.730000 ( 1.745792)
Creating a new hash every time:
user system total real
split: 0.470000 0.140000 0.610000 ( 0.643760)
scan: (\w+) 0.800000 0.180000 0.980000 ( 0.989383)
scan: (w+(-w+)?) 1.170000 0.260000 1.430000 ( 1.457280)
Variety tests by Stef Penner
mbp:rubinius stefan$ ruby -v
-> ruby 1.8.7 (2008-06-20 patchlevel 22) [i686-darwin9.3.0]
mbp:rubinius stefan$ macruby -v
-> MacRuby version 0.3 (ruby 1.9.0 2008-06-03) [universal-darwin9.0]
mbp:rubinius stefan$ jruby -v
-> ruby 1.8.6 (2008-06-22 rev 6555) [i386-jruby1.1.1]
mbp:rubinius stefan$ rbx -v
-> rubinius 0.9.0 (ruby 1.8.6 compatible) (8038487c4) (10/19/2008) [i686-apple-darwin9.5.0]
Variety tests by Stef Penner
$ rubinous regx.rb
user system total real
split: 1.422384 0.000000 1.422384 ( 1.422366)
scan: (\w+) 1.458300 0.000000 1.458300 ( 1.458299)
scan: (w+(-w+)?) 2.127930 0.000000 2.127930 ( 2.127929)
$ ruby regx.rb
user system total real
split: 0.410000 0.140000 0.550000 ( 0.559599)
scan: (\w+) 0.670000 0.180000 0.850000 ( 0.862585)
scan: (w+(-w+)?) 0.990000 0.270000 1.260000 ( 1.268065)
$ ruby1.9 regx.rb
user system total real
split: 0.090000 0.000000 0.090000 ( 0.096752)
scan: (\w+) 0.170000 0.000000 0.170000 ( 0.164321)
scan: (w+(-w+)?) 0.280000 0.000000 0.280000 ( 0.291374)
$ macruby regx.rb
user system total real
split: 0.440000 0.030000 0.470000 ( 0.490660)
scan: (\w+) 4.310000 0.050000 4.360000 ( 4.449849)
scan: (w+(-w+)?) 4.380000 0.040000 4.420000 ( 4.503897)
$ jruby regx.rb
user system total real
split: 0.456000 0.000000 0.456000 ( 0.456000)
scan: (\w+) 0.261000 0.000000 0.261000 ( 0.260000)
scan: (w+(-w+)?) 0.369000 0.000000 0.369000 ( 0.369000)
$jruby 1.1.3 regx.rb
user system total real
split: 0.235000 0.000000 0.235000 ( 0.234993)
scan: (\w+) 0.228000 0.000000 0.228000 ( 0.228318)
scan: (w+(-w+)?) 0.329000 0.000000 0.329000 ( 0.328884)
Its rather interesting to see how each version of ruby compares, yes Rubinius is slower, but WOW, Ruby 1.9.1 takes only 16% the time 1.8.7 takes!