Stats
Video transcript & code
When we're developing software, we often find ourselves needing to summarize and reason about large sets of numeric data. Whether it's evaluating system performance, measuring message traffic, or characterizing the size distributions of documents in a database, we need to be able to take a set of data points and process the set in meaningful ways.
I have a set of a thousand numeric point data points stored as a simple Ruby array in a file called "samples.rb". The source of the data isn't important for today's demonstration. Let's quickly go over how to derive some basic statistical information from this data set, using core Ruby methods.
We'll start with the very basics. minimum and maximum are trivial, since they are built-in.
data = eval(File.read("./samples.rb"))
min = data.min # => 0.00107847
max = data.max # => 0.112805329
We can get the sum of the values using the #reduce
method, as we learned in the last episode.
data = eval(File.read("./samples.rb"))
min = data.min # => 0.00107847
max = data.max # => 0.112805329
sum = data.reduce(:+) # => 57.123122340999984
Now that we have the sum, we can easily get the mean by dividing it by the number of data points.
data = eval(File.read("./samples.rb"))
min = data.min # => 0.00107847
max = data.max # => 0.112805329
sum = data.reduce(:+) # => 57.123122340999984
avg = sum / data.size # => 0.057123122340999984
Now, how about the median? This one's a little trickier. First, we'll get a sorted version of the collection, using #sort
. However, to properly discover the median, we can't just take the middle element from the sorted data. If the set has an even number of data points, we should average the middle two data points in order to find the true median.
We find the size of half of the data set. Then we extract an array of either two or one samples depending on whether the data set size is either even or odd. We are using the two-argument form of array slicing here, where the first argument is the index at which to start, and the second argument is the number of items to retrieve. Finally, we take the average of these middle elements.
data = eval(File.read("./samples.rb"))
min = data.min # => 0.00107847
max = data.max # => 0.112805329
sum = data.reduce(:+) # => 57.123122340999984
avg = sum / data.size # => 0.057123122340999984
sorted = data.sort
half_size = sorted.size / 2
middle = if sorted.size.even?
sorted[half_size - 1, 2]
else
sorted[half_size, 1]
end
med = middle.reduce(:+) / middle.size # => 0.0564275785
If we don't mind sacrificing a little readability, we can tighten this code up. By using the numeric #divmod
method, we can divide the sample size by two and find out both the quotient and the remainder at once. In this case, the remainder is going to either be 1 or 0. Then we can get our middle elements with a single array slice. The start index of the slice is the quotient minus 1 plus the remainder, and the size of the slice is 2 minus the remainder. The code for taking the average of the middle remains the same.
I have mixed feelings about this code; on the one hand, the old version was awfully verbose for a relatively small operation. On the other hand, this is pretty opaque code. If you have an idea for how to write this code more elegantly, feel free to send it in!
data = eval(File.read("./samples.rb"))
min = data.min # => 0.00107847
max = data.max # => 0.112805329
sum = data.reduce(:+) # => 57.123122340999984
avg = sum / data.size # => 0.057123122340999984
sorted = data.sort
half_size = sorted.size / 2
q, r = sorted.size.divmod(2)
middle = sorted[q - 1 + r, 2 - r]
med = middle.reduce(:+) / middle.size # => 0.0564275785
Let's do one more: standard deviation. For each sample in the data, we take the difference between the sample and the mean, and then square it using the double-star exponent operator. We take the average of the resulting list of numbers. Finally, we take the square root of the average using the .sqrt
function in the Math
module.
data = eval(File.read("./samples.rb"))
min = data.min # => 0.00107847
max = data.max # => 0.112805329
sum = data.reduce(:+) # => 57.123122340999984
avg = sum / data.size # => 0.057123122340999984
sorted = data.sort
half_size = sorted.size / 2
q, r = sorted.size.divmod(2)
middle = sorted[q - 1 + r, 2 - r]
med = middle.reduce(:+) / middle.size # => 0.0564275785
std = Math.sqrt(data.map{|n| (n - avg) ** 2}.reduce(:+) / data.size)
std # => 0.03249567434627222
This gives us the standard deviation, an overall indication of how much the samples deviate from the mean.
These are only a few basic statistics, but they are widely useful and it's good to know how to derive them when we need them.
Happy hacking!
Responses