In Progress
Unit 1, Lesson 21
In Progress

Stats

Video transcript & code

When we're developing software, we often find ourselves needing to summarize and reason about large sets of numeric data. Whether it's evaluating system performance, measuring message traffic, or characterizing the size distributions of documents in a database, we need to be able to take a set of data points and process the set in meaningful ways.

I have a set of a thousand numeric point data points stored as a simple Ruby array in a file called "samples.rb". The source of the data isn't important for today's demonstration. Let's quickly go over how to derive some basic statistical information from this data set, using core Ruby methods.

We'll start with the very basics. minimum and maximum are trivial, since they are built-in.

data = eval(File.read("./samples.rb"))
min = data.min                  # => 0.00107847
max = data.max                  # => 0.112805329

We can get the sum of the values using the #reduce method, as we learned in the last episode.

data = eval(File.read("./samples.rb"))
min = data.min                        # => 0.00107847
max = data.max                        # => 0.112805329
sum = data.reduce(:+)                 # => 57.123122340999984

Now that we have the sum, we can easily get the mean by dividing it by the number of data points.

data = eval(File.read("./samples.rb"))
min = data.min                        # => 0.00107847
max = data.max                        # => 0.112805329
sum = data.reduce(:+)                 # => 57.123122340999984
avg = sum / data.size                 # => 0.057123122340999984

Now, how about the median? This one's a little trickier. First, we'll get a sorted version of the collection, using #sort. However, to properly discover the median, we can't just take the middle element from the sorted data. If the set has an even number of data points, we should average the middle two data points in order to find the true median.

We find the size of half of the data set. Then we extract an array of either two or one samples depending on whether the data set size is either even or odd. We are using the two-argument form of array slicing here, where the first argument is the index at which to start, and the second argument is the number of items to retrieve. Finally, we take the average of these middle elements.

data = eval(File.read("./samples.rb"))
min = data.min                        # => 0.00107847
max = data.max                        # => 0.112805329
sum = data.reduce(:+)                 # => 57.123122340999984
avg = sum / data.size                 # => 0.057123122340999984
sorted = data.sort
half_size = sorted.size / 2
middle = if sorted.size.even?
           sorted[half_size - 1, 2]
         else
           sorted[half_size, 1]
         end
med = middle.reduce(:+) / middle.size # => 0.0564275785

If we don't mind sacrificing a little readability, we can tighten this code up. By using the numeric #divmod method, we can divide the sample size by two and find out both the quotient and the remainder at once. In this case, the remainder is going to either be 1 or 0. Then we can get our middle elements with a single array slice. The start index of the slice is the quotient minus 1 plus the remainder, and the size of the slice is 2 minus the remainder. The code for taking the average of the middle remains the same.

I have mixed feelings about this code; on the one hand, the old version was awfully verbose for a relatively small operation. On the other hand, this is pretty opaque code. If you have an idea for how to write this code more elegantly, feel free to send it in!

data = eval(File.read("./samples.rb"))
min = data.min                        # => 0.00107847
max = data.max                        # => 0.112805329
sum = data.reduce(:+)                 # => 57.123122340999984
avg = sum / data.size                 # => 0.057123122340999984
sorted = data.sort
half_size = sorted.size / 2
q, r = sorted.size.divmod(2)
middle = sorted[q - 1 + r, 2 - r]
med = middle.reduce(:+) / middle.size # => 0.0564275785

Let's do one more: standard deviation. For each sample in the data, we take the difference between the sample and the mean, and then square it using the double-star exponent operator. We take the average of the resulting list of numbers. Finally, we take the square root of the average using the .sqrt function in the Math module.

data = eval(File.read("./samples.rb"))
min = data.min                        # => 0.00107847
max = data.max                        # => 0.112805329
sum = data.reduce(:+)                 # => 57.123122340999984
avg = sum / data.size                 # => 0.057123122340999984
sorted = data.sort
half_size = sorted.size / 2
q, r = sorted.size.divmod(2)
middle = sorted[q - 1 + r, 2 - r]
med = middle.reduce(:+) / middle.size # => 0.0564275785
std = Math.sqrt(data.map{|n| (n - avg) ** 2}.reduce(:+) / data.size)
std                             # => 0.03249567434627222

This gives us the standard deviation, an overall indication of how much the samples deviate from the mean.

These are only a few basic statistics, but they are widely useful and it's good to know how to derive them when we need them.

Happy hacking!

Responses