In Progress
Unit 1, Lesson 21
In Progress

Sequel, Postgres, JSON

JSON has become the de facto standard data interchange format for web services.

Have you ever found yourself wanting to throw raw JSON service responses into a data store and not worry about mapping them to a local schema and native data types until later on?

Have you balked at the extra complexity of adding a document database to your application?

Happily, if you’re using PostgreSQL, you already have everything you need to store, query, and transform JSON data. In this episode, you’ll learn how to use PostgreSQL JSON queries, in conjunction with the Ruby Sequel gem, to work with JSON data within your existing SQL database.

 

Video transcript & code

Lately I've been playing with some of the newer features in the Postgres database. In fact, today I was having so much fun with it I didn't want to stop to write an episode about something else. So if you'll indulge me, today I want to show you how Postgres handles JSON data. I think this still has a lot of relevance for Ruby programmers, since many Ruby applications process JSON data, and use Postgres as their database backend. And also because we'll be using some more advanced features of the Ruby Sequel gem, which we first met in episode 179.

Let's say we want to perform some analytics on RubyTapas subscriber data. There is an API endpoint which enables us to fetch all subscriber records in JSON format in one big dump. The details of fetching the data are unimportant right now.

require "dotenv"
require "open-uri"
require "json"

Dotenv.load

def get_subscribers
  login = ENV.fetch("DPD_API_LOGIN")
  pass  = ENV.fetch("DPD_API_PASSWORD")
  open("https://api.getdpd.com/v2/storefronts/13159/subscribers",
       http_basic_authentication: [login, pass]) do |body|
    JSON.load(body)
  end
end

Once the data is fetched and parsed, we end up with an array of subscriber records that look something like this. There's a username, an ID, some timestamps, and then a nested subscription object which contains lots of details like the current subscription status.

{"id"=>12,
 "created_at"=>"1348411097",
 "updated_at"=>"1390265677",
 "username"=>"avdi@avdi.org",
 "last_login_at"=>"1390265677",
 "subscription"=>
  {"id"=>670,
   "status"=>"ACTIVE",
   "price"=>"0.00",
   "period"=>0,
   "unit"=>"",
   "tax_amount"=>"0.00",
   "trial_price"=>"0.00",
   "trial_period"=>nil,
   "trial_unit"=>nil,
   "trial_tax_amount"=>"0.00",
   "created_at"=>"1350670867",
   "updated_at"=>"1350670867",
   "started_at"=>"1350670867",
   "ended_at"=>"259200",
   "trial_started_at"=>nil,
   "last_payment_at"=>"1350670867",
   "next_payment_at"=>nil}}

Like I said, we want to perform analytics on this data. And we could just write Ruby scripts that do the analytics in-memory. But there are a lot of records, and this approach isn't particularly performant. Not to mention that simply fetching the data by itself takes several seconds each time, so we'll need to cache it somewhere locally. Plus, what if, as part of our analytics, we want to join this information with data in database tables that we already have? We'd have to do it manually, and that's no fun.

What we'd really like to do is throw this data into a table into our existing Postgres database so we can apply all the power and speed of that database, as well as be able to relate the data to our other tables. But how do we go about that? We could recreate all of the fields in this data as columns in a database, and write some code to painstakingly extract, convert, and insert the fields one-by-one into the database table.

But that sounds awfully tedious. And we're not even sure we care about all these fields. On the other hand, we don't want to have to tweak our import code and re-run the import every time we realize we DO want another field.

Fortunately, there's a better approach. In version 9.3, Postgres has robust native support for JSON data. And the Sequel Ruby library has special extensions just to deal with JSON data in Postgres.

We require sequel, and tell it to load up some special Postgres/JSON extensions. Then we define a new table. It has only one column, called "subscriber", which has a JSON data type.

We proceed to load data into this table from the subscriber list. We use #import method, which lets us import data in bulk. We tell it which columns to import into, and give it an array of rows to import. In order to correctly import the data as JSON, we use the Sequel.pg_json method to convert raw hashes into a special hash-like JSON class that Sequel provides.

require "./util"
require "sequel"
Sequel.extension(:pg_json)
Sequel.extension(:pg_json_ops)

DB = Sequel.connect(ENV.fetch("DATABASE_URL"))
DB.create_table!(:dpd_subscriber_dump) do
  column :subscriber, "json"
end

data = get_subscribers

DB[:dpd_subscriber_dump].import([:subscriber], data.map{|s| [Sequel.pg_json(s)]})

OK, our data has been imported into the database. Now what can we do with it? Lets do a very simple operation: we'll count how many active accounts there are.

We start a query by selecting the table we want. We tack on a where clause. We supply a string argument, in order to supply some raw SQL for the query.

Instead of just specifying a column in the condition, we specify that the database should look at a specific field inside the JSON data contained in that column. Specifically, it should first look for a subscription key, and then inside that object it should look for a status key.

By the way, I'm not going to spend a lot of time explaining the Postgres JSON operators in this episode. Today I just want to show you what is possible with them. Check out the excellent Postgres documentation to get a full rundown on the various JSON operators and functions that are available.

Once it finds the subscription status by following the specified JSON path, it should check to see if it is "CLOSED", "CANCELED", or "NEW". If it is none of those things, then the account is presently open, and should be counted. We end the query by telling it to return just the count of matching rows.

We execute this, and see a count of just under two thousand subscribers. Which seems like an opportune time for me to pause, and say: thank you very much for supporting what I do here!

require "./util"
require "sequel"
Sequel.extension(:pg_json)

DB = Sequel.connect(ENV.fetch("DATABASE_URL"))
DB[:dpd_subscriber_dump].where("subscriber#>>'{subscription,status}' NOT IN ('NEW', 'CLOSED', 'CANCELED')").count
# => 1950

What about accessing individual rows? Let's find the row for my test account and look at it.

require "./util"
require "sequel"
Sequel.extension(:pg_json)

DB = Sequel.connect(ENV.fetch("DATABASE_URL"))
row = DB[:dpd_subscriber_dump].detect{|r| r[:subscriber]["username"] == "avdi@avdi.org"}
row
# => {:subscriber=>
#      {"id"=>12,
#       "created_at"=>"1348411097",
#       "updated_at"=>"1390265677",
#       "username"=>"avdi@avdi.org",
#       "last_login_at"=>"1390265677",
#       "subscription"=>
#        {"id"=>670,
#         "status"=>"ACTIVE",
#         "price"=>"0.00",
#         "period"=>0,
#         "unit"=>"",
#         "tax_amount"=>"0.00",
#         "trial_price"=>"0.00",
#         "trial_period"=>nil,
#         "trial_unit"=>nil,
#         "trial_tax_amount"=>"0.00",
#         "created_at"=>"1350670867",
#         "updated_at"=>"1350670867",
#         "started_at"=>"1350670867",
#         "ended_at"=>"259200",
#         "trial_started_at"=>nil,
#         "last_payment_at"=>"1350670867",
#         "next_payment_at"=>nil}}}

The data that comes out of the subscriber column isn't just a blob of JSON text. Because the Sequel gem understands Postgres JSON columns, it automatically parses the column contents into a Ruby representation of the JSON objects.

We can use JSON path operators like the one we queried with earlier anywhere we'd normally specify columns in SQL. For instance, if we know that are likely to be looking up users by their usernames, we can alter our schema to add an index on that specific attribute within the JSON data.

As you can see, altering a schema using the Sequel gem is very similar to creating one. We use the Sequel.lit helper here in order to use a literal SQL expression without any of the escaping that the library usually performs.

require "./util"
require "sequel"
Sequel.extension(:pg_json)

DB = Sequel.connect(ENV.fetch("DATABASE_URL"))
DB.alter_table :dpd_subscriber_dump do
  add_index Sequel.lit("(subscriber->>'username')"), name: "username_index"
end

Now Postgres will maintain an index that will accelerate any future queries by subscriber username.

Of course, we might not want to litter our program with complex SQL queries that dig into JSON data. We can use a database view to hide all that knowledge about the structure of JSON data. Once again, the Sequel gem makes it easy to communicate our needs to Postgres.

I've prepared a query which will map all the attribute nested in the JSON data structure to column names. I won't go over it in detail, but one of the things that's interesting about this code is that it doesn't just unpack the JSON. It also converts the timezone attributes to SQL timestamp types, using Postgres casting operators and conversion functions.

We use create_or_replace_view to create a database view based on this select statement.

require "./util"
require "sequel"
Sequel.extension(:pg_json)

DB = Sequel.connect(ENV.fetch("DATABASE_URL"))

SUBSCRIBERS_SELECT = <<-END_SQL
   SELECT
   subscriber->>'username' AS username,
   (subscriber->>'id')::int AS id,
   to_timestamp((subscriber->>'created_at')::int)                           AS created_at,
   to_timestamp((subscriber->>'updated_at')::int)                           AS updated_at,
   to_timestamp((subscriber->>'last_login')::int)                           AS last_login,
   to_timestamp((subscriber#>>'{subscription, created_at}')::int)           AS subscription_created_at,
   to_timestamp((subscriber#>>'{subscription, updated_at}')::int)           AS subscription_updated_at,
   to_timestamp((subscriber#>>'{subscription, started_at}')::int)           AS subscription_started_at,
   to_timestamp((subscriber#>>'{subscription, ended_at}')::int)             AS subscription_ended_at,
   to_timestamp((subscriber#>>'{subscription, trial_started_at}')::int)     AS subscription_trial_started_at,
   to_timestamp((subscriber#>>'{subscription, next_payment_at}')::int)      AS subscription_next_payment_at,
   to_timestamp((subscriber#>>'{subscription, last_payment_at}')::int)      AS subscription_last_payment_at,
   (subscriber#>>'{subscription, status}')                                  AS subscription_status,
   (subscriber#>>'{subscription, trial_price}')::money                      AS subscription_trial_price,
   (subscriber#>>'{subscription, trial_period}')::int                       AS subscription_trial_period,
   (subscriber#>>'{subscription, trial_unit}')::char                        AS subscription_trial_unit,
   (subscriber#>>'{subscription, price}')::money                            AS subscription_price,
   (subscriber#>>'{subscription, period}')::int                             AS subscription_period,
   (subscriber#>>'{subscription, unit}')::char                              AS subscription_unit
   FROM dpd_subscriber_dump
END_SQL

DB.create_or_replace_view(:subscribers, SUBSCRIBERS_SELECT)

Now we can treat the new subscribers view like any other table in the database. We can query it for individual records, using username as a key. We can check the account status, or look at when the account was created. In the background, Postgres is deriving these values from the JSON we imported. But as far as our code is concerned, it's an ordinary table.

require "./util"
require "sequel"
Sequel.extension(:pg_json)

DB = Sequel.connect(ENV.fetch("DATABASE_URL"))
row = DB[:subscribers][username: "avdi@avdi.org"]

row[:status]                    # => "ACTIVE"
row[:created_at]                # => 2012-09-23 10:38:17 -0400

This is amazingly powerful stuff. It means we can dump JSON data derived from web services directly into our database, and then incrementally work out how we want to address that data after the fact. We can hide the "shape" of that data from the rest of our programs using views. And we can work very efficiently with that data by indexing selected pieces of it.

In the past, this would have meant adding a document data store like CouchDB or MongoDB to our system. And even then, we would not have been able to efficiently join the imported data with existing tables in our relational database. Using JSON columns in Postgres means we can get many of the advantages of a document store without leaving our relational store. And the Sequel Ruby library can help us make the most of Postgres' JSON capabilities.

That's all for today. Happy hacking!

Responses