In Progress
Unit 1, Lesson 1
In Progress

Screen Scraping Gateway

Have you ever needed to incorporate information from third-party website, and found that they don’t offer any kind of API? Sometimes the only way to get at the data you need is by good old-fashioned screen-scraping: pulling in webpages intended for human readers, and interpreting them programmatically.

In this episode, you’ll learn how to use the Ruby “Mechanize” gem to easily interact with and extract data from a website. But much more importantly, you’ll learn how to use the Gateway Pattern to isolate the rest of your code from the complexity and inherent fragility of screen-scraping. And you’ll see how to accomplish all this using test-driven development combined with interactive exploratory coding.

Video transcript & code

Today's example is drawn from some backend work I've been doing on the RubyTapas website. All of the subscriber features, such as sending out episode notifications and generating the RSS feed, are handled by a service called DPD. There is currently no API for DPD subscriptions, so if I want to automate anything I need to do it by screen-scraping.

Today I want to lay down some groundwork for pulling data out of DPD. I decide that in order to make a clean separation between my application code and the icky, fragile details of screen-scraping, I want to write a Gateway class. Gateway is a pattern from Patterns of Enterprise Application Architecture, by Martin Fowler. He summarizes the pattern like this:

An object that encapsulates access to an external system or resource.

The DPD website definitely qualifies as an external system or resource, so Gateway seems like a good match for this scenario.

I begin, as usual, with a test. Don't worry too much about the setup boilerplate at the beginning of the test. I'm using Jim Weirich's "Given" extension to RSpec. I'll probably cover this in detail in another episode; for now, suffice to say that it lets me write my specs using "given/when/then" terminology that you might know from behavior-driven development and tools like Cucumber.

As we discussed in Episode 52, classes at the borders of a system are best tested with integration tests that incorporate the actual external service as much as possible. Accordingly, this test will eschew mocks and stubs, and instead test the Gateway against the actual DPD website using my account credentials.

The terminology that DPD uses for subscription content is "content posts". In my own tools, I usually use the term "episodes" instead. The Gateway I'm about to write is intended to hide the details of getting information out of DPD. However, it is not intended to translate from the DPD domain model into my own preferred application terminology. That's a separate responsibility. So I call the Gateway class ContentPostGateway. Within the class I'll stick to using DPD terminology.

My test calls a method called #content_post_list. When it is finished this method should return a list of hashes, representing the data I see when I log into DPD and look at the main list of episodes.

The test saves the result of the method call, and then makes a series of assertions about the returned data. Since I'm testing against the live service, I've filled this test in with a series of assertions based on real RubyTapas episodes. The returned hashes should contain information on post titles, publish times, URLs, and IDs.

require_relative '../../lib/dpd/content_post_gateway'
require 'rspec-given'
require 'fakeweb'
require 'vcr'
require 'rspec-spies'

VCR.configure do |c|
  c.cassette_library_dir = 'spec/cassettes'
  c.hook_into :fakeweb
  c.default_cassette_options = {
    record: :new_episodes
  }
  c.configure_rspec_metadata!
end


module DPD
  describe ContentPostGateway, vcr: true do
    Given(:login) { ENV.fetch('DPD_ADMIN_LOGIN') }
    Given(:password) { ENV.fetch('DPD_ADMIN_PASSWORD') }
    Given(:gateway) { ContentPostGateway.new(login, password) }

    describe '#content_post_list' do
      When(:content_posts) { gateway.content_post_list }
      Then {
        content_posts[0][:title].should eq('001 Binary Literals')
      }
      And {
        content_posts[0][:published_at].should eq(Time.new(2012, 9, 24, 9, 00))
      }
      And {
        content_posts[0][:show_url].should eq('https://getdpd.com/plan/showpost/10?post_id=18')
      }
      And {
        content_posts[0][:id].should eq(18)
      }
      And {
        content_posts[3][:title].should eq('004 Barewords')
      }
      And {
        content_posts[3][:published_at].should eq(Time.new(2012, 10, 1, 9, 00))
      }
      And {
        content_posts[3][:show_url].should eq('https://getdpd.com/plan/showpost/10?post_id=26')
      }
      And {
        content_posts[3][:id].should eq(26)
      }
    end
  end
end

With a test in place, it's time to start implementation. Since this is a screen-scraping gateway, I'll be using the Mechanize screen-scraping library. I write a basic initializer method, and then start in on the method under test.

Here, I start by instantiating a Mechanize agent object. This object will enable me to interact with web pages programatically. I tell it to request the DPD login page. Then I instantiate an instance of the class and check if what I've written so far works. It returns a Mechanize::Page object, so I'm on the right track.

require 'mechanize'

module DPD
  class ContentPostGateway
    def initialize(login, password)
      @login    = login
      @password = password
    end

    def content_post_list
      agent      = Mechanize.new
      login_page = agent.get('https://getdpd.com/login') # !> assigned but unused variable - login_page
    end
  end
end

gw = DPD::ContentPostGateway.new(ENV['DPD_ADMIN_LOGIN'], ENV['DPD_ADMIN_PASSWORD'])
gw.content_post_list            # => #<Mechanize::Page

One thing I quickly realize is that implementing this method is going to take a lot of trial and error while hitting the DPD website. This is going to make for frustratingly long turnaround time every time I want to try out what I've written. I've already configured my tests to use VCR to ameliorate this problem; I take a moment now to add VCR to my in-progress implementation as well. VCR will interpose itself in between my code and the DPD website, recording every request and response to a YAML file. For future requests it'll use this YAML file like a local browser cache, keeping me from having to wait for the DPD servers to respond to requests I've already made.

require 'mechanize'
require 'fakeweb'
require 'vcr'

VCR.configure do |c|
  c.cassette_library_dir = 'cassettes'
  c.hook_into :fakeweb
  c.default_cassette_options = {
    record: :new_episodes
  }
end

module DPD
  class ContentPostGateway
    def initialize(login, password)
      @login    = login
      @password = password
    end

    def content_post_list
      agent      = Mechanize.new
      login_page = agent.get('https://getdpd.com/login') # !> assigned but unused variable - login_page
    end
  end
end

gw = DPD::ContentPostGateway.new(ENV['DPD_ADMIN_LOGIN'], ENV['DPD_ADMIN_PASSWORD'])

VCR.use_cassette 'scratch' do
  gw.content_post_list            # => #<Mechanize::Page
end

Now it's time to log in to the DPD site. From the login_page I extract the login form by looking for a form whose action is /login. I know what to look for because I used my web browser to inspect the DPD login page source HTML. Then I use convenience methods provided by Mechanize to fill in the login and password, and submit the form. The return value of the submit should be a new Page object representing the DPD Dashboard. Just to verify it worked, I add a little assertion that the title of the page includes the word "Dashboard".

require 'mechanize'
require 'fakeweb'
require 'vcr'

VCR.configure do |c|
  c.cassette_library_dir = 'cassettes'
  c.hook_into :fakeweb
  c.default_cassette_options = {
    record: :new_episodes
  }
end

module DPD
  class ContentPostGateway
    def initialize(login, password)
      @login    = login
      @password = password
    end

    def content_post_list
      agent         = Mechanize.new
      login_page    = agent.get('https://getdpd.com/login')
      form          = login_page.form_with(action: '/login')
      form.username = @login
      form.password = @password
      home_page     = agent.submit(form)
      unless home_page.title =~ /^Dashboard/
        raise "DPD admin session login failed for user #{login}"
      end
      home_page
    end
  end
end

gw = DPD::ContentPostGateway.new(ENV['DPD_ADMIN_LOGIN'], ENV['DPD_ADMIN_PASSWORD']) # !> assigned but unused variable - uri

VCR.use_cassette 'scratch' do
  gw.content_post_list            # => #<Mechanize::Page
end

Next up, I request the "plan" page, which among other things contains the master list of content posts. Now I need to locate the episode table on the page. This is a little tricky, because there is no special class or ID identifying this table.

However, I do know that the table starts with a header row that has column headers for "Name" and "Release Date". So I search the page for all table tags. Then I use #detect to search this list of tables. Inside the #detect block, I collect a list of table headers using #map and the #text method provided by each HTML node. Then I compare the list of headers to the headers I'm looking for: 'Name', and 'Release Date'. When I find a table with these headers, I know I've found my target.

I test this out, and sure enough the resulting HTML fragment contains the list of episodes.

require 'mechanize'
require 'fakeweb'
require 'vcr'

VCR.configure do |c|
  c.cassette_library_dir = 'cassettes'
  c.hook_into :fakeweb
  c.default_cassette_options = {
    record: :new_episodes
  }
end

module DPD
  class ContentPostGateway
    def initialize(login, password)
      @login    = login
      @password = password
    end

    def content_post_list
      agent         = Mechanize.new
      login_page    = agent.get('https://getdpd.com/login')
      form          = login_page.form_with(action: '/login')
      form.username = @login
      form.password = @password
      home_page     = agent.submit(form)
      unless home_page.title =~ /^Dashboard/
        raise "DPD admin session login failed for user #{login}"
      end
      list_page = agent.get('https://getdpd.com/plan')
      content_post_table = list_page.search('table').detect { |t|
        headings = t.search('th').map(&:text)
        headings == ['Name', 'Release Date']
      }
      content_post_table # !> assigned but unused variable - uri
    end
  end
end

gw = DPD::ContentPostGateway.new(ENV['DPD_ADMIN_LOGIN'], ENV['DPD_ADMIN_PASSWORD'])

VCR.use_cassette 'scratch' do
  puts gw.content_post_list
end
# >> <table class="methodtable">
# >> <thead><tr>
# >> <th width="50%">Name</th>
# >>     <th width="30%">Release Date</th>
# >>   </tr></thead>
# >> <tbody>
# >> <tr class="methodrow">
# >> <td>
# >> <a href="/plan/showpost/10?post_id=27">005 Array Literals</a>      </td>
# >>       <td>
# >>         Oct 3, 2012 9:00am      </td>
# >>     </tr>
# >> <tr class="methodrow">
# >> <td>
# >> <a href="/plan/showpost/10?post_id=26">004 Barewords</a>      </td>
# >>       <td>
# >>         Oct 1, 2012 9:00am      </td>
# >>     </tr>
# >> <tr class="methodrow">
# >> <td>
# >> <a href="/plan/showpost/10?post_id=21">003 Character Literals</a>      </td>
# >>       <td>
# >>         Sep 28, 2012 9:00am      </td>
# >>     </tr>
# >> <tr class="methodrow">
# >> <td>
# >> <a href="/plan/showpost/10?post_id=20">002 Large Integer Literals</a>      </td>
# >>       <td>
# >>         Sep 26, 2012 9:00am      </td>
# >>     </tr>
# >> <tr class="methodrow">
# >> <td>
# >> <a href="/plan/showpost/10?post_id=18">001 Binary Literals</a>      </td>
# >>       <td>
# >>         Sep 24, 2012 9:00am      </td>
# >>     </tr>
# >> <tr>
# >> <td colspan="3" style="text-align:right">
# >>       <a class="btn btn-primary" href="/plan/editpost/10"><i class="icon-plus icon-white"></i> Post New Content</a>    </td>
# >>   </tr>
# >> </tbody>
# >> </table>

Next I select just the body row tags in this table. A little poking around reveals two interesting facts: first, the rows are in reverse-chronological order, with the most recent episode first. Second, the very last row in the table isn't a content post listing at all; instead, it contains a button to add a new episode.

require 'mechanize'
require 'fakeweb'
require 'vcr'

VCR.configure do |c|
  c.cassette_library_dir = 'cassettes'
  c.hook_into :fakeweb
  c.default_cassette_options = {
    record: :new_episodes
  }
end

module DPD
  class ContentPostGateway
    def initialize(login, password)
      @login    = login
      @password = password
    end

    def content_post_list
      agent         = Mechanize.new
      login_page    = agent.get('https://getdpd.com/login')
      form          = login_page.form_with(action: '/login')
      form.username = @login
      form.password = @password
      home_page     = agent.submit(form)
      unless home_page.title =~ /^Dashboard/
        raise "DPD admin session login failed for user #{login}"
      end
      list_page = agent.get('https://getdpd.com/plan')
      content_post_table = list_page.search('table').detect { |t|
        headings = t.search('th').map(&:text)
        headings == ['Name', 'Release Date']
      }
      content_post_table # !> assigned but unused variable - uri
      content_post_rows  = content_post_table.search('tbody tr')
      content_post_rows[-3..-1]
    end
  end
end

gw = DPD::ContentPostGateway.new(ENV['DPD_ADMIN_LOGIN'], ENV['DPD_ADMIN_PASSWORD'])

VCR.use_cassette 'scratch' do
  puts gw.content_post_list
end
# >> <tr class="methodrow">
# >> <td>
# >> <a href="/plan/showpost/10?post_id=20">002 Large Integer Literals</a>      </td>
# >>       <td>
# >>         Sep 26, 2012 9:00am      </td>
# >>     </tr>
# >> <tr class="methodrow">
# >> <td>
# >> <a href="/plan/showpost/10?post_id=18">001 Binary Literals</a>      </td>
# >>       <td>
# >>         Sep 24, 2012 9:00am      </td>
# >>     </tr>
# >> <tr>
# >> <td colspan="3" style="text-align:right">
# >>       <a class="btn btn-primary" href="/plan/editpost/10"><i class="icon-plus icon-white"></i> Post New Content</a>    </td>
# >>   </tr>

With this in mind, I proceed to finish the method. I select all but the last row. Then I reverse the list, and proceed to map the reversed list to a hash of values.

Inside the map, I grab a list of columns. From here it's mostly a matter of mapping from columns to named values.

  • I extract the title from the first column.
  • I grab the publish time from the second column.
  • I pull the post URL path, from the A tag inside the first column, and then join it to a base URI to get a fully-qualified URL.
  • I extract the post ID out of the show URL with a regular expression.

Finally, I put all these values into a hash. I also convert them to suitable built-in types where appropriate.

Testing this out, I find I have an array of content post information. I run the automated test that I started out with, and find that it now passes.

(Let me just say for the record: during ordinary development I would start with a much smaller test and develop it incrementally in tandem with the implementation code. I've simplified my workflow for the purpose of this video.)

require 'mechanize'
require 'fakeweb'
require 'vcr'

VCR.configure do |c|
  c.cassette_library_dir = 'cassettes'
  c.hook_into :fakeweb
  c.default_cassette_options = {
    record: :new_episodes
  }
end

module DPD
  class ContentPostGateway
    def initialize(login, password)
      @login    = login
      @password = password
    end

    def content_post_list
      agent         = Mechanize.new
      login_page    = agent.get('https://getdpd.com/login')
      form          = login_page.form_with(action: '/login')
      form.username = @login
      form.password = @password
      home_page     = agent.submit(form)
      unless home_page.title =~ /^Dashboard/
        raise "DPD admin session login failed for user #{login}"
      end
      list_page = agent.get('https://getdpd.com/plan')
      content_post_table = list_page.search('table').detect { |t|
        headings = t.search('th').map(&:text)
        headings == ['Name', 'Release Date']
      }
      content_post_table # !> assigned but unused variable - uri
      content_post_rows  = content_post_table.search('tbody tr')
      content_post_rows[0..-2].reverse_each.map { |row|
        columns      = row.search('td')
        title        = columns[0].text.strip
        published_at = columns[1].text.strip
        show_path    = columns[0].at('a')['href']
        show_url     = URI.join('https://getdpd.com', show_path)
        id           = show_url.query[/post_id=(\d+)/, 1]
        {
            title:        title,
            published_at: Time.parse(published_at),
            show_url:     show_url.to_s,
            id:           id.to_i
        }
      }
    end
  end
end

gw = DPD::ContentPostGateway.new(ENV['DPD_ADMIN_LOGIN'], ENV['DPD_ADMIN_PASSWORD'])

VCR.use_cassette 'scratch' do
  gw.content_post_list
  # => [{:title=>"001 Binary Literals",
  #      :published_at=>2012-09-24 09:00:00 -0400,
  #      :show_url=>"https://getdpd.com/plan/showpost/10?post_id=18",
  #      :id=>18},
  #     {:title=>"002 Large Integer Literals",
  #      :published_at=>2012-09-26 09:00:00 -0400,
  #      :show_url=>"https://getdpd.com/plan/showpost/10?post_id=20",
  #      :id=>20},
  #     {:title=>"003 Character Literals",
  #      :published_at=>2012-09-28 09:00:00 -0400,
  #      :show_url=>"https://getdpd.com/plan/showpost/10?post_id=21",
  #      :id=>21}]
end

Everything I've extracted so far is still within the DPD data schema. I resist the temptation to perform extra work here like separating the post titles into episode number and episode name. This gateway is just about encapsulating the complexity of scraping the DPD site. Mappings to RubyTapas-specific concepts like "episode number" are beyond its responsibilities.

Client code that uses this gateway class will still have to understand the DPD schema. But it won't have to know anything about how to extract that data. And since this class returns nothing but "plain old data", it's very easy to stub out the Gateway when testing code which collaborates with it.

In another episode I'll tackle the task of mapping this "plain old data" to domain objects. For now, so long, and happy hacking!

Responses