In Progress
Unit 1, Lesson 21
In Progress

SAX

Video transcript & code

Sometimes I set out to write a new episode, and half way through I realize that the background info warrants an episode itself. That's the case with today's screencast. I started working on an episode on refactoring a SAX parser, when I realized not everyone knows what a SAX parser is. Since it's common to need to parse HTML and XML in a wide variety of software projects, I thought I'd go ahead and cover that

SAX, which stands for "Simple API for XML", is a language-agnostic model for efficiently processing XML and HTML documents. Rather than parse the whole document into memory, a parser object is given an event handler object, and as it comes across XML elements it sends messages to callback methods on the event handler. This can be especially handy for very large documents, which would otherwise take up a lot of space if they were parsed into a full Document Object Model (DOM) representation in memory. SAX lets us process just the data we care about.

Just to demonstrate, let's write a tiny SAX event handler and feed it little XML fragment. Our event handler will incrementally build up a tree of hashes from the XML data as it is parsed.

Our SAX event handler inherits from a SAX::Document class, which is a little confusing since it really isn't a document.

We'll provide an attribute reader to get at the root hash.

In the initializer, we'll set a @current_element variable to an empty hash, and make that the current root of the hash tree.

The start_element method will be called whenever the parser encounters the beginning of an XML element, and it will receive the name of the element and an array of attribute key/value pairs. In it, we instantiate a new hash with keys for the name and the parent hash in the tree. Then we add it to the children key of the current hash. Finally, we set the current element hash to be this new hash.

When an element ending is detected by the parser, our handler's end_element method will be called. In it, we pop the stack, so to speak, and reset the current element back up to the parent of the current element.

If when the parser sees plain text in the document, it will send the #characters message. We implement this method to add the text to a :content attribute of the current hash.

And that's it! To use our new SAX event handler, we instantiate it and pass it into a new Nokogiri SAX parser. Then we feed some XML into it… and when it's done, we get a nice tree of hashes out of it.

require 'nokogiri'
require 'pp'

class XmlToHash < Nokogiri::XML::SAX::Document
  attr_reader :root

  def initialize
    super
    @current_element = {}
    @root = @current_element
  end

  def start_element(name, attrs=[])
    element = {name: name, parent: @current_element}
    (@current_element[:children] ||= []) << element
    @current_element = element
  end

  def end_element(name)
    @current_element = @current_element[:parent]
  end

  def characters(chars)
    (@current_element[:content] ||= "") << chars
  end
end

xml_to_hash = XmlToHash.new
parser = Nokogiri::XML::SAX::Parser.new(xml_to_hash)
xml = <<EOF
<html>
  <body>
    <h1>Hello, world!</h1>
  </body>
</html>
EOF

parser.parse(xml)
pp xml_to_hash.root
# >> {:children=>
# >>   [{:name=>"html",
# >>     :parent=>{...},
# >>     :content=>"\n  \n",
# >>     :children=>
# >>      [{:name=>"body",
# >>        :parent=>{...},
# >>        :content=>"\n    \n  ",
# >>        :children=>
# >>         [{:name=>"h1", :parent=>{...}, :content=>"Hello, world!"}]}]}]}

Now let's apply this to a slightly more real-world example. Supposing we are building the next Google and so we are scraping web pages to find all the external links on them.

Here's the simplest implementation of a link scraper. We read in the content of the page, then parse it into a document object model. We initialize a Set object for the URLs. We use a Set because we don't want to store duplicates.

Then we use Nokogiri's CSS selector support to find all the A tags, and extract their HREF attributes.

require 'nokogiri'
require 'set'
require 'open-uri'
require 'pp'

text  = open('http://www.rubyrogues.com').read
doc   = Nokogiri::HTML(text)
links = Set.new
doc.css('a').each do |elt|
  if elt['href'] =~ /\Ahttps?:\/\//
    links << elt['href']
  end
end
pp links

After putting this in production for a while we realize that certain pages are driving memory use through the roof. Such as sites where someone has put the entire text of a book on a single page. Very inconsiderate of them!

In order to deal gracefully with large documents, we switch to SAX parsing. We create a handler class, which we call LinkCollector. It contains a Set of links. When it receives the #start_element message, it checks to see if the element is an =A tag, and if so if it links to a fully-qualified URL. If so, we add the URL to the set.

To use the LinkCollector, we instantiate one, feed it into a SAX parser object, and tell the parser to start parsing. When it is done, we can ask the link collecter for the links it collected.

require 'nokogiri'
require 'set'
require 'open-uri'
require 'pp'

class LinkCollecter < Nokogiri::XML::SAX::Document
  attr_reader :links

  def initialize
    super
    @links = Set.new
  end

  def start_element(name, attribute_list=[])
    if name == "a"
      if href = attribute_list.assoc("href")
        link = href.last
        if link =~ /\Ahttps?:\/\//
          @links << link
        end
      end
    end
  end
end 

text      = open('http://www.objectsonrails.com').read
collecter = LinkCollecter.new
parser    = Nokogiri::HTML::SAX::Parser.new(collecter)
parser.parse(text)
pp collecter.links

"But wait!" I hear you objecting. "We might not be holding the DOM in memory anymore, but we're still reading the entire raw page body into memory!"

This is true, and that's what we'll tackle next. To do this, we switch from using a regular SAX parser to using a push parser. A push parser can parse documents in bits and pieces, as they become available, rather than expecting the full input text at once.

Since the web is a messy place, we also configure the push parser to recover from any errors it finds in the document rather than raising an exception.

Then we use Net::HTTP to start a connection. We send the get message to the connection in order to retrieve the page, but instead of using the result value of the get, we pass it a block. The block will receive chunks of body text as they are downloaded. Each time we get a chunk, we push it into the push parser.

Finally, once the HTTP request is finished, we tell the push parser that it is done. All that's left is to ask our LinkCollecter–which we didn't have to change at all–for the list of URLs it found.

require 'nokogiri'
require 'set'
require 'net/http'
require 'pp'

class LinkCollecter < Nokogiri::XML::SAX::Document
  attr_reader :links

  def initialize
    super
    @links = Set.new
  end

  def start_element(name, attribute_list=[])
    if name == "a"
      if href = attribute_list.assoc("href")
        link = href.last
        if link =~ /\Ahttps?:\/\//
          @links << link
        end
      end
    end
  end
end 

collecter = LinkCollecter.new
uri       = URI('http://objectsonrails.com')
parser    = Nokogiri::HTML::SAX::PushParser.new(collecter)
parser.options = Nokogiri::XML::ParseOptions::RECOVER
Net::HTTP.start(uri.host, uri.port) do |http|
  http.get('/', {'Accept-Encoding' => ''}) do |chunk|
     parser << chunk
  end
end
parser.finish
pp collecter.links

This is the most verbose solution yet, but the advantage of it is that no part of it scales linearly with the size of the page. Since we only process chunks of text at a time, we can process arbitrarily large documents.

I want to end this episode with a pretty big caveat: it's easy to make assumptions about what code will perform better or worse. As I was writing the example code for this episode I did some memory and speed profiling, and found that even when downloading the entire text of my book Objects on Rails from the web and representing it as a DOM tree, the program was not significantly slower or larger than the fully streaming version. In fact, it was a little faster.

Nokogiri is built on LibXML, which is a very fast, very efficient XML library. There is probably a document size past which it becomes more efficient to process the document bit by bit rather than all at once, but it's a pretty darn big size. As always, make sure you profile your code when experimenting with optimization.

Happy hacking!

Responses