In Progress
Unit 1, Lesson 1
In Progress

Nokogiri

Sometimes you just need to munge some XML. And Nokogiri makes processing massive quantities of XML easy and blazingly fast. Here’s a quick tutorial to get you started!

Video transcript & code

If you do a lot of Rails development you might know Nokogiri simply as "that dependency that takes a long time to compile during bundle install".

Although before anyone chucks a bucket of White Claw at me, I should note that these days Nokogiri has precompiled gems for several platforms and if you're still waiting a long time for it to install you should probably update your Gemfile.

Anyway.

Nokogiri is more than just a rails dependency. Sometimes you need to work with XML data, and Nokogiri is actually pretty great for that.

Let's say we've got a big dump of data in XML form. Some of the entries in the XML represent screencast episodes.

<item>
  <title>Episode #429: Oga</title>
  <link>https://www.rubytapas.com/2016/07/25/episode-429-oga/</link>
  <pubDate>Mon, 25 Jul 2016 13:00:27 +0000</pubDate>
  ...

And let's say we need to make various bulk updates to this data.

For instance, take this postmeta element here.

  <wp:postmeta>
    <wp:meta_key>episode_number</wp:meta_key>
    <wp:meta_value><![CDATA[429]]></wp:meta_value>
  </wp:postmeta>

We want to keep these particular postmeta entries in the updated XML,

but we want to change their keys to be tapas_episode_number.

We also want to take the metadata value and copy it into the wp:menu_order element associated with the episode.

Let's write a tiny script to do this using Nokogiri.

We create a Nokogiri XML document out of the entire input file contents.

Nokogiri supports a couple of different ways to query XML. We're going to use the XPath query language.

We use the xpath method to query for item elements. The // at the start of this XPath expression means the elements can be found at any depth in the XML tree. It's basically just a shortcut to avoid figuring names of the parent nodes.

We only care about elements that contain a wp:postmeta element which in turn contains a wp:meta_key element whose contents are equal to episode_number.

For each of these episode items...

We grab a reference to the postmeta section that we are interested in by again looking for the relevant meta_key. This time we use at_xpath which will always return either a reference to a single XML element, or nil if nothing matches, not a list.

Then, we grab the meta_key child of that element, and update its content to the string "tapas_episode_number".

Next we get the episode number by finding the wp:meta_value child element and taking its contents.

And finally we look up the wp:menu_order child of the episode element, and update its contents with the episode number.

With our munging done, we open up an output file...

...and we write the updated XML out to it.

require "nokogiri"

doc = Nokogiri::XML(IO.read("episodes-archive.xml"))
doc.xpath("//item[wp:postmeta/wp:meta_key = 'episode_number']").each do |episode|
  postmeta = episode.at_xpath("wp:postmeta[wp:meta_key = 'episode_number']")
  postmeta.at_xpath("wp:meta_key").content = "tapas_episode_number"
  number = postmeta.at_xpath("wp:meta_value").content.to_i
  episode.at_xpath("wp:menu_order").content = number.to_s
end

open("output.xml", "w") do |out|
  doc.write_xml_to(out)
end

On the command line, we execute this script.

$ ruby update-metadata.rb

It takes all of maybe one second to process 24 megabytes of data and over six hundred episode items.

And that's one of the things that really shines about Nokogiri. Under the hood, it uses the blazingly fast platform-native libxml libraries. Which is also why Nokogiri has to be compiled from C.

As long as we lean on this optimization by feeding XPaths or CSS selectors to Nokogiri that libxml can optimize, we can operate on massive quantities of XML data in the blink of an eye.

Now if you're deeply familiar with XML, you might point out that an even more XML-native way to handle these kinds of bulk transformations is with XSLT. And you'd be absolutely right. If we'd done this with XSLT, it probably would have been even faster.

But XSLT is a whole other programming language to master. I really like the midpoint that Nokogiri offers: we can rummage through the XML with highly-optimized XPath expressions or CSS selectors. And once we find what we want, we can operate on it in good 'ole Ruby.

And that's all for today. Happy hacking!

Responses