XML and RSS Processing in Ruby
Focussing on the use of the REXML standard library and specialized gems like XmlSimple and Simple RSS, the comprehensive information on processing XML (Extensible Markup Language) and RSS (Really Simple Syndication) in Ruby. The most widely used textual method of describing structured data is XML, which is used for configuration files, messaging, and storage, whereas HTML is mostly used for text-based web publications. A specialized XML format called RSS is frequently used to record lists of items from websites.
You can also read TK GUI in Ruby: Creating Your First Graphical User Interface
Processing XML Documents
There are multiple ways to work with XML in Ruby, mostly using the REXML module in the standard library.
REXML and Document Parsing APIs
The two main XML manipulation APIs are SAX (Simple API for XML) and DOM (Document Object Model), both of which REXML supports.
DOM-style (Tree Parsing): To create a nested tree of objects (such as Element and Text), the REXML::Document class parses the whole XML document. This enables navigating with XPath searches or Ruby accessors. While this method works well for smaller documents, it may use a lot of RAM for larger files.
SAX-style (Stream Parsing): You can process a document as it is processed using the REXML::StreamParser (or REXML::Document.parse_stream), producing events like tag_start and tag_end. Because it avoids creating a complex in-memory data structure that represents the full page, this is quicker and uses less memory than DOM-style parsing when you are only interested in a piece of the text.
Checking Well-Formedness
Using the REXML library to try to parse an XML document and catch the resulting exception is the preferred way to find out if it is well-formed (structured correctly, e.g., opening tags match closing tags).
Code Example 1: Checking XML Well-Formedness
The valid_xml? method attempts to parse the XML string and, if successful, returns the parsed REXML::Document object; if a REXML::ParseException occurs, it returns nil.
require 'rexml/document'
def valid_xml?(xml)
begin
REXML::Document.new(xml)
rescue REXML::ParseException
# Return nil if an exception is thrown
nil
end
end
good_xml = %{
<groceries>
<bread>Wheat</bread>
</groceries>
}
doc = valid_xml?(good_xml) # returns REXML::Document object if valid, or nil if invalid
Extracting Data by Tree Structure
The data structure can be navigated after it has been loaded into a REXML::Document object. The document’s structure is frequently reflected in the Ruby code. For iterating over child elements, the #each_element technique is really helpful.
Code Example 2: Extracting Data from a Tree Structure
This code iterates through an XML document that represents orders, utilizing Element#each_element and Element#attributes to access nested elements and attributes.
require 'rexml/document'
# Example XML (you must define orders_xml before using it)
# orders_xml = <<-XML
# <orders>
# <order>
# <number>123</number>
# <date>2024-01-10</date>
# <items>
# <item desc="Laptop" />
# <item desc="Mouse" />
# </items>
# </order>
# </orders>
# XML
orders = REXML::Document.new(orders_xml)
# Iterate through each <order> element
orders.root.each_element('order') do |order|
order.each_element do |node|
if node.has_elements?
# Example: <items> contains <item> children
node.each_element do |child|
puts "#{child.name}: #{child.attributes['desc']}"
end
else
# Nodes like <number>, <date>
puts "#{node.name}: #{node.text}"
end
end
end
You can also read Ruby GUI Toolkits: GTK, wxRuby And RubyCocoa Explained
Navigating with XPath
An XML document’s elements or collections of elements can be referred to using a standard, programming language-independent method defined by XPath. The REXML::XPath module, which has class methods like first, each, and match, gives REXML a comprehensive XPath implementation.
Converting XML to Simple Ruby Structures
External tools can easily transform XML into more straightforward Ruby data structures if handling the intricate tree structure of REXML::Document is not desired.
XmlSimple
An XML document is parsed using the XmlSimple library (available as the xml-simple gem) and transformed into a nested structure of Ruby arrays and hashes. Because it creates a Document object in the background, there is a slight performance cost, but the end product is easier to use.
Code Example 3: Converting XML to a Hash with XmlSimple
By default, XmlSimple arranges attributes and elements into a hash and eliminates the name of the root element (unless KeepRoot is specified).
require 'rubygems'
require 'xmlsimple' # [19]
xml = %{
<freezer temp="-12" scale="celcius">
<food>Phyllo dough</food>
<food>Ice cream</food>
</freezer>
}
# Parse the XML
doc = XmlSimple.xml_in(xml)
# Pretty-print the result
pp doc
You can also read Extending Ruby with C: A Complete Beginner’s Guide
Creating and Modifying XML Documents
Starting with an empty REXML::Document object, you can generate or edit XML documents.
Code Example 4: Add_element is used to create XML
Parent.add_element is used to structure the page and add additional elements:
require 'rexml/document'
doc = REXML::Document.new #
# Create root element <meeting>
meeting = doc.add_element('meeting') #
# Add <time> element with attributes
meeting.add_element(
'time',
{
'from' => Time.now.to_s,
'to' => (Time.now + 3600).to_s
}
) #
# Output XML with indentation of 1 space
formatter = REXML::Formatters::Pretty.new(1)
formatter.compact = true # Avoid extra blank lines
formatter.write(doc, $stdout)
puts # newline after output
A more idiomatic Ruby method of creating XML documents is offered by the builder gem’s XmlMarkup class, which enables Ruby code nesting to mimic the XML structure being constructed.
You can also read Embedding Ruby: Running Ruby Code Inside C & C++ Programs
RSS and Feed Aggregation
Lists of articles from web pages are stored in a specific XML format called RSS (Rich/Really Simple Syndication). Weblog posts and articles are gathered from several RSS feeds using a program known as an aggregator.
Ruby Libraries for RSS/Atom
Several libraries are available in Ruby for processing syndication feeds:
Standard rss library: The three primary RSS format versions (0.9, 1.0, and 2.0) are supported by this built-in library.
Simple RSS library: Because it is more recent, supports Atom (a more recent syndication protocol that functions similarly to RSS), and is more relaxed, giving poorly formatted feeds a higher chance of being read, this library also known as the simple-rss gem is frequently chosen by serious aggregators.
Simple Feed Aggregator Example
An RSS library is used to parse the feed content after it has been read using open-uri in a basic aggregator.
Code Example 5: Reading a Single RSS Feed using Simple RSS
In this example, an RSS feed is fetched and parsed from a URL using the Simple RSS library:
require 'rubygems'
require 'simple-rss'
require 'open-uri'
# open-uri allows opening a URL as though it were a file
url = 'http://www.oreillynet.com/pub/feed/1?format=rss2'
# Note: The original source code shows 'RSS::Parser.parse' but implies
# this should be SimpleRSS.new based on the discussion
feed = SimpleRSS.new(open(url).read) #
puts "=== Channel: #{feed.channel.title} ==="
feed.items.each do |item|
puts item.title
puts " (#{item.link})"
puts
puts item.description
end
Basic Aggregator Structure
Using SimpleRSS.new(open(url).read) to read each URL and store the resultant feeds in an array, the RSSAggregator class encapsulates the procedure.
require 'open-uri'
require 'simple-rss'
class RSSAggregator
def initialize(feed_urls)
@feed_urls = feed_urls
@feeds = []
read_feeds
end
# Public method: list all article titles
def list_articles
@feeds.each do |feed|
puts "=== #{feed.channel.title} ==="
feed.items.each do |item|
puts "- #{item.title}"
end
puts
end
end
protected
def read_feeds
@feed_urls.each do |url|
content = URI.open(url).read
@feeds << SimpleRSS.parse(content)
end
end
end
You can also read What Are The Ruby Version Management With Code Examples
