Dealing with bad RSS feeds

Dave Morriss
2013-10-16

The Problem

I have written my own podcast management system which is based on a modified version of Linc Fessenden's venerable Bashpodder.

Bashpodder is a single Bash script which reads a file of RSS and ATOM feed URLs and parses them using XSLT to extract the enclosure URLs. It filters them against a history file containing a list of previously downloaded enclosures, and downloads any that are new. This method of doing things is very tolerant of badly formed feeds.

My management system uses a PostgreSQL database to keep a lot of information about the feeds I am subscribed to and what has been downloaded. I use this to keep track of what I have listened to, what is on what player, and so on. I use it to delete episodes as I listen to them, generate reports and all manner of weird stuff. For example, it tells me that I am currently subscribed to 84 feeds and have 98 episodes to listen to, which totals 3 days, 10 hours, 46 minutes, 35 seconds of listening time!

The database contains the results of parsing the feeds in a fair amount of detail. This stage is intolerant of badly formed feeds.

In the past year or so I have noticed a number of badly structured RSS feeds which have caused the Perl parser I use to fail.

You may wonder why I parse all my subscribed feeds twice. This is partly because this system has grown over a number of years and has acquired some idiosyncracies along the way. More specifically it is because the Bashpodder clone runs overnight on my server, whereas the postprocessing part of the system which uses the database runs on my workstation.

One day I'm going to rewrite this stuff.

Weird feed #1: mintCast

I listen to the mintCast podcast. In the time that I have been a subscriber they have suffered from a number of problems with their feed.

In general these have been:

  1. Duplication of episodes, sometimes with different URLs
  2. Mixing of MP3 and OGG episodes in the same feed

I have emailed the mintCast guys and they tell me the problem is actually a Wordpress bug which has not yet been resolved.

At the time of writing they have resolved the issue with mixed audio types but not the duplication issue.

A very simplified copy of the current feed is available. See the link at the end of the notes.

Weird feed #2: Pod Delusion Extra

I listen to the excellent Pod Delusion podcast and have recently subscribed to their Extra feed which contains various conference recordings and other goodies not in the main feed.

Unfortunately the authors have structured their episodes such that some have multiple enclosures. These are related enclosures, so this is a perfectly logical thing to do. Unfortunately, although it's not illegal RSS, many parsers do not know what to do with it.

I wrote to the editor of the Pod Delusion podcast pointing out this problem. He recently replied saying that he would look into the issue.

Solution

The general principle I have used to fix these issues is to write small Perl scripts to rationalise the feeds. I have a script per feed at the moment since the problems are feed-specific.

Here's how they work:

  1. These scripts are run out of cron on my server just before the main podcatcher runs.
  2. Each script writes a corrected feed file to a place visible to the Apache instance running on the server
  3. The podcatcher configuration file points to the corrected feed rather than the original
  4. When the database backend is run it also uses this modified feed

I am using Apache to serve these files because all of the tools I am using expect an URL for the podcast feed.

Fix #1: mintCast_fix

This is a Perl script which manipulates the feed and saves the result in a file. The full script may be viewed in the Gitorious repository https://gitorious.org/hprmisc

Here is an explanation of the main elements of the script.

The Perl modules used are:

  LWP::Simple
  XML::RSS

The first is for collecting the contents of the feed from the mintCast site. The second is used for parsing the RSS data received.

First the URL of the mintCast OGG feed is defined followed by the name of the output file:

  my $url = 'http://mintcast.org/category/ogg/feed';
  
  my $feedfile = 'mintCast.rss';

An XML::RSS object is defined which will be used to parse the feed.

  my $rss = XML::RSS->new();

The URL is downloaded using the get method of the LWP::Simple module. If this fails the script aborts with an error message.

The feed, held as a string, is then parsed with the XML::RSS parse method. The script uses this module because it is one of the few that can handle multiple enclosures.

  my $feed = get($url) or die "Unable to get feed $url\n";
  $rss->parse( $feed, { allow_multiple => ['enclosure'] } );

An RSS feed contains multiple items. For a podcast feed each of these usually contains an enclosure showing where the audio is to be found. There are various other fields within an RSS feed, but we are not interested in these here.

The foreach loop iterates through all of the items in the feed. In this script the RSS is edited in place, so the loop contains statements to read and write the parsed data structure.

In this particular script, for cosmetic purposes, it ensures that the creator element is used to populate the author tag. This is not important for the correct functioning of the feed.

The main problem being solved here is the reduction of multiple enclosures down to a single one. The problem with the mintCast feed is that the enclosure is repeated many times per item. This is due to a bug in the Wordpress system being used to deliver the podcast.

The embedded foreach loop iterates through the multiple enclosures. Once it finds one marked as type audio/ogg it writes this back to the array of enclosures and stops (the last statement terminates the enclosing loop). This causes the multiple enclosures to be reduced down to a single enclosure.

  foreach my $item ( @{ $rss->{'items'} } ) {
      if ( defined( $item->{'dc'}->{'creator'} ) ) {
          $item->{'author'} = $item->{'dc'}->{'creator'};
      }
  
      foreach my $enc ( @{ $item->{'enclosure'} } ) {
          if ( $enc->{'type'} =~ m{audio/ogg} ) {
              @{ $item->{'enclosure'} } = $enc;
              last;
          }
      }
  }

Finally the adjusted RSS is written out to the required file and the script exits.

  %$rss->save($feedfile);

Perl Notes

This script uses some of the features of Perl which may look a little bizarre. This is partly due to the use of the XML::RSS module which does not seem to adhere to the more common ways of writing object-oriented code.

  1. The $rss object is actually a reference to a data structure
  2. The expression @{ $rss->{'items'} } is Perl's way of treating a reference to a hash which contains an array as an array - in other words, it dereferences it and allows it to be used in a loop

Fix #2: PodDelusionExtra_fix

Like the previous example, this is a Perl script which manipulates the feed and saves the result in a file. The full script may be viewed in the Gitorious repository https://gitorious.org/hprmisc

Here is an explanation of the main elements of the script.

As with the other script, the Perl modules used are:

  LWP::Simple
  XML::RSS

This script begins by declaring variables:

  my ( $feed, $rss_in, $rss_out, $channel );

Next the URL of the Pod Delusion Extra feed is defined followed by the name of the output file:

  my $url = 'http://poddelusion.co.uk/blog/category/specials/feed/';
  
  my $feedfile = 'ThePodDelusionExtra.rss';

Two XML::RSS objects are defined. The first is used to parse the incoming feed and second holds the new feed as it is being assembled:

  $rss_in = XML::RSS->new();
  $rss_out = XML::RSS->new( version => '2.0' );

The URL is downloaded using the get method of the LWP::Simple module as before. If this fails the script aborts with an error message.

The feed, held as a string, is then parsed with the XML::RSS parse method as before.

  $feed = get($url) or die "Unable to get feed $url\n";
  $rss_in->parse( $feed, { allow_multiple => ['enclosure'] } );

The $channel variable is then initialised with a reference to the channel element of the feed. This is the outermost layer of the feed structure which holds the attributes which define the feed as a whole.

  $channel = $rss_in->channel;

The new feed object is initialised with the attributes in the incoming feed:

  $rss_out->channel(
      title       => $channel->{'title'},
      link        => $channel->{'link'},
      pubDate     => $channel->{'pubDate'},
      description => $channel->{'description'},
  );

A loop is then used to process all of the items in the incoming feed. The difference here is that for every enclosure found in the input feed a new item is generated in the output feed. The new item is intialised with attributes taken from the input feed and has the enclosure added to it. The result is that every enclosure ends up in its own item, which may be a duplicate of the previous item if the input item contains multiple enclosures.

  foreach my $item ( @{ $rss_in->{'items'} } ) {
      for my $enc ( @{ $item->{'enclosure'} } ) {
          $rss_out->add_item(
              title       => $item->{'title'},
              pubDate     => $item->{'pubDate'},
              description => $item->{'description'},
              enclosure   => {
                  url  => $enc->{'url'},
                  type => $enc->{'type'},
              },
          );
      }
  }

///home/cen/cendjm/HPR/Talks/Dealing_with_bad_RSS_feeds/ Finally the newly created RSS is written out to the required file and the script exits.

  $rss_out->save($feedfile);

Conclusion

Parsing an RSS feed into a database has proved to be a lot more difficult than I would have expected. My impression is that a lot of the problems I have encountered are because RSS is a rather nasty standard for podcast and news feeds.

Links