Friday, March 2, 2007

Geek Out: Mashing Yahoo! Pipes and the Congressional Record

I just added a post to my legislative tracking blog: Managing the volume of content from Congress. It's all about mashing up the feeds from with Yahoo! Pipes to make for some really useful feeds for tracking the information coming out of Congress.

This post here is to elaborate on a little more technical detail of how I put it all together.

The Legislative Tracking Pipes

These were actually fairly simple. This was simply a matter of knowing the legislative process as I follow it, and then creating the feeds to match. The only complex thing here was using regex to strip out the then unnecessary feed item title prefixes. Rather than explain it, just have a look at the pipes and see how I did it.

Congressional Record Tracking Pipes

This was more complex. The key here was to find or generate a list of feeds that I would need. As much time as this project took, going through all 535 members looking for member feeds would have been way too time-consuming.

There were also two problems:
  1. The feed URLs could change
  2. The way I process those feeds could change
That is, GovTrack could change the way it does anything, and it would affect my final product.

So, to circumvent the first problem, instead of using the GovTrack feeds directly, I created a redirect on The Mountain (my government-related site) that would generate the feed URL with just the member code. So, if the URL structure changes, I just have to change the redirect and I'm good to go.

(April 2 Update: Since I found a way to keep all the feed URLs in just 2 files and not 100, I'm dropping the redirect URL from the feeds and am going with direct URLs.)

As to the second problem, that's a Yahoo! Pipes thing. They currently do not support a pipe input module for creating a "processing pipe," though there is a suggestion open on this for which users can vote. (hint hint!)

As of now, if the item title prefixes of the speeches or "Debate" items would change, I'd have to change 105 feeds manually every time. Once Yahoo! supports processing pipes, then I should only have to go back through all 105 pipes once more to convert them to that pipe.

With that, the only pieces missing were the actual feeds for the content of members themselves.

Listing Member RSS Feeds

This was a little tricky (and is probably the major reason for this post). Nowhere does GovTrack list them all members with its codes for them. Even the raw data source for the people is unwieldy because it includes all members of Congress ever.

I was hoping for a spreadsheet or CSV of all 535 members with codes, and states, maybe even districts, but that proved to be unnecessary. I also tried using wget to download the states page and one link deep, but that is blocked.

Then I realized it's not that much work to just download the source for the 55 pages for each state and territory, and those at least include the number code for each member in some form. From there it was simply a matter of grepping out the lines that didn't have the member URL in them, and then using search and replace to convert the anchor tags for each member into RSS feed items instead.

I now had an RSS feed of RSS feeds for all 440 members of House (Representatives and Delegates) and 100 members of the Senate.

Why RSS of RSS, though?

That's because when I made the suggestion about an input module for pipes, Yahoo! referred me to a TechBrew post about using Google Spreadsheets to input a list of feeds into Yahoo! pipes. (Or, as they put it, "Pseudo-OPML.")

Yahoo! Pipes does not support OPML as a source, but it does support RSS, so convert your OPML or source list to an RSS "feed" instead. The source RSS feed isn't being used for syndication so much as it's being used simply as XML to structure the syndication RSS feed data.

With Google Spreadsheets, all they were really using was the RSS export capabilities there. I don't need to tie up my data nor a Google Spreadsheet with something I can do in a text file. So, I just made my own RSS feed of RSS feed redirects.

While I didn't use a Spreadsheet, but I did use the TechBrew technique for pulling a list of members from the RSS feed. This is where it got really cool, too. I was thinking I was going to need an XML file for each state. Then I realized I could actually do it all in one file (per chamber) and use a Pipes filtering module to narrow it down by state. That was fun.

This was really nice because then when it came time to duplicating my desired pipe for each state, it was simply a matter of copying the text from a link on GovTrack (why type them all and possibly have typos?), save a copy in Pipes, change the state name in two places, save, and that's it. Done. Then repeat for each state.

Final Output

Thus from each of the ~50 pages each chamber generates in the Congressional Record each day it's in session, this pipe will feature one paragraph from any member's appearances on each of those pages. At the end of the day this means less than 50 items per day in the Senate, and less than 100 items per day in the House. I can handle reading about 20 items at time, and if I stay on top of it, this setup can really help me do that.

The end result of these feeds can be seen in my Google Reader folders for the Congressional Record of the House and Senate (minus speeches about new bills). I also have a Reader folder for the Legislation feeds, but I tend to read each pipe in that one individually, not as a folder.

(Update: Actually, strike that last paragraph. I've gone to a single feed for all member speeches feeds, and the legislation feeds I don't keep in a folder now so I pay more attention to them.)

Another trick in all this was timing. Since Congress was gone last week, and items expire in GovTrack by date, I had to wait on setting things just right and debugging for when there would actually be content in the feeds. The last thing I wanted to do was duplicate a bunch of state feeds that had not been tested. And I did need to make some changes, too!

Yahoo! Pipes is great fun, and now will be even more useful.

Here are some other Yahoo! Pipes I have built, too.

10,862 days


Mark Woodman said...

Nice mashup work, Tim.

It would be interesting to put the pipe output into MySyndicaat, which would then let you do keyword searches on everything. CleverClogs did something like that with marketing blogs and a Grazr front end.

See and for more info.

Mark Woodman said...

The first link got cut off. Here's a shortcut:

Mark Woodman said...

You can now use real OPML in a Pipe with the Fetch Data module. Here's a writeup.