Basic Metaprogramming with Ruby and Rake

Problem

I needed a script to control a batch workflow process. There were a lot of moving parts to the process and dividing it into mutliple scripts or using command line arguments did not seem feasible. Rake looked like a perfect fit for this problem. One particular sub-problem, was I needed to process several files in a particular way. Originally, I thought I only needed to process them all at once, so I wrote a rake task like the following:

First Iteration

Second Iteration

However, in order to debug the process, I needed to process the files one by one. At first I started to create seperate tasks for each file. But then I remembered that there is nothing special about the way tasks are defined, and I can define tasks using some basic metaprogramming:

Conclusion

Really good DSLs like Rake and Sinatra make me forget I am writing Ruby. While I feel restricted (in a good way) to this subset of commands, at any time I can use the full power of Ruby if I need it. Being able to dynamically create tasks let me quickly create a script that is now up to ~40 tasks (and growing).

While this is starting to get a little bit unweidly, the equivalent bash script would be much worse. What keeps the script manageable is that much of the process is specificied as data. Most of the program is processing a hash table and creating a task per entry the hash table.

Using Redis to Control Recurring Sets of Tasks

Problem:

Create an application to check the availability of a large number of URLs. This list is constnatly changing and every url must be checked at a specific interval (e.g., every minute). The actual checking of the URL needs to be distributable over several servers. This is because if a URL is unavialable, we do not want to continue checking other URLs while waiting for the timeout.

Key Advantages of Redis:

Redis had two key features that made writing this code very straightforward.

  • rpoplpush: Pop the element on the right side of the list, return it, push onto left side of the list. This simulates a circular list.
  • blpop: Block until a new element is available at the list.

Code Design:

Clock:

  • Wake up every N seconds.
  • rpoplpush the clock queue and remember this first value.
  • Push this value onto a task queue. 
  • Continue reading values on the clock queue until encountering a value the same as the first value.

Worker(s):

  • Read values of the task queue.
  • Performs URL availablity check.
  • Block until there is a value in the task queue.

Code snippets:

Clock:

Code snippet for Worker:

Final Thoughts:

Getting started with Redis is very straightforward. Redis is easy to install and has a small number of commands to manipulate data. The Redis gem is a thin abstraction layer over the core Redis commands. In under an hour, I was able to write code that manipulated Redis data.

I'm looking forward to trying out Redis on some data processing problems at my current job. I need to process data in stages and I either write the data to temporary files are dump the data into MySQL. Temp files are an awkward solution because I need the process to be interuptable/resumable and dumping the data into MySQL is a performance bottleneck. For my needs, Redis looks like a nice sweet spot between writing temporary files and pulling the data into MySQL.

EDIT: Thanks to Ted Naleid for finding a bug in the original version of the clock code where the first_url was never put onto the worker queue.

Drawing Organization Chart Using VIVO

This is a presentation I gave at the First Annual VIVO Conference in NYC. I created a crawler that traverses the linked data available at http://vivo.ufl.edu to generate graphs of the academic organization structure. I start at http://vivo.ufl.edu/individual/UniversityOfFlorida and recursively crawl all core#subOrganization links.

I output a variety of graph formats to use with GraphViz, Network Workbench, and Javascript Visualization Toolkit.

The code for the crawler is available at: http://github.com/arockwell/vivo_org_chart.

Drawning Organization Charts Using VIVO

This is the best non-interactive graph I've created so far. This graph consistes of all colleges, departments, centers, and institutes.

This is the same graph with labels drawn on.

The biggest problem with adding labels to the graph is the labels are so long (20-30+ characters). I created an interactive version at: http://qa.vivo.ufl.edu/infovis/demo.html

This graph consists of only Colleges and Departments (about ~150 ondes). However, seeing the labels is still difficult in spots. Particularly around the bigger clusters, College of Medicine and College of Liberal arts and Sciecnes.