Using Redis to Control Recurring Sets of Tasks

Problem:

Create an application to check the availability of a large number of URLs. This list is constnatly changing and every url must be checked at a specific interval (e.g., every minute). The actual checking of the URL needs to be distributable over several servers. This is because if a URL is unavialable, we do not want to continue checking other URLs while waiting for the timeout.

Key Advantages of Redis:

Redis had two key features that made writing this code very straightforward.

  • rpoplpush: Pop the element on the right side of the list, return it, push onto left side of the list. This simulates a circular list.
  • blpop: Block until a new element is available at the list.

Code Design:

Clock:

  • Wake up every N seconds.
  • rpoplpush the clock queue and remember this first value.
  • Push this value onto a task queue. 
  • Continue reading values on the clock queue until encountering a value the same as the first value.

Worker(s):

  • Read values of the task queue.
  • Performs URL availablity check.
  • Block until there is a value in the task queue.

Code snippets:

Clock:

Code snippet for Worker:

Final Thoughts:

Getting started with Redis is very straightforward. Redis is easy to install and has a small number of commands to manipulate data. The Redis gem is a thin abstraction layer over the core Redis commands. In under an hour, I was able to write code that manipulated Redis data.

I'm looking forward to trying out Redis on some data processing problems at my current job. I need to process data in stages and I either write the data to temporary files are dump the data into MySQL. Temp files are an awkward solution because I need the process to be interuptable/resumable and dumping the data into MySQL is a performance bottleneck. For my needs, Redis looks like a nice sweet spot between writing temporary files and pulling the data into MySQL.

EDIT: Thanks to Ted Naleid for finding a bug in the original version of the clock code where the first_url was never put onto the worker queue.