Using Redis to Control Recurring Sets of Tasks

Problem:

Create an application to check the availability of a large number of URLs. This list is constnatly changing and every url must be checked at a specific interval (e.g., every minute). The actual checking of the URL needs to be distributable over several servers. This is because if a URL is unavialable, we do not want to continue checking other URLs while waiting for the timeout.

Key Advantages of Redis:

Redis had two key features that made writing this code very straightforward.

  • rpoplpush: Pop the element on the right side of the list, return it, push onto left side of the list. This simulates a circular list.
  • blpop: Block until a new element is available at the list.

Code Design:

Clock:

  • Wake up every N seconds.
  • rpoplpush the clock queue and remember this first value.
  • Push this value onto a task queue. 
  • Continue reading values on the clock queue until encountering a value the same as the first value.

Worker(s):

  • Read values of the task queue.
  • Performs URL availablity check.
  • Block until there is a value in the task queue.

Code snippets:

Clock:

Code snippet for Worker:

Final Thoughts:

Getting started with Redis is very straightforward. Redis is easy to install and has a small number of commands to manipulate data. The Redis gem is a thin abstraction layer over the core Redis commands. In under an hour, I was able to write code that manipulated Redis data.

I'm looking forward to trying out Redis on some data processing problems at my current job. I need to process data in stages and I either write the data to temporary files are dump the data into MySQL. Temp files are an awkward solution because I need the process to be interuptable/resumable and dumping the data into MySQL is a performance bottleneck. For my needs, Redis looks like a nice sweet spot between writing temporary files and pulling the data into MySQL.

EDIT: Thanks to Ted Naleid for finding a bug in the original version of the clock code where the first_url was never put onto the worker queue.
3 responses
I think you have a bug where the first_url will never get processed, only the other urls in the queue. So unless you're planning on putting a marker url in there as the first url, that you don't want to process, I think you'd want the clock to be like this:


redis = Redis.new(:host => redis_host, :port => redis_port)
while(1) do
first_url = redis.rpoplpush src_queue, src_queue
while(1) do
next_url = redis.rpoplpush src_queue, src_queue
redis.rpush work_queue, next_url
break if next_url == first_url
sleep sleep_interval
end
end
Sorry, also meant to move the sleep to the outer loop, otherwise you'll sleep on every url you push onto the work queue


redis = Redis.new(:host => redis_host, :port => redis_port)
while(1) do
first_url = redis.rpoplpush src_queue, src_queue
while(1) do
next_url = redis.rpoplpush src_queue, src_queue
redis.rpush work_queue, next_url
break if next_url == first_url
end
sleep sleep_interval
end
Ted,

Nice catch. I copied and pasted this code from another project and changed it a little bit. In the original version I immediately pushed first_url onto the work_queue. I'll update with a fix.

Thanks!