Third in a series. Read the first and second posts.
Like I explained in the first two posts, one of the more useful ways of debugging web application problems is interrogating the DNS system, and the tool to do it is already on your computer: dig(1).
So, I was researching how Amazon’s ElastiCache works to debug a performance problem we were seeing. The caches were warming more slowly than expected, and we were seeing cache misses on data that we knew should be there.
Normally when configuring a Memcache client, you give it the addresses of every node in the cluster, and the client library figures out how to distribute the keys across those nodes. With an ElastiCache Memcache cluster, you configure the client library with one address—the configuration endpoint—and it dynamically adds and removes nodes from the clients’ configuration on-the-fly.
We configured the app with the endpoint URL, and it all seemed to be working fine. The app was able to talk to the configuration endpoint just like a normal Memcache server. But eventually we noticed the subtle problems listed above.
It turns out that the stock PHP Memcached module doesn’t do auto-discovery—for that, you have to install the AWS-provided module. But then why weren’t connections to the cache endpoint obviously failing?
As I was playing around with a small cluster, it seemed that using the config endpoint as a normal Memcache node worked just fine—it would store and retrieve entries as normal. So I looked at the DNS record for the endpoint:
$ dig -t any +noall +answer test-cluster.d00b1e.cfg.use1.cache.amazonaws.com test-cluster.d00b1e.cfg.use1.cache.amazonaws.com. 15 IN A 172.31.x.y
The IP address was the same as one of the actual cache nodes, and the TTL was counting down from 15 seconds on successive lookups. Whenever the TTL reaches 0, the address changes to another one of the cache nodes:
So any given node can answer the special command to list the cache nodes, but if you connect a stock Memcache client library to the endpoint it will act like a normal Memcache server.
This behavior is surprising and doesn’t fit with a fail-fast design philosophy, but using the publicly available information at the heart of the Internet, we were able to figure it out quickly. The key is to know and use the tools at your disposal.