Amazon EC2 Latency: The Pretty Graphs

Yesterday, I wrote about our experiences with network latency on Amazon's EC2 platform. There I noted how the underlying internal network of Amazon would suddenly experience huge network latency, with pings reporting results as high as 7000 ms.

Following on, I noticed that cloudkick, the cloud performance monitoring people, published their own findings on the network latency, and digging into their graphs, we find a complete correlation with our own data.

As my mother would say; "misery enjoys company"

As you can see with this dramatic graph, the network times are more than just blips, with major latency drag lasting for more than a few hours at a time. What is interesting to note, is that they do eventually recover. We can't afford the luxury of waiting, so we have to shut down that instance and try again. Our architecture has been designed in such a way to allow this to happen without too much fuss.

Eran Tromer has published a very interesting paper on his findings on Amazon's EC2 architecture, including detailed analysis on the effect of different instance types within their network.

I have been contacted by Amazon regarding this issue now and hopefully the data we provide them can help them diagnose any problems that may exist and we can get back to the job at hand, solving our business problem, not Amazon's.


