Thanks for visiting! Matched Pattern is available for hire! We do web apps and infrastructure.
Our client PearachuteKids.com got the call Friday at 5:30pm that their episode of Shark Tank would air the following Sunday. That was all the info we had. We didn’t even know which hour of Shark Tank we’d be part of, nor what kind of traffic hit to expect. We were also sure we would’ve had more time to prepare for this!
Pearachute had one server on AWS that was handling all their web traffic. The database was a small, sleepy RDS cluster. Googling around, I found very little on what to expect in terms of actual requests per minute that Shark Tank generates.
Next, we began scaling up our resources by imaging a spare staging EC2 instance. A good amount of time was sunk in modifying this image to use in the production environment. If I had simply taken a production image, I’d have saved a chunk of work; but I didn’t want to take production offline to take a snapshot. Lesson learned.
Our biggest hurdle was overcoming the various subtleties to our deploy process and how we handled its config. We were setting Puma options during deploy that we being overtaken by an existing puma.rb config file that was on disk. The deploy task doesn’t overwrite the on-disk puma.rb if it exists. One did exist in that image and it was for staging.
So, we had to take down the now running server to take another image and use it to scale to our remaining limit of 8 servers. But this wasn’t enough by a long shot. We had a sudden mad dash to reach Amazon to increase our server limit. We were bumped to 40 instances within a few minutes of explaining the situation. I then scaled up to 21 production web servers and put them behind our main load balancer. Why that number specifically? Why not?!
As the new web servers came online, they started to connect to the RDS instance. By the time all clients connected, we had consumed 15% of the CPU. That made me wary. The decision was made to scale the RDS cluster. I figured there were two options:
Down the RDS cluster to snapshot the database then clone that into a new, larger RDS cluster
Grow the existing RDS cluster by simply modifying the instance settings.
We decided to roll with the manual version. Why? If the automatic upgrade failed, we had nothing to roll back to. If we did it manually, we would still have the existing database to point back to. I estimated it would take 15–30 minutes of downtime to do this. We clocked in at 22 minutes. The database cluster was increased to 5 times its original size and I was able to decommission the old RDS cluster.
With all the clients online we saw 300 database connections; by the end of the spike, it grew to 500.
During stress testing I noticed a URL was painfully slow and would result in cascading 500s very shortly after the test started. This route provided a huge chunk (1M) of JSON data that was not being cached in any way. I raced to optimize the function. I wrote a quick script that ran through and generated the JSON for each object we were concerned with and moved them to [S3]. I then updated the codebase to simply read the data directly from Amazon. This took the request from ~8s to load to ~2s. Further refactoring would be needed to clean up the cache hack and tighten up the original request, but no time for that now.
I wrapped up working on the infrastructure around 8:30pm EST. I ate for the second(?) time that day and finally took a shower.
Before The Storm
We aired at 10:15pm EST. Within five minutes, our traffic spiked 34x. Not 34%, mind you, 34 TIMES. We hit 45 requests a second at peak traffic. The spike lasted for about 20 minutes, which mirrored the actual air time of Pearachute’s Shark Tank segment. Our database load spiked to 35% when traffic peaked, but it held. All the servers held.
After The Show
We ran 21 t2.large Ubuntu web servers that hosted our Rails app via Nginx and Puma. Our database on RDS was Postgresql 9.5.2 on Multi-AZ’d db.m4.xlarge.