{"id":877,"date":"2013-06-28T15:42:45","date_gmt":"2013-06-28T20:42:45","guid":{"rendered":"http:\/\/www.phishie.com\/wordpress\/?p=877"},"modified":"2013-06-28T15:42:45","modified_gmt":"2013-06-28T20:42:45","slug":"eventually-consistant-performance","status":"publish","type":"post","link":"http:\/\/www.phishie.com\/wordpress\/2013\/06\/eventually-consistant-performance\/","title":{"rendered":"Eventually Consistant Performance"},"content":{"rendered":"<p>Wow it&#8217;s been a while since I wrote anything.<\/p>\n<p>Today at <a href=\"http:\/\/www.omniti.com\">work<\/a>\u00c2\u00a0I was reminded about a few different technologies and pieces of advice that I just take for granted because of the people I work around; and today these came together to form a bit of awesomeness I felt like sharing. \u00c2\u00a0 For a particular client we built a high performance <a href=\"http:\/\/nodejs.org\/\">node.js<\/a> API that drives their mobile application. \u00c2\u00a0One of the technologies that powers it is a <a href=\"http:\/\/basho.com\/riak\/\">Riak<\/a> cluster that we put together to provide a highly available and highly scalable data store for it.<\/p>\n<p><strong>The first is SSDs.<\/strong><\/p>\n<p><a href=\"https:\/\/twitter.com\/crucially\">Artur Bergman<\/a> gives a <a href=\"http:\/\/www.youtube.com\/watch?v=H7PJ1oeEyGg\">great rant<\/a> on why you should just shut up and buy SSDs but at the time we put the cluster together we just couldn&#8217;t get them anywhere so we used some fast traditional drives. \u00c2\u00a0We recently had a chance to upgrade these nodes onto systems with SSDs as part of a move to a new location in our datacenter. \u00c2\u00a0 To do this without downtime we added all the new nodes into the existing Riak cluster at line A in the graph. \u00c2\u00a0This took our cluster from 5 nodes to 11. \u00c2\u00a0This was a mixture of the old and new. \u00c2\u00a0At line B we started removing the old nodes down to the six new nodes with SSDs. \u00c2\u00a0The difference is pretty dramatic, especially once Riak had finished moving everything over. \u00c2\u00a0What you see here is a histogram of the response times for PUTs into Riak. \u00c2\u00a0Basically how long it takes to write data into the datastore.<\/p>\n<p><a href=\"http:\/\/www.phishie.com\/wordpress\/wp-content\/uploads\/2013\/06\/riak_week_puts.png\"><img decoding=\"async\" class=\"alignnone size-full wp-image-879\" alt=\"riak_week_puts\" src=\"http:\/\/www.phishie.com\/wordpress\/wp-content\/uploads\/2013\/06\/riak_week_puts.png\" width=\"100%\" srcset=\"http:\/\/www.phishie.com\/wordpress\/wp-content\/uploads\/2013\/06\/riak_week_puts.png 1007w, http:\/\/www.phishie.com\/wordpress\/wp-content\/uploads\/2013\/06\/riak_week_puts-300x101.png 300w\" sizes=\"(max-width: 1007px) 100vw, 1007px\" \/><\/a><\/p>\n<p><strong>The second is monitoring<\/strong><\/p>\n<p>The above graphic isn&#8217;t from Riak. \u00c2\u00a0It&#8217;s actually from our API. \u00c2\u00a0The API measures how long every backend call it makes takes, be it from Riak, or any of the rest of the\u00c2\u00a0myriad\u00c2\u00a0backend services. \u00c2\u00a0That&#8217;s right, _every_ request. \u00c2\u00a0We aggregate them, and push them out to <a href=\"http:\/\/www.circonus.com\/\">Circonus<\/a> which handles this huge volume of data and graphs it, and alerts us if something is wrong. \u00c2\u00a0My coworker <a href=\"http:\/\/brianbickerton.com\/\">Brian Bickerton<\/a> has even done a write up on <a href=\"http:\/\/www.circonus.com\/blog\/web-application-stat-collection-with-nodejs-and-circonus\/\">how we do this<\/a>.<\/p>\n<p><strong>The third is better monitoring<\/strong><\/p>\n<p>Monitoring isn&#8217;t a new topic. \u00c2\u00a0Neither is testing. \u00c2\u00a0But people can&#8217;t seem to do either of them well. \u00c2\u00a0The graphic earlier is a <a href=\"http:\/\/en.wikipedia.org\/wiki\/Histogram\">histogram<\/a> of the data points. \u00c2\u00a0Separating your data into &#8220;buckets&#8221; so that you can see where it falls. \u00c2\u00a0The darker areas mean more items fall into that bucket. \u00c2\u00a0I can mouse over any time period to get a more detailed view.<\/p>\n<p><a href=\"http:\/\/www.phishie.com\/wordpress\/wp-content\/uploads\/2013\/06\/histo_cursor.png\"><img decoding=\"async\" class=\"alignnone size-full wp-image-881\" alt=\"histo_cursor\" src=\"http:\/\/www.phishie.com\/wordpress\/wp-content\/uploads\/2013\/06\/histo_cursor.png\" width=\"100%\" srcset=\"http:\/\/www.phishie.com\/wordpress\/wp-content\/uploads\/2013\/06\/histo_cursor.png 1015w, http:\/\/www.phishie.com\/wordpress\/wp-content\/uploads\/2013\/06\/histo_cursor-300x125.png 300w\" sizes=\"(max-width: 1015px) 100vw, 1015px\" \/><\/a><\/p>\n<p>So what this is showing me is that at my cursor 98% of the data is less than where my cursor is, 1% is at my cursor, and 1% is larger. \u00c2\u00a0This is out of the 122,814 samples that were taken during that time period. \u00c2\u00a0So I can tell the we&#8217;ve made a huge impact in reducing the worst case situations for our users. \u00c2\u00a0This is actually a really important segment of users that often gets overlooked. \u00c2\u00a0It&#8217;s important to optimize the common case but if you&#8217;re one of the 1% of those users outside of it, it really sucks. \u00c2\u00a0Here we&#8217;ve shown our worst case writes going from a full second to less than 200ms. \u00c2\u00a0But it also shows why histograms are important. \u00c2\u00a0Let&#8217;s look at that same time period with the AVERAGE response time overlayed. \u00c2\u00a0And we&#8217;ll cap the graph at 200ms to show things clearer.<\/p>\n<p><a href=\"http:\/\/www.phishie.com\/wordpress\/wp-content\/uploads\/2013\/06\/average-zoom.png\"><img decoding=\"async\" class=\"alignnone size-full wp-image-882\" alt=\"average-zoom\" src=\"http:\/\/www.phishie.com\/wordpress\/wp-content\/uploads\/2013\/06\/average-zoom.png\" width=\"100%\" srcset=\"http:\/\/www.phishie.com\/wordpress\/wp-content\/uploads\/2013\/06\/average-zoom.png 1015w, http:\/\/www.phishie.com\/wordpress\/wp-content\/uploads\/2013\/06\/average-zoom-300x135.png 300w\" sizes=\"(max-width: 1015px) 100vw, 1015px\" \/><\/a><\/p>\n<p>The light blue average line shows some really interesting information. \u00c2\u00a0Notice how the average response time goes up each night. \u00c2\u00a0This is when the backups are happening. \u00c2\u00a0We can see how with the SSDs we no longer take any performance hit during the backups. \u00c2\u00a0But it also shows us how much information we&#8217;re missing by only looking at an average. \u00c2\u00a0Our basic use-case performance hasn&#8217;t really changed that much. \u00c2\u00a0We were around 30-50ms before and now we seems to stay at around 25ms. \u00c2\u00a0A nice gain, but from a user&#8217;s perspective 20ms is not\u00c2\u00a0noticeable\u00c2\u00a0 \u00c2\u00a0We would have no idea about the 1-2% of our users that were getting really horrible 200+ms performance. \u00c2\u00a0However these users will notice. \u00c2\u00a0 The histogram also shows that our normal use cases seems to separate into two different operation types. \u00c2\u00a0One that takes &lt; 10ms and another take takes 40-50ms.<\/p>\n<p>It&#8217;s really important to be able to tell the difference between theory and reality and to be able to make decisions based on real concrete data; and even more important to be able to measure those decisions to make sure they were right. \u00c2\u00a0 Monitoring at all levels of your stack provides this and lets me say with confidence:<\/p>\n<p><strong>Just buy the SSDs already.<\/strong><\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Wow it&#8217;s been a while since I wrote anything. Today at work\u00c2\u00a0I was reminded about a few different technologies and pieces of advice that I just take for granted because of the people I work around; and today these came together to form a bit of awesomeness I felt like sharing. \u00c2\u00a0 For a particular [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20],"tags":[],"class_list":["post-877","post","type-post","status-publish","format-standard","hentry","category-work"],"_links":{"self":[{"href":"http:\/\/www.phishie.com\/wordpress\/wp-json\/wp\/v2\/posts\/877","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/www.phishie.com\/wordpress\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/www.phishie.com\/wordpress\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/www.phishie.com\/wordpress\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/www.phishie.com\/wordpress\/wp-json\/wp\/v2\/comments?post=877"}],"version-history":[{"count":6,"href":"http:\/\/www.phishie.com\/wordpress\/wp-json\/wp\/v2\/posts\/877\/revisions"}],"predecessor-version":[{"id":888,"href":"http:\/\/www.phishie.com\/wordpress\/wp-json\/wp\/v2\/posts\/877\/revisions\/888"}],"wp:attachment":[{"href":"http:\/\/www.phishie.com\/wordpress\/wp-json\/wp\/v2\/media?parent=877"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/www.phishie.com\/wordpress\/wp-json\/wp\/v2\/categories?post=877"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/www.phishie.com\/wordpress\/wp-json\/wp\/v2\/tags?post=877"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}