{"id":807,"date":"2011-09-29T09:09:32","date_gmt":"2011-09-28T21:09:32","guid":{"rendered":"https:\/\/www.deltics.co.nz\/blog\/?p=807"},"modified":"2011-09-29T10:12:45","modified_gmt":"2011-09-28T22:12:45","slug":"use-knowledge-of-your-own-threads-to-extract-optimal-performance","status":"publish","type":"post","link":"https:\/\/www.deltics.co.nz\/blog\/posts\/807\/","title":{"rendered":"Use Knowledge of Your Own Threads to Extract Optimal Performance&#8230;"},"content":{"rendered":"<span class=\"rt-reading-time\" style=\"display: block;\"><span class=\"rt-label rt-prefix\">[Estimated Reading Time: <\/span> <span class=\"rt-time\">5<\/span> <span class=\"rt-label rt-postfix\">minutes]<\/span><\/span><p><a href=\"http:\/\/www.thedelphigeek.com\/2011\/09\/neversleeponthreadcontentionnot.html\">&#8220;The Delphi Geek&#8221; recently blogged about a performance bottleneck he had identified in FastMM<\/a> when used with a particular conditional define.  Although not directly related, his post reminded me of an experience I had many years ago, working on a highly complex multi-threaded system (long before FastMM) and the strategy we found we had to employ in order to get optimal performance from our threading code when encountering different numbers of CPU&#8217;s (these days &#8220;cores&#8217;).<br \/>\n<!--more--><\/p>\n<h3>The System<\/h3>\n<p>The system in question was a highly complex powerstation simulation\/emulation which had to scale from single processor (development\/test machines) thru dual processor (commissioning and QA) and quad processor (deployment) hardware.  To our surprise we found that our threading which worked very well on single CPU&#8217;s started to behave apparently quite erratically on different numbers of CPUs.  Our synchronization was fine (surprisingly in some respects &#8211; given that we were developing on single CPU systems, effectively serialising our code in many respects) &#8211; no, the issue was performance.<\/p>\n<p>The system was a training simulation for a physical power station being constructed in Germany &#8211; when you turn one of these things on, you need operators who have experience of running it &#8211; our job was to create the entire power station in software for these training purposes!<\/p>\n<p>We had to facilitate highly efficient scheduling between different threads.  There were threads that were executing a mathematical model of the power station itself, then there were threads that were emulating control hardware in the real system.  There was then a multi-threaded terminal emulation layer (emulating Unix X-Windows terminals on Windows NT) that had to communicate with a Primary Processing Unit which itself comprised a number of threads which marshalled signals and data from the various hardware emulation threads to\/from the terminal emulation.<\/p>\n<h3>The Problem and Our First Attempt at Solving It<\/h3>\n<p>As complex as all this sounds, allowing Windows to schedule the threads as it saw fit worked fine on 1 CPU (duh) but on 2 CPU&#8217;s we found that the system bogged down.  It didn&#8217;t lock up as would be expected from a fundamental synchronization issue, things just slowed\uf701 to a crawl.  Of course, synchronization was initially suspected and we spent some time investigating this before satisfying ourselves that we had got things right in that area at least, and eventually concluded that the issue was in the overhead of the context switching of the threads between the 2 CPUs.<\/p>\n<p>We found that we could massively improve throughput by specifically affining threads to each CPU using <a href=\"http:\/\/msdn.microsoft.com\/en-us\/library\/ms686247\">the Windows SetThreadAffinityMask() mechanism<\/a>, simply sharing them equally across all (i.e. both) CPUs.  As each thread came into being we counted the number of active threads.  Each odd numbered thread went to one CPU, each even numbered thread to the other, and this worked like a charm.<\/p>\n<p>Of course, even as naive as this approach sounds (it was a &#8220;quick fix&#8221; to see if it helped &#8211; and it did) we didn&#8217;t just assume 2 CPU&#8217;s &#8211; we used a modulo algorithm to identify the CPU for each thread so that it would &#8220;scale&#8221; to as many CPU&#8217;s as we found in a system.<\/p>\n<p>Or so we thought.<\/p>\n<p>It turned out that &#8211; in our application &#8211; this approach failed miserably on 4 CPU&#8217;s for some reason.  So we went back to the drawing board.  It seemed that CPU affinity was the solution, we just needed to be smarter in how we distributed our threads across the hardware.<\/p>\n<p>Smarter than the OS and smarter than we ourselves had previously been.<\/p>\n<h3>The Solution<\/h3>\n<p>We eventually landed on the &#8220;right&#8221; approach.  By analysing\/thinking about what the different threads were doing and what resources they contended with each other over (something no OS thread scheduler or memory manager can actually divine for itself &#8211; not with any currently technology I am aware of at any rate) we identified different classes (or groups or pools &#8211; call it what you will) of threads with different contention characteristics with each other and the other groups.<\/p>\n<p>There was one group that we determined should be affined to just one of the CPU&#8217;s (comprising threads which were unlikely to be active simultaneously so they could efficiently share a single CPU).  Another group that would be highly likely to be active when the first groups was also active but less likely to contend with other threads in the same (second) group, so these could efficiently share two of the remaining CPU&#8217;s.<\/p>\n<p>There was then a third group containing very few threads but which were likely to be running all the time, so these were affined to 3 of the CPU&#8217;s.<\/p>\n<p>(Within each group we also identified a base priority level for the threads in that group with the ability to specify higher or lower priority for further identified sub-groups of threads if required.  We found this was less significant on performance &#8211; in our case &#8211; than affinity however)<\/p>\n<p>The 4th CPU was left free for the OS to use as it saw fit &#8211; we kept our threads off this CPU entirely, the theory being that this would result in the OS using that 4th CPU for other threads and processes we could not (easily or reliably) directly constrain.<\/p>\n<p>The net result of this exercise was that we externalised these various configurations by identifying the thread &#8220;pools&#8221; and allowing the executable to be &#8220;tuned&#8221; by making entries in a configuration file (an INI back in those days).   At startup each thread would look in the config file to determine what CPU it should affine itself to and set it&#8217;s affinity accordingly.<\/p>\n<p>In addition to the tuning mechanism we also built in performance monitoring metrics, and if we found that performance was not meeting the required throughput levels (effective realtime operation of the simulation) on a particular machine (hardware\/software configuration) we could tune the configuration of our threads with almost infinite variety, requiring nothing more than a restart of the simulation software, and see how effective &#8211; or otherwise &#8211; or re-tuning had been, through those same metrics.<\/p>\n<p>We arrived at a base configuration for our common configurations:  1, 2 and 4 CPU&#8217;s.  But if someone had won the lottery and given us an 8-CPU machine to play with, we could have devised a further configuration for that exotic hardware too.<\/p>\n<p>The result:  One build of the system with an almost infinite variety of performance profiles able to cope with any number of CPU&#8217;s (and background load) that our system encountered.<\/p>\n<h3>In Conclusion<\/h3>\n<p>When it comes to complex multi-threaded applications, the only thing that the compiler, the OS and the runtime knows is that you have a complex multi-threaded application.  The only place where precise, detailed knowledge of how the various threads in your system will interact during their lifetime is in the system designer\/developer&#8217;s heads.<\/p>\n<p>Synchronization between threads is one expression of those interactions that ensures that the system behaves <i>correctly<\/i>.<\/p>\n<p>Scheduling of those threads is another expression of those interactions that ensure that the system behaves <i>optimally<\/i>.<\/p>\n<p>Just as we already <i>know<\/i> that we cannot rely on the runtime to automatically divine the necessary synchronization between threads, it really should not come as much of a surprise that, <b>sometimes<\/b> at least, the runtime also fails to divine the optimal <i>scheduling<\/i>.<\/p>\n<p>Just as we have to put in some effort to get synchronization right, the same attention to scheduling can sometimes be just as important.<\/p>\n<h3>Getting Across the Platforms<\/h3>\n<p>It should also be noted, in these days of cross-platform support in Delphi, that not all operating systems provide the same degree of control over thread scheduling as Windows.  This is not necessarily indicative of less sophistication in those other OS&#8217;s &#8211; perhaps quite the opposite.  If low level controls aren&#8217;t provided this may simply be because they aren&#8217;t found to be needed.<\/p>\n<p>I have no idea whether the approach we settled on in this system would have been possible or even necessary on OSX or Linux for example, but I am sure the principle can be adapted and applied to whatever degree may be necessary and appropriate in those cases.<\/p>\n","protected":false},"excerpt":{"rendered":"<p><span class=\"rt-reading-time\" style=\"display: block;\"><span class=\"rt-label rt-prefix\">[Estimated Reading Time: <\/span> <span class=\"rt-time\">5<\/span> <span class=\"rt-label rt-postfix\">minutes]<\/span><\/span> &#8220;The Delphi Geek&#8221; recently blogged about a performance bottleneck he had identified in FastMM when used with a particular conditional define. Although not directly related, his post reminded me of an experience I had many years ago, working on a highly complex multi-threaded system (long before FastMM) and the strategy we found we had to [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"jetpack_publicize_message":"","jetpack_is_tweetstorm":false,"jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":[]},"categories":[4,1],"tags":[],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/p1TKYv-d1","jetpack_sharing_enabled":true,"jetpack-related-posts":[{"id":349,"url":"https:\/\/www.deltics.co.nz\/blog\/posts\/349\/","url_meta":{"origin":807,"position":0},"title":"Delphi 2009 &#8211; String Performance","date":"18 Sep 2008","format":false,"excerpt":"NOTE: Downloads are now fixed! Andreas Hausladen generously took the time to make some detailed comments on my previous post, one of which prompted me to throw together some further performance test cases for String types specifically.\u00a0 The results were something of a mixed bag and contained some surprises. The\u2026","rel":"","context":"In &quot;Delphi&quot;","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/www.deltics.co.nz\/blog\/wp-content\/uploads\/delphi2009-stringperformance-resultscapture.jpg?resize=350%2C200&ssl=1","width":350,"height":200},"classes":[]},{"id":1985,"url":"https:\/\/www.deltics.co.nz\/blog\/posts\/1985\/","url_meta":{"origin":807,"position":1},"title":"Build Automation With Train","date":"29 Oct 2013","format":false,"excerpt":"A comment from Kevin P brought a build automation tool to my attention this evening, called Train. Train is a JavaScript based build automation tool from a little company called RemObjects. It is written in Oxygene but the API provides specific support for Delphi. It's an open source project and\u2026","rel":"","context":"In &quot;Delphi&quot;","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/www.deltics.co.nz\/blog\/wp-content\/uploads\/Screen-Shot-2013-10-29-at-23.36.22-.png?resize=350%2C200&ssl=1","width":350,"height":200},"classes":[]},{"id":375,"url":"https:\/\/www.deltics.co.nz\/blog\/posts\/375\/","url_meta":{"origin":807,"position":2},"title":"Delphi 2009 &#8211; StringPerformance Redux","date":"22 Sep 2008","format":false,"excerpt":"It looks like I may have jumped the gun with my conclusions from the previous exercise to benchmark string performance in Delphi 2009.\u00a0 Following a useful exchange in the comments with Kryvich I corrected a small discrepancy in the tests and made some changes to the performance testing subsystem within\u2026","rel":"","context":"In &quot;Delphi&quot;","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/www.deltics.co.nz\/blog\/wp-content\/uploads\/delphi2009-stringperformance-chart.jpg?resize=350%2C200&ssl=1","width":350,"height":200},"classes":[]},{"id":1297,"url":"https:\/\/www.deltics.co.nz\/blog\/posts\/1297\/","url_meta":{"origin":807,"position":3},"title":"Of Threads, Stacks and RAM &#8211; Part 1","date":"28 Nov 2012","format":false,"excerpt":"Roberto Schneiders recently drew my attention to the first post on his new blog (which I can recommend as a good read :) ), presenting the results of some performance testing of DataSnap that he had been involved with which proved to be very interesting (if initially somewhat disappointing). But\u2026","rel":"","context":"In &quot;Delphi&quot;","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/www.deltics.co.nz\/blog\/wp-content\/uploads\/Screen-Shot-2012-11-28-at-20.24.21-.png?resize=350%2C200&ssl=1","width":350,"height":200},"classes":[]},{"id":1330,"url":"https:\/\/www.deltics.co.nz\/blog\/posts\/1330\/","url_meta":{"origin":807,"position":4},"title":"Of Threads, Stacks and RAM &#8211; Part 2","date":"29 Nov 2012","format":false,"excerpt":"In the previous post in this series, we saw that the number of threads that a given process could support was determined by a number of factors, of which the stack size reserved for each thread was key. We also saw how we could change the stack size used by\u2026","rel":"","context":"In &quot;Delphi&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":1925,"url":"https:\/\/www.deltics.co.nz\/blog\/posts\/1925\/","url_meta":{"origin":807,"position":5},"title":"VCL Threading &#8211; Synchronization","date":"16 Oct 2013","format":false,"excerpt":"Although I am using Oxygene a lot these days, Delphi remains my tool of choice for Win32 (and x64) development, together with the VCL. Hence this post. A long time ago, in a galaxy far far away, Delphi was a Windows only development tool. 16 was the number of the\u2026","rel":"","context":"In &quot;Delphi&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/www.deltics.co.nz\/blog\/wp-json\/wp\/v2\/posts\/807"}],"collection":[{"href":"https:\/\/www.deltics.co.nz\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.deltics.co.nz\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.deltics.co.nz\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.deltics.co.nz\/blog\/wp-json\/wp\/v2\/comments?post=807"}],"version-history":[{"count":7,"href":"https:\/\/www.deltics.co.nz\/blog\/wp-json\/wp\/v2\/posts\/807\/revisions"}],"predecessor-version":[{"id":814,"href":"https:\/\/www.deltics.co.nz\/blog\/wp-json\/wp\/v2\/posts\/807\/revisions\/814"}],"wp:attachment":[{"href":"https:\/\/www.deltics.co.nz\/blog\/wp-json\/wp\/v2\/media?parent=807"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.deltics.co.nz\/blog\/wp-json\/wp\/v2\/categories?post=807"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.deltics.co.nz\/blog\/wp-json\/wp\/v2\/tags?post=807"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}