{"id":1297,"date":"2012-11-28T20:09:03","date_gmt":"2012-11-28T08:09:03","guid":{"rendered":"https:\/\/www.deltics.co.nz\/blog\/?p=1297"},"modified":"2012-11-28T20:59:20","modified_gmt":"2012-11-28T08:59:20","slug":"of-threads-stacks-and-ram-part-1","status":"publish","type":"post","link":"https:\/\/www.deltics.co.nz\/blog\/posts\/1297\/","title":{"rendered":"Of Threads, Stacks and RAM &#8211; Part 1"},"content":{"rendered":"<span class=\"rt-reading-time\" style=\"display: block;\"><span class=\"rt-label rt-prefix\">[Estimated Reading Time: <\/span> <span class=\"rt-time\">6<\/span> <span class=\"rt-label rt-postfix\">minutes]<\/span><\/span><p>Roberto Schneiders recently drew my attention to the first post on <a href=\"http:\/\/robertocschneiders.wordpress.com\/2012\/11\/22\/datasnap-analysis-based-on-speed-stability-tests\/\" target=\"_blank\">his new blog (which I can recommend as a good read \ud83d\ude42 ), presenting the results of some performance testing of DataSnap<\/a> that he had been involved with which proved to be very interesting (if initially somewhat disappointing).<\/p>\n<p>But my post isn&#8217;t about that, at least not directly.<\/p>\n<p><!--more--><\/p>\n<p>One of the characteristics noted about the Indy based system on which DataSnap is implemented was the memory utilisation, being significantly higher than other frameworks in the comparison.<\/p>\n<p>Further, commenter <strong>upow<\/strong> made the observation that Indy (by default) uses one thread per request and that Windows supports only 2,000 threads.<\/p>\n<p>These two observations are more directly connected than you might at first think, although the 2,000 number is not correct (but it&#8217;s derivation can be explained).<\/p>\n<p>Before we get into the why&#8217;s and the wherefore&#8217;s however, let us devise a simple test to determine what the thread limit actually is on Windows.  This simple test application will tell us:<\/p>\n<pre class=\"brush: delphi; title: ; notranslate\" title=\"\">\r\nprogram testlimit;\r\n\r\n{$APPTYPE CONSOLE}\r\n\r\nuses\r\n  Classes,\r\n  SysUtils,\r\n  Windows;\r\n\r\nfunction ThreadProc(aParam: Cardinal): Integer;\r\nbegin\r\n  Sleep(INFINITE);\r\n  result := 0;\r\nend;\r\n\r\n\r\nvar\r\n  i: Integer = 0;\r\n  id: Cardinal;\r\nbegin\r\n  try\r\n    while TRUE do\r\n    begin\r\n      if CreateThread(NIL, 0, @ThreadProc, NIL, 0, id) = 0 then\r\n        ABORT;\r\n      Inc(i);\r\n    end;\r\n\r\n  except\r\n    on e: Exception do\r\n      WriteLn(i, ' threads is the limit');\r\n  end;\r\nend.\r\n<\/pre>\n<p><a href='https:\/\/www.deltics.co.nz\/blog\/wp-content\/uploads\/testlimit.zip'>Click here to download the code (zipped) if you can&#8217;t\/don&#8217;t want to copy\/paste.<\/a><\/p>\n<p>This test code uses the Windows API directly to create as many, very simple threads as it possibly can.  It is a very dirty application, not doing any clean-up, but that&#8217;s OK &#8211; the point here is to determine an absolute limit of Windows, not to write a well behaved application (don&#8217;t worry, the threads will be &#8220;cleaned up&#8221; by Windows when the process terminates).<\/p>\n<p>Each thread that is created runs the same code &#8211; a simple function that immediately puts the thread to sleep.  Forever.  Again, bear in mind that the point is not to find out how many threads can usefully execute simultaneously, but simply how many we can actually create.  The number that can actually do anything useful will be less than this absolute limit.<\/p>\n<p>So whilst we don&#8217;t want the threads to be doing anything, we do need them to remain around in the system &#8211; we cannot let them exit their thread functions.  Putting them into an indefinite <strong>Sleep()<\/strong> state is the most efficient way of ensuring this.<\/p>\n<p>Compiling and running this application for 32-bit Windows (I strongly recommend you don&#8217;t compile and run for 64-bit, at least not yet) in Delphi XE3, I get the following output:<\/p>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\r\n  1569 threads is the limit\r\n<\/pre>\n<p>Somewhat less than the 2,000 that <strong>upow<\/strong> suggested was the case.  In fact, not even really very close at all.  And it seems to be a very arbitrary number.  It doesn&#8217;t even look very &#8220;limit-like&#8221;.  How come ?<\/p>\n<p>Before we get in to that, let&#8217;s see what the limit is on Win64 &#8211; maybe Microsoft increased it ?<\/p>\n<p>So, assuming you have XE2 or XE3 and a Windows 64-bit environment available, add Win64 as a platform, compile and run on Win64 (you might want to get yourself a tea or a coffee while you wait for it to finish).  Actually, it shouldn&#8217;t take that long, but it will take a lot longer than the Win32 version, and your machine will quickly become unusable as the threads mount up and drain resources from your system (even sleeping threads have demands).<\/p>\n<p>Eventually it will complete and you should get a far, far higher number of threads established as the limit in this case.  In my case:<\/p>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\r\n  153354 threads is the limit\r\n<\/pre>\n<p>A one hundred-fold increase in the number of threads !!  How do we explain the difference ?<\/p>\n<p>The answer is actually very simple:  <strong>Stack Size<\/strong><\/p>\n<h3>Commit and Reservation<\/h3>\n<p>Every thread in a process requires a stack.  On Windows (at least, and quite possibly as a universal rule &#8211; I don&#8217;t honestly know) the area of memory used for a stack must be contiguous.  That is, a single block of memory.<\/p>\n<p>And the size of a stack (the size of that block of memory) is important.<\/p>\n<p>Too small and your code will run out of room in that stack and you will get an exception &#8211; an exception which provided the inspiration and indeed the name for the <a href=\"http:\/\/stackoverflow.com\/\" target=\"_blank\">stackoverflow<\/a> website!<\/p>\n<p>Too big and there will be parts of the memory allocated for the stack that are never used and will be wasted.<\/p>\n<p>Fortunately, you can tell Windows how big you need your stack to be.  But (again, on Windows) this is something that goes into the header of your executable, so you have to give this information to your compiler so that it can make the appropriate entries when writing your EXE to disk.<\/p>\n<p>These settings are in the <strong>Linker<\/strong> options of your project:<\/p>\n<p><a href=\"https:\/\/i0.wp.com\/www.deltics.co.nz\/blog\/wp-content\/uploads\/Screen-Shot-2012-11-28-at-20.24.21-.png?ssl=1\"><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/www.deltics.co.nz\/blog\/wp-content\/uploads\/Screen-Shot-2012-11-28-at-20.24.21-.png?resize=618%2C476&#038;ssl=1\" alt=\"\" title=\"Stack Size Linker Settings\" width=\"618\" height=\"476\" class=\"aligncenter size-full wp-image-1310\" srcset=\"https:\/\/i0.wp.com\/www.deltics.co.nz\/blog\/wp-content\/uploads\/Screen-Shot-2012-11-28-at-20.24.21-.png?w=618&amp;ssl=1 618w, https:\/\/i0.wp.com\/www.deltics.co.nz\/blog\/wp-content\/uploads\/Screen-Shot-2012-11-28-at-20.24.21-.png?resize=300%2C231&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.deltics.co.nz\/blog\/wp-content\/uploads\/Screen-Shot-2012-11-28-at-20.24.21-.png?resize=150%2C115&amp;ssl=1 150w\" sizes=\"(max-width: 618px) 100vw, 618px\" data-recalc-dims=\"1\" \/><\/a><\/p>\n<p>The settings in the screenshot above are the default for a Delphi project.  They specify (in bytes) a <em>minimum<\/em> stack size of 16 KB and a <em>maximum<\/em> of 1 MB.<\/p>\n<p>The two values are important, but it is the <strong>maximum<\/strong> figure that is most important when it comes to explaining the number of threads we can create.<\/p>\n<p>The minimum stack size is the amount of memory that will be physically allocated for our initial stack.  This is called the &#8220;commit charge&#8221;.  The maximum stack size is the amount of memory that will actually be reserved for the stack, in case our stack exceeds that minimum.  This is the &#8220;reservation&#8221;.<\/p>\n<p>We can easily test the effect of these minimum and maximum sizes on our simple test application.<\/p>\n<p>Just to make it easier, we can use compiler directives to set these values, instead of having to keep going into our <strong>Project Options<\/strong>, so add these two directives after the <strong>$APPTYPE<\/strong> directive:<\/p>\n<pre class=\"brush: delphi; title: ; notranslate\" title=\"\">\r\n{$APPTYPE CONSOLE}\r\n{$MINSTACKSIZE 16384}\r\n{$MAXSTACKSIZE 65536}\r\n<\/pre>\n<p>This leaves the initial commit charge for each thread&#8217;s call stack unchanged at 16 KB, but drastically reduces the maximum size &#8211; and thus the amount of memory reserved for each stack &#8211; to just 64 KB.<\/p>\n<p>Recompile and run again (I suggest you do this on Win32, to save time).  You<br \/>\nshould get something similar to this output:<\/p>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\r\n  6076 threads is the limit\r\n<\/pre>\n<p><strong>NOTE: <\/strong> If the results you get are significantly different from those I am presenting here, bear in mind that each machine will be slightly different because the limit is driven by the particular hardware and software in each case.<\/p>\n<p>For context, I am conducting these tests inside a Win64 virtual machine (hosted on a Mac) with 4GB of RAM allocated to that VM.<\/p>\n<p>In any event, clearly there is no absolute limit of 2,000 threads on Win32.<\/p>\n<p>There is a practical limit however, which is a function of the amount of <em>addressable<\/em> memory, the amount of <em>available<\/em> memory, the degree of <em>fragmentation<\/em> of that memory, and the reservation of memory required for the stack for each thread.<\/p>\n<p>Even the implementation details of different memory managers will have an influence.<\/p>\n<p>It is the interaction of these variables, combined with the impact of the various hidden behaviours of complex modern software that results in the increase in the number of threads being somewhat lower than we might have expected.<\/p>\n<p>64 KB is just on sixteenth of 1 MB, but we do not increase the number of threads by a commensurate factor of 16, in fact achieving slightly less than 4 times as many threads.<\/p>\n<p>The influence of memory managers can be seen by simply switching to a different one.<\/p>\n<p>The 6076 result was achieved with both the default (FastMM) and also with the 4.99.1 release of FastMM.  However when repeating the test using ScaleMM (2.12) the number fell slightly to 6073.  Not a huge difference, but a difference nonetheless.<\/p>\n<p>Even simply removing the <strong>Classes<\/strong> unit from the uses list (included in anticipation of the next post in this series, but not actually required by this test program) will have an impact.  The figure increases slightly to 6086, as removing the <strong>Classes<\/strong> unit reduces the size of the <strong>exe<\/strong> loaded into the process memory, thus reducing the amount of memory used by the process, making that additional address space available for use as thread stack(s).<\/p>\n<p>The limit of 2,000 threads that <strong>upow<\/strong> quoted most likely originally derived from the fact that with a reservation of 1 MB for each stack, and with 2 GB of user address space per process, then there is a &#8220;perfect&#8221;, theoretical limit of 2,000 threads per process on 32-bit Windows.  In practice however, this perfect limit cannot be reached with that amount of stack per thread since no process can start with 100% of it&#8217;s address space available to be dedicated purely to serve as stacks for it&#8217;s threads<\/p>\n<p><a href=\"http:\/\/blogs.technet.com\/b\/markrussinovich\/archive\/2009\/07\/08\/3261309.aspx\" target=\"_blank\">All of this is very comprehensively explained by Mark Russinovich<\/a>, so in the next post I shall look at the tools available to us to work within these limits.<\/p>\n","protected":false},"excerpt":{"rendered":"<p><span class=\"rt-reading-time\" style=\"display: block;\"><span class=\"rt-label rt-prefix\">[Estimated Reading Time: <\/span> <span class=\"rt-time\">6<\/span> <span class=\"rt-label rt-postfix\">minutes]<\/span><\/span> Roberto Schneiders recently drew my attention to the first post on his new blog (which I can recommend as a good read \ud83d\ude42 ), presenting the results of some performance testing of DataSnap that he had been involved with which proved to be very interesting (if initially somewhat disappointing). But my post isn&#8217;t about that, [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"jetpack_publicize_message":"","jetpack_is_tweetstorm":false,"jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":[]},"categories":[4],"tags":[192,292,194,193,93,141],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/p1TKYv-kV","jetpack_sharing_enabled":true,"jetpack-related-posts":[{"id":1330,"url":"https:\/\/www.deltics.co.nz\/blog\/posts\/1330\/","url_meta":{"origin":1297,"position":0},"title":"Of Threads, Stacks and RAM &#8211; Part 2","date":"29 Nov 2012","format":false,"excerpt":"In the previous post in this series, we saw that the number of threads that a given process could support was determined by a number of factors, of which the stack size reserved for each thread was key. We also saw how we could change the stack size used by\u2026","rel":"","context":"In &quot;Delphi&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":735,"url":"https:\/\/www.deltics.co.nz\/blog\/posts\/735\/","url_meta":{"origin":1297,"position":1},"title":"RAD STUDIO XE2: Launch Event Report","date":"04 Aug 2011","format":false,"excerpt":"Today I was fortunate to be present in Auckland at the World Premier of the launch event for RAD Studio XE2. \u00a0There is so much good to report that I really don't know where to begin, so apologies if this post is a bit of a disorganised ramble. \u00a0But here\u2026","rel":"","context":"In &quot;Delphi&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":1068,"url":"https:\/\/www.deltics.co.nz\/blog\/posts\/1068\/","url_meta":{"origin":1297,"position":2},"title":"Info From the World Tour (Hamburg)","date":"22 Aug 2012","format":false,"excerpt":"In the Embarcadero forums, Roland Kossow posted his report on the first of the \"RAD Studio World Tour\" events in Hamburg yesterday, reporting on what's new in XE3 and adding some more detail to the \"XE3 And Beyond\" blog post. In a nutshell we have \"FM2\" (FireMonkey 2 - no\u2026","rel":"","context":"In &quot;Delphi&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":1289,"url":"https:\/\/www.deltics.co.nz\/blog\/posts\/1289\/","url_meta":{"origin":1297,"position":3},"title":"Suggestions for Marco &#8230; ?","date":"17 Nov 2012","format":false,"excerpt":"I have elevated this comment from David I in a previous post, to the status of a post in it's own right, in order that it might elicit the suggestions from readers\/commenters that David seeks. I have adjusted the opening wording to make more sense in the context of a\u2026","rel":"","context":"In &quot;Delphi&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":807,"url":"https:\/\/www.deltics.co.nz\/blog\/posts\/807\/","url_meta":{"origin":1297,"position":4},"title":"Use Knowledge of Your Own Threads to Extract Optimal Performance&#8230;","date":"29 Sep 2011","format":false,"excerpt":"\"The Delphi Geek\" recently blogged about a performance bottleneck he had identified in FastMM when used with a particular conditional define. Although not directly related, his post reminded me of an experience I had many years ago, working on a highly complex multi-threaded system (long before FastMM) and the strategy\u2026","rel":"","context":"In &quot;Delphi&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":576,"url":"https:\/\/www.deltics.co.nz\/blog\/posts\/576\/","url_meta":{"origin":1297,"position":5},"title":"Commitment Issues","date":"06 Oct 2009","format":false,"excerpt":"No, not a relationship blog and no, not a rant about the relationship between Embarcadero and the Delphi community. \u00a0This is a strictly and purely technical post about what \"Committed\" means in terms of Windows memory, and in particular a key aspect of how that applies to threaded applications. Last\u2026","rel":"","context":"In &quot;Delphi&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/www.deltics.co.nz\/blog\/wp-json\/wp\/v2\/posts\/1297"}],"collection":[{"href":"https:\/\/www.deltics.co.nz\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.deltics.co.nz\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.deltics.co.nz\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.deltics.co.nz\/blog\/wp-json\/wp\/v2\/comments?post=1297"}],"version-history":[{"count":20,"href":"https:\/\/www.deltics.co.nz\/blog\/wp-json\/wp\/v2\/posts\/1297\/revisions"}],"predecessor-version":[{"id":1323,"href":"https:\/\/www.deltics.co.nz\/blog\/wp-json\/wp\/v2\/posts\/1297\/revisions\/1323"}],"wp:attachment":[{"href":"https:\/\/www.deltics.co.nz\/blog\/wp-json\/wp\/v2\/media?parent=1297"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.deltics.co.nz\/blog\/wp-json\/wp\/v2\/categories?post=1297"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.deltics.co.nz\/blog\/wp-json\/wp\/v2\/tags?post=1297"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}