{"id":1330,"date":"2012-11-29T22:35:00","date_gmt":"2012-11-29T10:35:00","guid":{"rendered":"https:\/\/www.deltics.co.nz\/blog\/?p=1330"},"modified":"2012-11-29T22:35:00","modified_gmt":"2012-11-29T10:35:00","slug":"of-threads-stacks-and-ram-part-2","status":"publish","type":"post","link":"https:\/\/www.deltics.co.nz\/blog\/posts\/1330\/","title":{"rendered":"Of Threads, Stacks and RAM &#8211; Part 2"},"content":{"rendered":"<span class=\"rt-reading-time\" style=\"display: block;\"><span class=\"rt-label rt-prefix\">[Estimated Reading Time: <\/span> <span class=\"rt-time\">5<\/span> <span class=\"rt-label rt-postfix\">minutes]<\/span><\/span><p>In <a href=\"https:\/\/www.deltics.co.nz\/blog\/?p=1297\" target=\"_blank\">the previous post in this series<\/a>, we saw that the number of threads that a given process could support was determined by a number of factors, of which the stack size reserved for each thread was key.<\/p>\n<p>We also saw how we could change the stack size used by our application and how this could increase the number of threads that our process could support.  But if you thought it seemed a bit crude to have to set a single stack size for all the threads in a process (including the main thread), then you would be right, and we can do something about this.<\/p>\n<p><!--more--><\/p>\n<p>If we go back and look at our <strong>testlimit.dpr<\/strong> code, we can see a number of parameters supplied to the Windows <strong>CreateThread<\/strong> function:<\/p>\n<pre class=\"brush: delphi; title: ; notranslate\" title=\"\">\n   if CreateThread(NIL, 0, @ThreadProc, NIL, 0, id) = 0 then\n<\/pre>\n<p>The two 0&#8217;s are what we are interested in: the <strong>2nd<\/strong> and <strong>5th<\/strong> parameters.<\/p>\n<p>The <em>2nd<\/em> parameter is actually <em>stack size<\/em> !!  By leaving this as zero in our test code, we were indicating that we wanted each thread to use the default stack size &#8211; i.e. that established for the process itself, as specified in the <strong>Linker<\/strong> options or by the <strong>MINSTACKSIZE<\/strong>\/<strong>MAXSTACKSIZE<\/strong> directives.<\/p>\n<p>But with this stack size parameter to the <strong>CreateThread<\/strong> function we can override the process default stack size with a size (in bytes) more appropriate to the needs of the particular thread we are creating !<\/p>\n<p>But wait! There are <strong><em>two<\/em><\/strong> stack sizes &#8211; a minimum <em>and<\/em> a maximum.  But there is only one stack size parameter !  How come ?<\/p>\n<p>The answer is that the stack size parameter is in effect &#8220;overloaded&#8221;.<\/p>\n<p>By default, the stack size specified in the 2nd parameter is used to establish the commit charge (minimum stack size) for the thread stack.  But as we saw in the previous instalment, it is the larger, reservation value (maximum size) that is most important in this exercise.<\/p>\n<p>This is where the <em>5th<\/em> parameter comes in.<\/p>\n<p>This parameter accepts flags that we can pass to control certain aspects of the thread creation process.  <strong>CREATE_SUSPENDED<\/strong> is one such flag.  Another (indeed the <em>only<\/em> other, currently as far as I know) is the long and rather cumbersomely named <strong>STACK_SIZE_PARAM_IS_A_RESERVATION<\/strong> flag.<\/p>\n<p>If we pass this flag then the stack size parameter value will now be interpreted as establishing the <em>reservation<\/em> size for the thread&#8217;s stack (the maximum).<\/p>\n<p>What you will quickly have noticed however, is that you cannot specify both the minimum <em>and<\/em> maximum stack size for a thread.  It&#8217;s one or the other.  Whichever one you set, the other will still be taken from your application defaults (there are some funny rules in play that mean that you won&#8217;t always get what you ask for, but we shall look at those later).<\/p>\n<p>For now, it&#8217;s enough that in most cases the stack size that you will wish to limit for a given thread will be the maximum (reservation).<\/p>\n<p>So, to test the effect this has, let us restore the process default limits to the usual Delphi defaults by changing the stack size directives that we added to the code back to the initial values we used (16 KB and 1 MB respectively):<\/p>\n<pre class=\"brush: delphi; title: ; notranslate\" title=\"\">\n  {$MINSTACKSIZE 16384}\n  {$MAXSTACKSIZE 1048576}\n<\/pre>\n<p>Just to make sure, compile and run (for Win32 remember) and ensure that your thread limit is reported as being back down around 1567 (or whatever result you originally saw with the 1 MB stack size in your case).<\/p>\n<p>Now let&#8217;s make the changes needed to exercise some control over the individual threads.<\/p>\n<p>First, we must declare <strong>STACK_SIZE_PARAM_IS_A_RESERVATION<\/strong> because this flag is not actually defined for us in the <strong>Windows<\/strong> unit, or any other unit for that matter (<em>this and the fact that we aren&#8217;t using <strong>TThread<\/strong> for these tests should have set an alarm bell ringing if it wasn&#8217;t already &#8211; I think you can guess where this is headed&#8230;<\/em>).<\/p>\n<p>We will also add a constant for the desired thread stack size, let&#8217;s start with 64 KB.  Finally, we need to apply both our desired thread stack size and the flag to the <strong>CreateThread<\/strong> call, ending up with:<\/p>\n<pre class=\"brush: delphi; title: ; notranslate\" title=\"\">\nprogram testlimit2;\n\n{$APPTYPE CONSOLE}\n{$MINSTACKSIZE 16384}\n{$MAXSTACKSIZE 1048576}\n\nuses\n  SysUtils,\n  Windows;\n\nfunction ThreadProc(aParam: Cardinal): Integer;\nbegin\n  Sleep(INFINITE);\n  result := 0;\nend;\n\nconst\n  STACK_SIZE_PARAM_IS_A_RESERVATION = $10000;\n  THREAD_STACK = 65536;\n\nvar\n  i: Integer = 0;\n  id: Cardinal;\nbegin\n  try\n    while TRUE do\n    begin\n      if CreateThread(NIL, THREAD_STACK, @ThreadProc, NIL, STACK_SIZE_PARAM_IS_A_RESERVATION, id) = 0 then\n        ABORT;\n      Inc(i);\n    end;\n\n  except\n    WriteLn(i, ' threads is the limit');\n  end;\nend.\n<\/pre>\n<p>Again, <a href='https:\/\/www.deltics.co.nz\/blog\/wp-content\/uploads\/testlimit2.zip'>click to get the (zipped) code<\/a>.<\/p>\n<p>If you compile and run this, you should now see your thread limit restored to the higher limit (6076 in my case &#8211; your mileage may vary), despite the fact that the default stack size, and the stack for the main thread in the process itself, is still 1 MB.<\/p>\n<p>In the next article in this series we shall look at how we sometimes will not get what we ask for, and consider why.<\/p>\n<p>But to close this article, I shall address a couple of points raised in the comments on the first installment.<\/p>\n<h3>Raising The 2GB Process Memory Limit on Win32<\/h3>\n<p>A couple of people pointed out that you can increase the address space for a process even on 32-bit Windows by applying the <strong>LARGE_ADDRESS_AWARE<\/strong> flag in the PE header for the executable.<\/p>\n<p>This is true, and as Mark Russinovich covers in his very detailed article on the subject, this will indeed increase the number of threads you can create in any given process.<\/p>\n<p>However, the object of the exercise in this series is to identify what the limiting factors are, not to figure out ways of increasing those limits.  Quite apart from anything else, in this case, the ability to create the maximum possible amount of threads in a process is not the goal &#8211; the goal is to make the best possible use of the available resources, and for threads (generally speaking) the key resource is CPU, not RAM.<\/p>\n<p>Indeed, the more threads you add to a process, the more time your process will potentially spend context switching between those threads.<\/p>\n<h3>Is The Problem The Stack Memory, or Creating The Threads ?<\/h3>\n<p>Remember that the original impetus for this series of posts was an observation that the Indy framework on which DataSnap is based suffers severe performance problems as a result of the threading model that it uses (by default).<\/p>\n<p><strong>Gaetan Maerten<\/strong> commented on the previous post, saying that <a href=\"https:\/\/www.deltics.co.nz\/blog\/?p=1297&#038;cpage=1#comment-13605\" target=\"_blank\">he thought it was the creation\/destruction of the threads<\/a> that was the cause of the performance bottleneck in Indy, rather than the memory usage per se.<\/p>\n<p>Timing the code to create the maximum number of threads, and then replacing the thread creation with a simple memory allocation of the same amount of memory as would be required for the stacks for those threads (<strong>AllocMem(THREAD_STACK)<\/strong>) and timing the same number of allocations, we can then compare those times.<\/p>\n<p>What I found was that the code that simply allocated a given number of memory blocks took only 25%-40% of the time required to create the same number of threads.<\/p>\n<p>So Gaetan is correct &#8211; it is the creation and destruction of threads that is the source of the majority of the bottleneck, with the cost of the stack allocation for those threads accounting, in crude terms, for between 25% and 40% of the total thread creation cost.<\/p>\n<p>This points the way toward a more efficient, scalable parallel architecture which we shall look at in a later post.<\/p>\n<h3>Less Is [Sometimes] More<\/h3>\n<p>Both of these points &#8211; context switching and thread creation overhead &#8211; mean that in many cases the most efficient use of threads involves using fewer threads, not more, although even that is not always a hard and fast rule.<\/p>\n<p>Before I conclude this series I shall share an experience from many years ago that taught me how reducing the number of threads, even in a relatively simple yet theoretically highly parallel, architecture, actually led to significant improvements in performance.<\/p>\n<p>But before anyone latches on to that as a touchstone for multi-threading success, we also found in <strong>the same system<\/strong> that sometimes we needed to <em>increase<\/em> the number of threads.<\/p>\n<p>But, as I say, that&#8217;s for a later post.<\/p>\n","protected":false},"excerpt":{"rendered":"<p><span class=\"rt-reading-time\" style=\"display: block;\"><span class=\"rt-label rt-prefix\">[Estimated Reading Time: <\/span> <span class=\"rt-time\">5<\/span> <span class=\"rt-label rt-postfix\">minutes]<\/span><\/span> In the previous post in this series, we saw that the number of threads that a given process could support was determined by a number of factors, of which the stack size reserved for each thread was key. We also saw how we could change the stack size used by our application and how this [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"jetpack_publicize_message":"","jetpack_is_tweetstorm":false,"jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":[]},"categories":[4],"tags":[292,194,193,93,141],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/p1TKYv-ls","jetpack_sharing_enabled":true,"jetpack-related-posts":[{"id":1297,"url":"https:\/\/www.deltics.co.nz\/blog\/posts\/1297\/","url_meta":{"origin":1330,"position":0},"title":"Of Threads, Stacks and RAM &#8211; Part 1","date":"28 Nov 2012","format":false,"excerpt":"Roberto Schneiders recently drew my attention to the first post on his new blog (which I can recommend as a good read :) ), presenting the results of some performance testing of DataSnap that he had been involved with which proved to be very interesting (if initially somewhat disappointing). But\u2026","rel":"","context":"In &quot;Delphi&quot;","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/www.deltics.co.nz\/blog\/wp-content\/uploads\/Screen-Shot-2012-11-28-at-20.24.21-.png?resize=350%2C200&ssl=1","width":350,"height":200},"classes":[]},{"id":576,"url":"https:\/\/www.deltics.co.nz\/blog\/posts\/576\/","url_meta":{"origin":1330,"position":1},"title":"Commitment Issues","date":"06 Oct 2009","format":false,"excerpt":"No, not a relationship blog and no, not a rant about the relationship between Embarcadero and the Delphi community. \u00a0This is a strictly and purely technical post about what \"Committed\" means in terms of Windows memory, and in particular a key aspect of how that applies to threaded applications. Last\u2026","rel":"","context":"In &quot;Delphi&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":1925,"url":"https:\/\/www.deltics.co.nz\/blog\/posts\/1925\/","url_meta":{"origin":1330,"position":2},"title":"VCL Threading &#8211; Synchronization","date":"16 Oct 2013","format":false,"excerpt":"Although I am using Oxygene a lot these days, Delphi remains my tool of choice for Win32 (and x64) development, together with the VCL. Hence this post. A long time ago, in a galaxy far far away, Delphi was a Windows only development tool. 16 was the number of the\u2026","rel":"","context":"In &quot;Delphi&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":1930,"url":"https:\/\/www.deltics.co.nz\/blog\/posts\/1930\/","url_meta":{"origin":1330,"position":3},"title":"VCL Threading &#8211; Indeterminate Lifetimes","date":"18 Oct 2013","format":false,"excerpt":"Sometimes when you launch a thread you don't know when it will complete whatever processing it is tasked with. Sometimes you do. Sometimes it may never complete and will require that you expressly terminate it. Usually any given thread will have a lifecycle that is at least consistently one or\u2026","rel":"","context":"In &quot;Delphi&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":807,"url":"https:\/\/www.deltics.co.nz\/blog\/posts\/807\/","url_meta":{"origin":1330,"position":4},"title":"Use Knowledge of Your Own Threads to Extract Optimal Performance&#8230;","date":"29 Sep 2011","format":false,"excerpt":"\"The Delphi Geek\" recently blogged about a performance bottleneck he had identified in FastMM when used with a particular conditional define. Although not directly related, his post reminded me of an experience I had many years ago, working on a highly complex multi-threaded system (long before FastMM) and the strategy\u2026","rel":"","context":"In &quot;Delphi&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":900,"url":"https:\/\/www.deltics.co.nz\/blog\/posts\/900\/","url_meta":{"origin":1330,"position":5},"title":"Porting the Objective-C CFFTPSample to XE2: Part 1","date":"04 Jul 2012","format":false,"excerpt":"On the NZ DUG email list (yes, we still have those here) a question was recently posted asking for help with getting some FTP code working on OSX, using XE2. This coincided nicely with my reaching a point in my Objective-C learning where this sort of exercise was of interest\u2026","rel":"","context":"In &quot;Delphi&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/www.deltics.co.nz\/blog\/wp-json\/wp\/v2\/posts\/1330"}],"collection":[{"href":"https:\/\/www.deltics.co.nz\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.deltics.co.nz\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.deltics.co.nz\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.deltics.co.nz\/blog\/wp-json\/wp\/v2\/comments?post=1330"}],"version-history":[{"count":10,"href":"https:\/\/www.deltics.co.nz\/blog\/wp-json\/wp\/v2\/posts\/1330\/revisions"}],"predecessor-version":[{"id":1341,"href":"https:\/\/www.deltics.co.nz\/blog\/wp-json\/wp\/v2\/posts\/1330\/revisions\/1341"}],"wp:attachment":[{"href":"https:\/\/www.deltics.co.nz\/blog\/wp-json\/wp\/v2\/media?parent=1330"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.deltics.co.nz\/blog\/wp-json\/wp\/v2\/categories?post=1330"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.deltics.co.nz\/blog\/wp-json\/wp\/v2\/tags?post=1330"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}