Of Threads, Stacks and RAM – Part 1

[Estimated Reading Time: 6 minutes]

Roberto Schneiders recently drew my attention to the first post on his new blog (which I can recommend as a good read 🙂 ), presenting the results of some performance testing of DataSnap that he had been involved with which proved to be very interesting (if initially somewhat disappointing).

But my post isn’t about that, at least not directly.

One of the characteristics noted about the Indy based system on which DataSnap is implemented was the memory utilisation, being significantly higher than other frameworks in the comparison.

Further, commenter upow made the observation that Indy (by default) uses one thread per request and that Windows supports only 2,000 threads.

These two observations are more directly connected than you might at first think, although the 2,000 number is not correct (but it’s derivation can be explained).

Before we get into the why’s and the wherefore’s however, let us devise a simple test to determine what the thread limit actually is on Windows. This simple test application will tell us:

program testlimit;

{$APPTYPE CONSOLE}

uses
  Classes,
  SysUtils,
  Windows;

function ThreadProc(aParam: Cardinal): Integer;
begin
  Sleep(INFINITE);
  result := 0;
end;


var
  i: Integer = 0;
  id: Cardinal;
begin
  try
    while TRUE do
    begin
      if CreateThread(NIL, 0, @ThreadProc, NIL, 0, id) = 0 then
        ABORT;
      Inc(i);
    end;

  except
    on e: Exception do
      WriteLn(i, ' threads is the limit');
  end;
end.

Click here to download the code (zipped) if you can’t/don’t want to copy/paste.

This test code uses the Windows API directly to create as many, very simple threads as it possibly can. It is a very dirty application, not doing any clean-up, but that’s OK – the point here is to determine an absolute limit of Windows, not to write a well behaved application (don’t worry, the threads will be “cleaned up” by Windows when the process terminates).

Each thread that is created runs the same code – a simple function that immediately puts the thread to sleep. Forever. Again, bear in mind that the point is not to find out how many threads can usefully execute simultaneously, but simply how many we can actually create. The number that can actually do anything useful will be less than this absolute limit.

So whilst we don’t want the threads to be doing anything, we do need them to remain around in the system – we cannot let them exit their thread functions. Putting them into an indefinite Sleep() state is the most efficient way of ensuring this.

Compiling and running this application for 32-bit Windows (I strongly recommend you don’t compile and run for 64-bit, at least not yet) in Delphi XE3, I get the following output:

  1569 threads is the limit

Somewhat less than the 2,000 that upow suggested was the case. In fact, not even really very close at all. And it seems to be a very arbitrary number. It doesn’t even look very “limit-like”. How come ?

Before we get in to that, let’s see what the limit is on Win64 – maybe Microsoft increased it ?

So, assuming you have XE2 or XE3 and a Windows 64-bit environment available, add Win64 as a platform, compile and run on Win64 (you might want to get yourself a tea or a coffee while you wait for it to finish). Actually, it shouldn’t take that long, but it will take a lot longer than the Win32 version, and your machine will quickly become unusable as the threads mount up and drain resources from your system (even sleeping threads have demands).

Eventually it will complete and you should get a far, far higher number of threads established as the limit in this case. In my case:

  153354 threads is the limit

A one hundred-fold increase in the number of threads !! How do we explain the difference ?

The answer is actually very simple: Stack Size

Commit and Reservation

Every thread in a process requires a stack. On Windows (at least, and quite possibly as a universal rule – I don’t honestly know) the area of memory used for a stack must be contiguous. That is, a single block of memory.

And the size of a stack (the size of that block of memory) is important.

Too small and your code will run out of room in that stack and you will get an exception – an exception which provided the inspiration and indeed the name for the stackoverflow website!

Too big and there will be parts of the memory allocated for the stack that are never used and will be wasted.

Fortunately, you can tell Windows how big you need your stack to be. But (again, on Windows) this is something that goes into the header of your executable, so you have to give this information to your compiler so that it can make the appropriate entries when writing your EXE to disk.

These settings are in the Linker options of your project:

The settings in the screenshot above are the default for a Delphi project. They specify (in bytes) a minimum stack size of 16 KB and a maximum of 1 MB.

The two values are important, but it is the maximum figure that is most important when it comes to explaining the number of threads we can create.

The minimum stack size is the amount of memory that will be physically allocated for our initial stack. This is called the “commit charge”. The maximum stack size is the amount of memory that will actually be reserved for the stack, in case our stack exceeds that minimum. This is the “reservation”.

We can easily test the effect of these minimum and maximum sizes on our simple test application.

Just to make it easier, we can use compiler directives to set these values, instead of having to keep going into our Project Options, so add these two directives after the $APPTYPE directive:

{$APPTYPE CONSOLE}
{$MINSTACKSIZE 16384}
{$MAXSTACKSIZE 65536}

This leaves the initial commit charge for each thread’s call stack unchanged at 16 KB, but drastically reduces the maximum size – and thus the amount of memory reserved for each stack – to just 64 KB.

Recompile and run again (I suggest you do this on Win32, to save time). You
should get something similar to this output:

  6076 threads is the limit

NOTE: If the results you get are significantly different from those I am presenting here, bear in mind that each machine will be slightly different because the limit is driven by the particular hardware and software in each case.

For context, I am conducting these tests inside a Win64 virtual machine (hosted on a Mac) with 4GB of RAM allocated to that VM.

In any event, clearly there is no absolute limit of 2,000 threads on Win32.

There is a practical limit however, which is a function of the amount of addressable memory, the amount of available memory, the degree of fragmentation of that memory, and the reservation of memory required for the stack for each thread.

Even the implementation details of different memory managers will have an influence.

It is the interaction of these variables, combined with the impact of the various hidden behaviours of complex modern software that results in the increase in the number of threads being somewhat lower than we might have expected.

64 KB is just on sixteenth of 1 MB, but we do not increase the number of threads by a commensurate factor of 16, in fact achieving slightly less than 4 times as many threads.

The influence of memory managers can be seen by simply switching to a different one.

The 6076 result was achieved with both the default (FastMM) and also with the 4.99.1 release of FastMM. However when repeating the test using ScaleMM (2.12) the number fell slightly to 6073. Not a huge difference, but a difference nonetheless.

Even simply removing the Classes unit from the uses list (included in anticipation of the next post in this series, but not actually required by this test program) will have an impact. The figure increases slightly to 6086, as removing the Classes unit reduces the size of the exe loaded into the process memory, thus reducing the amount of memory used by the process, making that additional address space available for use as thread stack(s).

The limit of 2,000 threads that upow quoted most likely originally derived from the fact that with a reservation of 1 MB for each stack, and with 2 GB of user address space per process, then there is a “perfect”, theoretical limit of 2,000 threads per process on 32-bit Windows. In practice however, this perfect limit cannot be reached with that amount of stack per thread since no process can start with 100% of it’s address space available to be dedicated purely to serve as stacks for it’s threads

All of this is very comprehensively explained by Mark Russinovich, so in the next post I shall look at the tools available to us to work within these limits.

25 thoughts on “Of Threads, Stacks and RAM – Part 1”

boesystems says:

28 Nov 2012 at 21:23

Very useful insight, thank you.
David Heffernan says:

28 Nov 2012 at 21:24

Finer grained control over stack size is available in the CreateThread API. It stack reserve size ever becomes an issue, it is best to choose reserve sizes per thread rather than forcing all threads in a process to use smaller stacks.
1. Deltics says:
  
  28 Nov 2012 at 21:30
  
  Precisely – you’re pre-empting my posts (you saw this is “Part 1”, right?) 😉
  1. Arioch says:
    
    28 Nov 2012 at 22:07
    
    and u would probably touch that win32 process can have 3gb virtual space, not just 2 gb 🙂
    1. Deltics says:
      
      28 Nov 2012 at 22:12
      
      That is covered in Mark Russinovich’s article, but I shall come to that in a later post. 🙂
    2. Alexander Alexeev says:
      
      28 Nov 2012 at 22:55
      
      Actually, Win32 process can have up to 4 Gb virtual address space. Think about Win32 process with LARGE_ADDRESS_AWARED running on Win64.
      1. Deltics says:
        
        29 Nov 2012 at 07:05
        
        No, the maximum is 3GB even with Large Address Aware – 1GB is always reserved for the kernel. I think you can achieve more in effect using paging mechanisms within your own code, but I don’t think you can apply those techniques to the memory required for call stacks.
        
        But that really isn’t that important, other than to determine the absolute maximum theoretical limit. The practical limit will always be vastly less than this – the object of the exercise here is not to figure out how to create the most threads, but to identify what factors contribute to the limits.
        
        David Heffernan says:
        
        29 Nov 2012 at 09:01
        
        LARGE_ADDRESS_AWARE have 4GB user address space when run in WOW64 on x64. On x86 you get 3GB if you boot with /3GB.
        
        Deltics says:
        
        29 Nov 2012 at 09:14
        
        Ah yes, I forgot that wrinkle.
gaetan maerten says:

28 Nov 2012 at 22:05

Very nice article, I didn’t realise that limit came from the stack.
but the whole datasnap performance issue is IMO the cost of creation and destruction of threads, which kills the performance .
1. Arioch says:
  
  28 Nov 2012 at 22:09
  
  if DataSanp is as abstract, isolated and flexible as EMB says, then there probably may be another network leyar for it, DataSnap over synapse ot DataSnap over TWebsocket or some other actor-based framework
2. Deltics says:
  
  28 Nov 2012 at 22:17
  
  Thanks. I don’t know how much of the “cost” of thread creation stems from the need to locate the contiguous memory block for the thread’s stack, but I suspect that it is at least partly responsible, though to what extent I don’t know.
  
  I just had an idea as to how we might find out though. Watch out for the next post… 🙂
Eurides Baptistella (@ebaptistella) says:

28 Nov 2012 at 22:33

Very enlightening, thank you for sharing your knowledge ….
Stuart says:

28 Nov 2012 at 23:48

Thanks, good post I learned something 🙂
IL says:

29 Nov 2012 at 03:33

>Testlimit.exe -t 1000 -n 100
Testlimit v5.2 – test Windows limits
Copyright (C) 2012 Mark Russinovich
Sysinternals – http://www.sysinternals.com
Process ID: 4912
Creating threads…
Created 1534 threads. Lasterror: 8
Not enough storage is available to process this command.
iPath says:

29 Nov 2012 at 06:49

To be honest my comment had not the goal to emphasize the “limitation” of 2000 threads. 2000 was a good number enough to show a typical limit that exists in a common scenario on a Win32 system. I do not even count the case with /3G boot switch – the topic about memory management and performance is huge enough. It was derived from Mark Russinovich’s “Pushing Windows to the limits” series.

The comment in Roberto Schneiders’ blog was a light copy of my original comment on stackOverflow: http://stackoverflow.com/a/13218180/1022219 that I made a few weeks ago.
I was so pleasantly surprised to see Roberto Schneiders (i.e. a developer) to do such an advanced tests and analysis about performance and reliabillity (well, I’m mostly Windows guy with development as a hobby only 🙂 )

The main point was that relying on “one task per thread” is a bad decision, resource consuming and leads to huge performance degradation no matter if you use thread pools or not. DataSnap “meets” all these **bad** requirements, mostly because it’s based on Indy.
I remember a good comment in a forum in the following form: “…if your program uses so much threads – then you probably missed somethig…It’s time and you should to redesign it!…” – nice and true 🙂

**Please**, do not forget the other part: the nasty little system administrators (yeah, me too :P) – they will fight by all means not to admit a software that consumes (mostly) all servers’ resources. They also need resources for support, maintenance and running other software on these servers 🙂

EMBT should take **real** care of all these issues. I (as a customer) am not satisfied (as many others may be?), because now we pay Embarcadero at Enterprise level prices for suspiciously “Enterprise” solutions. This was an open secret, but now the truth was made public.

Anyway: +2000 🙂 for posting on “Threads, Stacks and RAM” – vital content, written in clear and understandable way! Waiting for the next parts!!!

Best Regards!
Upow, iPath or just Petar 😉
1. Deltics says:
  
  29 Nov 2012 at 07:12
  
  Thanks for the comment Peter. And please don’t think that I was “calling you out” on the 2,000 number. That certainly wasn’t my intention, it was merely the “hook” on which to hang the subject in the post. 🙂
  
  And yes, as you point out in those other comments, the number of threads in any architecture should not be constrained simply by the amount of available address space, so this series of posts is about increasing the understanding of some quite technical aspects of the runtime environment, to hopefully equip people to design those better architectures we desire, not about teaching how to squeeze more and more threads into a given amount of RAM. 🙂
  
  In fact, a real world experience many years ago that taught me the lesson of “Sometimes Fewer Threads, Not Always More” will form the conclusion in this series.
iPath says:

29 Nov 2012 at 06:50

Btw, “…1569 threads is the limit, Somewhat less than the 2,000…” is not very precise – you count only the threads that your program has created. What about the existing ones? What does Task Manager show in Performance/Threads when you run the program? On my system there are about 400 – 600 existing threads?
1. Deltics says:
  
  29 Nov 2012 at 07:27
  
  I’m not sure where you are seeing the 400-600 other threads, except perhaps in other processes. In a simple Delphi console application there is only one additional thread, that for the main console application process itself.
  
  The thread “limits” being addressed in this series are per process, since we are looking at the per process factors: address space, available memory and stack size. The system-wide “thread limit” is a lot more complicated and way beyond the scope of this series. 🙂
iPath says:

29 Nov 2012 at 07:45

Confusion of mine…I’ve incorrectly counted the threads from other processes. Thanks for pointing it out :)))
LDS says:

29 Nov 2012 at 20:16

Yes, and I’ve a QC entry (#77203) asking to surface that in the TThread class creator – still open…
LDS says:

29 Nov 2012 at 20:54

The final paragraph of Russinovich post is somethig Datasnap should consider: “For instance, the general goal for a scalable application is to keep the number of threads running equal to the number of CPUs (with NUMA changing this to consider CPUs per node) and one way to achieve that is to switch from using synchronous I/O to using asynchronous I/O and rely on I/O completion ports to help match the number of running threads to the number of CPUs.” There are also other techniques like queues and so on.
Pingback: Te Waka o Pascal · Of Threads, Stacks and RAM – Part 2
LDS says:

29 Nov 2012 at 23:27

Setting up a thread requires kernel calls, the setup of some internal Windows structures, and some security checks. In Windows threads are lighter than processes, but creating (and destroying) a thread has still some overhead. Russonivichs’ “Windows Internals” has a detailed explanation of it.
Windows has a lighter implementations – fibers, but these are not scheduled by the kernel, is up to the application using them to perfrom scheduling. They may be useful if the knowledge the application has about how to schedule them can really outsmart the OS scheduling, but for most tasks threads are better.
Wodzu says:

30 Nov 2012 at 01:47

This was very interesting read. Keep them comming Deltics:)

Comments are closed.

Of Threads, Stacks and RAM – Part 1

Commit and Reservation

Related

25 thoughts on “Of Threads, Stacks and RAM – Part 1”