Roberto Schneiders recently drew my attention to the first post on his new blog (which I can recommend as a good read
), presenting the results of some performance testing of DataSnap that he had been involved with which proved to be very interesting (if initially somewhat disappointing).
But my post isn’t about that, at least not directly.
One of the characteristics noted about the Indy based system on which DataSnap is implemented was the memory utilisation, being significantly higher than other frameworks in the comparison.
Further, commenter upow made the observation that Indy (by default) uses one thread per request and that Windows supports only 2,000 threads.
These two observations are more directly connected than you might at first think, although the 2,000 number is not correct (but it’s derivation can be explained).
Before we get into the why’s and the wherefore’s however, let us devise a simple test to determine what the thread limit actually is on Windows. This simple test application will tell us:
program testlimit;
{$APPTYPE CONSOLE}
uses
Classes,
SysUtils,
Windows;
function ThreadProc(aParam: Cardinal): Integer;
begin
Sleep(INFINITE);
result := 0;
end;
var
i: Integer = 0;
id: Cardinal;
begin
try
while TRUE do
begin
if CreateThread(NIL, 0, @ThreadProc, NIL, 0, id) = 0 then
ABORT;
Inc(i);
end;
except
on e: Exception do
WriteLn(i, ' threads is the limit');
end;
end.
Click here to download the code (zipped) if you can’t/don’t want to copy/paste.
This test code uses the Windows API directly to create as many, very simple threads as it possibly can. It is a very dirty application, not doing any clean-up, but that’s OK – the point here is to determine an absolute limit of Windows, not to write a well behaved application (don’t worry, the threads will be “cleaned up” by Windows when the process terminates).
Each thread that is created runs the same code – a simple function that immediately puts the thread to sleep. Forever. Again, bear in mind that the point is not to find out how many threads can usefully execute simultaneously, but simply how many we can actually create. The number that can actually do anything useful will be less than this absolute limit.
So whilst we don’t want the threads to be doing anything, we do need them to remain around in the system – we cannot let them exit their thread functions. Putting them into an indefinite Sleep() state is the most efficient way of ensuring this.
Compiling and running this application for 32-bit Windows (I strongly recommend you don’t compile and run for 64-bit, at least not yet) in Delphi XE3, I get the following output:
1569 threads is the limit
Somewhat less than the 2,000 that upow suggested was the case. In fact, not even really very close at all. And it seems to be a very arbitrary number. It doesn’t even look very “limit-like”. How come ?
Before we get in to that, let’s see what the limit is on Win64 – maybe Microsoft increased it ?
So, assuming you have XE2 or XE3 and a Windows 64-bit environment available, add Win64 as a platform, compile and run on Win64 (you might want to get yourself a tea or a coffee while you wait for it to finish). Actually, it shouldn’t take that long, but it will take a lot longer than the Win32 version, and your machine will quickly become unusable as the threads mount up and drain resources from your system (even sleeping threads have demands).
Eventually it will complete and you should get a far, far higher number of threads established as the limit in this case. In my case:
153354 threads is the limit
A one hundred-fold increase in the number of threads !! How do we explain the difference ?
The answer is actually very simple: Stack Size
Commit and Reservation
Every thread in a process requires a stack. On Windows (at least, and quite possibly as a universal rule – I don’t honestly know) the area of memory used for a stack must be contiguous. That is, a single block of memory.
And the size of a stack (the size of that block of memory) is important.
Too small and your code will run out of room in that stack and you will get an exception – an exception which provided the inspiration and indeed the name for the stackoverflow website!
Too big and there will be parts of the memory allocated for the stack that are never used and will be wasted.
Fortunately, you can tell Windows how big you need your stack to be. But (again, on Windows) this is something that goes into the header of your executable, so you have to give this information to your compiler so that it can make the appropriate entries when writing your EXE to disk.
These settings are in the Linker options of your project:
The settings in the screenshot above are the default for a Delphi project. They specify (in bytes) a minimum stack size of 16 KB and a maximum of 1 MB.
The two values are important, but it is the maximum figure that is most important when it comes to explaining the number of threads we can create.
The minimum stack size is the amount of memory that will be physically allocated for our initial stack. This is called the “commit charge”. The maximum stack size is the amount of memory that will actually be reserved for the stack, in case our stack exceeds that minimum. This is the “reservation”.
We can easily test the effect of these minimum and maximum sizes on our simple test application.
Just to make it easier, we can use compiler directives to set these values, instead of having to keep going into our Project Options, so add these two directives after the $APPTYPE directive:
{$APPTYPE CONSOLE}
{$MINSTACKSIZE 16384}
{$MAXSTACKSIZE 65536}
This leaves the initial commit charge for each thread’s call stack unchanged at 16 KB, but drastically reduces the maximum size – and thus the amount of memory reserved for each stack – to just 64 KB.
Recompile and run again (I suggest you do this on Win32, to save time). You
should get something similar to this output:
6076 threads is the limit
NOTE: If the results you get are significantly different from those I am presenting here, bear in mind that each machine will be slightly different because the limit is driven by the particular hardware and software in each case.
For context, I am conducting these tests inside a Win64 virtual machine (hosted on a Mac) with 4GB of RAM allocated to that VM.
In any event, clearly there is no absolute limit of 2,000 threads on Win32.
There is a practical limit however, which is a function of the amount of addressable memory, the amount of available memory, the degree of fragmentation of that memory, and the reservation of memory required for the stack for each thread.
Even the implementation details of different memory managers will have an influence.
It is the interaction of these variables, combined with the impact of the various hidden behaviours of complex modern software that results in the increase in the number of threads being somewhat lower than we might have expected.
64 KB is just on sixteenth of 1 MB, but we do not increase the number of threads by a commensurate factor of 16, in fact achieving slightly less than 4 times as many threads.
The influence of memory managers can be seen by simply switching to a different one.
The 6076 result was achieved with both the default (FastMM) and also with the 4.99.1 release of FastMM. However when repeating the test using ScaleMM (2.12) the number fell slightly to 6073. Not a huge difference, but a difference nonetheless.
Even simply removing the Classes unit from the uses list (included in anticipation of the next post in this series, but not actually required by this test program) will have an impact. The figure increases slightly to 6086, as removing the Classes unit reduces the size of the exe loaded into the process memory, thus reducing the amount of memory used by the process, making that additional address space available for use as thread stack(s).
The limit of 2,000 threads that upow quoted most likely originally derived from the fact that with a reservation of 1 MB for each stack, and with 2 GB of user address space per process, then there is a “perfect”, theoretical limit of 2,000 threads per process on 32-bit Windows. In practice however, this perfect limit cannot be reached with that amount of stack per thread since no process can start with 100% of it’s address space available to be dedicated purely to serve as stacks for it’s threads
All of this is very comprehensively explained by Mark Russinovich, so in the next post I shall look at the tools available to us to work within these limits.
-
Finer grained control over stack size is available in the CreateThread API. It stack reserve size ever becomes an issue, it is best to choose reserve sizes per thread rather than forcing all threads in a process to use smaller stacks.
-
Very nice article, I didn’t realise that limit came from the stack.
but the whole datasnap performance issue is IMO the cost of creation and destruction of threads, which kills the performance .-
if DataSanp is as abstract, isolated and flexible as EMB says, then there probably may be another network leyar for it, DataSnap over synapse ot DataSnap over TWebsocket or some other actor-based framework
-
-
>Testlimit.exe -t 1000 -n 100
Testlimit v5.2 – test Windows limits
Copyright (C) 2012 Mark Russinovich
Sysinternals – http://www.sysinternals.com
Process ID: 4912
Creating threads…
Created 1534 threads. Lasterror: 8
Not enough storage is available to process this command. -
To be honest my comment had not the goal to emphasize the “limitation” of 2000 threads. 2000 was a good number enough to show a typical limit that exists in a common scenario on a Win32 system. I do not even count the case with /3G boot switch – the topic about memory management and performance is huge enough. It was derived from Mark Russinovich’s “Pushing Windows to the limits” series.
The comment in Roberto Schneiders’ blog was a light copy of my original comment on stackOverflow: http://stackoverflow.com/a/13218180/1022219 that I made a few weeks ago.
I was so pleasantly surprised to see Roberto Schneiders (i.e. a developer) to do such an advanced tests and analysis about performance and reliabillity (well, I’m mostly Windows guy with development as a hobby only
)The main point was that relying on “one task per thread” is a bad decision, resource consuming and leads to huge performance degradation no matter if you use thread pools or not. DataSnap “meets” all these **bad** requirements, mostly because it’s based on Indy.
I remember a good comment in a forum in the following form: “…if your program uses so much threads – then you probably missed somethig…It’s time and you should to redesign it!…” – nice and true
**Please**, do not forget the other part: the nasty little system administrators (yeah, me too
) – they will fight by all means not to admit a software that consumes (mostly) all servers’ resources. They also need resources for support, maintenance and running other software on these servers
EMBT should take **real** care of all these issues. I (as a customer) am not satisfied (as many others may be?), because now we pay Embarcadero at Enterprise level prices for suspiciously “Enterprise” solutions. This was an open secret, but now the truth was made public.
Anyway: +2000
for posting on “Threads, Stacks and RAM” – vital content, written in clear and understandable way! Waiting for the next parts!!!Best Regards!
Upow, iPath or just Petar
-
Btw, “…1569 threads is the limit, Somewhat less than the 2,000…” is not very precise – you count only the threads that your program has created. What about the existing ones? What does Task Manager show in Performance/Threads when you run the program? On my system there are about 400 – 600 existing threads?
-
Confusion of mine…I’ve incorrectly counted the threads from other processes. Thanks for pointing it out
)) -
Yes, and I’ve a QC entry (#77203) asking to surface that in the TThread class creator – still open…
-
The final paragraph of Russinovich post is somethig Datasnap should consider: “For instance, the general goal for a scalable application is to keep the number of threads running equal to the number of CPUs (with NUMA changing this to consider CPUs per node) and one way to achieve that is to switch from using synchronous I/O to using asynchronous I/O and rely on I/O completion ports to help match the number of running threads to the number of CPUs.” There are also other techniques like queues and so on.
-
Setting up a thread requires kernel calls, the setup of some internal Windows structures, and some security checks. In Windows threads are lighter than processes, but creating (and destroying) a thread has still some overhead. Russonivichs’ “Windows Internals” has a detailed explanation of it.
Windows has a lighter implementations – fibers, but these are not scheduled by the kernel, is up to the application using them to perfrom scheduling. They may be useful if the knowledge the application has about how to schedule them can really outsmart the OS scheduling, but for most tasks threads are better.
Comments are now closed.



DelphiFeeds
25 comments