Designing Applications for High Performance - Part 1

Now that processors won’t be getting dramatically faster each year, application developers must learn how to design their applications for scalability and efficiency on multiple processor systems. I have spent the last 20 years in SQL Server development and the Windows Server Performance Group looking into multi-processor performance and scalability problems.  Over the years, I […]

Now that processors won’t be getting dramatically faster each year, application developers must learn how to design their applications for scalability and efficiency on multiple processor systems. I have spent the last 20 years in SQL Server development and the Windows Server Performance Group looking into multi-processor performance and scalability problems.  Over the years, I have encountered a number of recurring patterns that I would like to get designers to avoid.  In this three part series, I will go over these inefficiencies and provide suggestions to avoid them in order to improve application scalability and efficiency.  The guidelines are oriented towards server applications, but the basic principles apply to all applications.

The underlying problem is processors are much faster than RAM and need hardware caches or else they would spend most of their time waiting for memory access.  The effectiveness of any cache depends on locality of reference.  Poor locality can reduce performance by an order of magnitude, even with a single processor.  The problem is worse with multiple processors because data is often replicated in different caches and updates must be coordinated to give the illusion of a single copy (performing the magic of cache coherency is hard).  Also, applications might generate information that needs sharing across processors, which can overload the interconnect mechanism (e.g. bus) and slow down all memory requests, even for “innocent bystanders”.    

The following are some of the common pitfalls that can hurt overall performance:

· Using too many threads and doing frequent updates to shared data.  This results in a high number of context switches due to lock collisions when several threads try to update the protected data. 
· Cache effectiveness is reduced because thread data seldom has enough time in the cache before getting pushed out of the cache by other threads.