BUY THIS BOOK
Add to Cart

Print Book $44.95


Add to Cart

Print+PDF $58.44

Add to Cart

PDF $35.99

Safari Books Online

What is this?

Add to UK Cart

Print Book £31.95

What is this?

Looking to Reprint or License this content?


Java Performance Tuning
Java Performance Tuning, Second Edition

By Jack Shirazi
Book Price: $44.95 USD
£31.95 GBP
PDF Price: $35.99

Cover | Table of Contents | Colophon


Table of Contents

Chapter 1: Introduction
The trouble with doing something right the first time is that nobody appreciates how difficult it was.
—Fortune
There is a general perception that Java programs are slow. Part of this perception is pure assumption: many people assume that if a program is not compiled, it must be slow. Part of this perception is based in reality: many early applets and applications were slow, because of nonoptimal coding, initially unoptimized Java Virtual Machines (VMs), and the overhead of the language.
In earlier versions of Java, you had to struggle hard and compromise a lot to make a Java application run quickly. More recently, there have been fewer reasons why an application should be slow. The VM technology and Java development tools have progressed to the point where a Java application (or applet, servlet, etc.) is not particularly handicapped. With good designs and by following good coding practices and avoiding bottlenecks, applications usually run fast enough. However, the truth is that the first (and even several subsequent) versions of a program written in any language are often slower than expected, and the reasons for this lack of performance are not always clear to the developer.
This book shows you why a particular Java application might be running slower than expected, and suggests ways to avoid or overcome these pitfalls and improve the performance of your application. In this book I've gathered several years of tuning experiences in one place. I hope you will find it useful in making your Java application, applet, servlet, and component run as fast as you need.
Throughout the book I use the generic words "application" and "program" to cover Java applications, applets, servlets, beans, libraries, and really any use of Java code. Where a technique can be applied only to some subset of these various types of Java programs, I say so. Otherwise, the technique applies across all types of Java programs.
This question is always asked as soon as the first tests are timed: "Where is the time going? I did not expect it to take this long." Well, the short answer is that it's slow because it has not been performance-tuned. In the same way the first version of the code is likely to have bugs that need fixing, it is also rarely as fast as it can be. Fortunately, performance tuning is usually easier than debugging. When debugging, you have to fix bugs throughout the code; in performance tuning, you can focus your effort on the few parts of the application that are the bottlenecks.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Why Is It Slow?
This question is always asked as soon as the first tests are timed: "Where is the time going? I did not expect it to take this long." Well, the short answer is that it's slow because it has not been performance-tuned. In the same way the first version of the code is likely to have bugs that need fixing, it is also rarely as fast as it can be. Fortunately, performance tuning is usually easier than debugging. When debugging, you have to fix bugs throughout the code; in performance tuning, you can focus your effort on the few parts of the application that are the bottlenecks.
The longer answer? Well, it's true that there is overhead in the Java runtime system, mainly due to its virtual machine layer that abstracts Java away from the underlying hardware. It's also true that there is overhead from Java's dynamic nature. These overhead s can cause a Java application to run slower than an equivalent application written in a lower-level language (just as a C program is generally slower than the equivalent program written in assembler). Java's advantages—namely, its platform-independence, memory management, powerful exception checking, built-in multithreading, dynamic resource loading, and security checks—add costs in terms of an interpreter, garbage collector, thread monitors, repeated disk and network accessing, and extra runtime checks.
For example, hierarchical method invocation requires an extra computation for every method call because the runtime system has to work out which of the possible methods in the hierarchy is the actual target of the call. Most modern CPU s are designed to be optimized for fixed call and branch targets and do not perform as well when a significant percentage of calls need to be computed on the fly. On the other hand, good object-oriented design actually encourages many small methods and significant polymorphism in the method hierarchy. Compiler inlining is another frequently used technique that can significantly improve compiled code. However, this technique cannot be applied when it is too difficult to determine method calls at compile time, as is the case for many Java methods.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
The Tuning Game
Performance tuning is similar to playing a strategy game (but happily, you are usually paid to do it!). Your target is to get a better score (lower time) than the last score after each attempt. You are playing with, not against, the computer, the programmer, the design and architecture, the compiler, and the flow of control. Your opponents are time, competing applications, budgetary restrictions, etc. (You can complete this list better than I can for your particular situation.)
I once worked with a customer who wanted to know if there was a "go faster" switch somewhere that he could just turn on and make the whole application go faster. Of course, he was not really expecting one, but checked just in case he had missed a basic option somewhere.
There is no such switch, but very simple techniques sometimes provide the equivalent. Techniques include switching compilers, turning on optimizations, using a different runtime VM, finding two or three bottlenecks in the code or architecture that have simple fixes, and so on. I have seen all of these yield huge improvements to applications, sometimes a 20-fold speedup. Order-of-magnitude speedups are typical for the first round of performance tuning.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
System Limitations and What to Tune
Three resources limit all applications:
  • CPU speed and availability
  • System memory
  • Disk (and network) input/output (I/O)
When tuning an application, the first step is to determine which of these is causing your application to run too slowly.
If your application is CPU-bound, you need to concentrate your efforts on the code, looking for bottlenecks, inefficient algorithms, too many short-lived objects (object creation and garbage collection are CPU-intensive operations), and other problems, which I will cover in this book.
If your application is hitting system-memory limits, it may be paging sections in and out of main memory. In this case, the problem may be caused by too many objects, or even just a few large objects, being erroneously held in memory; by too many large arrays being allocated (frequently used in buffered applications); or by the design of the application, which may need to be reexamined to reduce its running memory footprint.
On the other hand, external data access or writing to the disk can be slowing your application. In this case, you need to look at exactly what you are doing to the disks that is slowing the application: first identify the operations, then determine the problems, and finally eliminate or change these to improve the situation.
For example, one program I know of went through web server logs and did reverse lookups on the IP addresses. The first version of this program was very slow. A simple analysis of the activity being performed determined that the major time component of the reverse lookup operation was a network query. These network queries do not have to be done sequentially. Consequently, the second version of the program simply multithreaded the lookups to work in parallel, making multiple network queries simultaneously, and was much, much faster.
In this book we look at the causes of bad performance. Identifying the causes of your performance problems is an essential first step to solving those problems. There is no point in extensively tuning the disk-accessing component of an application because we all know that "disk access is much slower than memory access" when, in fact, the application is CPU-bound.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
A Tuning Strategy
Here's a strategy I have found works well when attacking performance problems:
  1. Identify the main bottlenecks (look for about the top five bottlenecks, but go higher or lower if you prefer).
  2. Choose the quickest and easiest one to fix, and address it (except for distributed applications where the top bottleneck is usually the one to attack: see the following paragraph).
  3. Repeat from Step 1.
This procedure gets your application tuned the quickest. The advantage of choosing the "quickest to fix" of the top few bottlenecks rather than the absolute topmost problem is that once a bottleneck has been eliminated, the characteristics of the application change, and the topmost bottleneck may not need to be addressed any longer. However, in distributed applications I advise you target the topmost bottleneck. The characteristics of distributed applications are such that the main bottleneck is almost always the best to fix and, once fixed, the next main bottleneck is usually in a completely different component of the system.
Although this strategy is simple and actually quite obvious, I nevertheless find that I have to repeat it again and again: once programmers get the bit between their teeth, they just love to apply themselves to the interesting parts of the problems. After all, who wants to unroll loop after boring loop when there's a nice juicy caching technique you're eager to apply?
You should always treat the actual identification of the cause of the performance bottleneck as a science, not an art. The general procedure is straightforward:
  1. Measure the performance by using profilers and benchmark suites and by instrumenting code.
  2. Identify the locations of any bottlenecks.
  3. Think of a hypothesis for the cause of the bottleneck.
  4. Consider any factors that may refute your hypothesis.
  5. Create a test to isolate the factor identified by the hypothesis.
  6. Test the hypothesis.
  7. Alter the application to reduce the bottleneck.
  8. Test that the alteration improves performance, and measure the improvement (include regression-testing the affected code).
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Perceived Performance
It is important to understand that the user has a particular view of performance that allows you to cut some corners. The user of an application sees changes as part of the performance. A browser that gives a running countdown of the amount left to be downloaded from a server is seen to be faster than one that just sits there, apparently hung, until all the data is downloaded. People expect to see something happening, and a good rule of thumb is that if an application is unresponsive for more than three seconds, it is seen as slow. Some Human Computer Interface authorities put the user patience limit at just two seconds; an IBM study from the early '70s suggested people's attention began to wander after waiting for more than just one second. For performance improvements, it is also useful to know that users are not generally aware of response time improvements of less than 20%. This means that when tuning for user perception, you should not deliver any changes to the users until you have made improvements that add more than a 20% speedup.
A few long response times make a bigger impression on the memory than many shorter ones. According to Arnold Allen, the perceived value of the average response time is not the average, but the 90th percentile value: the value that is greater than 90% of all observed response times. With a typical exponential distribution, the 90th percentile value is 2.3 times the average value. Consequently, as long as you reduce the variation in response times so that the 90th percentile value is smaller than before, you can actually increase the average response time, and the user will still perceive the application as faster. For this reason, you may want to target variation in response times as a primary goal. Unfortunately, this is one of the more complex targets in performance tuning: it can be difficult to determine exactly why response times are varying.
If the interface provides feedback and allows the user to carry on other tasks or abort and start another function (preferably both), the user sees this as a responsive interface and doesn't consider the application as slow as he might otherwise. If you give users an expectancy of how long a particular task might take and why, they often accept this and adjust their expectations. Modern web browsers provide an excellent example of this strategy in practice. People realize that the browser is limited by the bandwidth of their connection to the Internet and that downloading cannot happen faster than a given speed. Good browsers always try to show the parts they have already received so that the user is not blocked, and they also allow the user to terminate downloading or go off to another page at any time, even while a page is partly downloaded. Generally, it is not the browser that is seen to be slow, but rather the Internet or the server site. In fact, browser creators have made a number of tradeoffs so that their browsers appear to run faster in a slow environment. I have measured browser display of identical pages under identical conditions and found browsers that are actually faster at full page display but seem slower because they do not display partial pages, download embedded links concurrently, and so on. Modern web browsers provide a good example of how to manage user expectations and perceptions of performance.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Starting to Tune
Before diving into the actual tuning, there are a number of considerations that will make your tuning phase run more smoothly and result in clearly achieved objectives.
Any application must meet the needs and expectations of its users, and a large part of those needs and expectations is performance. Before you start tuning, it is crucial to identify the target response times for as much of the system as possible. At the outset, you should agree with your users (directly if you have access to them, or otherwise through representative user profiles, market information, etc.) what the performance of the application is expected to be.
The performance should be specified for as many aspects of the system as possible, including:
  • Multiuser response times depending on the number of users (if applicable)
  • Systemwide throughput (e.g., number of transactions per minute for the system as a whole, or response times on a saturated network, again if applicable)
  • The maximum number of users, data, files, file sizes, objects, etc., the application supports
  • Any acceptable and expected degradation in performance between minimal, average, and extreme values of supported resources
Agree on target values and acceptable variances with the customer or potential users of the application (or whoever is responsible for performance) before starting to tune. Otherwise, you will not know where to target your effort, how far you need to go, whether particular performance targets are achievable at all, and how much tuning effort those targets may require. But most importantly, without agreed targets, whatever you achieve will tend to become the starting point.
The following scenario is not unusual: a manager sees horrendous performance, perhaps a function that was expected to be quick, but takes 100 seconds. His immediate response is, "Good grief, I expected this to take no more than 10 seconds." Then, after a quick round of tuning that identifies and removes a huge bottleneck, function time is down to 10 seconds. The manager's response is now, "Ah, that's more reasonable, but of course I actually meant to specify 3 seconds—I just never believed you could get down so far after seeing it take 100 seconds. Now you can start tuning." You do not want your initial achievement to go unrecognized (especially if money depends on it), and it is better to know at the outset what you need to reach. Agreeing on targets before tuning makes everything clear to everyone.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
What to Measure
The main measurement is always wall-clock time. You should use this measurement to specify almost all benchmarks, as it's the real-time interval that is most appreciated by the user. (There are certain situations, however, in which system throughput might be considered more important than the wall-clock time, e.g., for servers, enterprise transaction systems, and batch or background systems.)
The obvious way to measure wall-clock time is to get a timestamp using System.currentTimeMillis( ) and then subtract this from a later timestamp to determine the elapsed time. This works well for elapsed time measurements that are not short. Other types of measurements have to be system-specific and often application-specific. You can measure:
  • CPU time (the time allocated on the CPU for a particular procedure)
  • The number of runnable processes waiting for the CPU (this gives you an idea of CPU contention)
  • Paging of processes
  • Memory sizes
  • Disk throughput
  • Disk scanning times
  • Network traffic, throughput, and latency
  • Transaction rates
  • Other system values
However, Java doesn't provide mechanisms for measuring these values directly, and measuring them requires at least some system knowledge, and usually some application-specific knowledge (e.g., what is a transaction for your application?).
You need to be careful when running tests with small differences in timings. The first test is usually slightly slower than any other tests. Try doubling the test run so that each test is run twice within the VM (e.g., rename main( ) to maintest( ), and call maintest( ) twice from a new main( )).
There are almost always small variations between test runs, so always use averages to measure differences and consider whether those differences are relevant by calculating the variance in the results.
For distributed applications , you need to break down measurements into times spent on each component, times spent preparing data for transfer and from transfer (e.g., marshalling and unmarshalling objects and writing to and reading from a buffer), and times spent in network transfer. Each separate machine used on the networked system needs to be monitored during the test if any system parameters are to be included in the measurements. Timestamps must be synchronized across the system (this can be done by measuring offsets from one reference machine at the beginning of tests). Taking measurements consistently from distributed systems can be challenging, and it is often easier to focus on one machine, or one communication layer, at a time. This is usually sufficient for most tuning.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Don't Tune What You Don't Need to Tune
The most efficient tuning you can do is not to alter what works well. As they say, "If it ain't broke, don't fix it." This may seem obvious, but the temptation to tweak something just because you have thought of an improvement has a tendency to override this obvious statement.
The second most efficient tuning is to discard work that doesn't need doing. It is not at all uncommon for an application to be started with one set of specifications and to have some of the specifications change over time. Many times the initial specifications are much more generic than the final product. However, the earlier generic specifications often still have their stamps in the application. I frequently find routines, variables, objects, and subsystems that are still being maintained but are never used and never will be used because some critical aspect is no longer supported. These redundant parts of the application can usually be chopped without any bad consequences, often resulting in a performance gain.
In general, you need to ask yourself exactly what the application is doing and why. Then question whether it needs to do it in that way, or even if it needs to do it at all. If you have third-party products and tools being used by the application, consider exactly what they are doing. Try to be aware of the main resources they use (from their documentation). For example, a zippy DLL (shared library) that is speeding up all your network transfers is using some resources to achieve that speedup. You should know that it is allocating larger and larger buffers before you start trying to hunt down the source of your mysteriously disappearing memory. Then you can realize that you need to use the more complicated interface to the DLL that restricts resource usage rather than a simple and convenient interface. And you will have realized this before doing extensive (and useless) object profiling because you would have been trying to determine why your application is being a memory hog.
When benchmarking third-party components, you need to apply a good simulation of exactly how you will use those products. Determine characteristics from your benchmarks and put the numbers into your overall model to determine if performance can be reached. Be aware that vendor benchmarks are typically useless for a particular application. Break your application down into a hugely simplified version for a preliminary benchmark implementation to test third-party components. You should make a strong attempt to include all the scaling necessary so that you are benchmarking a fully scaled usage of the components, not some reduced version that reveals little about the components in full use.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Performance Checklist
  • Specify the required performance.
    • Ensure performance objectives are clear.
    • Specify target response times for as much of the system as possible.
    • Specify all variations in benchmarks, including expected response ranges (e.g., 80% of responses for X must fall within 3 seconds).
    • Include benchmarks for the full range of scaling expected (e.g., low to high numbers of users, data, files, file sizes, objects, etc.).
    • Specify and use a benchmark suite based on real user behavior. This is particularly important for multiuser benchmarks.
    • Agree on all target times with users, customers, managers, etc., before tuning.
  • Make your benchmarks long enough: over five seconds is a good target.
    • Use elapsed time (wall-clock time) for the primary time measurements.
    • Ensure the benchmark harness does not interfere with the performance of the application.
    • Run benchmarks before starting tuning, and again after each tuning exercise.
    • Take care that you are not measuring artificial situations, such as full caches containing exactly the data needed for the test.
  • Break down distributed application measurements into components, transfer layers, and network transfer times.
  • Tune systematically: understand what affects the performance; define targets; tune; monitor and redefine targets when necessary.
    • Approach tuning scientifically: measure performance; identify bottlenecks; hypothesize on causes; test hypothesis; make changes; measure improved performance.
    • Determine which resources are limiting performance: CPU, memory, or I/O.
    • Accurately identify the causes of the performance problems before trying to tune them.
    • Use the strategy of identifying the main bottlenecks, fixing the easiest, then repeating.
    • Don't tune what does not need tuning. Avoid "fixing" nonbottlenecked parts of the application.
    • Measure that the tuning exercise has improved speed.
    • Target one bottleneck at a time. The application running characteristics can change after each alteration.
    • Improve a CPU limitation with faster code, better algorithms, and fewer short-lived objects.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 2: Profiling Tools
If you only have a hammer, you tend to see every problem as a nail.
—Abraham Maslow
Before you can tune your application, you need tools that will help you find the bottlenecks in the code. I have used many different tools for performance tuning, and so far I have found the commercially available profilers to be the most useful. You can easily find several of these, together with reviews, by searching the Web using "java+optimi" and "java+profile" as your search term or by checking various computer magazines. I also maintain a list at http://www.JavaPerformanceTuning.com/resources.shtml. These tools are usually available free for an evaluation period, and you can quickly tell which you prefer using. If your budget covers it, it is worth getting several profilers: they often have complementary features and provide different details about the running code. I have included a list of profilers in Chapter 19.
All profilers have some weaknesses, especially when you want to customize them to focus on particular aspects of the application. Another general problem with profilers is that they frequently fail to work in nonstandard environments. Nonstandard environments should be rare, considering Java's emphasis on standardization, but most profiling tools work at the VM level, and there is not currently a VM profiling standard, so incompatibilities do occur. Even if a VM profiling standard is finalized, I expect there will be some nonstandard VMs you may have to use, possibly a specialized VM of some sort—there are already many of these.
When tuning, I normally use one of the commercial profiling tools, and on occasion when the tools do not meet my needs, I fall back on a variation of one of the custom tools and information-extraction methods presented in this chapter. Where a particular VM offers extra APIs that tell you about some running characteristics of your application, these custom tools are essential to access those extra APIs. Using a professional profiler and the proprietary tools covered in this chapter, you will have enough information to figure out where problems lie and how to resolve them. When necessary, you can successfully tune without a professional profiler, as the Sun VM contains a basic profiler, which I cover in this chapter. However, this option is not ideal for the most rapid tuning.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Measurements and Timings
When looking at timings, be aware that different tools affect the performance of applications in different ways. Any profiler slows down the application it is profiling. The degree of slowdown can vary from a few percent to a few hundred percent. Using System.currentTimeMillis( ) in the code to get timestamps is the only reliable way to determine the time taken by each part of the application. In addition, System.currentTimeMillis( ) is quick and has no effect on application timing (as long as you are not measuring too many intervals or ridiculously short intervals; see the discussion in Section 1.7).
Another variation on timing the application depends on the underlying operating system . The operating system can allocate different priorities for different processes, and these priorities determine the importance the operating system applies to a particular process. This in turn affects the amount of CPU time allocated to a particular process compared to other processes. Furthermore, these priorities can change over the lifetime of the process. It is usual for server operating systems to gradually decrease the priority of a process over that process's lifetime. This means that the process has shorter periods of the CPU allocated to it before it is put back in the runnable queue. An adaptive VM (like Sun's HotSpot) can give you the reverse situation, speeding up code shortly after it has started running (see Section 3.7).
Whether or not a process runs in the foreground can also be important. For example, on a machine with the workstation version of Windows (most varieties including NT, 95, 98, and 2000), foreground processes are given maximum priority. This ensures that the window currently being worked on is maximally responsive. However, if you start a test and then put it in the background so that you can do something else while it runs, the measured times can be very different from the results you would get if you left that test running in the foreground. This applies even if you do not actually do anything else while the test is running in the background. Similarly, on server machines, certain processes may be allocated maximum priority (for example, Windows NT and 2000 server version, as well as most Unix server configured machines, allocate maximum priority to network I/O processes).
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Garbage Collection
The Java runtime system normally includes a garbage collector. Some of the commercial profilers provide statistics showing what the garbage collector is doing. You can also use the -verbosegc option with the VM. This option prints out time and space values for objects reclaimed and space recycled as the reclamations occur. The 1.4 VM introduced an additional option to log the output to a file instead of standard error: the -Xloggc:<file> option. Printing directly to a file is slightly more efficient than redirecting the VM output to a file because the direct file write buffering is slightly more efficient than the piped redirect buffering. The printout includes explicit synchronous calls to the garbage collector (using System.gc( ) ) as well as asynchronous executions of the garbage collector, as occurs in normal operation when free memory available to the VM gets low.
System.gc( ) does not necessarily force a synchronous garbage collection. Instead, the gc( ) call is really a hint to the runtime that now is a good time to run the garbage collector. The runtime decides whether to execute the garbage collection at that time and what type of garbage collection to run. In more recent VMs, the effects of calling System.gc( ) can be completely disabled using the runtime flag XX:+DisableExplicitGC.
It is worth looking at some output from running with -verbosegc. The following code fragment creates lots of objects to force the garbage collector to work, and also includes some synchronous calls to the garbage collector:
package tuning.gc;
public class Test {
  public static void main(String[  ] args)
  {
    int SIZE = 4000;
    StringBuffer s;
    java.util.Vector v;
  
    //Create some objects so that the garbage collector 
    //has something to do
    for (int i = 0; i < SIZE; i++)
    {
      s = new StringBuffer(50);
      v = new java.util.Vector(30);
      s.append(i).append(i+1).append(i+2).append(i+3);
    }
    s = null;
    v = null;
    System.out.println("Starting explicit garbage collection");
    long time = System.currentTimeMillis(  );
    System.gc(  );
    System.out.println("Garbage collection took " + 
      (System.currentTimeMillis(  )-time) + " millis");
  
    int[  ] arr = new int[SIZE*10];
    //null the variable so that the array can be garbage collected
    time = System.currentTimeMillis(  );
    arr = null;
    System.out.println("Starting explicit garbage collection");
    System.gc(  );
    System.out.println("Garbage collection took " + 
      (System.currentTimeMillis(  )-time) + " millis");
  }
}
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Method Calls
Most profiling tools provide a profile of method calls, showing where the bottlenecks in your code are and helping you decide where to target your efforts. By showing which methods and lines take the most time, a good profiling tool can quickly pinpoint bottlenecks.
Most method profilers work by sampling the call stack at regular intervals and recording the methods on the stack. This regular snapshot identifies the method currently being executed (the method at the top of the stack) and all the methods below, to the program's entry point. By accumulating the number of hits on each method, the resulting profile usually identifies where the program is spending most of its time. This profiling technique assumes that the sampled methods are representative, i.e., if 10% of stacks sampled show method foo( ) at the top of the stack, then the assumption is that method foo( ) takes 10% of the running time. However, this is a sampling technique , so it is not foolproof: methods can be missed altogether or have their weighting misrecorded if some of their execution calls are missed. But usually only the shortest tests are skewed. Any reasonably long test (i.e., seconds rather than milliseconds) normally gives correct results.
This sampling technique can be difficult to get right. It is not enough to simply sample the stack. The profiler must also ensure that it has a coherent stack state, so the call must be synchronized across stack activities, possibly by temporarily stopping the thread. The profiler also needs to make sure that multiple threads are treated consistently and that the timing involved in its activities is accounted for without distorting the regular sample time. Also, too short a sample interval causes the program to become extremely slow, while too long an interval results in many method calls being missed and misrepresentative profile results being generated.
The JDK comes with a minimal profiler, obtained by running a program using the java executable with the -Xrunhprof option (
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Object-Creation Profiling
Unfortunately, the SDK provides only very rudimentary object-creation statistics. Most profile tool vendors provide much better object-creation statistics, determining object numbers and identifying where particular objects are created in the code. My recommendation is to use a better (probably commercial) tool in place of the SDK profiler.
The Heap Analysis Tool, which can analyze the default profiling mode with Java 2, provides a little more information from the profiler output, but if you are relying on this, profiling object creation will require a lot of effort. To use this tool, you must use the binary output option:
% java -Xrunhprof:format=b <classname>
         
I have used an alternate trick when a reasonable profiler is unavailable, cannot be used, or does not provide precisely the detail I need. This technique is to alter the java.lang.Object class to catch most nonarray object-creation calls. This is not a supported feature, but it does seem to work on most systems because all constructors chain up to the Object class's constructor, and any explicitly created nonarray object calls the constructor in Object as its first execution point after the VM allocates the object on the heap. Objects that are created implicitly with a call to clone( ) or by deserialization do not call the Object class's constructor, and so are missed when using this technique.
Under the terms of the license granted by Sun, it is not possible to include or list an altered Object class with this book. But I can show you the simple changes to make to the java.lang.Object class to track object creation.
The change requires adding a line in the Object constructor to pass this to some object-creation monitor you are using. java.lang.Object does not have an explicitly defined constructor (it uses the default empty constructor), so you need to add one to the source and recompile. For any class other than Object, that is all you need to do. But there is an added problem in that Object does not have a superclass, and the compiler has a problem with this: the compiler cannot handle an explicit
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Monitoring Gross Memory Usage
The JDK provides two methods for monitoring the amount of memory used by the runtime system: freeMemory( ) and totalMemory( ) in the java.lang.Runtime class.
totalMemory( ) returns a long, which is the number of bytes currently allocated to the runtime system for this particular VM process. Within this memory allocation, the VM manages its objects and data. Some of this allocated memory is held in reserve for creating new objects. When the currently allocated memory gets filled and the garbage collector cannot allocate sufficiently more memory, the VM requests more memory from the underlying system. If the underlying system cannot allocate any further memory, an OutOfMemoryError error is thrown. Total memory can go up and down; some Java runtimes return sections of unused memory to the underlying system while still running.
freeMemory( ) returns a long, which is the number of bytes available to the VM to create objects from the section of memory it controls (i.e., memory already allocated to the runtime by the underlying system). The free memory increases when a garbage collection successfully reclaims space used by dead objects, and also increases when the Java runtime requests more memory from the underlying operating system. The free memory reduces each time an object is created and when the runtime returns memory to the underlying system.
SDK 1.4 added a new method, Runtime.maxMemory( ) . This method simply gives the -Xmx value, and is of no use to monitor heap usage.
It can be useful to monitor memory usage while an application runs: you can get a good feel for the hotspots of your application. You may be surprised to see steady decrements in the free memory available to your application when you were not expecting any change. This can occur when you continuously generate temporary objects from some routine; manipulating graphical elements frequently shows this behavior.
Monitoring memory with freeMemory( ) and totalMemory( )
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Client/Server Communications
To tune client/server or distributed applications, you need to identify all communications that occur during execution. The most important factors to look for are the number of transfers of incoming and outgoing data and the amount of data transferred. These elements affect performance the most. Generally, if the amount of data per transfer is less than about one kilobyte, the number of transfers is the factor that limits performance. If the amount of data being transferred is more than about a third of the network's capacity, the amount of data is the factor limiting performance. Between these two endpoints, either the amount of data or the number of transfers can limit performance, although in general, the number of transfers is more likely to be the problem.
As an example, web surfing with a browser typically hits both problems at different times. A complex page with elements from multiple sites can take longer to display completely than one simple page with 10 times more data. Many different sites are involved in displaying the complex page; each site must have its server name converted to an IP address, which can take many network transfers. Each site then needs to be connected to and downloaded from. The simple page needs only one name lookup and one connection, and this can make a huge difference. On the other hand, if the amount of data is large compared to the connection bandwidth (the speed of the Internet connection at the slowest link between your client and the server machine), the limiting factor is bandwidth, so the complex page may display more quickly than the simple page.
Several generic tools are available for monitoring communication traffic, all aimed at system and network administrators (and quite expensive). I know of no general-purpose profiling tool targeted at application-level communications monitoring; normally, developers put their own monitoring capabilities into the application or use the trace mode in their third-party communications package, if they use one. (snoop,
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Performance Checklist
  • Use system- and network-level monitoring utilities to assist when measuring performance.
  • Run tests on unloaded systems with the test running in the foreground.
    • Use System.currentTimeMillis( ) to get timestamps if you need to determine absolute times. Never use the timings obtained from a profiler as absolute times.
    • Account for performance effects of any caches.
  • Get better profiling tools. The better your tools, the faster and more effective your tuning.
    • Pinpoint the bottlenecks in the application: with profilers, by instrumenting code (putting in explicit timing statements), and by analyzing the code.
    • Target the top five to ten methods, and choose the quickest to fix.
    • Speed up the bottleneck methods that can be fixed the quickest.
    • Improve the method directly when the method takes a significant percentage of time and is not called too often.
    • Reduce the number of times a method is called when the method takes a significant percentage of time and is also called frequently.
  • Use an object-creation profiler together with garbage-collection statistics to determine which objects are created in large numbers and which large objects are created.
    • See if the garbage collector executes more often than you expect.
    • Determine the percentage of time spent in garbage collection and reduce that if over 15% (target 5% ideally).
    • Use the Runtime.totalMemory( ) and Runtime.freeMemory( ) methods to monitor gross memory usage.
  • Check whether your communication layer has built-in tracing features.
    • Check whether your communication layer supports the addition of customized layers.
  • Identify the number of incoming and outgoing transfers and the amounts of data transferred in distributed applications.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 3: Underlying JDK Improvements
Throughout the progressive versions of Java, improvements have been made at all levels of the runtime system: in the garbage collector, in the code, in the VM handling of objects and threads, and in compiler optimizations. It is always worthwhile to check your own application benchmarks against each version (and each vendor's version) of the Java system you try out. Any differences in performance need to be identified and explained; if you can determine that a compiler from one version (or vendor) together with the runtime from another version (or vendor) speeds up your application, you may have the option of choosing the best of both worlds. Standard Java benchmarks tend to be of limited use in deciding which VMs provide the best performance for your application. You are always better off creating your own application benchmark suite for deciding which VM and compiler best suit your application.
The following sections identify some points to consider as you investigate different VMs, compilers, and JDK classes. If you control the target Java runtime environment, i.e., with servlet and other server applications, more options are available to you. We will look at these extra options too.
The effects of the garbage collector can be difficult to determine accurately. It is worth including some tests in your performance benchmark suite that are specifically arranged to identify these effects. You can do this only in a general way, since the garbage collector is normally not under your control. (Sun does intend to introduce an API that will allow a pluggable garbage collector to replace the one delivered with the VM, but building your own garbage collector is not a realistic tuning option. Using a pluggable third-party garbage collector doesn't give you control over the garbage collector either.) The basic way to see what the garbage collector is up to is to run with the -verbosegc option. This prints out time and space values for objects reclaimed and space recycled. The printout includes explicit synchronous calls to the garbage collector (using
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Garbage Collection
The effects of the garbage collector can be difficult to determine accurately. It is worth including some tests in your performance benchmark suite that are specifically arranged to identify these effects. You can do this only in a general way, since the garbage collector is normally not under your control. (Sun does intend to introduce an API that will allow a pluggable garbage collector to replace the one delivered with the VM, but building your own garbage collector is not a realistic tuning option. Using a pluggable third-party garbage collector doesn't give you control over the garbage collector either.) The basic way to see what the garbage collector is up to is to run with the -verbosegc option. This prints out time and space values for objects reclaimed and space recycled. The printout includes explicit synchronous calls to the garbage collector (using System.gc( )) as well as asynchronous executions of the garbage collector, as occurs in normal operation when free memory available to the VM gets low. You can try to force the VM to execute only synchronous garbage collections by using the -noasyncgc option to the Java executable (no longer available from JDK 1.2). The -noasyncgc option does not actually stop the garbage-collector thread from executing; it still executes if the VM runs out of free memory (as opposed to just getting low on memory). Output from the garbage collector running with -verbosegc is detailed in Section 2.2.
The garbage collector usually works by freeing the memory that becomes available from objects that are no longer referenced or, if this does not free sufficient space, expanding the available memory space by asking the operating system for more memory (up to a maximum specified to the VM with the -Xmx/-mx option). The garbage collector's space-reclamation algorithm tends to change with each version of the JDK.
Sophisticated generational garbage collectors, which smooth out the impact of garbage collection, are now being used; HotSpot uses a state-of-the-art generational garbage collector. Analysis of object-oriented programs has shown that most objects are short-lived, fewer have medium lifespans, and very few objects are long-lived. Generational garbage collectors move objects through multiple spaces, each time copying live objects from one space to the next and reclaiming the space used by objects that are no longer alive. By concentrating on short-lived objects—the early spaces—and spending less time recycling space where older objects live, the garbage collector frees the maximum amount of space for the lowest impact.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Tuning the Heap
Heap size is important to Java application performance. Tuning the heap is a multistep process. First, we'll look at the big picture, with gross tuning steps that optimize the size of the heap, followed by advice for fine-tuning the heap. Next, we'll look at the impact of shared memory on tuning the heap.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Gross Tuning
The VMs provided by most vendors include the two main heap tuning parameters: -mx/-Xmx and -ms/-Xms. Respectively, these parameters set the maximum and starting sizes of the heap in bytes. They are typically available with every VM.
VMs vary as to whether they accept the -mx and -ms parameters or the -Xmx and -Xms parameters, or both. They also vary about accepting a space between the number following the parameter and accepting shorthand notations of K and M for kilobytes and megabytes, e.g., -Xmx32M. Check the documentation or simply try the various possibilities for your VM).
Tuning the heap with these two parameters requires trial and error, but is relatively simple. You don't need to consider the exact garbage-collection algorithm or how different parameters might affect each other. Instead, you can identify the cost of garbage collection to the application using the measurement techniques covered in Chapter 2. You can then simply alter the two parameters and remeasure using the same technique. Typically, you might want to use a range of values for the maximum heap size, keeping the starting heap size either absolutely constant (e.g., 1 megabyte) or relatively constant (e.g., half the maximum heap), and graph the result, looking for where garbage collection has the minimum cost.
Note that GC activity can take hours to settle into a regular pattern. If you are tuning a long-lived application, bear this in mind when looking at the GC output.
Gross heap tuning is fairly stable, in that moving to a different VM or tweaking the application usually won't invalidate the tuning parameters. They may no longer be the most optimal sizes after such changes, but they should still be reasonable. The following sections describe some considerations for heap parameters.
The heap size should not become so large that physical memory is swamped and the system has to start paging. So keep the maximum heap size below the size of the physical memory (RAM) on the machine. Also, subtract from the RAM the amount of memory required for other processes that will be running at the same time as the Java application, and keep the maximum heap size below that value.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Fine-Tuning the Heap
In addition to the gross heap-tuning factors, a host of other parameters can be used for fine-tuning the VM heap. These other factors are usually strongly dependent on the garbage-collection algorithm being used by the VM, and the parameters vary for different VMs and different versions of VMs. In this section, I'll cover a few examples to give you an idea of the possibilities. Note that every VM and every version of the VM is different, and you need to retune the system with any change for this level of fine-tuning. Fine-tuning is probably worth doing only where every last microsecond is needed or for a really stable deployed system, i.e., one that needs no more development.
Note that the following sections refer to some of the internal heap areas used by the HotSpot generational garbage collector. Generally, the total VM heap consists of permanent space (Perm), old space (Old), young space (Young), and scratch or survivor space (Scratch). Parameters referring to "new" space (New), such as -XX:NewSize, refer to the combination Young+Scratch. The -Xmx parameter sizes Old+New. The full heap is Old+New+Perm.
Most garbage-collection algorithms do not immediately expand the heap if they need space to create more objects, even when the heap has not yet been expanded to the maximum allowable size. Instead, there is usually a series of attempts to free up space by reclaiming objects, compacting objects, defragmenting the heap, and so on. If all these attempts are exhausted, then the heap is expanded. Several GC algorithms try to keep some free space available so that temporary objects can be created and reclaimed without expanding the heap. For example, the Sun 1.3 VM allows the parameter -XX:MinFreeHeapRatio= num, where num is 0 to 100, to specify that the heap should be expanded if less than num% of the heap is free. Similarly, the -XXMaxHeapFreeRatio parameter specifies when the heap should be contracted. The IBM VM uses -Xminf and -Xmaxf with decimal parameters between 0 and 1 (e.g., 20% is 20 for the Sun VM and 0.2 for the IBM VM). The Sun default is to try to keep the proportion of free space to living objects at each garbage collection within the 40%-70% range. That is, if less than 40% of the heap is free after a garbage collection (so more than 60% of the heap is full of objects), then the heap is expanded. Otherwise, the next garbage collection will likely occur sooner than desired. (IBM defaults are 0.3 min and 0.6 max.)
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Sharing Memory
If you are running multiple VMs on the same machine, you have the option of sharing some of the memory between them. There is a proposal for VMs to share system memory automatically, and this is likely to happen in the future. But currently (as of the 1.4 release), if you want to share memory between VM processes, you need to run multiple pseudoprocesses within one VM process. The necessary techniques are actually quite complicated, as many subtle problems can arise when trying to run several applications in the same VM while keeping them independent of each other.
Fortunately, there is a free open source library called Echidna (available from http://www.javagroup.org/echidna/) that takes care of all the subtleties involved in running multiple applications independently within the same VM system process. The library also provides several management tools to help use Echidna effectively. If you want to know how Echidna works or need to use parts of the library within your project, I have written an article that covers the technology in some detail.
The shared-memory advantages from combining multiple applications into one VM are significant for applications with small memory requirements where the VM memory overhead is significant by comparison. But for applications that require large amounts of memory, there may be little benefit.
A shared-memory VM also provides a faster startup time, as the VM can already be running when the application is started. For example, a VM using the Echidna library can be a running system process with no Java application running (except for the Echidna library). The Echidna library can start any Java application in exactly the same way the VM would have started it, but without all the VM startup overhead.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Replacing JDK Classes
It is possible for you to replace JDK classes directly. Unfortunately, you can't distribute these altered classes with any application or applet unless you have complete c