Wednesday, 9 July 2014

GC Polacy's

Introduction
Garbage collection (GC) is an integral part of the Java Virtual Machine (JVM) as it collects unused Java heap memory so that the application can continue allocating new objects. The effectiveness and performance of the GC play an important role in application performance and determinism. The IBM JVM provided with IBM WebSphere Application Server V8 (on supported platforms) provides four different GC policy algorithms:
·         -Xgcpolicy:optthruput
·         -Xgcpolicy:optavgpause
·         -Xgcpolicy:gencon
·         -Xgcpolicy:balanced
Each of these algorithms provides different performance and deterministic qualities. In addition, the default policy in WebSphere Application Server V8 has changed from -Xgcpolicy:optthruput to the -Xgcpolicy:gencon policy. Let’s take a look at each of these policies and see what this change in the default policy means.
The garbage collector
Different applications naturally have different memory usage patterns. A computationally intensive number crunching workload will not use the Java heap in the same way as a highly transactional customer-facing interface. To optimally handle these different sorts of workloads, different garbage collection strategies are required. The IBM JVM supports several garbage collection policies to enable you to choose the strategy that best fits your application
The parallel mark-sweep-compact collector: optthruput
The simplest possible garbage collection technique is to continue allocating until free memory has been exhausted, then stop the application and process the entire heap. While this results in a very efficient garbage collector, it means that the user program must be able to tolerate the pauses introduced by the collector. Workloads that are only concerned about overall throughput might benefit from this strategy.
The optthruput policy (-Xgcpolicy:optthruput) implements this strategy (Figure 1). This collector uses a parallel mark-sweep algorithm. In a nutshell, this means that the collector first walks through the set of reachable objects, marking them as live data. A second pass then sweeps away the unmarked objects, leaving behind free memory than can be used for new allocations. The majority of this work can be done in parallel, so the collector uses additional threads (up to the number of CPUs by default) to get the job done faster, reducing the time the application remains paused.
Figure 1. Application and collector CPU usage: optthruput
Description: Application and collector CPU usage: optthruput
The problem with a mark-sweep algorithm is that it can lead to fragmentation (Figure 2). There might be lots of free memory, but if it is in small slices interspersed with live objects then no individual piece might be large enough to satisfy a particular allocation.
The solution to this is compaction. In theory, the compactor slides all the live objects together to one end of the heap, leaving a single contiguous block of free space. This is an expensive operation because every live object might be moved, and every pointer to a moved object must be updated to the new location. As a result, compaction is generally only done when it appears to be necessary. Compaction can also be done in parallel, but it results in a less efficient packing of the live objects -- instead of a single block of free space, several smaller ones might be created.
Figure 2. Heap fragmentation
Description: Heap fragmentation
The concurrent collector: optavgpause
For applications that are willing to trade some overall throughput for shorter pauses, a different policy is available. The optavgpause policy (-Xgcpolicy:optavgpause) attempts to do as much GC work as possible before stopping the application, leading to shorter pauses (Figure 3). The same mark-sweep-compact collector is used, but much of the mark and sweep phases can be done as the application runs. Based on the program's allocation rate, the system attempts to predict when the next garbage collection will be required. When this threshold approaches, aconcurrent GC begins. As application threads allocate objects, they will occasionally be asked to do a small amount of GC work before their allocation is fulfilled. The more allocations a thread does, the more it will be asked to help out. Meanwhile, one or more background GC threads will use idle cycles to get additional work done. Once all the concurrent work is done, or if free memory is exhausted ahead of schedule, the application is halted and the collection is completed. This pause is generally short, unless a compaction is required. Because compaction requires moving and updating live objects, it cannot be done concurrently.
Figure 3. Application and collector CPU usage: optavgpause
Description: Application and collector CPU usage: optavgpause
The generational collection: gencon
It has long been observed that the majority of objects created are only used for a short period of time. This is the result of both programming techniques and the type of application. Many common Java idioms create helper objects that are quickly discarded; for exampleStringBuffer/StringBuilder objects, or Iterator objects. These are allocated to accomplish a specific task, and are rarely needed afterwards. On a larger scale, applications that are transactional in nature also tend to create groups of objects that are used and discarded together. Once a reply to a database query has been returned, then the reply, the intermediate state, and the query itself are no longer needed.
This observation lead to the development of generational garbage collectors. The idea is to divide the heap up into different areas, and collect these areas at different rates. New objects are allocated out of one such area, called the nursery (or newspace). Since most objects in this area will become garbage quickly, collecting it offers the best chance to recover memory. Once an object has survived for a while, it is moved into a different area, called tenure (or oldspace). These objects are less likely to become garbage, so the collector examines them much less frequently. For the right sort of workload the result is collections that are faster and more efficient since less memory is examined, and a higher percentage of examined objects are reclaimed. Faster collections mean shorter pauses, and thus better application responsiveness.
IBM's gencon policy (-Xgcpolicy:gencon) offers a generational GC ("gen-") on top of the concurrent one described above ("-con"). The tenure space is collected as described above, while the nursery space uses a copying collector. This algorithm works by further subdividing the nursery area into allocate and survivor spaces (Figure 4). New objects are placed in allocate space until its free space has been exhausted. The application is then halted, and any live objects in allocate are copied into survivor. The two spaces then swap roles; that is, survivor becomes allocate, and the application is resumed. If an object has survived for a number of these copies, it is moved into the tenure area instead.
Figure 4. Gencon in action
Description: Gencon in action
In theory, this means that half of the nursery space (that is, the survivor space) is unused at any point in time. In practice, the amount of memory set aside for survivor space is adjusted on the fly depending on the percentage of objects that survive each collection. If the majority of new objects are getting collected -- which is the expected case -- then the dividing line between allocate and survivor is tilted, increasing the amount of data that can be allocated before a collection is required.
This style of collector has an important benefit: by moving live objects with each collection, the nursery area is implicitly compacted with each collect. This results in the largest possible block of free space, but also tends to move objects that are closely related (for example, a String and its char[] data) into nearby memory locations. This improves the performance characteristics of the system memory cache, and, in turn, the application itself.
The cost of a nursery garbage collection is relative to the amount of data that survives (Figure 5). Because the expectation is that most objects will be garbage, a nursery collection generally results in a very brief pause. While the majority of objects should be collected quickly, some will not. This means that over time, the tenure area will fill up with long-lived objects and a garbage collection of the entire heap will be required. Most of the techniques described above for the concurrent collector still apply here. Marking of the tenure area will run concurrently as required while allocations and collections happen in the nursery. Sweeping of the tenure area is not done concurrently under gencon, but as part of the main tenure collection.
Figure 5. Application and collector CPU usage: gencon
Description: Application and collector CPU usage: gencon
The region-based collector: balanced
A new garbage collection policy has been added in WebSphere Application Server V8. This policy, called balanced (-Xgcpolicy:balanced), expands on the notion of having different areas of the heap. It divides the heap into a large number of regions, which can be dealt with individually. The details of region-based garbage collection in general, and the balanced policy in particular, will be discussed in Part 2.
Tuning heap settings for non-generational collectors
How to monitor and analyze the garbage collector
Monitoring garbage collector and Java heap usage can be achieved using one of two mechanisms:
·         The Health Center live monitoring tool provides very low performance overhead monitoring and analysis of garbage collection and other data, including method execution and lock contention.
·         The -verbose:gc command line option produces output to the native_stderr.log file that can be loaded into the Garbage Collection and Memory Visualizer tool. (It is also possible to direct -verbose:gcoutput to its own file using the -Xverbosegclogoption.)
See Resources for details.
The first step to tuning the heap size for any application is to run your application using the default heap settings, which will enable you to gauge the out-of-the-box performance. At this point, if the heap is consistently less than 40% free or the GC pauses are taking more than 10% of the total run time, you should consider increasing the heap size. The minimum and maximum heap sizes can be modified by specifying -Xms<value> and -Xmx<value>respectively.
The time spent in GC pauses for the mark and sweep phases of a garbage collection are based on the number of live objects on the heap. As you increase the heap size on a consistent workload, the mark and sweep phases will continue to take approximately the same length of time. So, by increasing the heap size the interval between GC pauses will increase, which will give the application more time to execute.
If the GC is performing compaction phases due to fragmentation issues, increasing the heap size might help alleviate the long pauses due to compaction. Compaction phases tend to significantly increase GC pause times, so if they are a regular occurrence tuning the heap settings can improve the application performance.
Fixed vs variable sized heap
Using a variable size heap enables the GC to use only the OS resources required by the application for the heap. As the application heap requirements change, the GC can respond by expanding or contracting the heap. The GC can only contract contiguous blocks of memory from the end of the heap, so a compaction might be required to contract the heap. The actual contraction and expansion phases are very quick and will not noticeably increase the GC pause times. By setting the maximum heap size larger than required for normal operations, the application will be able to handle extra workloads by expanding the heap.
Applications with consistent heap requirements might see GC pause time improvements by using a fixed heap size.
Tuning generational GC
How to set command line options in the admin console
Java command line options can be set in the WebSphere Application Server administrative console using the generic JVM arguments option in the Java Virtual Machine panel of the process definition. To find the Java Virtual Machine panel:
1.    Navigate to the admin console and select Servers > Server Types > WebSphere application serversfrom the left panel.
2.    Select your application server from the main panel.
3.    Expand the Java and Process Management option to the right of the main panel and select Process definition.
4.    Select the Java Virtual Machine option on the right.
5.    The Generic JVM arguments text box displays near the bottom of the main panel.
When the options are added, you will need to save and synchronize the changes before restarting the application server to put the changes in effect.
When tuning for generational garbage collection, the simplest approach is to treat the nursery space as a new Java heap area, in addition to the Java heap area used in the non-generational case. The Java heap for the non-generational case therefore becomes the tenured heap.
This approach is conservative: the expectation is that the occupancy of the tenured heap will drop as a result of introducing the nursery, but it provides a safe starting point, especially in the case of a migration from a non-generational policy. When the occupancy of the tenure heap after global (full) collections can be monitored, the size can then be adjusted as describedearlier:
·         -Xmn<size> sets the initial and maximum size of the nursery, effectively setting both -Xmnsand -Xmnx.
·         -Xmns<size> sets the initial size of the nursery to the specified value.
·         -Xmnx<size> sets the maximum size of the nursery to the specified value.
The size of the nursery heap should be fixed, and as a result only one of these options, -Xmnis required. Therefore, you only need to understand how to correctly size the nursery heap.
Sizing the nursery heap
To correctly size the nursery, you first need to consider the mechanism that nursery collections use and the secondary characteristics that occur as a result:
·         Nursery collections work by copying data from allocate to survivor spaces. Copying data is a relatively expensive, time consuming task. As a result, the duration of a nursery collect is dominated by the amount of data being copied. That isn't to say that there is no effect from the number of objects being copied and the size of the nursery itself, but that these are relatively minor in comparison to the cost of copying the actual data. As a result, the duration of the nursery collect is proportional to the amount of data being copied.
·         Only a finite and fixed amount of data is "live" in any given collection. Once an application has completed startup and fully populated its caches, and so on, the amount of "live" data that needs to be copied in the nursery heap is fixed by the amount of work that is being done at that point in time. In a system that processes transactions, the amount of live data that needs to be copied will be equivalent to one set of live transactions. For example, if you have configured your application server with 50 WebContainer threads enabling 50 concurrent transactions to occur, then the amount of live data will be that associated with those 50 transactions.
This means that the duration of the nursery collect is set by the size of the data associated with the number of concurrent transactions occurring at the time of the collect, and not the size of the nursery. This also means that as the size of the nursery is made larger, there is an increase in time between nursery collects, without an increase in the duration of the collect. In effect, as the nursery is made larger, the overall time spent in garbage collection drops.
Figure 6 shows that if the size of the nursery is less than the live data associated with one set of transactions, and thereby the time between nursery collects is less than one transaction, then data has to be copied multiple times.
Figure 6. Average number of times data is copied versus time between nursery collects
Description: Average number of times data is copied versus time between nursery collects
As the nursery size is expanded and the time between nursery collects increases, you do less copying on average, and the overhead of garbage collection drops.
Limitations on the nursery heap size
There are no direct limitations on the size of the nursery heap that are imposed by the IBM garbage collector or JVM; in fact, there are cases where the nursery is being set to sizes in the order of 10s and even 100s of gigabytes. There are, however, limitations imposed by the operating system that the Java process has to adhere to in terms of virtual memory and process address space, as well as the availability of sufficient physical memory (RAM). The operating system restrictions for each platform for a 32bit process are shown in Figure 7.
Figure 7. 32bit address spaces by operating system
Description: 32bit address spaces by operating system
The restrictions on 64bit process are much, much larger. With addressable memory in the range of hundreds to billions of gigabytes, it is the available physical memory (RAM) limitation that becomes much more important.
Putting the two together
As discussed above, the simplest approach is to treat the nursery as an additional memory space. However, both the nursery and tenured heaps are in fact allocated as a single continuous segment of memory, the size of which is controlled by the -Xmx setting. If only the -Xmx setting is used, 25% of the -Xmx value is used for the maximum nursery size, and the size of the nursery is permitted to grow and shrink within that 25%. This gives the layout for the Java heap shown in Figure 8.
Figure 8. Default heap layout
Description: Default heap layout
You should, however, fix the nursery size at a large value to minimise time spent in garbage collection, and enable the tenured heap to resize itself according to occupancy to build in resilience. Therefore, the preferred layout for the Java heap is as shown in Figure 9.
Figure 9. Recommended heap layout
Description: Default heap layout
In order to achieve this layout, the individual values for the minimum and maximum heap size for both the nursery and tenured spaces should be set, with minimum and maximum nursery size settings equal to each other, and the minimum and maximum tenured space sizes set to different values.
For example, if you want to have a 256MB nursery heap size, and a tenured heap between 756MB and 1024MB, the values would be:
-Xmns256M
-Xmnx256M
-Xmos756M
-Xmox1024M
Migrating to generational
Since the default GC policy has changed from optthruput to gencon in WebSphere Application Server V8, previously chosen tuning parameters might need to be adjusted. The primary issue is changing the heap sizes to compensate for the nursery. A program that previously ran fine under optthruput with 1G of heap (i.e. -Xmx1G) might be unhappy running with only 768M of tenure space and 256M of nursery space. The techniques described above will help to choose new heap parameters.
There are other less obvious situations where gencon may display different behaviour.
Because classes are generally long-lived objects, they are allocated directly into tenure space. As a result, class unloading can only be done as part of a tenure collection. If an application relies heavily on short-lived class loaders, and nursery collections can keep up with any other allocated objects, then tenure collections might not happen very frequently. This means that the number of classes and class loaders will continue increasing, which can increase the pressure on native memory and lead to very long tenure collections when they do happen, because there is so much class unloading work to be done.
If this issue becomes a problem, there are two solutions. The first is to encourage additional tenure collections in the presence of large amounts of class loaders. The command line option -Xgc:classUnloadingKickoffThreshold=<number> tells the system that a concurrent tenure collection be started every time <number> new class loaders have been created. So, for example, specifying -Xgc:classUnloadingKickoffThreshold=100 will start a concurrent tenure collect whenever a nursery collect notices that 100 new class loaders have been created since the last tenure collection. The second solution is to change to one of the other GC policies.
A similar issue can arise with reference objects (for example, subclasses of java.lang.ref.Reference) and objects with finalize()methods. If one of these objects survives long enough to be moved into tenure space before becoming unreachable, it could be a long time before a tenure collection runs and "realizes" that the object is dead. This can become a problem if these objects are holding on to large or scarce native resources. We've dubbed this an "iceberg" object: it takes up a small amount of Java heap, but below the surface lurks a large native resource invisible to the garbage collector. As with real icebergs, the best tactic is to steer clear of the problem wherever possible. Even with one of the other GC policies, there is no guarantee that a finalizable object will be detected as unreachable and have its finalizer run in a timely fashion. If scarce resources are being managed, manually releasing them wherever possible is always the best strategy.
Changing policies
The default policy should provide adequate performance for most workloads, but it may not be the ideal choice for a particular application.
An application that operates like a "batch job" sets up its initial state and loads the data on which it will operate. Most of these objects will live for the duration of the job, with only a few additional objects being created as the job runs. This sort of workload fits the optthruput model since the expectation is that there will be very little garbage until the task is complete. In a very similar case, jobs that complete very quickly or allocate very few objects might be able to run without requiring a garbage collection with an appropriate sized heap. In these cases, the minimal overhead of the optthruput collector makes a good choice.
By contrast, a transactional application is continually creating and discarding groups of objects. In this context, the term "transaction" can be very literal -- consider a database update or an e-commerce purchase -- or used in the far broader sense of a discrete unit of work. To take some examples, serving a web page can be considered a transaction. The client submits a URL, the server computes the content of the page and sends it to the client. Once the client has received the page, the server can discard the computed data.
To stretch the definition a little further, consider a standard user interface. The user clicks on the Save button and the system opens a file dialog so the user can navigate the filesystem and choose a location for the document. Once the user has dismissed the dialog, all of that intermediate state becomes unnecessary. Even some batch jobs are really transactional under the surface. A task that is creating thumbnails for a collection of large image files might appear to be a single large batch job, but internally it is processing images separately, each one forming a transaction of sorts. For any of these kinds of workloads, the gencon model should provide benefits.
The optavgpause model sits somewhere in the middle. Applications that have a lot of long-lived data that changes slowly as the program runs might fit this model. This is an unusual pattern for a workload; generally, long lived data is either almost never changed or changed frequently. That said, a system with a slowly evolving data set that does not create many intermediate objects mighty benefit from this policy. Programs that are unable to run efficiently under gencon due to one of the problems discussed above might benefit from the concurrent nature of optavgpause.
Conclusion
This article presented a brief description of the garbage collections strategies available in the Java Virtual Machine in WebSphere Application Server V8. While the default setting should perform well in most cases, some tuning might be required to get the best performance. By matching the GC policy used to the type of workload and choosing suitable heap parameters, the impact of garbage collection on the application can be reduced.
Part 2 of this series will introduce a new region-based garbage collection strategy, called balanced, which is designed to improve scalability when deploying on large 64bit multi-core systems. The article will cover the motivation behind the new technology and the performance improvements it provides, as well as hints and tips on tuning this new option.
Resources
Learn
·         Wikipedia: Garbage collection definition


LINK: http://www.ibm.com/developerworks/websphere/techjournal/1106_bailey/1106_bailey.html
http://www.ibmsystemsmag.com/ibmi/administrator/websphere/Tuning-Garbage-Collection-With-IBM-Technology-for/

No comments:

Post a Comment