The Java OHR Project

Sub-millisecond pause times on a hundred Gig heap – and beyond!

Rudi Simic - Industrie IT Technology Labs

find the code on github:

https://github.com/industrieit/ohr

Should i bother reading this? Is this relevant to me?

This project is for those interested in minimal pause times for very large memory footprint java applications.

The table below compares an OHR vs POJO based population growth simulator where each person represents an object – growing to a full 66 gigabyte process heap.

Max GC pause times come down from 22 seconds to 20 milliseconds when using OHR rather than POJOS for long lived objects. OHR also seems to pack more objects into the same amount of memory (almost double – 369 vs 189 million).

Note the max GC pause times listed in the table are absolute outliers. Average and median values would be sub or low milliseconds. Even better figures could be achieved.

Another important point is that increasing heap size would have no discernible effect on pause times when using OHR for long lived objects. I could have tested with a 200 gig heap, whereby OHR stats would remain by and large the same but the POJO stats would get a hell of a lot worse. I happened to use the biggest AWS instance available on spot availability in my local region.

Introduction

If i were going to pick one thing that has increased the productivity of software development over the last decade or so, even possibly above the expressiveness of modern higher level languages, it would be the automatic garbage collection baked into their runtimes. I recently blogged about Google’s new Go language, and agreed with the decision to include a garbage collector – on the balance of things. The facts being that in a multi-core parallel world where objects and structures are passed between threads, perfectly maintaining all object life-cycles becomes a very difficult task.

But garbage collection is most definitely a two edged sword.

This is not a criticism mind you. Modern garbage collectors are trying to efficiently solve a complex mathematical directed-graph problem. I’ll warrant those working on the problem are more capable than I.

The current state of play, however, is:

(i) in order to have garbage collection with a negligible cpu load, you’re application must reserve, say, 5 times your application’s actual working memory to allow the collector space to fill between collections and do its thing.
(ii) there are instances where current collectors must ‘stop the world’, traverse every object reference (at least in one generation segment), mark it, sweep the unmarked refs to remove the unreferenced objects, then finally repack the remaining objects into a linear segment of memory in order to prevent fragmentation and speed up future mallocs to fixed linear offsets rather than more complex and time consuming list traversal. Even on modern CPUs, when object numbers climb to the potential billions, this pause time easily reaches seconds, tens of seconds, or even MINUTES!

And while some are thinking of gigabyte heaps – i am looking further forward to the possibility of tens, or hundreds of gigabytes – it won’t be long before a not particularly cutting edge server gives you a full terabyte of RAM. When nonvolatile memristor based storage takes off the distinction between ‘RAM’ and ‘Hard Drive’ storage fades to potential insignificance. Java heaps of such size are at the moment not feasible (depending on your definition) using current regular collection technology. To scale out on a machine with such a high resource profile, you’d currently have to run multiple JVMs in the one server. This adds inefficiency in the replicated virtual machine resident memory, and additional complexity associated with inter process communication.

This is the biggest thing that separates systems level languages from higher level languages. Without doubt you get massive benefits using higher level managed languages, but you pay a price. And when you push the envelope, the balance between pros and cons starts to shift more to the middle, and eventually backwards to your own goal line.

I’ve often wished that Java would allow me to opt an object reference out of garbage collection. There may be a very large, critical collection in your application. The ingress and egress of objects into and out of the collection may be defined by a set of well-known state transitions- for example startSession/endSession or something similar. In such an example – with well known simple state transition scenarios, it is NOT overly onerous to manage the object lifecycle manually. And opting that one large critical collection out of GC could improve your application’s performance level DRAMATICALLY. From dead in the water/completely unacceptable to ‘you know what, we can live with that in preference to rewriting the whole %^&*( thing in C’

And thats the main point – a long running application is not infinitely long in code, It runs for a long time because it’s looping. If this same app occupies a large amount of memory, its unlikely you have defined thousands or millions of root object references in your source (int x1=0;………. int x9999999999=0;). Most likely you have defined a COLLECTION class as one of the root or transitive references, through which a vast web of reference links spider out.

And once you start to include collections in your app, you can no longer completely abrogate your object tracking and cleanup responsibilities. Hence the fact most of us have inadvertently created memory leak issues in a Java app even though it provides full, always-on, non-optional garbage collection. You can also of course access a nulled out object reference, which is really just a soft controlled segfault.

The common Java collectors always have a fully locked mark-sweep phase, where all executing threads are suspended and the whole reference tree must be traversed (possibly in parallel) from the root references to determine whether they are unreferenced). Once the duds are marked deallocation and compaction can be performed in the background. Modern collector are more complex than this – they maintain separate ‘generation’ spaces to cut the pauses into traversing smaller chunks of memory but thats the general meat of it.

In very large heaps this is traversal is a BIG task. Imagine a java heap many gigabytes. Its possibly there may be hundreds of millions or even billions of references in the heap. Assuming each check and mark takes one CPU clock cycle (impossible), you are looking at pause times in the order of SECONDS for an existing fast CPU. This is unacceptable for many classes of applications. I am not necessarily talking about hard real time here. Hard real time applications require a specialist stack from the ground up (including the OS). The flight control system of the Joint Strike Fighter don’t use java, and it also don’t use a desktop type OS. There are plenty of domains aside from hard real time where a pause in processing of seconds simply does not cut it.

Wouldn’t it be great to be able to opt out some of your larger object collections from garbage collection, while at the same time allowing the trivial short lived flotsam you create and discard within your methods to still be automatically collected? The best of both worlds?
As I’ve said I’ve often wished for this, but it was previously not possible in java without diving into JNI. This project is an attempt to come up with a nicer, all java solution.

Introducing the JAVA OHR project.

This project, (or one like it) has the potential to open up the use of Java to scenarios you previously wouldn’t have considered it for. Like what? How about a portable (*conditions apply) 100% Java application with 100 gigabytes of ‘heap’ and sub-millisecond garbage collection pause times? Interested? Read on.

OHR:

O – Off
H – Heap
R – Reification

I say off heap, but I really mean parallel heap reification. AS you probably know – the Java Virtual Machine architecture is (heh…) a Virtual Machine emulated (with great precision and performance) on top of an operating system process. That operating system provides an process heap, from which the JVM constructs its own managed heap. Except for stack based allocation everything is allocated on a heap of some sort. Its when you run machines in machines the confusion arises because you then have possibly more than one. My nomenclature will be from the Java perspective. For the rest of the discussion just take note that when i refer to the ‘heap’, i am referring to the Java ‘heap’. When i say something is allocated ‘off heap’, I mean off the Java heap (but on the OS process heap).

While investigating the LMAX disruptor architecture (thanks Andrew C), I’ve recently been made aware of the existence of the sun.misc.Unsafe class, an internal utility that allows you to manipulate JVM memory at the level of the JVM itself. I instantly recognised the powerful possibilities which could be realised using this class coupled with runtime byte code generation.

I’ll give you one guess why it’s called Unsafe…. yes – because using the class unwraps you from the protective arms for the JVM, and incorrect use WILL crash the JVM with a segment fault. Yes – the very same you get in any language that allows unrestrained pointer manipulation. What? Totally unacceptable? Possibly so – in which case accept the compromises of an always-on mandatory garbage collector. OHR is for those who need more than that, but are still reluctant to drop the familiarity of java and do stuff in C or C++. For those who need to write Java apps that consume massive amounts of memory yet still require negligible pause times. Those who are considering using an unmanaged language like C, but would prefer to get the productivity benefits of Java if they could just get rid of the biggest roadblock – garbage collection, and are prepared to be a little more careful in lifecycle management.

Off Heap Reified objects, (or OHR objects, or Reified objects) are defined as abstract classes, with abstract getters and setters. I use runtime class generation magic (via javassist) to fill in the blanks for you. The runtime generation is a compilation phase, the end result being that when you reify the class (via a factory method), you get what appears to be a plain old java object, complying to your abstract definition. The abstract class, though having abstract accessors, may include concrete method implementations that make use of the accessors. It smells the same as any other java object. However – its backing memory is not on the java heap, and it delegates the storage and retrieval of bean properties to memory outside the auspices of the garbage collector (that is – to the operating system process heap rather than the JVM heap)

Now I don’t recommend YOU use Unsafe in your day to day development. But i MUST to achieve the results I am aiming for. Be forewarned. OHR allows you to achieve feats not possible without it. But with great power comes great responsibility.

The upside of using com.mist.Unsafe rather than a JNI library is that, at least for the Hotspot based VM, the code is all Java and portable across platforms. The downside is that its specifically not a java.* or javax.* package, so there’s no guarantee it’ll be around in the future. I wouldn’t be too worried though. Some high profile high performance libraries now use com.misc.Unsafe – it’s become a dirty secret for on-the-sly JVM poking. I understand it was recently put into OpenJDK for that reason.

Reifying a standard bean is easy, But it would be awesome to have a complete, parallel implementation of collection classes that mirror those from standard java we’ve come to know and love, but operate on OHR objects instead. I at this stage provide implementations of an ArrayList and a SkipListHashMap (though I don’t claim they are particularly elegant implementations). It should not be difficult to port the existing collections implementations. There is a question of how to maintain proper concurrency however – which I’ll discuss at the end of this article. Also iteration support has not been implemented as yet.

As much as I want to follow the Java standard java collections interfaces, there are two problems i have.

* Its time to bite the bullet on collection capacity. Without garbage collection (as we’ll see), you can manage collections with many billions of entries no problem, so integer indexes are insufficient. All my collections use long indexes (and long hashes), which off the bat breaks the standard java interface definitions (all expecting 32 bit indexing).

* There are a few functions the standard java collections allows you to do which would be problematic for very large collections. ArrayList allows you to insert and delete entries for example. ArrayList is backed by a linear array underneath. Insertion and deletion of non-tail entries requires a memory shift of all entries above the insertion to make room for, or fill in the the new or deleted entry, thus maintaining the ability to reference any entry via a linear offset from the base. For small collections this is no problem. A large collection may require shifting gigabytes of data. Possible but not really feasible for our ends – fast performance on large sets of data. To avoid performance problems it may be an idea to remove this option. My ArrayList implementation does support insertion and deletion but I’m not sure it should. A list implementation that does support these functions is best written using sparse array techniques (or linked lists) using higher level underlying collections rather than linear arrays. Sparse arrays may support more efficiency for some functions at the expense of efficiency of others (namely offset indexing).

As a trade-off, wrapping functionality to be able to access the OHR collections through the standard interfaces for existing code should be provided (its on the TODO list). In such usage some methods may throw a runtime exception if the capacity of the underlying collection exceeds the limits of 32 bit indexing.
—–

As a quick at how to use OHR, consider the following OHR-able abstract class definition:

public abstract class TC2 {

@Reify
public abstract long getX();

@Reify
public abstract void setX(long l);

public String blah()
{
return "x is "+getX();
}
}

Note it’s an abstract class that does not have to extend or implement a special base class or interface. We have abstract property accessors marked by a special annotation. We also have a real method that makes use of the abstract property accessors. Obviously we need some magic to realise (or reify) such a class definition. So the ‘new’ operator is not applicable.

I provide a factory object call Reifier

Reifier has a method signature of

public Object reify(Class clazz);

From there the magic begins. In the case where the class has not been reified before, a ‘compilation’ phase is executed to provide a reified implementation class of the abstract base. The appropriately annotated accessors are introspected, and much like a traditional compilation phase the total required memory to fit the variables is calculated, offsets are determined for each property denote where in the aggregate chunk of memory each virtual variable is stored. A runtime subclass is generated that implements abstract accessors. This runtime class is then cached and used for reification of any future equivalent class.

In the event of instantiation, off heap memory is allocated via sun.misc.AllocateMemory() – essentially a malloc wrapper. The pointer to this memory is stored in a private variable of the reified class (a primitive long and so not an object), and the accessors are implemented to access appropriate offsets from this base pointer that sit within the allocated space requested from malloc. Normal java objects do exactly the same thing hidden inside the JVM implementation. sun.misc.Unsafe also provides methods to write primitives and byte arrays into arbitrary process memory. Note that writing to memory not previously allocated will result in the dreaded segment fault and lead to process termination by the OS. Its exactly the same as accessing a stale pointer in an unmanaged language.

Incidentally, its the first time i’ve actually used Javassist – but i have to say i was pleasantly surprised with the ease with which you can be used. Its a full blown macro processor. A bolt-on macro processor, not built into the language like, say, Lisp’s macro functionality, but powerful nonetheless (it’s cool libraries like Javassist that make a language like Java, with, shall we say, a non-spectacular, modest built in feature set, still somewhat relevant today)

Fore those interested in the internals – the reified implementation of the above looks like this (heavily redacted):

public class TC2 extends com.simic.ot.jassist.TC2
{
private volatile long basePtr;
private volatile int instmarker;
private static Unsafe u;
public void setX(long paramLong)
{
ohwrite(8L, paramLong);
}
public long getX()
{
return ohreadlong(8L);
}
public void ohwrite(long offset, long val)
{
u.putLong(this.basPtr + offset, val);
}
…..
}

The implemented POJO that is used to access the OHR data space is therefore what I call a lightweight proxy. It contains only the base pointer from which the properties may be relatively indexed from. If you let that lightweight proxy go out of scope – and you can’t regain it from another reference, and you haven’t created a handle, the off heap memory allocated for the object is lost forever. The proxy will be garbage collected – the off heap malloced memory remains a permanent memory leak. The object must be freed manually:

Reifier.freeOHR(myRef);

Now – what types are supported for reified properties? All java primitives are supported (though the equivalent Boxed objects are not). This is a performance oriented library with the goal of minimising garbage collection penalty. Primitive boxing, though convenient, comes at the cost of copious amounts of java object creation. Find an alternative way (such as an additional boolean primitive property) to indicate ‘null’ value with greater performance.

Incidentally – the @Reify annotation has an optional argument to indicate the consistency model used in the property’s manipulation

@Reify(consistency = Consistency.NORMAL)
public abstract void setX(int x);

@Reify(consistency = Consistency.ORDERED)
public abstract void setX(int x);

@Reify(consistency = Consistency.VOLATILE)
public abstract void setX(int x);

(NOTE: still to be implemented – the Consistency options are disregarded and all accesses are currently normal)

For those unsure of what this means i’ll leave it as an exercise to review the documentation on the Java Memory Model. Suffice to say it is related to the timeliness with which data changes are viewable between code running on different threads and processors. More strict requirements such as VOLATILE require a limitations on how the much the compiler can optimise code such as by keeping intermediate calculations in registers, as well as controlling the various levels of on-chip caches and queues. The JMM model is an abstraction that can be implemented in many architectures, all of which provide native processor level instructions related to cache control. In the case of X86-64, there is no difference between ORDERED and VOLATILE. This is not necessarily the case on other architectures.

Being unfamiliar with the Java Memory Model is asking for trouble when writing multithreaded code running on multicore machines. Needless to say safe concurrent data access is difficult, and the best way to win the game is not to play at all (anyone getting that reference is old). This is why, in this new world, a renewed effort is being put on constructing solutions via CSP (Communicating Sequential Processes) to ease of on this inter thread communication. Google Go’s Channels are a good example of this. It should be noted that care needs to be taken in passing the lightweight proxies themselves around disparate threads ( we may require an option to tighten up the setting of the pointer to the lightweight proxy itself)

Strings

If theres one class that creates craptastic amounts of needless garbage its java.lang.String. String manipulation is common, and notwithstanding the OO purity of String immutability, care needs to be taken in its use. OHR provides an inline String like representation that is allocated intrinsically with the defining object itself. However, since we preallocate a fixed amount of memory for each OHR object type, we force you to set a max length. If we force that limitation we preallocate the string data much the same way we allocate inline arrays – the property is implicitly part of the containing object, unlike java.lang.String.

@Reify
public abstract String getStr();

@Reify
@InlineStringReify(length=10, trimOverflow=true asciiOnly=false)
public abstract void setStr(String st);

(NOTE: Reify modifiers are always on the setter except for inline arrays which do not have setters)

Along with length you will notice a few other options in the annotation. trimOverflow, set to true will silently truncate a string too big to fit into the allocated space, otherwise by default a runtime exception will be thrown. asciiOnly is a mechanism to conserve space. in many cases you will be dealing with 8 bit ascii character strings. setting asciiOnly to true stores the data one byte per character, halving the required space. Passing a String containing a character above 255 will throw an exception on an asciiOnly string.

Inline Strings are a great way to store database results. Database schemas already limit string sizes for the same reasons of efficient allocation as OHR.

One final optimisation. The above definition using accessors of type java.lang.String creates an object on each access, due to java.lang.String’s immutable behaviour. If you define the type to be CharSequence instead,

@Reify
public abstract CharSequence getStr4();

@Reify
@InlineStringReify(length=10, trimOverflow=false, asciiOnly=true)
public abstract void setStr4(CharSequence st);

OHR will use a custom CharSequence object for each access to prevent creating garbage. The custom char sequence directly accesses the underlying off-heap memory, allowing things like string comparison to be done with no garbage created. Do not hold this reference after using at, as changing the string length via the setter will change the properties of the char sequence object.

Arrays

Arrays are problematic because standard java array definitions are too wedded to the runtime. You can’t implement your own version of an array. I provide an equivalent interface. Here’s an example for the long array (theres one for each primitive):

public interface OHRLongArray {
 long get(long pos);
 void set(long pos, long val);
 long length();
 }

You have an option to realise the array inside the actual object (an inline array), whereby space is allocated to fit the array inside the enclosing object (like inline string). In the case of an external array only a pointer is maintained to the externally allocated array – in which case the two objects have independent life-cycles. In the case of an inline array the array is born and dies along with its parent, as its allocated memory is encompassed by that of the owning object. External arrays are simply alternate OHR objects and are referenced through pointers (actually handles) the same way as any other OHR reference.

Inline arrays can only have a getter – you cannot replace an inline array as it belongs to the parent. You could replace all its elements with those of an external array but that’s not the same thing.

Here’s an example of defining an inline array (note – getter only)

@Reify
@InlineArrayReify(length=10)
public abstract OHRLongArray getLongarr(); //note - no setter

here’s an example of defining an external array:

@Reify
public abstract void setLa(OHRLongArray a);

@Reify
public abstract OHRLongArray getLa();

since a pointer is only stored in an enclosing object, reassignment is possible and so getters and setters are permissible.

An obvious question is how one reference objects between the java heap and off-heap objects. There are a few points to make:

In the case of what i call back referencing – that is, referencing from off heap to java heap – YOU DON’T!!! The OHR compiler will throw an error if you attempt to assign a standard java object reference to a reified setter. Only primitives and other OHR objects may be assigned to OHR properties. I originally had the concept of a back reference where normal java objects would be tracked via an IndentityHashmap from the off heap back to the java heap. This proved to be too inefficient. It also muddies the waters. If you inadvertently or naively maintain a one-to-one mapping between standard java objects and OHR objects – you get the absolute worst of both worlds lots of garbage collected objects referencing off heap objects that must be manually managed. I pulled out the baked in implementation in order for developers to be more conscious of what they are doing. If you must back-reference, do it indirectly. Place the java object into a map keyed by a primitive object key, then store that primitive in an OHR property;

Map<Long, Object> cache=new HashMap<Long,Object>();
long key=counter++;
cache.put(key,myJavaObject);
myOHRObject.setPojoPointer(key);

because of collector compaction there is no way to grab a constant pointer of a Java object (unless you lock it through JNI), which is why things like IdentityHashMap are relatively inefficient.

In the case of foreword referencing – you’ve already done it via instantiating a class through the Reifier:

TC2 t;
 t= (TC2) Reifier.reify(TC2.class);

Reference t above, a concrete implementation of our base class is a standard POJO which represents a lightweight proxy that allows us to access the off-heap OHR object. It contains very few fields – the primary one being essentially a long pointer which represents the base address of the allocated OS process heap (not java heap) reserved for our OHR object. The reified property accessors are HARD coded methods that read and write into fixed offsets from this base pointer (predetermined during the compilation phase). The other property i’ll discuss later.

As stated – OHR objects are allowed to reference primitives, inline strings, inline primitive arrays (represented by my OHRXArray equivalent interfaces), as well as other OHR references (and external arrays and other collections are just another OHR object reference). Taken this together this allows us to have – from the perspective of the JVM – ONE lightweight proxy object (20 bytes of Java heap space and one object reference) referencing ONE OHR object, which itself spiders off to a network tree of a universe of OHR objects and collections, of unlimited size, which PLAY NO PART in java garbage collection, yet are fully reference able from the that one JVM proxy via accessor chaining:

A.getB().getC().getD()

This is the secret of OHR. The price you pay is manual lifecycle management of the OHR objects.

You do not however need to keep all lightweight proxies in Java scope. A proxy may be converted to a handle – a primitive long mangled with extra information which can be used to ‘re-attach’ back to a live lightweight proxy.

 TC2 t= (TC2) Reifier.reify(TC2.class); //OHR object held via proxy
 long handle=Reifier.getHandle(t); //handle
 TC2 t=(TC2) Reifier.reattach(handle);

The OHR framework, in fact, uses exactly this mechanism to to store OHR references between OHR objects.

Note – lightweight proxies are not unique to each OHR object (at least in the current implementation) – that is to say two references to the SAME OHR object, t1 and t2, do not have to be the same object. That is to say id t1 and t2 are lightweight proxies pointing to the same OHR object, t1==t2 does not have to be true – though it may be. Non unique proxies make reification easier at the expense of locking mechanisms – and it may be this is not the right way to go in the long run.

Note that you will come to grief if you try to reattach a handle after the object was freed. You may get a segment fault if the memory segment has not been re-assigned to another OHR object. If it HAS been reassigned, the handle has enough info to determine the memory is now re-allocated to a DIFFERENT OHR object and will throw a runtime exception. In any event you MUST manual track and control the lifecycle of your OHR objects.

In order to get the best performance in a large OHR base app, you need to limit the number of long lived JAVA objects in the system, and maximise the number of OHR objects. You could theoretically write an with ONE long live java reference – that of an OHR proxy referencing an OHR object that spiders out over the OHR memory space, with all other proxy references accessed from that discarded within the method call within which it is accessed (to be cleaned by an incremental sweep).

Profiler analysis has shown that common collectors can still be swamped by fast creation of OHR objects as the lightweight proxies still pile up int he JVM. I provide an option to recycle the proxies for re-use after they are no longer needed.

TC2 t= (TC2) Reifier.reify(TC2.class); //OHR object held via proxy
 //... use the reference
 //...
 //no longer needed - recycle to avoid further POJO creation and cleanup
 Reiffier.recycleLWP(t);
 ...
 TC2 t2= (TC2) Reiffier.reify(TC2.class); //possible same POJO proxy object re-used rather than new

This is optional, and if there are no recycled cache proxies for that class a new one will be created. Ideally this should not be needed, but profiler stats shown later show a not insignificant improvement in performance when you recycle proxy objects. Your app could theoretically become garbage generation free in the steady state.

Helper annotations

Now that you must maintain your own objects life cycles I’ve provided some annotations to help.

Imagine OHR object a references OHR object B. A conceptually owns object B, and when A dies B must also die.

In such as case use the @Owned annotation:

public abstract class A{

@Reify
 @Owned
 public abstract void setB(OtherOHR b);

@Reify
 public abstract OtherOHR getB();
 }

in the above case freeing the parent automatically frees the owned properties

A a= (A) Reiffier.reify(A.class);
 B b= (B) Reiffier.reify(B.class);
 a.setB(b);
 //free a…
 Reiffier.free(A);
 b.doSomething() //NO!!!!! - b is owned by A and was freed with A - b is now a dangling pointer!!!!

The Handle

I mentioned the handle is a ‘mangled’ pointer to the off-heap allocated memory of the OHR object. Let me explain that. While 32 bits (around 4 billion integers) for addressing has finally become inadequate with the creep of technology, taking on another 32 bits to total 64 bits of addressing ( 4 billion times 4 billion -> 1.8 by ten to the 19th power) is ridiculous and i envisage will remain so for the remainder of my life (famous last words). That being the case why don’t we use the top bits to add extra information to prevent common error scenarios.

Here’s a possible scenario we’d like to avoid. You instantiate a reified object. You later free it. The allocated memory is freed – which normally means the malloc algorithm keeps the segment in a list of free nodes. After the app’s been running a while there’ll be enough fragmentation to ensure the memory directly above and below this free chunk will be allocated, so it won’t be able to combine regions into bigger chunks. As i said before – just like a C struct OHR objects have fixed allocation size, so the next time you instantiate the same OHR type, you have a better than random chance that exact same memory chunk will be allocated for that exact same type. Malloc allocation algorithms vary (and C++ allows you to provide your own implementation), but if you ask for 78 bytes and theres a slot with exactly that available you’ll get that in preference to malloc cleaving the memory from a larger chunk.

Here’s the danger. Imagine you instantiate an OHR object, take its handle, and hold it. Due to some error you inadvertently free it. You now have a dangling pointer. Before you use it however, some other part of the code instantiates the same type. Malloc efficiently gives you back the same memory. You now have a valid object of the same type in the same memory position. If you re-attach the old handle – everything seems kosher. But you are now connected to a completely DIFFERENT instance of the class. This could lead to some very difficult to find bugs.

Which is why i introduced the instance guard count. Provided you do not need the full 64 bit address space (that is, around 17 billion gigabytes of memory – currently a safe bet), we put a guard number in the top bits, which we encode into the objects space as well as the handle. On top of that we encode a class identifier nt he object space as well.

Every time we reattach a handle (and this is done when we access OHR objects through other OHR accessors (since handles are stored as OHR object references), we check the guard bits match between the handle and the guard number in the object space.
1010110100000000000000000000000011111101111111011111110111111101
| guard | pointer |

The chance you get a match between different instance types obviously depends on how many guard bits are allocated, but using 8 bits give you a 1 in 255 chance an error slips through – good enough i think. An error is far more likely to come up most times than not, so even a modicum of testing will highlight issues.. The guard bit count is a configurable setting, but the default is 8 bits. Changing the guard bit allocation can only done by setting a system property, and cannot be changed after its initialised.

As i also mentioned, each OHR class type is identified by a number when configured, and this is also stored in the object space. So when you reattach a handle, OHR goes through the following steps:

* extract the guard bits and the base pointer from the handle.
* access the class identifier and the stored guard bits by known offsets from the base pointer
* check that the guard bits match – if not – StaleHandleException (a runtime exception)
* check the class id is registered as a valid OHR instance type, if not, StaleHandleException
* if all’s ok – create or get a cached Lightweight proxy instance for the type, reset the base pointer and return the proxy object.

This provides some protections

* If the object was freed and re-allocated to the same object type after the handle was taken the guard bits will in all likelihood not match
* If the object was freed and reallocated to another object you may be peering into the body of the object and likely the guard bits will not match and the class identifier will not be valid

The third error – you reattach into freed memory, sadly will crash your program with a segment fault. C’est la vie – the price you pay for playing in an unmanaged world.

Having played around with the OHR framework a fair amount – i am actually pleasantly surprised how few segment faults you actually do get. I was expecting more.

Instance Memory layout

OHR objects are allocated in a common fashion. From the offsets to the base pointer, an 8 byte preamble exists, followed by the property data, followed by required padding for alignment:

0-3 classid
4-7 guard count
8-X instance data as required
X-Y padding to achieve 8 byte word boundary as required.

Instance data is normally packed in order of descending size – so longs and doubles first and then on down, as recommended to achieve proper alignment to satisfy the architecture. Booleans are stored as a byte (better than java which normally stores a boolean as a full 32 bit int. Byte array however are packed for efficiency.

For X64, longs must be aligned to an 8 byte boundary, ints to a 4 byte boundary and so on. obviously we do lose some memory through padding as objects themselves must be allocated on a 8 byte boundary. the object preamble data – that before the instance data starts – is itself 8 bytes through necessity though we could have packed it into less space.

You will find that, as we’ll see, OHR object density compares favourably to standard Java object density. In the sample adam and eve app we pack almost double the objects into the same OS allocated memory.

Performance

So – whats the performance penalty in accessing the properties of a reified objects? Property accesses are currently several times slower than POJO property accesses, though your app’s performance would be better than this as you are doing computation other than property access. You may not notice appreciable difference.

This also has to be balanced against the fact that you can handle much bigger memory footprints with OHR.
The benefits – no matter how many OHR objects you create, or how many big your collections are, they play no part in, and do not affect, garbage collection of the Java heap (assuming you let the lightweight proxies go out of scope and/or recycle them).

The downsides – YOU are responsible for freeing your objects explicitly.

There is no free lunch. At least now, you have the choice between ease of use and performance.

In my experience in playing around with OHR there is a clear delineation in the objects you would consider for off heap reification. They are long lived, they undergo well known state transitions. Taking responsibility for their full lifecycle management is not a particularly difficult additional responsibility.

Conversely – the throwaway stuff you create temporarily inside a method is the low hanging fruit for the garbage collector. Provided it doesn’t stray too far, escape analysis may flag the reference as dead and ready for reclamation before your return code executes. There is still a cost of cleanup and compaction, though nowhere near the hit you take once a reference gets into the old generation. These short lived objects shouldn’t really be considered for OHR.

When you take out the garbage collection, you see just what a supreme optimising JIT HotSpot is. In its ability to gain runtime feedback for further optimisation, HotSpot is able to optimise the most critical parts of your code to a level just not feasible for a static compiler (it doesn’t know enough about the execution behaviour of your application, and mega-optimising EVERYTHING would take too long and make the executable too big)

The optimised code generated by OHR via javassist is able to be – in some cases, JITed down to a single or small handful of java opcode. Forget the function call hops – HotSpot will inline them out in short order.

Adam and Eve Stress Test

OK – lets write a simple app that really taxes the garbage collector. I call it Adam and Eve and you will find it in the source. A crude population growth modeller. We start off at year zero with a fixed number of Adams and Eves. I start with 100 non-bred people rather than just two.

Each person is represented by an individual object. Each person has the following properties:
year of birth (int)
a mother (person ref)
a father (person ref)
children (an ArrayList of persons).

The whole population is held in a root ArrayList. The root ArrayList contains a list of sub ArrayLists, one for each year. Each citizen born in that year is held in that year based sublist. The originals adams and eves have null parents.

We then start the clock ticking. At age twenty, the people start ‘a breedin’…

The breeding rate is controlled by a variable that allows you control the rate of population increase. The main points of note are each person is tracked by a full object, and each person object takes up about 64 bytes of memory (including object reference pointers). One thing you notice in the 64 bit world, pointer references really add to the object payload size. Also, although i’m not killing people off, past the age of 40 they stop breeding. This will allow a smooth transition of older objects (people) to be page into swap if required, as the breeding action climbs up the beehive of the population graph.

WE have a standard java implementation and an OHR implementation

Obviously, its only a matter of time, before any system succumbs to the enormity of exponential population growth. Lets hope our crude model isn’t a prescient omen for the future of our own planet. It is however a good stress test to compare OHR against standard Java. They’ll both break – we just want to find out how long they each hold out for, and the garbage collection profile of each.

The results of the test are in the table at the top of the blog. So for a 66 gig heap:

GC pause times are 20 seconds (POJO) compared to 20 milliseconds (OHR), with median and average pauses far more consistent (incrementals only) for OHR.
Almost double the number of OHR objects are packed into the same 66 gig space compared to POJO objects.

Comparing the java and OHR java illustrates that in using the OHR framework – its by and large ‘just’ java – you don’t code much differently. This is in contrast to say, JNI, with which there is a clear jump to a lower level technology.

I’m intentionally running the demo single threaded. the garbage collector has phases that do things in parallel, so keeping the app to a single thread gives us a better idea of just how much horsepower the garbage collector is using

One thing you will notice in the adam and eve code is the call to:

Reifier.recycleLWP(o);

This is an optional feature to recycle the lightweight pojo proxy objects so they do not have to be garbage collected and recreated. In my stress tests I have noticed that even rapid proxy creation seems to overwhelm the younger GC generations and some refs leak to the old generations. Recycling the proxies creates essentially garbage free code. It may be possible to set more aggressive cleanup settings for the collection to obviate this requirement – I haven’t experimented enough. It would be nice to dispense with proxy recycling. It is however completely optional and if no proxies are recycled the Reifier will create a new one.

Its important to note – though we are playing with a 66 gigabyte total heap, doubling or quadrupling it would have negligible effect on the OHR implementation, but the standard POJO implementation of the test app would see pause times blow out further in a most likely worse than linear fashion – to potentially MINUTES of lock.

Which is why I say OHR allows you to maintain potentially sub millisecond pause times on hundreds of gigabyte heaps or more.

So where we at?

This is currently alpha quality code. With more refinement this project or one similar could become a mainstay of a subset of high resource java apps until (if ever for the medium term) the capabilities of garbage collectors pushes through to a new level. The test cases show how to use the various features such as inline arrays and so on.

There are a few things still to be wrapped up

- Code cleanup. You can tell the codebase came about via a lot of experimentation.

- Luckily the API is incredibly light – the minimalist Reifier class. This is good news because we can make major changes to the underlying implementation without breaking much.

- Is the non-singleton lightweight proxy the correct way to go? It gives the best performance but then it appears to me locking becomes trickier for collections.

- A complete set of off-heap collections classes that mirror the features of standard java collection objects. These will (must!) all be indexed to 64 bits to be essentially future proof (though a wrapper class will be provided in order to reference them as standard collections interfaces where total space remains below the 32 bit index, with a RuntimeException thrown on overflow). Although most of the second generation of collections classes in java provide no concurrency support, OHR objects MUST provide some level of locking, as a race condition could lead to not just indeterminate behaviour but possible segment faults and process termination.

- We’re going to have to think about a proper OHR object locking mechanism. This is where the non unique, multiple proxy instances per OHR instance becomes a sticking point since we can’t lock on any proxy instance specific field. Not sure whether sun.misc.Unsafe provides enough to do OHR rather than POJO side locking other than some sort of spinlock using the test-and-set feature.

- Proper serialisation/externalisation support.

A few more ideas:

- OHR over memory mapped files. An OHR java object space spanning MULTIPLE java processes. Could open up some funky multi-process possibilities.
- OHR would be PERFECT as a backing store for MASSIVE java cache implementations. Some solutions exist like Teradata’s BIGDATA, but they actually serialise the object and store it off heap in hibernation. OHR provides a much more elegant solution, I think, if done correctly. OHR would allow a live object to remain off-heap and participate in, say, MVCC transaction flow with efficient single property changes. Something to look at.
- More robust test suite

There is no doubt garbage collectors will improve – but I have a feeling the OHR framework could be useful for quite some time for those looking for more system level performance in Java applications. Along with projects like the LMAX Disruptor (which also makes use of the Unsafe class), I’m hoping off heap reification becomes a mainstay, allowing those interested in large memory footprint java applications to take things to a higher (or is that lower?) level.

This project is released under an Apache License, so you are free to privately fork and use the code as you see fit.

cheers,
Rudi…

The Java OHR Project