(注:このブログはもう更新していません)この日記は私的なものであり所属会社の見解とは無関係です。 GitHub: takahashikzn






  • Java7u06より前のバージョンでは、String.substringなどで文字列を切り出す際に元の文字列の内部表現であるchar配列が使いまわされるので、小さな文字列でも内部で大きなchar配列への参照が残ってしまうケースがある。
  • そこで、Java7u06以降は単にchar配列を必要な部分だけコピーすることでこの問題を回避した。






Strings consume a lot of memory in any application. Especially the char[] containing the individual UTF-16 characters is contributing to most of the memory consumption of a JVM by each character eating up two bytes.
It is not uncommon to find 30% of the memory consumed by Strings, because not only are Strings the best format to interact with humans, but also popular HTTP APIs use lots of Strings. With Java 8 Update 20 we now have access to a new feature called String Deduplication, which requires the G1 Garbage Collector and is turned off by default.
String Deduplication takes advantage of the fact that the char arrays are internal to strings and final, so the JVM can mess around with them.

Various strategies for String Duplication have been considered, but the one implemented now follows the following approach:
Whenever the garbage collector visits String objects it takes note of the char arrays. It takes their hash value and stores it alongside with a weak reference to the array. As soon as it finds another String which has the same hash code it compares them char by char.
If they match as well, one String will be modified and point to the char array of the second String. The first char array then is no longer referenced anymore and can be garbage collected.

This whole process of course brings some overhead, but is controlled by tight limits. For example if a string is not found to have duplicates for a while it will be no longer checked.





  • ガベージコレクタが文字列オブジェクトの内容を調べるタイミングで、中身の文字配列のハッシュ値を、その弱参照として保持しておく。
  • もし、同一ハッシュ値を持つ他の文字列オブジェクトを見つけたら文字配列の内容を検査し、同一ならば双方が同じ文字配列を指すように文字列オブジェクトを改編する。
  • すると、参照されなくなった文字配列はガベージコレクタの対象になる。



-XX:+UseStringDeduplication -XX:+PrintStringDeduplicationStatistics


[GC concurrent-string-deduplication, 4658.2K->0.0B(4658.2K), avg 99.6%, 0.0165023 secs]
   [Last Exec: 0.0165023 secs, Idle: 0.0953764 secs, Blocked: 0/0.0000000 secs]
      [Inspected:          119538]
         [Skipped:              0(  0.0%)]
         [Hashed:          119538(100.0%)]
         [Known:                0(  0.0%)]
         [New:             119538(100.0%)   4658.2K]
      [Deduplicated:       119538(100.0%)   4658.2K(100.0%)]
         [Young:              372(  0.3%)     14.5K(  0.3%)]
         [Old:             119166( 99.7%)   4643.8K( 99.7%)]
   [Total Exec: 4/0.0802259 secs, Idle: 4/0.6491928 secs, Blocked: 0/0.0000000 secs]
      [Inspected:          557503]
         [Skipped:              0(  0.0%)]
         [Hashed:          556191( 99.8%)]
         [Known:              903(  0.2%)]
         [New:             556600( 99.8%)     21.2M]
      [Deduplicated:       554727( 99.7%)     21.1M( 99.6%)]
         [Young:             1101(  0.2%)     43.0K(  0.2%)]
         [Old:             553626( 99.8%)     21.1M( 99.8%)]
      [Memory Usage: 81.1K]
      [Size: 2048, Min: 1024, Max: 16777216]
      [Entries: 2776, Load: 135.5%, Cached: 0, Added: 2776, Removed: 0]
      [Resize Count: 1, Shrink Threshold: 1365(66.7%), Grow Threshold: 4096(200.0%)]
      [Rehash Count: 0, Rehash Threshold: 120, Hash Seed: 0x0]
      [Age Threshold: 3]
      [Dropped: 0]


For our convenience we do not need to add up all data ourselves but can use the handy totals calculation.
The above snippet is the forth execution of String Deduplication, it took 16ms and looked at about 120k Strings.
All of them are new, meaning not yet looked at. These numbers look different in real apps, where strings are passed multiple times, thus some might be skipped or have a hashcode already (as you might know the hash code of a String is computed lazy).
In above case all strings could be deduplicated, removing 4.5MB of data from memory.
The Table section gives statistics about the internal tracking table, and the Queue one lists how many requests for deduplication have been dropped due to load, which is one part of the overhead reduction mechanism.


全ての文字列は生成されたばかりで、まだ未検査のものだ。(訳注: knownが0であることを指している)一方で実際のアプリケーションでは文字列の検査が何度も実行されるため、検査対象外になるかハッシュコードが算出済み(※文字列のハッシュコードが遅延計算されることは知っていると思うが)かで、上記のような数値にはならない。


So how does this compare to String Interning? I blogged about how great String Interning is for memory efficiency. In fact the String Deduplication is almost like interning with the exception that interning reuses the whole String instance, not just the char array.


The argument the creators of JDK Enhancement Proposal 192 make is that often developers do not know where the right place to intern strings would be, or that this place is hidden behind frameworks. As I wrote, you need some knowledge where you typically encounter duplicates (like country names).
String Deduplication also benefits duplicate Strings across applications inside the same JVM and thus also includes stuff like XML Schemas, urls, jar names etc which one normally would assume not appear multiple times.

It also adds no runtime overhead as it is performed asynchronously and concurrent during garbage collection, while String Interning happens in the application thread. This now also explains the reason we find that Thread.sleep() in above code. Without the sleep there would be too much pressure on GC, so String Deduplication would not run at all. But this is a problem only for such sample code. Real applications usually find a few ms spare time to run String Deduplication.

JDK Enhancement Proposal 192の作者は、文字列共有を使うべき箇所や、それの仕組みがフレーワークによって如何に隠蔽されているかを理解していない開発者が多いことを指摘している。