Tuesday, May 28, 2013

Java Concurrency : 10 things every programmer should know

1.  Synchronized

Every one knows that synchronized keyword is used to acquire a lock and serialize access to a code block. A lesser known function of this keyword is to synchronize thread local memory with main memory. When a thread enters a synchronized block, it refreshes its local cache with data from main memory. You can be sure that you are now reading any data written by other threads. When a thread leaves a synchronized block, it writes data to main memory. The data is guaranteed to be seen by any other thread that reads.

2. Executors

Prior to JDK 5 and prior to java.util.concurrent, the way to create threads was to extend java.lang.Thread and override the run method or implement a Runnable and pass it to a Thread constructor. However most applications need more than a single thread and you had to write your own thread pool. Since JDK5, the preferred way to create and use threads is to use java.util.concurrent.Executors class to create a threadpool.

ExecutorService tPool = Executors.newCachedThreadPool() ;
tPool.submit(new Runnable() {

      public void run() {
             // do work
       }

}) ;

Executors can create different kinds to threadpools. ExecutorService is an interface that can accept Runnable or Callable s that need to be executed.

3. Callable and Future

Callable like Runnable is an interface to represent a task that needs to be executed. The difference is that the call method of the Callable interface can return a value.

Future is an interface that has methods to check status on an executing task and get the result when it is available.

Callable<List> work = new Callable<List>() {   public List call() {

         List result = new ArrayList() ;

          // do some work and populate result

         return result ;
   }
}

Future<List> future = executor.submit(work) ;

List result = future.get() ;

get() method waits for the execution to complete and then gets the result.

Callable and Future make it convenient to code the interaction between tasks that generate results and tasks that are waiting for results. Future also has methods to check if a task is completed or canceled. You may cancel a task with the cancel method.

4. Thread Termination

Terminating a thread or threadpool gracefully is the responsibility of the application. A best practice is to provide a stop method that tells the thread to let submitted work complete and then exit.

If you have created the thread directly, then your implementation of shutdown needs to set a flag. The run method would check this flag and exit when necessary. Since a race condition is possible care should to taken to synchronize setting or reading the flag. Once the flag is set, any new work should be rejected and the thread should exit after already submitted work is completed. 

ExecutorService discussed above has a shutdown method which shuts down the threadpool after completing of already submitted tasks. No new tasks are accepted once this method is called.

public void stop() throws InterruptedException {
    executor.shutdown() ;
    executor.awaitTermination(timeout,TimeUnit.seconds) ;
}

5. Thread Interruption

Interruption is cooperative. Calling the interrupt method on thread merely sets the interrupted status flag in the thread to true. It does not automatically interrupt the thread. Implementations of well behaved blocking methods or long running methods should check this flag and exit early. Exiting early involves clearing the interrupted status and throwing an InterruptedException.

If your code calls a method that throws an InterruptedException, the code should either propagate the exception up the stack ( so someone more appropriate can handle it) or it should catch the Exception and set the interrupted status by calling the interrupt method.

The isInterrupted method returns the current interrupted status. The interrupted() method clears the status. These method names are a little confusing.

6. ConcurrentHashMap

In concurrent programs, it is better to use ConcurrentHashMap as opposed to synchronizedHashMap. See the blog ConcurrentHashMap.

7. Explicit locks

Explicit locks have several advantages over the synchronized keyword. For details read the blog When to use Explicit Locks ?

8. Compare and Swap

Locking in concurrent programs whether using synchronized or using explicit locks is expensive. The thread that is blocked waiting for lock might be suspended by operating system. When it acquires the lock it has to be rescheduled for execution and wait for its time slice.

Modern CPUs support the execution of some compound operations like compare and swap, fetch and increment, test and set without locking. When multiple threads try to operate on the same data, only one thread succeeds but the others do not block. This substantially increases scalability of concurrent programs.

Since JDK 5, Java has taken advantage of these atomic compound operations supported by CPUs in the form of Atomic variables and data structures like ConcurrentHashMap. Atomic classes discussed in 9 have various compound operations like CompareAndSet that take advantage of this.

9. Atomic variables

Operations like incrementing a variable or check and update, are compound operations. You first read, then increment/check and lastly write. For this to be threadsafe, locking is required. As mentioned above, locking is expensive and not scalable. The java.util.concurrent.atomic package has a set of classes that let you perform thread safe lock free operations on variables using techniques like item 8.

A get on an atomic variable gets the latest update from memory. A set on an atomic variable is available immediately to other threads to read. This is the same behavior as volatile variable and as per the Java memory model listed in 10.

10. Java Memory model

The Java memory model describes the rules that define how variables written to memory are seen, when such variables are written and read by multiple threads. It is a topic that is not well understood and many programmers are not aware of it. Read about it in the blog Java Memory model.

Thursday, April 25, 2013

10 reasons for considering the Scala programming language

Scala programming language has been around for a few years now and its popularity is increasing. Having programmed in Java for many years, I was initially skeptical whether we needed another programming language on the JVM. But after trying out Scala and reading about the language, I have had a change in heart. Whether your background is Java, C/C++, Ruby, python, C# or any other language, Scala has some very useful features that will force you to consider it, if you were looking for a programming language. This blog just lists the useful features. Programming examples will follow in subsequent blogs.

1. Objected oriented programming language (OOP)

Scala is a object oriented programming language. The benefits of OOP are well documented. A majority of programs today are written in some OO language. If you come from JAVA, C++, C# background, then you already know the benefits. If you are currently using a language that is not OO, then this might be one of the reasons for you to consider Scala. In Scala everything is an Object, unlike JAVA where primitives are not objects and the use of static methods lets you bypass the OO paradigm. OO programming enables you to write programs that have a structure that models that problem domain that the program is written for. This helps produce programs that are easier to read and maintain.

2. Functional programming

In contrast to OO programming, functional programming encourages the use of functions to do some work without changes in state or changes to the data it works on. Data is immutable. Functions take data as input and may produce new data as output. Additionally, a function is a type just like an Integer, String or any class. The advantage of functional programming is that there are no side effects - a function takes input and produces output , that is all. This make it easy to write error free programs that can scale or can be executed in parallel. Scala has very good support functional for programming. 

3. Static Types

In statically typed languages like C++, Java and Scala, every variable has a type and the type
determines what the program can do with the variable. If you try to multiply 2 Strings, the compilation process will flag that as error. Statically typed language protect the programmer
by detecting errors and from shooting himself in the foot. If you think strong typing is annoying and leads to verbose code, then you will be pleased to know that unlike Java, Scala supports type inference ( ability to detect type ) which reduces verbosity.

4. Brevity

Scala has features that enable the programmer to write compact code as opposed to verbose code. Less code mean fewer bugs and less time spent on maintenance.

//Java
public class Person {
  private String fname ;
  private String lname ;

  public Person(String first, String last) {
      fname = first ;
      lname = last ;
  }

}

In Scala the same class is written as

class Person(fname: String,lname: String) 

Scala supports type inference that helps avoid verbose code.  

// Java String is in the statement twice
public String[] stringarray = new String[5] ;

// Scala type is infered as Array of Strings
val stringarray = new Array[String](5)

5. JVM language

Scala is compiled to bytecode that runs on the Java virtual machine. Since the JVM is available on every platform, your scala code will run on windows , linux , mac os and any other platform for which a JVM is available.

Another advantage is the integration with Java. Java has a very rich class library. There are several open source projects that provide additional libraries for very useful functions. Java code can be called from Scala programs very easily, which means all those function rich libraries are available for your use in Scala.

val calendar = new java.util.GregorianCalendar()
print(java.lang.String.format("%1$ty%1$tm%1$td",cal))

will print todays date in format YYMMDD.

6.  Better support for concurrency

To write concurrent programs in JAVA, you had to deal with threads, the java memory model, locking ,  synchronization, deadlocks etc. Writing error free concurrent programs was difficult. Scala has a actor based programming model that shields the programmer from the issues you face in Java , C/C++. To write concurrent programs , you implement actors that send, receive and handle messages. The Actor model lets the programmer avoid sharing data between threads and the issues related to locking shared data.

7. Scalable programs

By avoiding locking in concurrent programs, Scala is able to exploit the parallelism in way that Java cannot. In Java, a recommended best practice for writing scalable code was to use immutable objects. With the Actor model is Scala, you use immutable objects as messages and have unsynchronized methods. Immutable object are also at the heart of functional programming (2) which Scala promotes.

How many times have we heard of a Ruby or Python application that has be rewritten in Java or C++ because it cannot scale to the increased demands of users ? With Scala, this will not be an issue.

8. Fast

Studies have shown that Scala is at least as fast as Java.
see http://research.google.com/pubs/pub37122.html

9. General purpose/multi-purpose

The brevity and compactness of Scale ensures that it can be used for scripting or rapid application development a la Ruby or Python. But the fact that it runs on JVM and its scalability features ensure that it can be used for complex applications.

10. It is getting more popular

This is a more non technical reason. Scala is getting more popular. More startups are moving to Scala. Many are skipping Java and going directly to Scala. If you are a Java programmer, learning Scala makes you more marketable. Even if you are not a Java programmer, learning Scala will open up a number of opportunities in the programming world. 

Thursday, April 4, 2013

Using HBase Part 2: Architecture


In this blog, let us take a quick look at some architectural details of HBase.

For an introduction to NoSql and HBase, read the following blogs.
What is NoSql ?
Using HBase

Internally HBase is a  a sparse, distributed, persistent, multidimensional sorted Map. While that sentence seems complicated, reading each word individually gives clarity.
sparse - some cells can be empty
distributed - data is partitioned across many hosts
persistent - stored to disk
multidimensional - more than 1 dimension (key,value,version)
Map - key and value
sorted - maps are generally not sorted but this one is

HBase uses HDFS to store the data.

An HBase table has rows and columns. Columns are grouped into column families. There is a version for each value. So table,row key, column family, column name, version are used to get to a value. Both row keys and values are byte[]s.

Table is sorted by row key, Within a column family, the columns are sorted. Storage is per column family. So logically related columns should be in a column family.

A Table is made of regions. A region has a subset of the rows in a table. A region can be described using tablename, start key, end key. A region is made up of one or more HDFS files.

The regions are managed by servers known as the region servers. There is a master server that assigns regions to region servers.

HBase has 2 catalog tables -ROOT- and .META. .META has information on all regions in the system. -ROOT- has information on .META. When a client wants to access data, these 2 tables are consulted to determine which region server has the region that should be used for this request. The client issues read/write requests to the region server directly.

HBase uses zookeeper to maintain cluster state. A simple diagram below shows the components of an HBase cluster.



















Logical view of a table:
The table is figure 2 has 2 column families: cf1 with columns colA and ColB, cf2 with columns ColC
and ColD. The value in each cell is uniquely identified by row key, column family, column name and a timestamp or version.

Logical view of RegionServer:




The rows of a table are in a Region. Region is the unit of allocation and is identified by a start key and end key. The regions are distributed across the region servers in the cluster.








Physical view of Region Server:

Each Region has one of more stores. Each Store is per column family. The memStore is where changes are stored in memory before writing to disk. The file store is the persistent store and is a file written to HDFS. The Hfile is described in the blog HFile.

Each RegionServer has a write ahead log (WAL) . Writes are first written to the WAL. If the region server crashed before memory is flushed to disk, the WAL is used to recover. This implies data is stored in memory and flushed to disk periodically. Changes are sorted while in memory.

Reads look for data in memStore first and then go to disk if necessary. Data is flushed to disk in 64 Mb chunks. This size is configurable. HFiles are merged to larger files. Sorting in memory and merging files makes it like a mergeSort.

For delete, the row is marked as deleted ( as opposed to physically removing it).

HBase provides ACID semantics at a row level. HBase does multi version concurrent updates, which means updates happen by creating a new version as opposed to overwriting existing row. Writers need to acquire a lock to write. Readers do not acquire a lock.To ensure consistent reads without locking, HBase assigns a write number to each write. The read returns data from the highest write number that is durable. Locks stored in memory in the region server. This is sufficient because all values for a row are in one region server. Transactions are committed in a serial order.

Sharding is automatic. Regions split when files reach a certain size.

Compaction step which run in background combines files, removes deleted data.

This concludes the introduction to HBase architecture.

Friday, March 15, 2013

Using HBase

HBase is a NoSQL database from the hadoop family. The NoSql concept is discussed in my blog at What is NoSql ? HBase is a column oriented key value store based on Google's Bigtable.

To recap,  you would be considering a NoSql database because your RDBMS is probably not able to meet your requirements because of one or more of the following reasons:
  • You application deals with billions and billion of rows of data
  • Application does a lot of writes
  • Reads require low latency
  • linear scalability with commodity hardware is required
  • You frequently need to add more columns or remove columns
There are several NoSql databases that can address one or more of these issues. In this article I provide an introduction to HBase. The goal is to help you get started evaluating whether HBase would be appropriate for your problem. This is introductory material. More details in subsequent blogs.

Main features of HBase are :

  • Built on hadoop and HDFS. If you are already using hadoop , then HBase can be viewed as an extension to your hadoop infrastructure that provides random reads and writes.
  •  A Simple data model based on keys , values and columns. More on this later. 
  • Scales linearly by adding commodity hardware 
  • Automatic partitioning of tables as they grow larger 
  • Classes available for integration with MapReduce 
  • Automatic failover support 
  • Support rowkey range scans 
Data Model
 
The main constructs of the model are  Table, rows, column family and columns.

Data is written and read from a Table. A Table has rows and column families. Each row has a key.

Each Column family has one or more columns. Columns in a column family are logically related. Each column has a name and value. When a Table is created, the column families have to be declared. But the columns in each family do not need to be defined and can be added on demand. Each column is referred to using the syntax columnFamily:column. For example, an age column in a userprofile column family is referred to as userprofile:age. For each row, storage space is taken up only for the columns written in that row.

Let us design a Hbase table to store User web browsing information. Each user has a unique id called userid. For each user we need to store

(1) some profile information like sex, age, geolocation, membership.
(2) For each partner website he visits, store the page types viewed, products viewed.
(3) For each partner website he visits, store products purchased , product put in shopping cart but not purchased.

Our structure might look like

{
userid1:{ // rowkey
    profile:{ // column family
          sex: male, // column , value
          age : 25,
          member: Y 
    },
    browsehistory: { // column family
          partner1.hp:23,    // visited partner1 homepage 23 times
          partner2.product.pr1 : 4 // viewed product pr1 4 times
    }
    shoppinghistory: { // column family
         partner3.pr3: 25.5 , // purchased pr3 from partner3 for $25.5
    } 
 
 }

 Let us design an Hbase table for the above structure.

Tablename : UserShoppingData. Since we will lookup data based on user, the key can be userid.

(1) ColumnFamily profile for profile information. Columns would be sex, age, member etc
(2) ColumnFamily browsehistory for browsing data. Columns are dynamic such as websitename.page or website.productid
(3) ColumnFamily shopping history for shopping data. Columns are dynamic.


The beauty is you can dynamically add columns. If visualizing this as columns is difficult, just think that you are dynamically adding key value pairs.  This kind of data is required in a typical internet shopper analytics application. 

HBase is an appropriate choice because you have several hundred million internet shoppers. That is several million rows. If you wanted to store data by date, you might make the key userid+date, in which case you might have even more rows - in the order of billions. Data is written as the user visits various internet shopping websites. Later the data might need to read with low latency to be able to show the user a promotion or advertisement based on his past history. A company I worked for in the past used a very popular RDBMS for such high volume writes and when ever the RDBMS was flooded with such write requests, the RDBMS would grind to a halt.

Let us use HBase shell to create the above table, insert some data into it and query it. 

Step 1: Download and install HBase from http://hbase.apache.org

Step 2: Start hbase
$ ./start-hbase.sh
starting master, logging to /Users/jk/hbase-0.94.5/bin/../logs/hbase-jk-master-jk.local.out
 

Step 3: Start hbase shell
$ ./hbase shell
HBase Shell; enter 'help' for list of supported commands.
Type "exit
" to leave the HBase Shell
Version 0.94.5, r1443843, Fri Feb  8 05:51:25 UTC 2013
hbase(main):001:0>

 

Step4: Create the table
hbase(main):004:0> create 'usershoppingdata','profile','browsehistory','shophistory'
0 row(s) in 3.9940 seconds


Step5: Insert some data
hbase(main):003:0> put 'usershoppingdata', 'userid1','profile:sex','male'
0 row(s) in 0.1990 seconds

hbase(main):004:0> put 'usershoppingdata', 'userid1','profile:age','25'
0 row(s) in 0.0090 seconds

hbase(main):005:0> put 'usershoppingdata', 'userid1','browsehistory:amazon.hp','11'
0 row(s) in 0.0100 seconds

hbase(main):006:0> put 'usershoppingdata', 'userid1','browsehistory:amazon.isbn123456','3'
0 row(s) in 0.0070 seconds

hbase(main):007:0> put 'usershoppingdata', 'userid1','shophistory:amazon.isbn123456','19.99'
0 row(s) in 0.0140 seconds

 

Step 6: Read the data
hbase(main):008:0> scan 'usershoppingdata'
ROW                        COLUMN+CELL                                                                
 userid1                   column=browsehistory:amazon.hp, timestamp=1362784343421, value=11          
 userid1                   column=browsehistory:amazon.isbn123456, timestamp=1362786676092, value=3   
 userid1                   column=profile:age, timestamp=1362784243334, value=25                      
 userid1                   column=profile:sex, timestamp=1362784225141, value=male                    
 userid1                   column=shophistory:amazon.isbn123456, timestamp=1362786706557, value=19.99 
1 row(s) in 0.1450 seconds
 

hbase(main):010:0> get 'usershoppingdata', 'userid1'
COLUMN                     CELL                                                                       
 browsehistory:amazon.hp   timestamp=1362784343421, value=11                                          
 browsehistory:amazon.isbn timestamp=1362786676092, value=3                                           
 123456                                                                                               
 profile:age               timestamp=1362784243334, value=25                                          
 profile:sex               timestamp=1362784225141, value=male                                        
 shophistory:amazon.isbn12 timestamp=1362786706557, value=19.99                                       
 3456                                                                                                 
5 row(s) in 0.0520 seconds
 

hbase(main):011:0> get 'usershoppingdata', 'userid1', 'browsehistory:amazon.hp'
COLUMN                     CELL                                                                       
 browsehistory:amazon.hp   timestamp=1362784343421, value=11                                          
1 row(s) in 0.0360 seconds


Step 7: Add few more rows

hbase(main):015:0> put 'usershoppingdata', 'userid2','profile:sex','male'
0 row(s) in 0.0070 seconds

hbase(main):016:0> put 'usershoppingdata', 'userid3','profile:sex','male'
0 row(s) in 0.0060 seconds

hbase(main):017:0> put 'usershoppingdata', 'userid4','profile:sex','male'
0 row(s) in 0.0330 seconds

hbase(main):018:0> put 'usershoppingdata', 'userid5','profile:sex','male'
0 row(s) in 0.0050 seconds


Step 8: Let us do some range scans on the row key
hbase(main):024:0> scan 'usershoppingdata', {STARTROW => 'u'}
ROW                        COLUMN+CELL                                                                
 userid1                   column=browsehistory:amazon.hp, timestamp=1362784343421, value=11          
 userid1                   column=browsehistory:amazon.isbn123456, timestamp=1362786676092, value=3   
 userid1                   column=profile:age, timestamp=1362784243334, value=25                      
 userid1                   column=profile:sex, timestamp=1362784225141, value=male                    
 userid1                   column=shophistory:amazon.isbn123456, timestamp=1362786706557, value=19.99 
 userid2                   column=profile:sex, timestamp=1362788377896, value=male                    
 userid3                   column=profile:sex, timestamp=1362788385501, value=male                    
 userid4                   column=profile:sex, timestamp=1362788392575, value=male                    
 userid5                   column=profile:sex, timestamp=1362788398087, value=male                    
5 row(s) in 0.0780 seconds


hbase(main):019:0> scan 'usershoppingdata', {STARTROW => 'userid3'}
ROW                        COLUMN+CELL                                                                
 userid3                   column=profile:sex, timestamp=1362788385501, value=male                    
 userid4                   column=profile:sex, timestamp=1362788392575, value=male                    
 userid5                   column=profile:sex, timestamp=1362788398087, value=male                    
3 row(s) in 0.0250 seconds

hbase(main):023:0> scan 'usershoppingdata', {STARTROW => 'userid3', STOPROW => 'userid5'}
ROW                        COLUMN+CELL                                                                
 userid3                   column=profile:sex, timestamp=1362788385501, value=male                    
 userid4                   column=profile:sex, timestamp=1362788392575, value=male                    
2 row(s) in 0.0160 seconds


The shell is very useful to playaround with the data model and get familiar with HBase. In a real world application , you might write code in a language like Java. There is more to HBase than this simple introduction. I will get into internals and architecture in future blogs.

Friday, February 15, 2013

Hadoop Secondary Sort: Sorting values

Sorting is a core strength of the Hadoop MapReduce framework. But by default it sorts only the keys. The values for each key are not sorted. There are many use cases where the values for a key are required in a sorted order. For example, let us say a web application logs users interaction - user id, time and other log data. The log data is distributed across log files from different servers. The requirement is to get users log records in  a chronological order.

The input to the map is the set of log records. Let us say (userid,time) not in any order.

The output from the reducer needs to be sorted by user by time.

user1, t1
user1, t2
user1, tn
.
.
usern, t1
usern,tn

where for each user, t1 < t2 ..... < tn.

We know Hadoop sorts the keys. We could make the map output a key that is a combination of user and time. In other words, a composite key that is a combination of userid and time. We need to write a comparator that hadoop can use to sort using the composite key. However a side effect of this is that hadoop will send records for the same user to different reducers. This means you will be not be able to reduce all the users records as a group.

Remember, the hadoop framework sorts the output of a Map by key. It then partitions the output. Each partition is intended for a Reducer.  All values for a key from a Map are in the same partition. We want all the records for a user to go to the same Reducer. This implies they need to be in the same partition. Fortunately there is a way to influence partitioning. We tell hadoop to put all records for a user in the same partition by implementing a partioner that partitions just based on user id and not the composite key.

A Reducer receives partitions from several Mappers. Remember that reducer is called with a key and list of values for that key. Before the framework can call the reducer, it has to group the values for that key from all the partitions.  We want the grouping to happen based on userid ( not userid + time).
So we need to implement a grouping comparator that hadoop uses for grouping and this will compare userids. Since the records in each partition are sorted by userid and time, grouping which is a merge process preserves the sort order - like a merge sort.

In summary you need to

Step 1: Make the map output key a composite of the natural key and value
 Make the Map ouput key a composite of the natural key (userid) and the value (time). A composite key implements a WritableComparable. You need to override the compareTo method to use both userid and time. This method is used to order the keys using the composite key. You also need to override the write and readFields method which are called for serialization and deserialization.
You tell Hadoop to use this composite key by calling the method
job.setOutputKeyClass(UserTime.class) ;

Map output would be like

(user1,t1) , t1
(user1,t2),t2
.
(user2,t1),t1
(user2,t2),t2
 
Step 2: Partition Map output using only the natural key
To ensure all values for the key go the same reducer, you need to implement a partitioner. This is a class that extends org.apache.hadoop.mapreduce.Partitioner. Override the getPartition method to return a partition based on the natural key user id. You tell hadoop to partition using this partitioner with the call
job.setPartitionerClass(NaturalKeyPartitioner.class) ;

Partition from Map 1:
(user1,t1),t1
(user1,t2),t2
(user2,t1),t1
(user2,t2),t2

Partition from Map 2:
(user1,t3),t3
(user1,t4),t4
(user2,t3),t3
(user2,t4),t4


Step 3: Group values using only the natural key
To ensure the reducer gets called with all the values for the key, you need to implement a WritableComparator based on the natural key user id. Override the compare method to compare based on the userid. You tell hadoop to use this comparator for grouping with the call
job.setGroupingComparatorClass(NaturalKeyGroupComparator.class) ;

Input to Reducer 1:
key = (user1,t1), values = t1,t2,t3,t4

Input to Reducer 2:
key = (user2,t1), values = t1,t2,t3,t4


Output from Reducer1:
user1, t1
user1, t2
user1, t3
user1, t4

The complete sample source code is at SecondarySort.jar.

Wednesday, January 16, 2013

Generics : Array Creation

How do you write a method to convert a generic collection to an Array ? A naive implementation would be:
 
public static <T> T[] convertToArray(Collection<T> c) {

   T[] a = new T[c.size()] ; // compilation error
   int i = 0 ;
   for (T x : c) {
      a[i++] = x ;

   }
     
} 
The code does not compile because the type of Array is required to be able to create an array. Array is what is known as a reifiable type. A type is reifiable if its type information is available at runtime. Any java class or primitive type is reifiable.
Generics on the other hand are implemented by erasure - that is the type information is erased and runtime uses casts to get appropriate behavior. So while
 
List<T> a = new ArrayList<T> () ; 
works because T is erased. Under the hood, just an ArrayList() is created and casts added when getting T. However,
 
T[] a = new T[size] ; // compile error
 
will not work because for arrays type information is required.
The solution is to use reflection, which is what you would if you wanted to dynamically create an instance of any reifiable type like a plain java class. The method signature in our example changes a little to take the array type as an additional parameter. Since it is easier to pass in the required array
 
public static <T> T[] convertToArray(Collection<T> c, T arry) {
   if (arry.length < c.size()) {

      arry = (T[]) java.lang.reflect.Array.newInstance(
                             arry.getClass().getComponentType(),c.size()) ;

      int i = 0 ;
      for (T x : c) {
          a[i++] = x ;
      }
   }
     
} 
The newInstance method on Array creates an array of the required type. getComponentType returns the type of elements of the array. This is analagous to using reflection to a create an instance of class K. You would do
 
K.getClass.newInstance() ;
 
In summary, in generic methods, you can use new operator to create non-reifiable types (eg List<T>) because the type information is erased during compilation (List is created). But for reifiable type, you need to use reflection because the type information is required and cannot be erased

Wednesday, December 19, 2012

JAVA Garbage Collection

Garbage (GC) collection is the process by which the java virtual machine frees up memory by releasing the memory taken up by objects that are no longer referenced by any other objects. Garbage collection is automatic. For simple applications, the developer does even need to be aware of garbage collection. But for applications with large memory footprint or are long running or have low latency requirements, some understanding is necessary to ensure that garbage collection does not interfere with the application. A common interference of garbage collection is that the application seems to stop responding or the time to respond goes up randomly. The articles lists a few important points every Java developer needs to know about garbage collection.

1.0  Generational GC

Since JDK 5 , the garbage collectors are what are called generational collectors. The heap is divided into regions based on the age of the objects. The young generation has objects that are short lived. The tenured generation has objects that are long lived. All objects are first created in the young region and after a while if they are alive, they are moved to the tenured generation. Garbage collection of the young region happens frequently and is generally fast. GC for the tenured region happens less frequently. Since most objects are short lived, this makes the GC more efficient.

2.0  Types of collectors

Serial Collector : Garbage from both young and tenured regions is done serially and while this happens your application is paused. This is the default collector on single cpu machines and for small heaps sizes ( less that 2G) . This is fine if your application does not care about pauses.

Parallel Collector: This is the default collector on server class machines ( multiple CPUs and greater than 2G heap size). Multiple threads/cpus are used to do garbage collection in parallel for the young region. This makes collection faster. But the application is still paused when GC happens. For the tenured region, the GC is serial as in a serial collector.

Parallel Compacting Collector: GC for the young region is the same as parallel collector and uses multiple threads. However GC for tenured region happens in parallel using multiple CPUs. Application is paused when GC happens.

Concurrent Mark Sweep Collector (CMS): For young region, it is same as in parallel collectors. But for tenured region,  most of the time, GC runs concurrently with the application. The application pauses during GC are expected to be much shorter than the other collectors. This is an ideal choice for applications that cannot tolerate long pauses.

3.0 Understanding GC in your application

Before you try to tune your applications GC, it is important to understand when GC is happening, how much time it takes and how much memory it is reclaiming. The JVM provides the following options to log GC activity.

The -XX:+PrintGCDetails prints GC details described below. The -XX:+PrintGCTimeStamps prints the time from the start of the JVM to when each GC happened. The -Xloggc:gcfilename.log writes the log to gcfilename.log.

In the gc log, you will see a number of lines like

11.561: [GC [PSYoungGen: 868524K->294158K(1198848K)] 1303221K->728855K(4694144K), 0.3640750 secs] [Times: user=1.44 sys=0.02, real=0.37 secs]

This indicates that a GC of the young region occurred at time 11.561 secs from start. The young region was reduced from 868524k to 294158k (66%).  The number (1198848K) is the memory allocated to the young region. The total heap was reduced from 1303221K to 728855K or 44%. The number (4694144K) is the total heap. This GC took .37 secs.

You will see a few lines like

3602.170: [Full GC (System) [PSYoungGen: 16250K->0K(1662080K)] [PSOldGen: 1594630K->1578665K(3495296K)] 1610881K->1578665K(5157376K) [PSPermGen: 22314K->22314K(35904K)], 3.4836190 secs] [Times: user=3.45 sys=0.03, real=3.48 secs]

This indicates that a full GC occurred at 3602.17 secs from the start. The young region was reduced from 16250K to 0K. The old or tenured region was reduced from  1594630K to 1578665K. The total heap was reduced from 1610881K to 1578665K. The GC took 3.48 sec.

The GCViewer is free tool to view GC logs graphically.
GC log viewed in GCViewer

The very small black lines at the bottom indicate the small GCs. The tall black lines at the hourly mark are the Full GCs. The blue peaks are lines indicating how the used heap goes up and goes down after a GC. The ruby red line just below the blue spikes shows the growth of the tenured region. You can see that the tenured region drops after a full GC. Full GCs take a lot of time and you want to reduce the frequency with which they occur.

4.0 Tuning options

 The JVM offers a few knobs that one can turn to tune the GC in a way most suitable to your machine and your application.

-Xms -Xmx options are used to set the initial and maximum size of the heap.  Maximum heap size should be less that physical memory on the machine to avoid paging and one should also leave aside memory for the operating system and other applications running on the same machine. While bigger heap and more memory are good because the GC has to collect less often, when it does have to collect, it has to do more work and the GC pauses could be longer.

–XX:+UseSerialGC
–XX:+UseParallelGC
–XX:+UseParallelOldGC
–XX:+UseConcMarkSweepGC

These options are used to select the GC. SerialGC and ParallelGC are selected by default depending on machine type as described earlier.  Applications that have low latency requirements and cannot tolerate long GC pauses should consider switching to the Concurrent Mark Sweep GC.

-XX:NewSize=n is used to set the default initial size of the young generation. Most applications have many short lived objects and few long lived objects. The newsize should be large enough that short lived objects fit into the young generation and are garbage collected in the small GCs. If the young generation is too small, short lived object get moved to the tenured region which leads to longer Full GCs.

-XX:MaxPauseTimeMillis is a hint to the GC as to the desired maximum pause time. This is just a hint and may or may not be honoured.

5.0 References

There are many other tuning options and the following documents from Oracle are good references on tuning options as well as garbage collection in general:

1. http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html
2. http://www.oracle.com/technetwork/java/javase/tech/memorymanagement-whitepaper-1-150020.pdf