Code & Cluster

Wednesday, May 16, 2012

When to use explicit Locks in JAVA ?

Prior to JDK 5, the only way to protect data from concurrent access was to use the synchronized keyword. The limitations of using synchronized are

(1) A thread that tries to acquire a lock has to wait till it gets the lock. There is no way to timeout.
(2) A thread that is waiting for a lock cannot be interrupted.
(3) Since synchronized applies to a block of code, the lock has to be acquired and released in the same block. While this is good most of the time, there are cases where you need the flexibility of acquiring and releasing the lock in different blocks.

The Lock interfaces and classes are well documented at java.util.concurrent.locks
The basic usage of the new Lock interface is

Lock l = new ReentrantLock() ;
l.lock() ;
try {
// update
} finally {
l.unlock() ;
}

You might be tempted to say that this can be done using synchronized. However the new Lock interface has several additional features.

1. Non blocking method
The trylock() method (without params) acquires the lock if it is available. If it is not available it returns immediately. This is very useful in avoiding deadlocks when you are trying to acquire multiple locks .

2. Timed
trylock(time......) acquires the lock if it is free within the time. Otherwise it returns false. The thread can be interrupted during the wait time.

This is useful when you have service time requirements such as in real time bidding. Say the method needs to response in 10 milli secs, otherwise the response is of no use because the bid is lost.

3. Interruptible
The lockInterruptibly method will try to acquire the lock till it is interrupted.
This is useful in implementing abort or cancel features.

4. Non block structured locking
You can acquire the lock in one method and release it in another. Or you can wrap the lock and unlock in your domain specific accquireLock() and releaseLock() methods.

This is useful in avoiding race conditions on read,update,save operations on data stored in caches. The synchronization provided by ConcurrentHashMap or Synchronized Map protects only for the duration of get and set operation. Not while the data is modified.

cache.acquireLock(key) ;
Data d = cache.get(key) ;
d.update1() ;
d.update2() ;
d.update3() ;
cache.put(key,d) ;
cache.releaseLock(key) ;

Acquiring and releasing the lock are abstracted away in acquirelock and releaseLock methods.

5. Read /Write Locks
This is my favorite feature. The ReadWriteLock interface exposes 2 locks objects. A read lock and a write lock.

You acquire the read lock when all you are doing is reading. Multiple threads can acquire the read lock.By allowing multiple readers, you achieve greater concurrency. A read lock cannot be acquired while a write lock is held by another thread.

You acquire the write lock when you need to write data. Only one thread can acquire a write lock at a time. A write lock cannot be acquired while other threads have acquired read locks.

Is the use of synchronized obsolete ?
Not really. Synchronized blocks are simple to use and are widely used. Most programmers are very familiar with its usage. They are less error prone as the lock is automatically release. It is reasonable to continue using synchronized for the the simpler use cases of locking. But if you need any of the features described above, using explicit locks is well worth the extra coding. Performance wise there is not much difference though studies have shown that explicit locks are slightly faster.

Friday, April 20, 2012

Build distributed applications using Spring HTTP Invoker

Buliding distributed applications involves calling methods on objects that are remote - on different machines and/or different JVMs. Code running on machine A invokes a method on an object running on machine B and it works just as if the caller and the target were in the same JVM. In the past, CORBA, RMI and EJBs were technologies used for remote invocation. But they are complicated to use. The protocols are binary and difficult to troubleshoot. Also they are not suitable for use across intranets because they use ports that networks admins hate to open.

Since 2000, SOAP based web services enabled remote invocation using HTTP as the transport and XML for payload. While HTTP solved the problems of troubleshooting and firewalls, the performance of using XML was not very good. Some developers prefer web services using JSON over HTTP, but that requires modeling the data in JSON.

Spring HTTP Invoker is a remoting mechanism, where the programming model is plain java, but HTTP is used as the transport and the payload is created using java serialization. Spring HTTP gives developers the benefit of HTTP without the performance overhead of XML based web services. In the rest of the article, we explain with a simple example remoting using Sprint HTTP Invokers.

For this tutorial you will need

(1) Spring
(2) JDK
(3) Eclipse
(4) Tomcat

In this example, we create a service AccountService, with a method getAccount. The service is deployed to tomcat. We invoke the getAccount method from J2SE client in a different JVM. You may download the full source code for this sample at RemoteService

Step 1: create the service and implementation

Let us define an interface AccountService and its implementation AccountServiceImpl in plain JAVA.

public interface AccountService {
  public Account getAccount(int id) ;
}
public class AccountServiceImpl implements AccountService {
 @Override
 public Account getAccount(int id) {
  // TODO Auto-generated method stub
  return new Account(id,"testacct",100,2999.99F) ;
 }
}

Step 2: Spring Application context for server side
The Spring application context is defined in the file remoting-servlet.xml.

<beans>
    <bean id="accountService" class="com.mj.account.AccountServiceImpl"/>
    <bean> name="/AccountService" class="org.springframework.remoting.httpinvoker.HttpInvokerServiceExporter">
       <property name="service" ref="accountService"/>
       <property name="serviceInterface" value="com.mj.account.AccountService"/>
   </bean>
</beans>

The first bean accountService needs no explanation - it is a simple spring bean. The 2nd exports a bean /AccountService. This is exported by HttpInvokerServiceExporter, a Spring provided class. The service exported is accountService defined by the 1st bean. Since we will be invoking using HTTP, the url is /AccountService. (by convention).

Step 3: package as war and deploy to tomcat

The classes and context xml needs to be packaged as a war and deployed to tomcat. The standard spring MVC dispatcherServlet needs to be wired into the web.xml.

<servlet>
        <servlet-name>remoting</servlet-name>
        <servlet-class>
            org.springframework.web.servlet.DispatcherServlet
        </servlet-class>
        <load-on-startup>1</load-on-startup>
</servlet>
<servlet-mapping>
        <servlet-name>remoting</servlet-name>
        <url-pattern>/*</url-pattern>
</servlet-mapping>

Step 4: Create a application context for the client with the entry

<bean id="AccountProxy" class="org.springframework.remoting.httpinvoker.HttpInvokerProxyFactoryBean">
     <property name="serviceUrl" value="http://localhost:8080/remoteservice/AccountService"/>
     <property name="serviceInterface" value="com.mj.account.AccountService"/>
</bean>

This defines a bean AccountProxy whose implementation is the HttpInvokerProxyFactorybean, which will create the Http invoker. The url to invoke is http://localhost:8080/remoteservice/AccountService. http://localhost:8080 is where the target web server is listening. /remoteservice is the tomcat context ( I deployed the service as remoteservice.war). We defined /AccountService as the url for our bean in remoting-servlet.xml.

Step 5: Remote invoke the service

public class AccountServiceClient {
 public static void main(String[] args) {
   ApplicationContext applicationContext = new ClassPathXmlApplicationContext("remoteclient.xml");  
   AccountService testService = (AccountService) applicationContext.getBean("AccountProxy");  
   Account a = testService.getAccount(25) ;   
   System.out.println(a) ;
 }
}

You should see the output 25 testacct 100 2999.99

In summary, it is very easy to do remote invocation and distribute services using Spring HTTP invokers. You get the ease of plain JAVA programming and the ease of maintainence and troubleshooting because of HTTP. There is simply no reason to use RMI like protocols any more.

Sunday, March 18, 2012

High Availability for Web applications

As more mission critical applications move to the cloud, making the application highly available becomes super critical. An application not available for whatever reason, web server down, database down etc mean lost users, lost revenue that can be devastating to your business. In this blog we examine some basic high availability concepts.

Availability means your web application is available to your users to use. We would all like our applications to available 100% of the time. But for various reasons it does not happen. The goal of high availability is to make the application available as much as possible. Generally, availability is expressed as a percent of time that application is available per year. One may say availability is 99% or 99.9% and so on.

Redundancy and failover are techniques used to achieve high availability. Redundancy is achieved by having multiple copies of your server. Instead of 1 apache web server, you have two. One is the active server. The active server is monitered and if for some reason it fails, you failover to the 2nd server which becomes active. Another approach is to use a cluster of active servers as is done in a tomcat clusters. All servers are active. A load balancer distributes load among the members of the cluster. If one or two member of the cluster go down, no users are affects because other servers continue processing. Of course, the load balancer can become a point of failure and needs redundancy and failover.

If you were launching a new web application to the cloud, you might start of with a basic architecture as shown below without any HA consideration.

Phase 1: 1 Tomcat web server

Very soon you hit 2 issues. First whenever the server goes down, by accident or by intent, your users cannot use the application. As the number of users goes up, your server is not able to handle the load.

Phase 2: Tomcat cluster

You add redundancy and scalability by using a tomcat cluster as shown in the figure below. The cluster is fronted by Apache Web server + mod_proxy which distributes requests to the individual server. Mod_proxy is the load balancer.

Now the application scales horizontally. Tomcat or application failure is not an issue because there are other servers in the cluster. But we have introduced a new point a failure, the load balancer. If Apache+mod_proxy goes down, the application is unavailable.

To read more about setting up a tomcat cluster see Tomcat clustering

To learn how to use a load balancer with tomcat see Loadbalancing with Tomcat

Phase 3: Highly available Tomcat cluster
The figure below shows how to eliminate the point of failure and make the load balancer highly available.

You add redundancy by adding a second apache+mod_proxy. However only one of the apache is active. The second apache is not handling any requests. It merely monitors the active server using a tool like heartbeat. If for some reason, the active server goes down, the 2nd server knows and the passive server takes over the ip address and starts handling requests. How does this happen ?

This is possible because the ip address for this application that is advertised to the world is shared by the two apache's. This is know as a virtual ip address. While the 2 servers share the virtual IP, TCP/IP routes packets to only the active server. When the active server goes down, the passive server tells TCP/IP to start routing packets intended for this ip address to it. There are TCP/IP commands that let the server start and stop listening on the virtual ip address.

Tools like heartbeat and Ultramonkey enable you to maintain a heartbeat with another and failover when necessary. With heartbeat, there is a heartbeat process on each server. Config files have information on the virtual ip address, active server, passive server. There are several articles on the internet on how to setup heartbeat.

In summary, you can build highly available applications using open source tools. The key cocepts of HA, redundancy, monitoring & failover, virtual ip address apply to any service and not just web servers. You can use the same concepts to make your database server highly available.

Sunday, February 19, 2012

Java Generics #2 : what is "? super T" ?

Consider the merge method below that copies all elements from List source to List target.

public static <T> void merge(List<? super T> target, List<? extends T> source)

The <T> following static declares T as a new type variable for this method. We discussed "? extends T" in the blog Java Generics #1. Here, let us examine "? super T". One can guess that it is a wildcard that means any types, that are a superclass of T. If T is Integer, then List<? super T> could be List<Integer>, List<Number> or List<Object>. The code below shows the use of the merge method. In line 6, T is Integer and "? super T" is Number. In line 10, T is Number and "? super T" is Object.

1 List<Integer> aInts = new ArrayList<Integer>() ;
2 aInts.add(5) ;
3 aInts.add(7) ;
4 List<Number> aNums = new ArrayList<Number>() ;
5 aNums.add(12.5) ;
6 MCollections.<Integer>merge(aNums,aInts) ; // works
7 System.out.println(aNums.toString()) ; // aNums has 5,7,12.5
8 List<Object> aObjs = new ArrayList<Object>() ;
9 aObjs.add("hello") ;
10 MCollections.<Number>merge(aObjs,aNums) ; // works as well
11 System.out.println(aObjs.toString()) ; // aObjs has hello,5,7,12.5

We discussed in the last blog that if you have a Collection<? extends T> you can get values out of it, but you cannot put stuff into it. So what can you do with Collection<? super T> ?

In our merge example above, List<? super T> is the target, which means the implementation is putting/setting elements into it. "? super T" means any supertype of T. Logically it makes sense that you can put any supertype into the List.

The implementation of merge could be

1 public class MCollections {
2 public static <T> void merge(List<? super T> target, 
3                                List<? extends T> source) {
4   for(int i = 0 ; i < source.size(); i++) {
5     T e = source.get(i) ;
6     target.add(e) ;
7   }
8 }
9 }

But if you were to do a get, what would be the returned type ?. There would be no way to know. Hence it is not allowed as shown in line 4 of the code below.

1 List<? super Integer> aNums = new ArrayList<Number>() ;
2 aNums.add(11) ;
3 aNums.add(12) ;   
4 Number n = aNums.get(0) ; // Compilation Error - not allowed 
5 Object o = aNums.get(0) ; // allowed -- No compile error

The exception to the rule is getting an Object, which is allowed because since Object is a supertype of every other java type.

In summary, you can enable subtyping using "? super T" when you need to put objects into the collection. ( But you can get them out only as Object). You can enable subtyping using ? extends T when you need to get objects out of the collection. It follows that if you need to do both get and put, then you cannot use either of these wildcard mechanisms and you need to use a explicit type.

Sunday, January 15, 2012

Java Generics #1 : Subtyping using wildcard with extends

Generics is one of those more complicated language features in Java that is not well understood by many programmers. Many avoid it altogether. This is not without reason. While writing your program, if you have to stop and think a lot about syntax, there is more than a good chance, you would try to avoid that language construct. In this blog I discuss one type of subtyping with generics which can be tricky.

In Java we know that Integer extends Number. In other words, Integer is a subtype of Number. Anywhere a Number required, you can pass in an Integer. But does this mean that List<Integer> is a subtype of List<Number> ?

Consider the code. Will it work ?

1 List<Integer> aList = new ArrayList<Integer>() ;
2 aList.add(11) ;
3 aList.add(13) ;
4 List<Number> nList = aList ;
5 nList.add(11.5) ;

aList is a list of Integers. nList is a list of Numbers. In line4 nList is made to reference aList. In line 5 we add a double to aList. But aList is a list of integers. This is obviously not correct. And Java will not allow it. Line 4 will cause a compilation error. But sometimes we want to be able use subtypes. Generics have the concept of wildcards that enable subtyping when logically approriate.

Consider the addAll method of the Collection interface.

interface Collection<T> {
   public boolean addAll(Collection<? extends T> x) ;

}

? extend T says that given a collection of Type T , you can add to it elements from any collection whose type is a subtype of T. The following code is valid.

1 List<Number> aList = new ArrayList<Number>() ;
2 List<Integer> intList = Arrays.asList(11,12) ;
3 List<Double> dList = Arrays.asList(15.15) ;
4 aList.addAll(intList) ;
5 aList.addAll(dList) ;

The implemention of addAll method will get elements from the list passed in as parameter and put it into the target collection. Note that it is only a get operation on Collection<? extends T>. A put on Collection<? extends T> would however not be allowed. To understand, consider the code below

1 List <? extends Number> numList ;
2 List<Integer> intList = Arrays.asList(11,12) ;
3 numList = intList ; // Will this work ?
4 numList.add(5.67) ; // Will this work ?

Should line 3 work ? What about line 4 ?
The Java compiler allows line 3 because List<Integer> is considered a subtype of List <? extends Number>. But line 4 is a compilation error because you should not be allowed to add a double to List<Integer>.

In summary, when you have a Collection<? extends T>, it is safe to get elements out of the collection but not safe to put elements into it. Hence the compiler does not allow it.

Sunday, December 18, 2011

Single Sign On for the cloud: SAML & OpenId

When accessing different applications owned by different organizations, having to authenticate everytime you go from one application to another is annoying. Not only is it time consuming, but you also have to remember multiple passwords, which are often lost.

Single sign on is the ability to authenticate once and be able to move between applications seamlessly using the authenticated identity.

Within an intranet or between applications that are under the control of one development organization, single sign on for web applications can be easily implemented by generating a sessionid and passing it around using cookies. However, such a solution is proprietary and will not work if you need to leave the intranet and access other applications on the cloud. To interoperate with applications on the cloud, a more standards based solution is required.

A related concept and benefit is federated identity. Organizations can agree to a common name to refer to users. The user and his attributes needs to be created only in once place and others can refer to this information.

In this blog, we briefly examine two popular protocols that can be used for single sign on on the cloud: SAML and OpenId.

OpenId

The problem OpenId solves is that you as a user do not have to maintain and provide a password to each and every site you visit.

You maintain your password or other identifying credential with one provider known as the OpenId provider.

The website or application that you visit and that requires proof of who you are, relies on the OpenId provider to verify that you are who you claim to be. This is known as the relying party.

The basics of the OpenId protocol are:

1. You visit a web application (relying party) and enter an OpenId

2. Based on your OpenId, the relying party determines who your OpenId provider is.

3. The relying party redirects your request to the OpenId provider.

4. If you are already authenticated, this step is skipped.

The OpenId provider authenticates you by asking for a password or other information. The provider warns you that the relying party is requesting information about you.

5. The request is redirected back to the relying party where you are shown the URL you were trying to access.

The protocol does not require providers or relying parties to be registered anywhere. It uses plain HTTP requests and responses. The protocol messages are plain text key value pairs. The protocol works well with modern "Web20" AJAX style applications.

The OpenId protocol originated from consumer oriented websites such as Google, Twitter, Facebook etc and that is where this protocol is popular.

The OpenId specification is described at OpenId specification
There is a java implementation of OpenId at openid4java

SAML (Security Assertion Markup Language)

SAML is a XML based protocol that enables web based authentication, authorization and single sign on.

SAML involves a relying party requesting an assertion and a SAML provider responding with the assertion.

Examples of assertions are :
Authentication Assertion : This user was authenticated using such and such method at time t.
Attribute Assertion : This user has a title supermanager.
Authorization Assertion : This user has permission to delete file xyz.doc.

A typical SAML interaction would be as follows:

1. A user tries to access a URL or a web application which is the relying party
2. The relying party creates a SAML authentication request.
3. The relying party redirects the users browser to a SAML provider. Embedded in the request is the SAML authentication request.
4. The SAML provider evaluates the SAML request and authenticates the user.
5. The SAML provides returns a SAML authentication response to the user browser.
6. The browser forwards the SAML response back to the relying party.
7. The relying party verifies and interprets the SAML response.
8. If the response implies successful authentication, the user is redirected to the URL, he was originally trying to reach.

SAML has the concept of profiles. The interaction is different based on the profile. The interaction above is the Web SSO profile.

SAML has its origins more in enterprise software, in web services, in B2B communication and is from the early 2000s when XML was very popular. In fact SAML1.x had only a SOAP binding.

The SAML specification is at SAML Specification
There is a SAML implementation at OpenSAML

Which protocol should I use ?

OpenId is a simpler protocol. But has SAML has more features.

OpenId supports discovery of the OpenId provider. Using SAML has generally required expensive SAML projects.

OpenId supports only service provider initiated SSO. You go to a service provider web site and they need authentication. They start the conversation with the OpenId provider. SAML can also support identity provider initiated SSO. You are authenticated into your companys portal. Your company has a partner travel website for business travel. With SAML, you can go from your companys portal ( SAML provider) to the partner website ( relying party) without needing reauthentication.

SAML has been around longer than OpenId. SAML is more popular in the enterprise where as OpenId is more popular in consumer oriented applications.

Both OpenId and SAML rely on external transport layer security protocols such as SSL for the security of protocol messages.

If you are starting a new website and want to accept users from other popular sites such as google or twitter, you might consider OpenId. However if you are an enterprise and you want your authenticated users to access your partner sites without re-authentication, you might need SAML.

In summary, SAML is a feature rich protocol more popular in the enterprise. OpenId is simpler protocol with some limitations.

Saturday, November 19, 2011

What is NoSQL ?

NoSQL is a term used to refer to a class of database systems that differ from the traditional relational database management systems (RDBMS) in many ways. RDBMSs are accessed using SQL. Hence the term NoSQL implies not accessed by SQL. More specifically not RDBMS or more accurately not relational.

Some key characteristics of NqSQL databases are :

They are distributed, can scale horizontally and can handle data volumes of the order of several terrabytes or petabytes, with low latency.

They have less rigid schemas than a traditional RDBMS.

They have weaker transactional guarantees.

As suggested by the name, these databases do not support SQL.

Many NoSQL databases model data as row with column families, key value pairs or documents

To understand what non relational means, it might be useful to recap what relational means.

Theoretically, relational databases comply with Codds 12 rules of relational model. More simply, in RDBMS, a table is relation and database has a set of such relations. A table has rows and columns. Each table has contraints and the database enforces the constraints to ensure the integrity of data.Each row in a table is identified by a primary key and tables are related using foreign keys. You eliminate duplicate data during the process of normalization, by moving columns into separate tables but keeping the relation using foreign keys. To get data out of multiple tables requires joining the tables using the foreign keys. This relational model has been useful in modeling most real world problems and is in widespread use for the last 20 years.

In addition, RDBMS vendors have gone to great lengths to ensure that RDBMSs do a great job in maintaining ACID (actomic, consistent, integrity, durable) transactional properties for the data stored. Recovery is supported from unexpected failures. This has lead to relational databases becoming the de facto standard for storing enterprise data.

If RDBMSs are so good, Why does any one need NoSQL databases ?

Even the largest enterprises have users only in the order of 1000s and data requirements in the order of few terra bytes. But when your application is on the internet, where you are dealing with millions of users and data in the order of petabytes, things start to slow down with a RDBMS. The basic operations with any database are read and write. Reads can be scaled by replicating data to multiple machines and load balancing read requests. However this does not work for writes because data consistency needs to be maintained. Writes can be scaled only by partitioning the data. But this affects read as distributed joins can be slow and hard to implement. Additionally, to maintain ACID properties, databases need to lock data at the cost of performance.

The Googles, facebooks , Twitters have found that relaxing the constraints of RDBMSs and distributing data gives them better performance for usecases that involve

Large datasets of the order of petabytes. Typically this needs to stored using multiple machines.
The application does a lot of writes.
Reads require low latency.
Data is semi structured.
You need to be able to scale without hitting a bottleneck.
Application knows what it is looking for. Adhoc queries are not required.

What are the NoSQL solutions out there ?

There are a few different types.

1. Key Value Stores

They allow clients to read and write values using a key. Amazon's Dynamo is an example of a key value store.

get(key) returns an object or list of objects
put(key,object) store the object as a blob

Dynamo use hashing to partition data across hosts that store the data. To ensure high availability, each write is replicated across several hosts. Hosts are equal and there is no master. The advantage of Dynamo is that the key value model is simple and it is highly available for writes.

2. Document stores

The key value pairs that make up the data are encapsulated as a document. Apache CouchDB is an example of a document store. In CouchDB , documents have fields. Each field has a key and value. A document could be

"firstname " : " John ",
"lastname " : "Doe" ,
"street " : "1 main st",
"city " : "New york"

In CouchDB, distribution and replication is peer to peer. Client interface is RESTful HTTP, that integrated well with existing HTTP loadbalancing solutions.

3. Column based stores

Read and write is done using columns rather than rows. The best known examples are Google's BigTable and the likes of HBase and Cassandra that were inspired by BigTable. The BigTable paper says that BigTable is a sparse, distributed, persistent, multidimensional sorted Map. While that sentence seems complicated, reading each word individually gives clarity.
sparse - some cells can be empty
distributed - data is partitioned across many hosts
persistent - stored to disk
multidimensional - more than 1 dimension
Map - key and value
sorted - maps are generally not sorted but this one is

This sample might help you visualize a BigTable map

{
row1:{
    user:{
          name: john
          id : 123
    },
    post: {
          title:This is a post    
          text : xyxyxyxx
    }
}
row2:{
    user:{
          name: joe
          id : 124
    },
    post: {
          title:This is a post    
          text : xyxyxyxx
    }
}
row3:{
    user:{
          name: jill
          id : 125
    },
    post: {
          title:This is a post    
          text : xyxyxyxx
    }
}

}

The outermost keys row1,row2, row3 are analogues to rows. user and post are what are called column families. The column family user has columns name and id. post has columns title and text. Columnfamily:column is how you refer to a column. For eg user:id or post:text. In Hbase, when you create the table, the column families need to be specified. But columns can be added on the fly. HBase provides high availability and scalability using a master slave architecture.

Do I needs a NoSQL store ?

You do not need a NoSQL store if

All your data fits into 1 machine and does not need to be partitioned.
You are doing OLTP which required the ACID transaction properties and data consistency that RDBMSs are good at.
You need ad hoc querying using a language like SQL.
You have complicated relationships between the entities in your applications.
Decoupling data from application is important to you.

You might want to start considering NoSQL stores if

Your data has grown so large that it can no longer be handled without partitioning.

Your RDBMS can no longer handle the load.

You need very high write performance and low latency reads.

Your data is not very structured.

You can have no single point of failure.

You can tolerate some data inconsistency.

Bottomline is that NoSql stores are a new and complex technology. There are many choices and no standards. There are specific use cases for which NoSql is a good fit. But RDBMS does just fine for most vanilla use cases.