Friday, October 10, 2014

ServletContainerInitializer : Discovering classes in your Web Application

In my blog on java.util.ServiceLoader, we discussed how it can be used to discover third party implementations of your interfaces. This can be useful if your application is a container that executes code written by developers. In this blog, we discuss dynamic discovery and registration for Servlets.

All Java Web developers are already familiar with javax.servlet.ServletContextListerner interface. If you want to do initialization when the application starts or clean up when it is destroyed, you implement the contextInitialized and contextDestroyed methods of this interface.

In Servlet 3.0 specification, they added a couple interesting features that help with dynamicity, that are particularly useful to developers of libraries or containers.

(1) javax.servlet.ServletContainerInitializer is another interface that can notify your code of application start.

Library or container developers typically provide an implementation of this interface. The implementation should be annotated with the HandlesTypes annotation. When the application starts, the Servlet container calls the OnStart method of this interface, passing in as a parameter a set of all classes that implement, extend or are annotated with the type(s) declared in the HandlesTypes annotation.

(2) The specification also add a number of methods to dynamically register Servlets, filters and listeners. You will recall that previously, if you needed to add a new Servlet to you application, you needed to modify web.xml.

Combining (1) and (2), it should be possible to dynamically discover and add Servlets to a web application. This is a powerful feature that allows you to make the web application modular and spread development across teams without build dependencies. Note that this technique can be used to discover any interface, class or annotation. I am killing 2 birds with one stone by using this to discover servlets.

In the rest of the blog, we will build a simple web app, that illustrates the above concepts. For this tutorial you will need

(1) JDK 7.x or higher
(2) Apache Tomcat or any Servlet container
(3) Apache Maven

In this example we will

(1) We will implement SevletContainerInitializer called WebContainerInitializer and package it in a jar containerlib.jar.
(2)  To make the example interesting, we will create a new annotation @MyServlet, which will act like the @WebServlet annotation in the servlet specification. WebContainerInitializer will handle types that are annotated with @MyServlet.
(3) We will write a simple web app that has a Servlet annotated with @MyServlet and has containerlib.jar in the lib directory. No entries in web.xml.
(4) When the app starts, the servlet is discovered and registered. You can go to a browser and invoke it.

Before we proceed any further, you may download the code from my github respository, So you can look at the code as I explain. The code for this example is in the dynamicservlets directory.

Step 0: Get the code

git clone https://github.com/mdkhanga/my-blog-code.git

dynamicservlets has 2 subdirectories: containerlib and dynamichello.

The containerlib project has the MyServlet annotation and the WebContainerInitializer which implements ServletContainerInitializer.

DynamicHello is a web application that uses containerlib jar.

Step 1: The MyServlet annotation
MyServlet.java
@Retention(RetentionPolicy.RUNTIME)
@Target(ElementType.TYPE)
public @interface MyServlet {   
    String path() ;
}

The annotation applies to classes and is used as
@MyServlet(path = "/someuri")

Step 2: A Demo servlet
HelloWorldServlet.java
@MyServlet(path = "/greeting")
public class HelloWorldServlet extends HttpServlet {
     

    protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {
        PrintWriter p = response.getWriter() ;
        p.write(" hello world ");
        p.close();
    }
   
}


This is a simple hello servlet that we discover and register. Nothing needs to be added to web.xml.

Step 3: WebContainerInitializer
WebContainerInitializer.java
This is the implementation of ServletContainerInitializer.

@HandlesTypes({MyServlet.class})
public class WebContainerInitializer implements ServletContainerInitializer {

    public void onStartup(Set> classes, ServletContext ctx)
            throws ServletException {
       
        for (Class c : classes) {
            MyServlet ann = (MyServlet)c.getAnnotation(MyServlet.class) ;       
            ServletRegistration.Dynamic d = ctx.addServlet("hello", c) ;
            d.addMapping(ann.path()) ;
           
        }

    }


The implementation needs to be in separate jar and included as a jar in the lib directory of the application war. WebContainerInitializer is annotated with @HandleTypes that takes MyServlet.class as parameter. When the application starts, the servlet container finds all classes that are annotated with MyServlet and passes them to the onStartup method. In the onStartup method, we go through each class found by the container, get the value of the path attribute from the annotation and register the servlet.

To make this work, we need one more thing, which is in the META-INF/services directory, a file whose name is javax.servlet.ServletContainerInitializer, which contains 1 line com.mj.WebContainerInitializer. If you are wondering why this is required, please see my this blog.

Step 4: Build and run the app

To build,
cd containerlib
mvn clean install
cd dynamichello
mvn clean install

This builds dynamichello/target/dynamichello.war that can be deployed to tomcat or any servlet container.
When the application starts, you will see the following messages in the log

Initializing container app .....
Found ...com.mj.servlets.HelloWorldServlet
path = /greeting


Point you browser to http://localhost:8080/hello/greeting.

The servlet will respond with a hello message.

In summary, this technique can be used to dynamically discover classes during application startup. This is typically used to implement libraries or containers such as JAX-RS implementation. This allows implementations to be provided by different developers. There is no hard wiring.

Saturday, September 20, 2014

Discovering third party API/SPI implementations using java.util.ServiceLoader

One interface, many implementations is a very well known object oriented programming paradigm. If you write the implementations yourself then you know what  those implementations are and you can write a factory class or method that creates and returns the right implementation. You might also make this config driven and inject the correct implementation based on configuration.

What if third parties are providing implementations of your interface? If you know those implementations in advance, then you could do the same as in the case above. But one downside is that code change is required to add or use new implementations or to remove them. You could come up with a configuration file, where implementations are listed and your code uses the list to determine what is available. Downside is that configuration has to be updated by you and this is non standard approach, in that, every API developer could come up with his own format for the configuration. Fortunately JAVA has a solution.

In JDK6, they introduced java.util.ServiceLoader, a class for discovering and loading classes.

It has a static load method that can be used to create a ServiceLoader that will find and load all of a particular Type.

public static<T> ServiceLoader<T> load(Class<T> service)

You would use it as
ServiceLoader<SortProvider> sl = ServiceLoader.load(SortProvider.class) ;
This creates a ServiceLoader that can find and load every SortProvider in the classpath.

The Iterator method returns an Iterator to the implementations founds that will be loaded lazily.
Iterator<SortProvider> it_sl = sl.Iterator() ;

You can iterate over what is found and store it in a Map or somewhere else in memory.
while (its.hasNext()) {
            SortProvider sp = its.next() ;
            log("Found provider " + sp.getProviderName()) ;
            sMap.put(sp.getProviderName(),sp) ;
}

How does ServiceLoader know where to look ?
  • Implementors package their implementation in a jar
  • jar should have a META-INF/services directory
  • services directory should have a file whose name is the fully qualified name of the Type
  • file has a list of fully qualified name of implementations of type
  • jar is installed to the classpath
I have a complete API/SPI example for a Sort interface below that you can download at https://github.com/mdkhanga/my-blog-code. This sample is in msort directory. You should download the code first, so that you can look at code while reading the text below. This example illustrates how ServiceLoader is used to discover implementations from third party service providers. Sort interface can be used for sorting data. Service providers can provide implementations of various Sort algorithms. In the example,

1. com.mj.msort.Sort is the main Sort API. It has 2 sort methods. One for Arrays and one of
 collections. 2 implementations are provided - bubblesort and mergesort. But anybody can write additional implementations.
 
2. com.mj.msort.spi.SortProvider is the SPI.Third party implementors of Sort must also implement the SortProvider interface. The SPI provides another layer of encapsulation. We don't want to know the implementation details. We just want an instance of the implementation.

3. SPI providers need to implement Sort and SortProvider.

4. com.mj.msort.SortServices is a class that can discover and load SPI implementations and make them available to API users. It uses java.util.ServiceLoader to load SortProviders. Hence SortProvider also needs to be packaged as required by java.util.ServiceLoader for it to be discovered.

This is the class that brings everything together. It uses ServiceLoader to find all implementations of SortProviders and stores them in a Map. It has a getSort method that programmers can call to get a specific implementation or whatever is there.

5.  Sample Usage

Sort s = SortServices.getSort(...
s.sort(...

In summary, ServiceLoader is a powerful mechanism to find and load classes of a type. It can used to build highly extensible and dynamic services. As an additional exercise, you can create your own implementation of SortProvider in your own jar and SortServices will find it as long as it is on the classpath.

Tuesday, August 26, 2014

Android programming tutorial

Android is an open source linux based operating system for mobile devices like smart phones, tables and other devices. The purpose of this blog is to introduce a developer to android development. There are already many tutorials for Android. So why another ? Mobile development is fun and easy. But despite lots of documentation from Google and several blogs, the initial startup for new developer is not easy. There is substantial trial and error even for the experienced programmer before you get comfortable with the development process.

In the rest of the blog I will
  • Describe some android application concepts
  • Describe what SDKs and tools you need to download
  • Develop a very simple android application.
This blog will be most useful when used in conjunction with the official Android developers documentation. There are new terms like Activity or Layout that I describe only briefly. You should read more about it from the original documentation.

Concepts

  • Android applications are mostly developed in JAVA.
  • Android development is like any other event driven UI development. Layout UI elements on the screen and write code to handle event like user tapping a button or a menu option.
  • An activity is a single screen of an application that a user interacts with. 
  • An application may have many activities. Each activity has a layout that describes how the user interface widgets are layed out on the screen.
  • Activities communicate by sending Intents to each other. For example, if by clicking a button, a particular screen needs to replace the current one, the current activity will send an intent to the one that needs to come to the foreground.
  • Android SDK supports all the UI elements like text boxes, buttons, lists , menus, action bar etc that are necessary to build a UI.
  • The layouts determine how the UI elements are positioned on the screen respective to each other. With LinearLayout, the UI elements are positioned one after the other.  With RelativeLayout, the UI elements are positioned relative to one another.
  • Additionally, there are APIs
    • to store data to a file or to a local SQLite relational database.
    • to phone other devices.
    • to send text messages to other devices.
    • to send messages to other applications.
  • Using HTTP, REST or other general purpose client libraries, you can make requests to remote servers.
  • Most of the time, any JAVA library that you can use in any JAVA application is generally usable in Android. ( of course sometimes there are issues such as supported JDK versions)
Required Tools
  • JAVA SDK  
  • Android Studio
    • This has the Android SDK and an IntelliJ based IDE.
    • You could also use the eclipse ADT or just the plain SDK with command line.
    • For this tutorial I have used Android studio 0.8.2.
  • Optional - A mobile device
    • Android SDK has emulators that you can run the app on. But they are slow.
    • Running on a real device gives more satisfaction. I used a Nexus 7. 
  • Optional - Download the source code for the tutorial below from https://sites.google.com/site/khangaonkar/home/android
In the rest of the blog we will work through a very simple tutorial to develop an android application.

Tutorial

Step 1: Download the android SDK

Download the android SDK from http://developer.android.com/sdk/installing/index.html. The SDK is available in 3 flavors : eclipse ADT , android studio (intelliJ) and commandline. For this tutorial, I used android studio because that seems to be the recommended direction from google. But (except on MacOs) eclipse works fine as well.

Step 2 : Create a new project

Start Android Studio
Select File > New Project
Enter Application name and click next
Accept default for form factors and click next
Select the default blank activity and hit next
Select the defaults for the new activity and hit finish

You should see a project as shown below















Step 3: Create an emulator
An emulator lets you test your application on a variety of devices without actually having the device. Let us create a Nexus 7 emulator.

Click Tools > Android > AVD Manager
Click create and enter the information as shown below




















Click Ok
Select the created device and hit Start
This will a take a couple of minutes. The emulators are slow. Eventually you will see the window shown below













In the main project, in the lower window, you should see that the emulator is detected.








Caution: Emulators are very slow and take a lot of time to start. The first time I install a new version of AppStudio or eclipse ADT, they almost never work. It takes a little bit of trial and error to get them going.

Step 4 : Run the application

Click Run > Run App
When prompted, Select the emulator
The default apps shows hello world on the screen













Step 5: Review generate files

Under Greeting/app/ src/main/java is the class com.mj.greeting.MyActivity. This is the main class the represents the logic around what is shown on the screen.
line 17  is setContentView(R.layout.activity_my);
This line sets the layout that is displayed on the screen. The layout is defined as an xml file Greeting/apps/src/main/res/layout/activity_my.xml. The LayoutManager and any UI elements like editboxes , buttons etc and their properties are defined here. In this case, a RelativeLayout surrounds a Textview whose default value is Hello World.

Step 6: Add some new code
Let us add an edittext box and a button to the UI. The user can type a message in the editbox and then click the button. On clicking the message replaces what is displayed in the textview.

In the file Greeting/apps/src/main/res/layout/activity_my.xml

add an android:id to the relativelayout
    xmlns:tools="http://schemas.android.com/tools"
    android:id="@+id/main"


add an android:id to the textview
        android:id="@+id/textview"
        android:text="@string/hello_world"


The ids will let us reference these widgets in code.

Add an edittext box
<EditText
        android:id="@+id/edittext"
        android:layout_width="wrap_content"
        android:layout_height="wrap_content"
        android:layout_below="@id/textview"
        android:ems="10"
        android:layout_marginTop="10dp"
        android:text="greeting" android:inputType="text" />


and a button
<Button
        android:id="@+id/button"
        android:layout_width="wrap_content"
        android:layout_height="wrap_content"
        android:layout_below="@+id/edittext"
        android:layout_marginTop="10dp"
        android:text="Update Greeting"
        android:onClick="onClick"/>


OnClick attribute references the method that is called when the user clicks the button. So we will need to add an onClick method implementation

To the class com.mj.greeting.MyActivity add the method
public void onClick(View v) {
        View main = this.findViewById(R.id.main) ; // get a reference to the current view
        EditText edit = (EditText) main.findViewById(R.id.edittext) ; // get a reference to the edittext
        TextView tv= (TextView) main.findViewById(R.id.textview) ; // get the textview
        tv.setText(edit.getText()); // get the text entered in edittext and put it in the textview
    }


Run the application




















Step 7: Run on a real device

So far we have been running the application on a emulator. It is much more fun to run on a real device. Enable USB debugging on your device.  On the Nexus 7, USB debugging is enabled by selecting the option in Settings/Developer Options.

Connect it to your development machine with a USB cable. do Run > Run App

The application will be installed and run on the device.



















In summary, getting started with mobile development is simple and fun once you get comfortable with the concepts and tools.






Saturday, July 26, 2014

Distributed Systems : Consensus Protocols

Modern software systems scale by partitioning the data and distributing data across several machines. Systems are made highly available by replicating data across multiple machines. When multiple systems are involved in managing state, they need to agree when a particular piece data needs to change.

You are familiar with the concept of a transaction in a relational database. A transaction is a unit of work (like a insert or update or some combination of multiple statements) that as a whole can be committed and aborted. What if the work involves updating multiple databases that are on different machines ? To ensure consistent state across the system, all the databases should agree on what to do, whether to commit or abort the state change.

Modern distributed NoSql databases have a similar but slightly different problem. If you had a single server and set a value v=8 in the server. There is no doubt what the value of v is. Any client that connects to the server reads the value as 8.  What if you had a cluster of 3 servers ? Would a client connecting to one the servers see the value as 8 ? Consensus is required to ensure all servers agree on what the value of v is.

Consider systems like Apache Zookeeper or Apache Cassandra. To ensure high availability, clients can connect to any node in the cluster and read  or write data. To ensure consistency in the cluster, some consensus is required among the nodes in the cluster when state changes.

In the rest of this blog we briefly cover some distributed protocols starting with two phase commit, which users of relational databases are very familiar with. We will then talk about Paxos , ZAB and Raft. Paxos became popular because it was used by google for its distributed systems. ZAB is used by Zookeeper which is an important component of the Hadoop echosystem. These protocols are hard to understand and no attempt is made to go into detail. The purpose is to introduce readers to some of the consensus concepts that are important in distributed systems.

1. Two phase commit

Used in databases to ensure all participants in distributed updates either commit or abort the changes.
One node called the co-ordinator originates the transaction.

1.1 Co-ordinator sends a prepare message to all participants.
1.2 Each participant replies with a yes if it can commit its part of the transaction or No otherwise.
1.3 If the co-ordinator receives a yes from all participants, it sends a commit message to the participants. Otherwise it sends an abort message.
1.4 If the participant receives a commit message, it commits its change. If it receives an abort message, it aborts the change. In both cases, it sends an acknowledgement back to the co-ordinator.
1.5 Transaction is complete when the coordinator receives all acknowledgments.

One limitation of this protocol is that if the co-ordinator crashes, the participants do not know whether to commit or abort the transaction, as they do not know how the other participants responded.

2. Three phase commit

The protocol attempts to let the participants make progress even if the co-ordinator fails.

2.1 Co-ordinator sends a prepare message to all participants.
2.2  Each participant replies with a yes if it can commit its part of the transaction or No otherwise.
2.3. If the co-ordinator receives yes from all of participants, it send a pre-commit  message to all participants.
2.4 When the co-ordinator receives an acknowledgment from a majority of participants, it sends a commit message to all participants.

If the co-ordinator fails, the participants can communicate with each other and determine whether to commit or abort.

3. PAXOS

Paxos was first published in that nineties but it became more popular after Google implemented and used it in its distributed infrastructure. The protocol is notorious for being difficult to understand. Below is a very brief description. See references for more details.

There are nodes that propose values called proposers and that accept values called acceptor.

3.1 A proposer with a value to propose submits a proposal (v,n) with value v and sequence number n.

3.2.  When an acceptor receives a proposal (v,.n), it compares it with the highest version proposal accepted for that value. If this proposal is higher version that any accepted proposal, the acceptor replies agree and sends the value of any previously accepted proposal. If the acceptor has already accepted a higher version, it rejects the current proposal.

3.3 If the proposer receives agree from majority of acceptors, it can pick one of the values sent by the acceptors. If they acceptors have not sent any value, it can pick its own value. It then sends a commit message with the chosen value to acceptors. If majority reject or do not respond, abort this proposal and try another one.

3.4  When the acceptor receives a commit message, it agrees to commit if the sequence number is the highest it has agreed to or if the value is the same as the last accepted proposal. Otherwise it rejects the commit.

3.5 If a majority accept the commit, the proposal is complete. Otherwise abort and try again.

Key takeaway is that majorities are used to accept proposal. If there are multiple proposers competing for a value, it is possible that no progress is made in accepting values. The solution is to elect a leader that proposes values. Other players in the system could be learners who learn about accepted values from either the leader or other participants.

4. ZAB (Zookeeper Atomic Broadcast)

ZAB was developed for use in Apache Zookeeper due to limitations in PAXOS. In Zookeeper , the order in which changes are applied in important. In PAXOS, it is possible that updates get applied by acceptors out of order.

ZAB is similar to PAXOS in that a leader proposes values and values are accepted based on majority vote. The key difference is that strict order of updates is maintained. If the leader crashes and a new leader is elected, the updates are applied in the original order.

5. RAFT

RAFT is another distributed consensus protocol that claims to be simpler that PAXOS or ZAB

A node can either be a leader, follower or candidate.

5.1 By default all nodes are followers. When there is no leader, a node can make itself a candidate for leadership and solicit votes.

5.2 The candidate that gets majority votes is elected leader.

5.3 A client submits its updates to the leader. Leader updates a log (uncommitted) and sends the update to followers.

5.4 When leaders hears from a majority of followers that they have made the update, leader commits the change and informs the followers of the commit

5.5 Followers commit the update.

5.6 If a leader terminates for some reason, one of the followers turns itself into a candidate and gets elected as the leader.

We have a given a brief description of some consensus protocols. If you use Hadoop, Cassandra, Kafka or similar distributed systems, you will run into these protocols. For more details, some references are provided below.

References:

1.  Database Management Systems by RamaKrishnan and Gehrke
2. PAXOS made simple
PAXOS by example
4. The secret lives of data
5. Apache Zookeeper
6. Paxos paper trail
7. Raft Consensus

Friday, June 27, 2014

Apache Cassandra : Things to consider before choosing Cassandra

A lot has been written about NoSql databases. There is lot of hype surrounding many of these databases. Unfortunately most written material either sings praises about a particular database or trashes it. I also am starting to see people pick databases for the wrong reasons. The purpose of this blog is to highlight the reasons to consider while choosing Cassandra as your database.

1. Scaling by partitioning data

Cassandra is designed to store large quantities of data - several hundreds of terrabytes or petabytes that typically cannot be stored on a single machine. Cassandra solves the problem by partitioning the data across machines in a cluster using a consistent hash. When data is partitioned across several machines, some of the things we are used to in relational databases like consistency and transactions are difficult to implement. Hence those features are weak or in some cases not available. So the ability to scale comes at the expense of other features.

The single biggest mistake people make is using Cassandra when their size of data is not large enough to merit partitioning. If in the foreseeable future, you data size is a few hundred gigabytes , stick to mysql or other relational database of your choice. Even if your data size grows in the the future, you can always port to Cassandra when you reach the stage of few TB. This is especially true if you are building a new application with limited resources. Do not let the complexity of Cassandra slow down the rest of your feature development.

2. High availability

The CAP theorem states that out of consistency, availability and partition tolerance , it is possible for a system to have only 2 of 3. Cassandra is designed for availability and partition tolerance.

If your applications primary requirement is high availability, Cassandra can be a great choice. With its shared nothing architecture, where all nodes are equal, multiple nodes can go down and the database is still available. Clients can connect to any node and that node with get/put the data to the node that is required to handle that data. Replication ensures that if the primary node that handles the data goes down, a replica is able to service the request.

3. Replication

Replication has 2 purposes : One it provides redundancy for data in case of failures. Second, it makes copies of data available closer to where it is consumed or served. In many databases, setting up replication is cumbersome. Not in Cassandra. Replication is core to the architecture.  Replication is configured at a keyspace level by specifying a replication strategy and the number of replicas and the data is replicated within the cluster or across data centers as required.

If replication is important, especially across data centers, Cassandra is a great choice.

4. Optimized for writes

Write operations update a in memory data structure called Memtable and returns immediately. Nothing is locked and nothing is written to disk. Writes are very fast. When Memtables reach a certain size, they are flushed to disk to a file called SSTable. Reads may have to go through multiple SStables and aggregate changes to return correct data. For this reason, reads might not be that fast.

If you have a workloads that involves a lot of writes and few reads then Cassandra is a suitable database. A common use case is storing data from log files of high volume production web servers that service several billion requests a day. An analytics application would potentially read the data, but the read volume is low because the reads are done by in house business analysts and not internet users.

5. Compaction

Over time several SSTables get created and reads have to go through multiple SSTables to get to data. Periodically Cassandra will asynchronously merge smaller SSTables into large SSTables. People have complained that during compaction, things slow down and throughput degrades. There are probably ways to tune this, but you should be aware of compaction when using Cassandra.

6. Limited querying capability

Cassandra supports a SQL like language called CQL. It is "SQL like" and not SQL. Many very basic things like aggregation operators are not supported. Joins of tables are not supported. Range queries on partition key are not supported. Range queries are possible within a partition key using the clustered columns, but it requires some additional data modeling. 

Bottom line is that Cassandra partitions the data based on consistent hash of the partition key and look ups are possible based only on the look up key. Anything else requires additional modeling that involves what is called clustered columns.

7. Consistency model

Cassandra was inspired by Amazon's Dynamo database where the model was eventual consistency. When a client requested data and there was inconsistency between the values in the nodes  of a cluster, the server returned a vector clock to the client and it was the responsibility of the client to resolve any conflict.

Cassandra's model is tunable consistency. For a read or write operation, client can specify a consistency level such as ANY, ALL, QUORUM, ONE , TWO etc. However when there are concurrent writes, the order is determined based on machine time stamps. So it is important that clocks on nodes in the cluster be synchronized. Getting the consistency model to work requires time and effort on the part of the developer. If the kind of strong consistency we are used to in relational databases is important to you, Cassandra will not be a suitable choice.
 
8. Frequent updates

Based on what is discussed in (4) and (8) Cassandra is not suitable for use cases where you update column values frequently. When concurrent updates happens, Cassandra uses timestamps to determine which update happened first and you could sometimes encounter the lost update problem. To work around the problem, what you have to do is append updates to a collection or wide columns and then aggregate the final value on reads. Again this is additional work in data modeling and programming and you might be better off using another database if frequent updates are an integral part of your use case.

In summary, Cassandra is an excellent choice as a database when your data size is very large and high availability or replication are important. But it is not a general purpose database. Some of the scalability comes at a cost and you give up other features like consistency or querying.

For additional information on Cassandra, please check DataStax documentation. You can also read these blogs:
HBase vs Cassandra
Cassandra Data Model

Wednesday, May 28, 2014

Apache Kafka JAVA tutorial

In Apache Kafka introduction we discussed some key features of Kafka. In this tutorial we will setup a small Kafka cluster. We will send messages to a topic using a JAVA producer. We will consume the messages using a JAVA consumer.

For this tutorial you will need

(1) Apache Kafka
(2) Apache Zookeeper
(3) JDK 7 or higher. An IDE of your choice is optional
(4) Apache Maven
(5) Source code for this sample kafkasample.zip

Zookeeper is required as the Kafka broker uses Zookeeper to store topic configuration and consumer information.

We will setup a 3 node cluster. I will assume you have 3 machines: host1,host2,host3. We will run a Kafka broker on each host. We will run Zookeeper on host1 and host2. You can do this tutorial on one machine, but you will need to change the port numbers to avoid conflict.

Step 1: Download the latest binaries

Apache Kafka can be downloaded at http://kafka.apache.org/downloads.html
Apache Zookeeper can be downloaded at http://kafka.apache.org/downloads.html

Step 2: Install Zookeeper

We will install zookeeper on host1 and host2. At least 2 is a good idea so that the cluster is still usable if one of the servers goes down.

For instructions on how to setup a zookeeper cluster see http://zookeeper.apache.org/doc/trunk/zookeeperStarted.html. The instructions are simple and we will not repeat them here.

Step 3: Configure the Kafka Brokers

Assume you have unzipped the binaries to /usr/local/kafka.

Edit the config/server.properties on each host as follows:

Each broker needs a unique id. On host1 set
broker.id=1
On host2 set it to 2 and on host3 set it to 3

On each host, set the directory where the messages are stored
log.dirs=/srv/apps/kafka/data

Tell the brokers which zookeepers to connect to
zookeeper.connect=host1:2181,host2:2181

These are the only properties that need changes. All the other default properties in server.properties are good enough and you can change them as required.

Step 4: Start the brokers

On each server , from /usr/local/kafka/bin execute

kafka-server-start.sh server.properties &

Step 5: Create a topic

Let us create a topic mytopic with 3 partitions and a replication factor of 2.

/usr/local/kafka/bin$ kafka-topics.sh --create --zookeeper host1:2181 --replication-factor 2 --partitions 3 --topic mytopic

Step 5: Write a JAVA producer

The complete source code is at kafkasample.zip.

Create the properties that need to be passed to the producer.
       
        Properties props = new Properties();
        props.put("metadata.broker.list", "host2:9092,host3:9092");
        props.put("serializer.class", "kafka.serializer.StringEncoder");
        props.put("request.required.acks", "1");
        ProducerConfig config = new ProducerConfig(props);

metadata.broker.list is the list of brokers that the producer can try to connect to.
We are sending text messages. So we use the String encoder.   Setting request.required.acks to 1 ensures that a publish message request is considered completed only when an acknowledgment is received from the leader.
            
        Producer producer = new Producer(config);
     

The above line creates a producer which will send key and value both of type String.
  
        String topic = "mytopic"     
        for (int i = 1 ; i <= 1000 ; i++) {
           
            String msg = " This is message " + i ;

            KeyedMessage data = new KeyedMessage(topic, String.valueOf(i), msg);
            producer.send(data);

        }
        producer.close();

       

The code above sends 1000 messages to the topic.

Step 6: Write a JAVA consumer

Our topic has 3 partitions. Messages within a partition are delivered to a consumer in order. So one consumer per partition makes sense. In our consumer,we will create 3 threads, one for each partition. We use what is called the High level consumer API

First setup the consumer configuration

        Properties props = new Properties();
        props.put("zookeeper.connect", "host:2181");
        props.put("group.id", "mygroupid2");
        props.put("zookeeper.session.timeout.ms", "413");
        props.put("zookeeper.sync.time.ms", "203");
        props.put("auto.commit.interval.ms", "1000");
        ConsumerConfig cf = new ConsumerConfig(props) ;

zookeeper.connect tells the consumer which zookeeper to connect. Consumer needs to connect to zookeeper to get topic information as well as store consumer offset. group.id is the name of the consumer group. Every message is delivered to each consumer group. auto.commit.interval.ms tells how often the consumer should commit offsets to zookeeper.

       ConsumerConnector consumer = Consumer.createJavaConsumerConnector(cf) ;
       String topic = "mytopic"  ;
       Map topicCountMap = new HashMap();
       topicCountMap.put(topic, new Integer(3));
       Map<String,List<KafkaStream<byte[],byte[]>>> consumerMap =

               consumer.createMessageStreams(topicCountMap);
       List<KafkaStream<byte[],byte[]>> streams = consumerMap.get(topic);


The code above creates a ConsumerConnector and gets a list of KafkaStreams. For this topic, we indicate 3 streams are required ( 1 for each partition) and the connector creates 3 streams.

        ExecutorService executor = Executors.newFixedThreadPool(3); ;
        int threadnum = 0 ;     
        for(KafkaStream stream  : streams) {
            executor.execute(new KafkaPartitionConsumer(threadnum,stream));
            ++threadnum ;
        }


The code above creates a threadpool and submits a runnable KafkaPartitionConsumer that will read the stream. The code for the runnable is shown below.

        public static class KafkaPartitionConsumer implements Runnable {
            private int tnum ;
            private KafkaStream kfs ;
            public KafkaPartitionConsumer(int id, KafkaStream ks) {
                tnum = id ;
                kfs = ks ;
            }   
            public void run() {
                System.out.println("This is thread " + tnum) ;
                ConsumerIterator it = kfs.iterator();
                int i = 1 ;
                while (it.hasNext()) {
                    System.out.println(tnum + " " + i + ": " + new String(it.next().message()));
                    ++i ;
                }
            }

        }

Each thread is reading from a stream from a particular partition. If there are no messages, the call hasNext() will block.

Step 7 : Start the consumer

I built my code using maven. When there are dependencies on jars, it is also easier to use maven to run the program, as it pulls in all the dependencies automatically into the classpath

mvn exec:java -Dexec.mainClass="com.mj.KafkaMultiThreadedConsumer"

Step 8 : Start the producer

mvn exec:java -Dexec.mainClass="com.mj.KafkaProducer"

You should see the consumer print out the messages.

You can start multiple consumers with a different groupid and they will each receive all the messages.



Friday, April 18, 2014

Apache Kafka Introduction : Should I use Kafka as a message broker ?

Asynchronous messaging is an important component of any distributed application. Producers and consumers of messages are de-coupled. Producers send messages to a queue or topic. Consumers consume messages from the queue or topic. The consumers do not have to be running when the message is sent. New consumers can be added on the fly. For Java programmers, JMS was and is the popular API for programming messaging applications. ActiveMQ, RabbitMQ , MQSeries (henceforth referred to as traditional brokers) are some of the popular message brokers that are widely used. While these brokers are very popular, they do have some limitations when it comes to internet scale applications. Generally their throughput will max out at few ten thousands of messages per second. Also, in many cases, the broker is a single point of failure.

A message broker is little bit like a database. It takes a message from a producer, stores it. Later a consumer reads the messages. The concepts involved in scaling a message broker are the same concepts as in scaling databases.  Databases are scaled by partitioning the data storage and we have seen that applied in Hadoop, HBASE, Cassandra and many other popular open source projects. Replication adds redundancy and failure tolerance.

A common use case in internet companies is that log messages from thousands of servers need to sent to other servers that do number crunching and analytics. The rate at which messages are produced and consumed is several thousands per sec, much higher than a typical enterprise application. This needs message brokers that can handle internet scale traffic.

Apache Kafka is a open source message broker that claims to support internet scale traffic. Some key highlights of Kafka are
  • Message broker is a cluster of brokers. So there is partitioning and no single point of failure.
  • Producers send messages to Topics.
  • Messages in a Topic are partitioned among brokers so that you are not limited by machine size.
    • For each topic partition 1 broker is a leader
    • leader handles reads and writes
    • followers replicate
  • For redundancy, partitions can be replicated.
  • A topic is like a log file with new messages appended to the end.
  • Messages are deleted after a configurable period of time. Unlike other messaging systems where message is deleted after it is consumed. Consumer can re-consume messages if necessary.
  • Each consumer maintains the position in the log file where it last read.
  • Point to point messaging is implemented using Consumer groups. Consumer groups is a set of consumers with the same groupid. Within a group, each message is delivered to only one member of the group.
  • Every message is delivered at least once to every consumer group. You can get publish subscribe using multiple consumer groups.
  • Ordering of messages is preserved per partition. Partition is assigned to consumer within a consumer group. If you have same number of partitions and consumers in a group, then each consumer is assigned one partition and will get messages from that partition in order.
  • Message delivery: For a producer , once a message is committed, it will be available as long as at least one replica is available. For the consumer, by default, Kafka provides at least once delivery, which means, in case of a crash, the message could be delivered multiple times. However with each consume, Kafka returns the offset in the logfile. The offset can be stored with the message consumed and in the event of a consumer crash, the consumer that takes over can start reading from the stored offset. For both producer and consumer, acknowledgement from broker is configurable.
  • Kafka uses zookeeper to store metadata.
  • Producer API is easy to use. There 2 consumer APIs.
  • High level API is the simple API to use when you don'nt want to manage read offset within the topic. ConsumerConnector is the consumer class in this API and it stores offsets in zookeeper.
  • What they call the Simple API is the hard to use API to be used when you want low level control of read offsets.
  • Relies on filesystem for storage and caching. Caching is file system page cache.
  • O(1) reads and writes since message and written to end of log and read sequentially. Reads and writes are batched for further efficiency.
  • Developed in Scala programming language
Apache Kafka can be downloaded at http://kafka.apache.org/downloads.html.

They have a good starter tutorial at http://kafka.apache.org/documentation.html#quickstart. So I will not repeat it. I will however write a future tutorial for JAVA producers and consumers.

Apache Kafka is a suitable choice for a messaging engine when
  • You have a very high volume of messages - several billion per day
  • You need high through put
  • You need the broker to be highly available
  • You need cross data center replication
  • You messages are logs from web servers
  • Some loss of messages is tolerable
Some concerns that you need to be aware of are
  • Compared to JMS, the APIs are low level and hard to use
  • APIs are not well documented. Documentation does not have javadocs
  • APIs are changing and the product is evolving
  • Default delivery is at least once delivery. Once and only once delivery requires additional work for the application developer
  • Application developer needs to understand lower level storage details like partitions and consumer read offsets within the partition
It is useful to remember history and draw an analogy with NoSQL databases. 3 or 4 years ago Nosql database were hot and people wanted to use them everywhere. Today we know that traditional RDBMS are not going anywhere and the NoSQL databases are suitable for some specialized use cases. In fact NoSQL database are going in the direction of additing features that are available in RDBMSs. Kafka today is where NoSql databases were a few years ago. Don'nt throw away your traditional message broker yet. While Kafka will be great for the cases mentioned above, lot of the simpler messaging use cases can be done lot more easily  with a traditional message broker.