Thursday, April 25, 2013

10 reasons for considering the Scala programming language

Scala programming language has been around for a few years now and its popularity is increasing. Having programmed in Java for many years, I was initially skeptical whether we needed another programming language on the JVM. But after trying out Scala and reading about the language, I have had a change in heart. Whether your background is Java, C/C++, Ruby, python, C# or any other language, Scala has some very useful features that will force you to consider it, if you were looking for a programming language. This blog just lists the useful features. Programming examples will follow in subsequent blogs.

1. Objected oriented programming language (OOP)

Scala is a object oriented programming language. The benefits of OOP are well documented. A majority of programs today are written in some OO language. If you come from JAVA, C++, C# background, then you already know the benefits. If you are currently using a language that is not OO, then this might be one of the reasons for you to consider Scala. In Scala everything is an Object, unlike JAVA where primitives are not objects and the use of static methods lets you bypass the OO paradigm. OO programming enables you to write programs that have a structure that models that problem domain that the program is written for. This helps produce programs that are easier to read and maintain.

2. Functional programming

In contrast to OO programming, functional programming encourages the use of functions to do some work without changes in state or changes to the data it works on. Data is immutable. Functions take data as input and may produce new data as output. Additionally, a function is a type just like an Integer, String or any class. The advantage of functional programming is that there are no side effects - a function takes input and produces output , that is all. This make it easy to write error free programs that can scale or can be executed in parallel. Scala has very good support functional for programming. 

3. Static Types

In statically typed languages like C++, Java and Scala, every variable has a type and the type
determines what the program can do with the variable. If you try to multiply 2 Strings, the compilation process will flag that as error. Statically typed language protect the programmer
by detecting errors and from shooting himself in the foot. If you think strong typing is annoying and leads to verbose code, then you will be pleased to know that unlike Java, Scala supports type inference ( ability to detect type ) which reduces verbosity.

4. Brevity

Scala has features that enable the programmer to write compact code as opposed to verbose code. Less code mean fewer bugs and less time spent on maintenance.

//Java
public class Person {
  private String fname ;
  private String lname ;

  public Person(String first, String last) {
      fname = first ;
      lname = last ;
  }

}

In Scala the same class is written as

class Person(fname: String,lname: String) 

Scala supports type inference that helps avoid verbose code.  

// Java String is in the statement twice
public String[] stringarray = new String[5] ;

// Scala type is infered as Array of Strings
val stringarray = new Array[String](5)

5. JVM language

Scala is compiled to bytecode that runs on the Java virtual machine. Since the JVM is available on every platform, your scala code will run on windows , linux , mac os and any other platform for which a JVM is available.

Another advantage is the integration with Java. Java has a very rich class library. There are several open source projects that provide additional libraries for very useful functions. Java code can be called from Scala programs very easily, which means all those function rich libraries are available for your use in Scala.

val calendar = new java.util.GregorianCalendar()
print(java.lang.String.format("%1$ty%1$tm%1$td",cal))

will print todays date in format YYMMDD.

6.  Better support for concurrency

To write concurrent programs in JAVA, you had to deal with threads, the java memory model, locking ,  synchronization, deadlocks etc. Writing error free concurrent programs was difficult. Scala has a actor based programming model that shields the programmer from the issues you face in Java , C/C++. To write concurrent programs , you implement actors that send, receive and handle messages. The Actor model lets the programmer avoid sharing data between threads and the issues related to locking shared data.

7. Scalable programs

By avoiding locking in concurrent programs, Scala is able to exploit the parallelism in way that Java cannot. In Java, a recommended best practice for writing scalable code was to use immutable objects. With the Actor model is Scala, you use immutable objects as messages and have unsynchronized methods. Immutable object are also at the heart of functional programming (2) which Scala promotes.

How many times have we heard of a Ruby or Python application that has be rewritten in Java or C++ because it cannot scale to the increased demands of users ? With Scala, this will not be an issue.

8. Fast

Studies have shown that Scala is at least as fast as Java.
see http://research.google.com/pubs/pub37122.html

9. General purpose/multi-purpose

The brevity and compactness of Scale ensures that it can be used for scripting or rapid application development a la Ruby or Python. But the fact that it runs on JVM and its scalability features ensure that it can be used for complex applications.

10. It is getting more popular

This is a more non technical reason. Scala is getting more popular. More startups are moving to Scala. Many are skipping Java and going directly to Scala. If you are a Java programmer, learning Scala makes you more marketable. Even if you are not a Java programmer, learning Scala will open up a number of opportunities in the programming world. 

Thursday, April 4, 2013

Using HBase Part 2: Architecture


In this blog, let us take a quick look at some architectural details of HBase.

For an introduction to NoSql and HBase, read the following blogs.
What is NoSql ?
Using HBase

Internally HBase is a  a sparse, distributed, persistent, multidimensional sorted Map. While that sentence seems complicated, reading each word individually gives clarity.
sparse - some cells can be empty
distributed - data is partitioned across many hosts
persistent - stored to disk
multidimensional - more than 1 dimension (key,value,version)
Map - key and value
sorted - maps are generally not sorted but this one is

HBase uses HDFS to store the data.

An HBase table has rows and columns. Columns are grouped into column families. There is a version for each value. So table,row key, column family, column name, version are used to get to a value. Both row keys and values are byte[]s.

Table is sorted by row key, Within a column family, the columns are sorted. Storage is per column family. So logically related columns should be in a column family.

A Table is made of regions. A region has a subset of the rows in a table. A region can be described using tablename, start key, end key. A region is made up of one or more HDFS files.

The regions are managed by servers known as the region servers. There is a master server that assigns regions to region servers.

HBase has 2 catalog tables -ROOT- and .META. .META has information on all regions in the system. -ROOT- has information on .META. When a client wants to access data, these 2 tables are consulted to determine which region server has the region that should be used for this request. The client issues read/write requests to the region server directly.

HBase uses zookeeper to maintain cluster state. A simple diagram below shows the components of an HBase cluster.



















Logical view of a table:
The table is figure 2 has 2 column families: cf1 with columns colA and ColB, cf2 with columns ColC
and ColD. The value in each cell is uniquely identified by row key, column family, column name and a timestamp or version.

Logical view of RegionServer:




The rows of a table are in a Region. Region is the unit of allocation and is identified by a start key and end key. The regions are distributed across the region servers in the cluster.








Physical view of Region Server:

Each Region has one of more stores. Each Store is per column family. The memStore is where changes are stored in memory before writing to disk. The file store is the persistent store and is a file written to HDFS. The Hfile is described in the blog HFile.

Each RegionServer has a write ahead log (WAL) . Writes are first written to the WAL. If the region server crashed before memory is flushed to disk, the WAL is used to recover. This implies data is stored in memory and flushed to disk periodically. Changes are sorted while in memory.

Reads look for data in memStore first and then go to disk if necessary. Data is flushed to disk in 64 Mb chunks. This size is configurable. HFiles are merged to larger files. Sorting in memory and merging files makes it like a mergeSort.

For delete, the row is marked as deleted ( as opposed to physically removing it).

HBase provides ACID semantics at a row level. HBase does multi version concurrent updates, which means updates happen by creating a new version as opposed to overwriting existing row. Writers need to acquire a lock to write. Readers do not acquire a lock.To ensure consistent reads without locking, HBase assigns a write number to each write. The read returns data from the highest write number that is durable. Locks stored in memory in the region server. This is sufficient because all values for a row are in one region server. Transactions are committed in a serial order.

Sharding is automatic. Regions split when files reach a certain size.

Compaction step which run in background combines files, removes deleted data.

This concludes the introduction to HBase architecture.