Code & Cluster: Using HBase

HBase is a NoSQL database from the hadoop family. The NoSql concept is discussed in my blog at What is NoSql ? HBase is a column oriented key value store based on Google's Bigtable.

To recap, you would be considering a NoSql database because your RDBMS is probably not able to meet your requirements because of one or more of the following reasons:

You application deals with billions and billion of rows of data

Application does a lot of writes

Reads require low latency

linear scalability with commodity hardware is required

You frequently need to add more columns or remove columns

There are several NoSql databases that can address one or more of these issues. In this article I provide an introduction to HBase. The goal is to help you get started evaluating whether HBase would be appropriate for your problem. This is introductory material. More details in subsequent blogs.

Main features of HBase are :

Built on hadoop and HDFS. If you are already using hadoop , then HBase can be viewed as an extension to your hadoop infrastructure that provides random reads and writes.

A Simple data model based on keys , values and columns. More on this later.

Scales linearly by adding commodity hardware

Automatic partitioning of tables as they grow larger

Classes available for integration with MapReduce

Automatic failover support

Support rowkey range scans

Data Model

The main constructs of the model are Table, rows, column family and columns.

Data is written and read from a Table. A Table has rows and column families. Each row has a key.

Each Column family has one or more columns. Columns in a column family are logically related. Each column has a name and value. When a Table is created, the column families have to be declared. But the columns in each family do not need to be defined and can be added on demand. Each column is referred to using the syntax columnFamily:column. For example, an age column in a userprofile column family is referred to as userprofile:age. For each row, storage space is taken up only for the columns written in that row.

Let us design a Hbase table to store User web browsing information. Each user has a unique id called userid. For each user we need to store

(1) some profile information like sex, age, geolocation, membership.
(2) For each partner website he visits, store the page types viewed, products viewed.
(3) For each partner website he visits, store products purchased , product put in shopping cart but not purchased.

Our structure might look like

{
userid1:{ // rowkey
    profile:{ // column family
          sex: male, // column , value
          age : 25,

          member: Y

    },
    browsehistory: { // column family
          partner1.hp:23,    // visited partner1 homepage 23 times
          partner2.product.pr1 : 4 // viewed product pr1 4 times
    }

    shoppinghistory: { // column family

         partner3.pr3: 25.5 , // purchased pr3 from partner3 for $25.5

Let us design an Hbase table for the above structure.

Tablename : UserShoppingData. Since we will lookup data based on user, the key can be userid.

(1) ColumnFamily profile for profile information. Columns would be sex, age, member etc
(2) ColumnFamily browsehistory for browsing data. Columns are dynamic such as websitename.page or website.productid
(3) ColumnFamily shopping history for shopping data. Columns are dynamic.

The beauty is you can dynamically add columns. If visualizing this as columns is difficult, just think that you are dynamically adding key value pairs. This kind of data is required in a typical internet shopper analytics application.

HBase is an appropriate choice because you have several hundred million internet shoppers. That is several million rows. If you wanted to store data by date, you might make the key userid+date, in which case you might have even more rows - in the order of billions. Data is written as the user visits various internet shopping websites. Later the data might need to read with low latency to be able to show the user a promotion or advertisement based on his past history. A company I worked for in the past used a very popular RDBMS for such high volume writes and when ever the RDBMS was flooded with such write requests, the RDBMS would grind to a halt.

Let us use HBase shell to create the above table, insert some data into it and query it.

Step 1: Download and install HBase from http://hbase.apache.org

Step 2: Start hbase
$ ./start-hbase.sh
starting master, logging to /Users/jk/hbase-0.94.5/bin/../logs/hbase-jk-master-jk.local.out

Step 3: Start hbase shell
$ ./hbase shell
HBase Shell; enter 'help' for list of supported commands.
Type "exit" to leave the HBase Shell
Version 0.94.5, r1443843, Fri Feb 8 05:51:25 UTC 2013
hbase(main):001:0>

Step4: Create the table
hbase(main):004:0> create 'usershoppingdata','profile','browsehistory','shophistory'
0 row(s) in 3.9940 seconds

Step5: Insert some data
hbase(main):003:0> put 'usershoppingdata', 'userid1','profile:sex','male'
0 row(s) in 0.1990 seconds

hbase(main):004:0> put 'usershoppingdata', 'userid1','profile:age','25'
0 row(s) in 0.0090 seconds

hbase(main):005:0> put 'usershoppingdata', 'userid1','browsehistory:amazon.hp','11'
0 row(s) in 0.0100 seconds

hbase(main):006:0> put 'usershoppingdata', 'userid1','browsehistory:amazon.isbn123456','3'
0 row(s) in 0.0070 seconds

hbase(main):007:0> put 'usershoppingdata', 'userid1','shophistory:amazon.isbn123456','19.99'
0 row(s) in 0.0140 seconds

Step 6: Read the data
hbase(main):008:0> scan 'usershoppingdata'
ROW                        COLUMN+CELL
userid1                   column=browsehistory:amazon.hp, timestamp=1362784343421, value=11
userid1                   column=browsehistory:amazon.isbn123456, timestamp=1362786676092, value=3
userid1                   column=profile:age, timestamp=1362784243334, value=25
userid1                   column=profile:sex, timestamp=1362784225141, value=male
userid1                   column=shophistory:amazon.isbn123456, timestamp=1362786706557, value=19.99
1 row(s) in 0.1450 seconds

hbase(main):010:0> get 'usershoppingdata', 'userid1'
COLUMN                     CELL
browsehistory:amazon.hp   timestamp=1362784343421, value=11
browsehistory:amazon.isbn timestamp=1362786676092, value=3
123456
profile:age               timestamp=1362784243334, value=25
profile:sex               timestamp=1362784225141, value=male
shophistory:amazon.isbn12 timestamp=1362786706557, value=19.99
3456
5 row(s) in 0.0520 seconds

hbase(main):011:0> get 'usershoppingdata', 'userid1', 'browsehistory:amazon.hp'
COLUMN                     CELL
browsehistory:amazon.hp   timestamp=1362784343421, value=11
1 row(s) in 0.0360 seconds

Step 7: Add few more rows

hbase(main):015:0> put 'usershoppingdata', 'userid2','profile:sex','male'
0 row(s) in 0.0070 seconds

hbase(main):016:0> put 'usershoppingdata', 'userid3','profile:sex','male'
0 row(s) in 0.0060 seconds

hbase(main):017:0> put 'usershoppingdata', 'userid4','profile:sex','male'
0 row(s) in 0.0330 seconds

hbase(main):018:0> put 'usershoppingdata', 'userid5','profile:sex','male'
0 row(s) in 0.0050 seconds

Step 8: Let us do some range scans on the row key
hbase(main):024:0> scan 'usershoppingdata', {STARTROW => 'u'}
ROW                        COLUMN+CELL
userid1                   column=browsehistory:amazon.hp, timestamp=1362784343421, value=11
userid1                   column=browsehistory:amazon.isbn123456, timestamp=1362786676092, value=3
userid1                   column=profile:age, timestamp=1362784243334, value=25
userid1                   column=profile:sex, timestamp=1362784225141, value=male
userid1                   column=shophistory:amazon.isbn123456, timestamp=1362786706557, value=19.99
userid2                   column=profile:sex, timestamp=1362788377896, value=male
userid3                   column=profile:sex, timestamp=1362788385501, value=male
userid4                   column=profile:sex, timestamp=1362788392575, value=male
userid5                   column=profile:sex, timestamp=1362788398087, value=male
5 row(s) in 0.0780 seconds

hbase(main):019:0> scan 'usershoppingdata', {STARTROW => 'userid3'}
ROW                        COLUMN+CELL
userid3                   column=profile:sex, timestamp=1362788385501, value=male
userid4                   column=profile:sex, timestamp=1362788392575, value=male
userid5                   column=profile:sex, timestamp=1362788398087, value=male
3 row(s) in 0.0250 seconds

hbase(main):023:0> scan 'usershoppingdata', {STARTROW => 'userid3', STOPROW => 'userid5'}
ROW                        COLUMN+CELL
userid3                   column=profile:sex, timestamp=1362788385501, value=male
userid4                   column=profile:sex, timestamp=1362788392575, value=male
2 row(s) in 0.0160 seconds

The shell is very useful to playaround with the data model and get familiar with HBase. In a real world application , you might write code in a language like Java. There is more to HBase than this simple introduction. I will get into internals and architecture in future blogs.

Code & Cluster

Friday, March 15, 2013

Using HBase

No comments:

Post a Comment