HBase is a NoSQL database from the hadoop family. The NoSql concept is discussed in my blog at
What is NoSql ? HBase
is a column oriented key value store based on Google's Bigtable.
To recap, you would be considering a NoSql database because your RDBMS is probably not able to meet your requirements because of one or more of the following reasons:
Main features of HBase are :
The main constructs of the model are Table, rows, column family and columns.
Data is written and read from a Table. A Table has rows and column families. Each row has a key.
Each Column family has one or more columns. Columns in a column family are logically related. Each column has a name and value. When a Table is created, the column families have to be declared. But the columns in each family do not need to be defined and can be added on demand. Each column is referred to using the syntax columnFamily:column. For example, an age column in a userprofile column family is referred to as userprofile:age. For each row, storage space is taken up only for the columns written in that row.
Let us design a Hbase table to store User web browsing information. Each user has a unique id called userid. For each user we need to store
(1) some profile information like sex, age, geolocation, membership.
(2) For each partner website he visits, store the page types viewed, products viewed.
(3) For each partner website he visits, store products purchased , product put in shopping cart but not purchased.
Our structure might look like
Let us design an Hbase table for the above structure.
Tablename : UserShoppingData. Since we will lookup data based on user, the key can be userid.
(1) ColumnFamily profile for profile information. Columns would be sex, age, member etc
(2) ColumnFamily browsehistory for browsing data. Columns are dynamic such as websitename.page or website.productid
(3) ColumnFamily shopping history for shopping data. Columns are dynamic.
The beauty is you can dynamically
add columns. If visualizing this as columns is difficult, just think that you are dynamically
adding key value pairs.
This kind of data is required in a typical internet shopper analytics application.
HBase is an
appropriate choice because you have several hundred million internet shoppers. That is several million rows. If you wanted to store data by date, you might make the key userid+date, in which case you might have even more rows - in the order of billions. Data is written
as the user visits various internet shopping websites. Later the data might need to read with low latency to be able to show
the user a promotion or advertisement based on his past history. A company I worked for in the past used a very popular RDBMS for such high volume writes and when ever the RDBMS was flooded with such write requests, the RDBMS would grind to a halt.
Let us use HBase shell to create the above table, insert some data into it and query it.
Step 1: Download and install HBase from http://hbase.apache.org
Step 2: Start hbase
$ ./start-hbase.sh
starting master, logging to /Users/jk/hbase-0.94.5/bin/../logs/hbase-jk-master-jk.local.out
Step 3: Start hbase shell
$ ./hbase shell
HBase Shell; enter 'help' for list of supported commands.
Type "exit" to leave the HBase Shell
Version 0.94.5, r1443843, Fri Feb 8 05:51:25 UTC 2013
hbase(main):001:0>
Step4: Create the table
hbase(main):004:0> create 'usershoppingdata','profile','browsehistory','shophistory'
0 row(s) in 3.9940 seconds
Step5: Insert some data
hbase(main):003:0> put 'usershoppingdata', 'userid1','profile:sex','male'
0 row(s) in 0.1990 seconds
hbase(main):004:0> put 'usershoppingdata', 'userid1','profile:age','25'
0 row(s) in 0.0090 seconds
hbase(main):005:0> put 'usershoppingdata', 'userid1','browsehistory:amazon.hp','11'
0 row(s) in 0.0100 seconds
hbase(main):006:0> put 'usershoppingdata', 'userid1','browsehistory:amazon.isbn123456','3'
0 row(s) in 0.0070 seconds
hbase(main):007:0> put 'usershoppingdata', 'userid1','shophistory:amazon.isbn123456','19.99'
0 row(s) in 0.0140 seconds
Step 6: Read the data
hbase(main):008:0> scan 'usershoppingdata'
ROW COLUMN+CELL
userid1 column=browsehistory:amazon.hp, timestamp=1362784343421, value=11
userid1 column=browsehistory:amazon.isbn123456, timestamp=1362786676092, value=3
userid1 column=profile:age, timestamp=1362784243334, value=25
userid1 column=profile:sex, timestamp=1362784225141, value=male
userid1 column=shophistory:amazon.isbn123456, timestamp=1362786706557, value=19.99
1 row(s) in 0.1450 seconds
hbase(main):010:0> get 'usershoppingdata', 'userid1'
COLUMN CELL
browsehistory:amazon.hp timestamp=1362784343421, value=11
browsehistory:amazon.isbn timestamp=1362786676092, value=3
123456
profile:age timestamp=1362784243334, value=25
profile:sex timestamp=1362784225141, value=male
shophistory:amazon.isbn12 timestamp=1362786706557, value=19.99
3456
5 row(s) in 0.0520 seconds
hbase(main):011:0> get 'usershoppingdata', 'userid1', 'browsehistory:amazon.hp'
COLUMN CELL
browsehistory:amazon.hp timestamp=1362784343421, value=11
1 row(s) in 0.0360 seconds
Step 7: Add few more rows
hbase(main):015:0> put 'usershoppingdata', 'userid2','profile:sex','male'
0 row(s) in 0.0070 seconds
hbase(main):016:0> put 'usershoppingdata', 'userid3','profile:sex','male'
0 row(s) in 0.0060 seconds
hbase(main):017:0> put 'usershoppingdata', 'userid4','profile:sex','male'
0 row(s) in 0.0330 seconds
hbase(main):018:0> put 'usershoppingdata', 'userid5','profile:sex','male'
0 row(s) in 0.0050 seconds
Step 8: Let us do some range scans on the row key
hbase(main):024:0> scan 'usershoppingdata', {STARTROW => 'u'}
ROW COLUMN+CELL
userid1 column=browsehistory:amazon.hp, timestamp=1362784343421, value=11
userid1 column=browsehistory:amazon.isbn123456, timestamp=1362786676092, value=3
userid1 column=profile:age, timestamp=1362784243334, value=25
userid1 column=profile:sex, timestamp=1362784225141, value=male
userid1 column=shophistory:amazon.isbn123456, timestamp=1362786706557, value=19.99
userid2 column=profile:sex, timestamp=1362788377896, value=male
userid3 column=profile:sex, timestamp=1362788385501, value=male
userid4 column=profile:sex, timestamp=1362788392575, value=male
userid5 column=profile:sex, timestamp=1362788398087, value=male
5 row(s) in 0.0780 seconds
hbase(main):019:0> scan 'usershoppingdata', {STARTROW => 'userid3'}
ROW COLUMN+CELL
userid3 column=profile:sex, timestamp=1362788385501, value=male
userid4 column=profile:sex, timestamp=1362788392575, value=male
userid5 column=profile:sex, timestamp=1362788398087, value=male
3 row(s) in 0.0250 seconds
hbase(main):023:0> scan 'usershoppingdata', {STARTROW => 'userid3', STOPROW => 'userid5'}
ROW COLUMN+CELL
userid3 column=profile:sex, timestamp=1362788385501, value=male
userid4 column=profile:sex, timestamp=1362788392575, value=male
2 row(s) in 0.0160 seconds
The shell is very useful to playaround with the data model and get familiar with HBase. In a real world application , you might write code in a language like Java. There is more to HBase than this simple introduction. I will get into internals and architecture in future blogs.
To recap, you would be considering a NoSql database because your RDBMS is probably not able to meet your requirements because of one or more of the following reasons:
- You application deals with billions and billion of rows of data
- Application does a lot of writes
- Reads require low latency
- linear scalability with commodity hardware is required
- You frequently need to add more columns or remove columns
Main features of HBase are :
- Built on hadoop and HDFS. If you are already using hadoop , then HBase can be viewed as an extension to your hadoop infrastructure that provides random reads and writes.
- A Simple data model based on keys , values and columns. More on this later.
- Scales linearly by adding commodity hardware
- Automatic partitioning of tables as they grow larger
- Classes available for integration with MapReduce
- Automatic failover support
- Support rowkey range scans
The main constructs of the model are Table, rows, column family and columns.
Data is written and read from a Table. A Table has rows and column families. Each row has a key.
Each Column family has one or more columns. Columns in a column family are logically related. Each column has a name and value. When a Table is created, the column families have to be declared. But the columns in each family do not need to be defined and can be added on demand. Each column is referred to using the syntax columnFamily:column. For example, an age column in a userprofile column family is referred to as userprofile:age. For each row, storage space is taken up only for the columns written in that row.
Let us design a Hbase table to store User web browsing information. Each user has a unique id called userid. For each user we need to store
(1) some profile information like sex, age, geolocation, membership.
(2) For each partner website he visits, store the page types viewed, products viewed.
(3) For each partner website he visits, store products purchased , product put in shopping cart but not purchased.
Our structure might look like
{
userid1:{ // rowkey
profile:{ // column family
sex: male, // column , value
age : 25,
member: Y
},
browsehistory: { // column family
partner1.hp:23, // visited partner1 homepage 23 times
partner2.product.pr1 : 4 // viewed product pr1 4 times
}
shoppinghistory: { // column family
partner3.pr3: 25.5 , // purchased pr3 from partner3 for $25.5
}
}
Let us design an Hbase table for the above structure.
Tablename : UserShoppingData. Since we will lookup data based on user, the key can be userid.
(1) ColumnFamily profile for profile information. Columns would be sex, age, member etc
(2) ColumnFamily browsehistory for browsing data. Columns are dynamic such as websitename.page or website.productid
(3) ColumnFamily shopping history for shopping data. Columns are dynamic.
starting master, logging to /Users/jk/hbase-0.94.5/bin/../logs/hbase-jk-master-jk.local.out
HBase Shell; enter 'help
Type "exit
Version 0.94.5, r1443843, Fri Feb 8 05:51:25 UTC 2013
hbase(main):001:0>
0 row(s) in 3.9940 seconds
0 row(s) in 0.1990 seconds
hbase(main):004:0> put 'usershoppingdata', 'userid1','profile:age','25'
0 row(s) in 0.0090 seconds
hbase(main):005:0> put 'usershoppingdata', 'userid1','browsehistory:amazon.hp','11'
0 row(s) in 0.0100 seconds
hbase(main):006:0> put 'usershoppingdata', 'userid1','browsehistory:amazon.isbn123456','3'
0 row(s) in 0.0070 seconds
hbase(main):007:0> put 'usershoppingdata', 'userid1','shophistory:amazon.isbn123456','19.99'
0 row(s) in 0.0140 seconds
ROW COLUMN+CELL
userid1 column=browsehistory:amazon.hp, timestamp=1362784343421, value=11
userid1 column=browsehistory:amazon.isbn123456, timestamp=1362786676092, value=3
userid1 column=profile:age, timestamp=1362784243334, value=25
userid1 column=profile:sex, timestamp=1362784225141, value=male
userid1 column=shophistory:amazon.isbn123456, timestamp=1362786706557, value=19.99
1 row(s) in 0.1450 seconds
COLUMN CELL
browsehistory:amazon.hp timestamp=1362784343421, value=11
browsehistory:amazon.isbn timestamp=1362786676092, value=3
123456
profile:age timestamp=1362784243334, value=25
profile:sex timestamp=1362784225141, value=male
shophistory:amazon.isbn12 timestamp=1362786706557, value=19.99
3456
5 row(s) in 0.0520 seconds
COLUMN CELL
browsehistory:amazon.hp timestamp=1362784343421, value=11
1 row(s) in 0.0360 seconds
0 row(s) in 0.0070 seconds
hbase(main):016:0> put 'usershoppingdata', 'userid3','profile:sex','male'
0 row(s) in 0.0060 seconds
hbase(main):017:0> put 'usershoppingdata', 'userid4','profile:sex','male'
0 row(s) in 0.0330 seconds
hbase(main):018:0> put 'usershoppingdata', 'userid5','profile:sex','male'
0 row(s) in 0.0050 seconds
ROW COLUMN+CELL
userid1 column=browsehistory:amazon.hp, timestamp=1362784343421, value=11
userid1 column=browsehistory:amazon.isbn123456, timestamp=1362786676092, value=3
userid1 column=profile:age, timestamp=1362784243334, value=25
userid1 column=profile:sex, timestamp=1362784225141, value=male
userid1 column=shophistory:amazon.isbn123456, timestamp=1362786706557, value=19.99
userid2 column=profile:sex, timestamp=1362788377896, value=male
userid3 column=profile:sex, timestamp=1362788385501, value=male
userid4 column=profile:sex, timestamp=1362788392575, value=male
userid5 column=profile:sex, timestamp=1362788398087, value=male
5 row(s) in 0.0780 seconds
ROW COLUMN+CELL
userid3 column=profile:sex, timestamp=1362788385501, value=male
userid4 column=profile:sex, timestamp=1362788392575, value=male
userid5 column=profile:sex, timestamp=1362788398087, value=male
3 row(s) in 0.0250 seconds
hbase(main):023:0> scan 'usershoppingdata', {STARTROW => 'userid3', STOPROW => 'userid5'}
ROW COLUMN+CELL
userid3 column=profile:sex, timestamp=1362788385501, value=male
userid4 column=profile:sex, timestamp=1362788392575, value=male
2 row(s) in 0.0160 seconds
No comments:
Post a Comment