Data Science: What difference of RDBMS and Hive?

Hive resembles a traditional database by supporting SQL interface but it is not a full database.

Hive can be better called as data warehouse instead of database.

Hive enforces schema on read time whereas RDBMS enforces schema on write time.

In RDBMS, a table’s schema is enforced at data load time, If the data being loaded doesn’t conform to the schema, then it is rejected. This design is called schema on write.But Hive doesn’t verify the data when it is loaded, but rather when a it is retrieved. This is called schema on read.

Schema on read makes for a very fast initial load, since the data does not have to be read, parsed, and serialized to disk in the database’s internal format. The load operation is just a file copy or move.

Schema on write makes query time performance faster, since the database can index columns and perform compression on the data but it takes longer to load data into the database.

Hive is based on the notion of Write once, Read many times but RDBMS is designed for Read and Write many times.

In RDBMS, record level updates, insertions and deletes, transactions and indexes are possible. Whereas these are not allowed in Hive because Hive was built to operate over HDFS data using MapReduce, where fulltable scans are the norm and a table update is achieved by transforming the data into a new table.

In RDBMS, maximum data size allowed will be in 10’s of Terabytes but whereas Hive can 100’s Petabytes very easily.

As Hadoop is a batchoriented system, Hive doesn’t support OLTP (Online Transaction Processing) but it is closer to OLAP (Online Analytical Processing) but not ideal since there is significant latency between issuing a query and receiving a reply, due to the overhead of Mapreduce jobs and due to the size of the data sets Hadoop was designed to serve.

RDBMS is best suited for dynamic data analysis and where fast responses are expected but Hive is suited for data warehouse applications, where relatively static data is analyzed, fast

response times are not required, and when the data is not changing rapidly.

To overcome the limitations of Hive, HBase is being integrated with Hive to support record level operations and OLAP.

Hive is very easily scalable at low cost but RDBMS is not that much scalable that too it is very costly scale up.

source:http://hadooptutorial.info/

Data Science

Monday, December 22, 2014

What difference of RDBMS and Hive?

1 comment: