Friday, June 13, 2014

Hive vs Pig


Scenario/Feature
Pig
Hive
Remark
Utilizing SQL experience
Pig Latin’s syntax is Data flow oriented
HiveQL syntax is very similar to SQL

Query Optimization
Developer has some control
Developer has no say in query optimization. Hive Optimizer is final authority.

Coding Style
Verbose
SQL like

General Usage
Scheduled jobs to crunch massive data
ETL like jobs
Adhoc queries

Thought process
Think in terms of Flow chart
Think in terms of SQL like declarative style

Underlying platform
Hadoop, Dryad
Hadoop

Connection to external world
Integrated with Hadoop streaming which make Pig accessible to other languages
Get connected to external world using Thrift server (e.g. JDBC). Easy to integrate existing BI tools

Temporary table concept
No such requirement

For complex task, for hive you have to manually to create temporary table to store intermediate data,

Queries in case of complex data structure
Queries involving complex data structure are easier to write. Pig has Tuple and Bag data types.
Queries involving complex data structure are difficult to write

Meta data support
Pig has no metadata support, (or it is optional, in future it may integrate hcatalog).
Hive has tables' metadata stored in relational database

Ease of writing code
Writing UDF in pig much is easier
Writing UDF in hive is not easy.
My opinion ( derived from support to complex data structure)
Streaming of data
Pig allows one to load data and user code at any point in the pipeline. This is can be particularly important if the data is a streaming data, for example data from satellites or instruments.
Hive, which is RDBMS based, needs the data to be first imported (or loaded) and after that it can be worked upon. So if you were using Hive on streaming data, you would have to keep filling buckets (or files) and use hive on each filled bucket, while using other buckets to keep storing the newly arriving data.

Suitability for parallelization
Pig is well suited to parallelization and so it has an edge for systems where the datasets are huge, i.e. in systems where throughput has higher precedence than latency (the time to get any particular datum of result)


Who is faster
Pig is faster in the data import.
Hive is faster is execution

Handling of Skewed data
Pig has a special join mode (skew-join) which users can use to query over data whose join skew distribution in data is not even. It samples the data and uses that information to distribute the load evenly. Pig order-by command also similarly samples the data first. (Pig 'order by' statement does global sorting of data in a scalable fashion (multiple map/reduce tasks))
Hive sort-by sorts within each reduce task
If your data is not evenly distributed (e.g. across join or sort keys), this can greatly affect the runtime of the query- few of the tasks can get much larger share of the processing.

1 comment: