Scenario/Feature
|
Pig
|
Hive
|
Remark
|
Utilizing SQL experience
|
Pig Latin’s syntax is Data flow
oriented
|
HiveQL syntax is very similar to SQL
|
|
Query Optimization
|
Developer has some control
|
Developer has no say in query
optimization. Hive Optimizer is final authority.
|
|
Coding Style
|
Verbose
|
SQL like
|
|
General Usage
|
Scheduled jobs to crunch massive data
ETL like jobs
|
Adhoc queries
|
|
Thought process
|
Think in terms of Flow chart
|
Think in terms of SQL like declarative
style
|
|
Underlying platform
|
Hadoop, Dryad
|
Hadoop
|
|
Connection to external world
|
Integrated with Hadoop streaming which
make Pig accessible to other languages
|
Get connected to external world using
Thrift server (e.g. JDBC). Easy to integrate existing BI tools
|
|
Temporary table concept
|
No such
requirement |
For complex task, for hive you have to
manually to create temporary table to store intermediate data,
|
|
Queries in case of complex data
structure
|
Queries
involving complex data structure are easier to write. Pig has Tuple and Bag
data types. |
Queries involving complex data
structure are difficult to write
|
|
Meta data support
|
Pig has no
metadata support, (or it is optional, in future it may integrate hcatalog). |
Hive has tables' metadata stored in
relational database
|
|
Ease of writing code
|
Writing
UDF in pig much is easier |
Writing UDF in hive is not easy.
|
My opinion ( derived from support to
complex data structure)
|
Streaming of data
|
Pig allows
one to load data and user code at any point in the pipeline. This is can be particularly
important if the data is a streaming data, for example data from satellites
or instruments. |
Hive,
which is RDBMS based, needs the data to be first imported (or loaded) and
after that it can be worked upon. So if you were using Hive on streaming
data, you would have to keep filling buckets (or files) and use hive on each filled
bucket, while using other buckets to keep storing the newly arriving data. |
|
Suitability for parallelization
|
Pig is
well suited to parallelization and so it has an edge for systems where the
datasets are huge, i.e. in systems where throughput has higher precedence
than latency (the time to get any particular datum of result) |
|
|
Who is faster
|
Pig is
faster in the data import. |
Hive is
faster is execution |
|
Handling of Skewed data
|
Pig has a
special join mode (skew-join) which users can use to query over data whose
join skew distribution in data is not even. It samples the data and uses that
information to distribute the load evenly. Pig order-by command also
similarly samples the data first. (Pig 'order by' statement does global
sorting of data in a scalable fashion (multiple map/reduce tasks))
|
Hive
sort-by sorts within each reduce task |
If your data is not evenly distributed
(e.g. across join or sort keys), this can greatly affect the runtime of the
query- few of the tasks can get much larger share of the processing.
|
Friday, June 13, 2014
Hive vs Pig
Subscribe to:
Post Comments (Atom)
Nice Explanation... Thanks for info...
ReplyDelete