Agile, Scrum, Kanban, Architecture, ...: Hive vs Pig

Friday, June 13, 2014

Hive vs Pig

Scenario/Feature	Pig	Hive	Remark
Utilizing SQL experience	Pig Latin’s syntax is Data flow oriented	HiveQL syntax is very similar to SQL
Query Optimization	Developer has some control	Developer has no say in query optimization. Hive Optimizer is final authority.
Coding Style	Verbose	SQL like
General Usage	Scheduled jobs to crunch massive data ETL like jobs	Adhoc queries
Thought process	Think in terms of Flow chart	Think in terms of SQL like declarative style
Underlying platform	Hadoop, Dryad	Hadoop
Connection to external world	Integrated with Hadoop streaming which make Pig accessible to other languages	Get connected to external world using Thrift server (e.g. JDBC). Easy to integrate existing BI tools
Temporary table concept	No such requirement	For complex task, for hive you have to manually to create temporary table to store intermediate data,
Queries in case of complex data structure	Queries involving complex data structure are easier to write. Pig has Tuple and Bag data types.	Queries involving complex data structure are difficult to write
Meta data support	Pig has no metadata support, (or it is optional, in future it may integrate hcatalog).	Hive has tables' metadata stored in relational database
Ease of writing code	Writing UDF in pig much is easier	Writing UDF in hive is not easy.	My opinion ( derived from support to complex data structure)
Streaming of data	Pig allows one to load data and user code at any point in the pipeline. This is can be particularly important if the data is a streaming data, for example data from satellites or instruments.	Hive, which is RDBMS based, needs the data to be first imported (or loaded) and after that it can be worked upon. So if you were using Hive on streaming data, you would have to keep filling buckets (or files) and use hive on each filled bucket, while using other buckets to keep storing the newly arriving data.
Suitability for parallelization	Pig is well suited to parallelization and so it has an edge for systems where the datasets are huge, i.e. in systems where throughput has higher precedence than latency (the time to get any particular datum of result)
Who is faster	Pig is faster in the data import.	Hive is faster is execution
Handling of Skewed data	Pig has a special join mode (skew-join) which users can use to query over data whose join skew distribution in data is not even. It samples the data and uses that information to distribute the load evenly. Pig order-by command also similarly samples the data first. (Pig 'order by' statement does global sorting of data in a scalable fashion (multiple map/reduce tasks))	Hive sort-by sorts within each reduce task	If your data is not evenly distributed (e.g. across join or sort keys), this can greatly affect the runtime of the query- few of the tasks can get much larger share of the processing.

1 comment:

Ravi TejaFebruary 12, 2015 at 9:42 PM
Nice Explanation... Thanks for info...
ReplyDelete
Replies

Add comment

Disclaimer & Copyright

The entries in my blog are solely my opinions and do not represent the thoughts, intentions, plans or strategies of any third party, including my employer, except where explicitly stated. Needless to say, a weblog is a snapshot in time. Over time, as I interact with the community at large and/or learn more about various topics, my thoughts and opinions are subject to change. As such you should not consider out of date posts to reflect my current thoughts and opinions. Java, Oracle, Orcle Fusion Middleware, TIBCO, Sun, Microsoft, IBM, WebSphere, SAP, NetWeaver, Cloudera, HortonWorks and any other mentioned are trade marks of respective owners. © Copyright 2001-2015, Tushar Jain

Agile, Scrum, Kanban, Architecture, ...

Friday, June 13, 2014

Hive vs Pig

1 comment:

Followers

Add to Technorati Favorites

My Docs

Blog Archive

Contributors

My Blog List

Disclaimer & Copyright

Agile, Scrum, Kanban, Architecture, ...

Friday, June 13, 2014

Hive vs Pig

1 comment:

Subscribe To SOA Blog

Followers

Add to Technorati Favorites

My Docs

Blog Archive

Contributors

My Blog List

Disclaimer & Copyright