Monday, June 16, 2014

Blog: For my High Schooler - Compilation vs Interpretation



Yash: Today we were talking about Java. During discussion instructor told that Java is compiled as well interpreted language.

Me: He is perfectly correct.

Yash: But what is Compile and Interpreted?

Me: It is very easy to understand.

Me:  Let’s take a scenario.  Recently elected Prime Minister of India, Narendra Modi receives a letter of congratulations from Japanese Prime Minister, Shinzo Abe.   Japanese PM has letter in Japanese. Indian PM does not understand Japanese, so one of translator in Indian PM Office translates the thank you note and pass on to Modi.
If you notice, translation from Japanese to Hindi has happened prior to letter reaches on the Modi’s desk.

Now consider second scenario. Modi is travelling to Japan. There is meeting of Modi and Abe. As we know Abe does not know Hindi and Modi have no clue of Japanese. So during meeting there will be a translator, who will translate Hindi to Japanese and vice versa in real time. 

In this case, translation is happening in nearly real time.

Yash: Ok, I got it. First scenario is Compilation while second is Interpretation.  
Me: Fantastic. Now my question is why Java has both compilation and interpretation?

Yash: This easy.  Java take benefit of compilation by translating English into something intermediately language.

Me: This intermediate language is called Byte Code.

Yash: Yep. Since we have concept of virtual machine, each platform has its own virtual machine, which acts as real time translator (interpreter), java become platform independent.

Me: Excellent. We will be discussing about Virtual Machine tomorrow.

Saturday, June 14, 2014

HiveQL vs SQL


Scenario/Feature
HiveQL
SQL
Remarks
Default Join
"equi" join
Inner join
"equi" join - the only entries that are returned are the ones where the condition is true and returns no null values
Join syntax
LEFT OUTER JOIN
RIGHT OUTER JOIN
LEFT JOIN
RIGHT JOIN

Largest table last
Hive attempts to perform a map-side join where it loads the first table into memory and reads the second table in as normal input to the map function

When writing queries, try to facilitate this as much as possible and order the tables used in the join so that the largest table is last.
Data Type
No interval types



All queries must reference a table
'dual' or table-less queries supported


No session-scoped temp tables



No 'IN' predicate



No 'FIND' string search function for producing the offset to a match



No find/replace string functions for plain strings (i.e. not regex)



No regular UNION, INTERSECT, or MINUS operators



Null values are treated differently than empty string, and are exported differently.  IE, empty strings are exported as '\n' and nulls are exported as nulls

This isn't unique to Hive but still annoying when exporting data from Hive into another system.

No hierarchical/self-referencing querying

Most distributed computing solutions can't do this, but it can be very handy.

No Update or Delete statements



No cost-based explain plans. 

Running explain plans generally just shows the path of accessing data.  Useful to some degree but it would be great if it was more advanced in that it could help the user understand which steps are causing the biggest slowdowns

Hive Does not support the ability to run a query that select from tables in more than one database
It is possible


Hive does not support sub-queries such as those connected by IN/EXISTS in the WHERE clause



Hive does not support the truncation of data from a table



No inequality join



group_concat () is missing in Hive QL

it is available with Impala

Friday, June 13, 2014

Hive vs Pig


Scenario/Feature
Pig
Hive
Remark
Utilizing SQL experience
Pig Latin’s syntax is Data flow oriented
HiveQL syntax is very similar to SQL

Query Optimization
Developer has some control
Developer has no say in query optimization. Hive Optimizer is final authority.

Coding Style
Verbose
SQL like

General Usage
Scheduled jobs to crunch massive data
ETL like jobs
Adhoc queries

Thought process
Think in terms of Flow chart
Think in terms of SQL like declarative style

Underlying platform
Hadoop, Dryad
Hadoop

Connection to external world
Integrated with Hadoop streaming which make Pig accessible to other languages
Get connected to external world using Thrift server (e.g. JDBC). Easy to integrate existing BI tools

Temporary table concept
No such requirement

For complex task, for hive you have to manually to create temporary table to store intermediate data,

Queries in case of complex data structure
Queries involving complex data structure are easier to write. Pig has Tuple and Bag data types.
Queries involving complex data structure are difficult to write

Meta data support
Pig has no metadata support, (or it is optional, in future it may integrate hcatalog).
Hive has tables' metadata stored in relational database

Ease of writing code
Writing UDF in pig much is easier
Writing UDF in hive is not easy.
My opinion ( derived from support to complex data structure)
Streaming of data
Pig allows one to load data and user code at any point in the pipeline. This is can be particularly important if the data is a streaming data, for example data from satellites or instruments.
Hive, which is RDBMS based, needs the data to be first imported (or loaded) and after that it can be worked upon. So if you were using Hive on streaming data, you would have to keep filling buckets (or files) and use hive on each filled bucket, while using other buckets to keep storing the newly arriving data.

Suitability for parallelization
Pig is well suited to parallelization and so it has an edge for systems where the datasets are huge, i.e. in systems where throughput has higher precedence than latency (the time to get any particular datum of result)


Who is faster
Pig is faster in the data import.
Hive is faster is execution

Handling of Skewed data
Pig has a special join mode (skew-join) which users can use to query over data whose join skew distribution in data is not even. It samples the data and uses that information to distribute the load evenly. Pig order-by command also similarly samples the data first. (Pig 'order by' statement does global sorting of data in a scalable fashion (multiple map/reduce tasks))
Hive sort-by sorts within each reduce task
If your data is not evenly distributed (e.g. across join or sort keys), this can greatly affect the runtime of the query- few of the tasks can get much larger share of the processing.