Agile, Scrum, Kanban, Architecture, ...: June 2014

Monday, June 16, 2014

Blog: For my High Schooler - Compilation vs Interpretation

Yash: Today we were talking about Java. During discussion instructor told that Java is compiled as well interpreted language.

Me: He is perfectly correct.

Yash: But what is Compile and Interpreted?

Me: It is very easy to understand.

Me: Let’s take a scenario. Recently elected Prime Minister of India, Narendra Modi receives a letter of congratulations from Japanese Prime Minister, Shinzo Abe. Japanese PM has letter in Japanese. Indian PM does not understand Japanese, so one of translator in Indian PM Office translates the thank you note and pass on to Modi.

If you notice, translation from Japanese to Hindi has happened prior to letter reaches on the Modi’s desk.

Now consider second scenario. Modi is travelling to Japan. There is meeting of Modi and Abe. As we know Abe does not know Hindi and Modi have no clue of Japanese. So during meeting there will be a translator, who will translate Hindi to Japanese and vice versa in real time.

In this case, translation is happening in nearly real time.

Yash: Ok, I got it. First scenario is Compilation while second is Interpretation.

Me: Fantastic. Now my question is why Java has both compilation and interpretation?

Yash: This easy. Java take benefit of compilation by translating English into something intermediately language.

Me: This intermediate language is called Byte Code.

Yash: Yep. Since we have concept of virtual machine, each platform has its own virtual machine, which acts as real time translator (interpreter), java become platform independent.

Me: Excellent. We will be discussing about Virtual Machine tomorrow.

Saturday, June 14, 2014

HiveQL vs SQL

Scenario/Feature	HiveQL	SQL	Remarks
Default Join	"equi" join	Inner join	"equi" join - the only entries that are returned are the ones where the condition is true and returns no null values
Join syntax	LEFT OUTER JOIN RIGHT OUTER JOIN	LEFT JOIN RIGHT JOIN
Largest table last	Hive attempts to perform a map-side join where it loads the first table into memory and reads the second table in as normal input to the map function		When writing queries, try to facilitate this as much as possible and order the tables used in the join so that the largest table is last.
Data Type	No interval types
	All queries must reference a table	'dual' or table-less queries supported
	No session-scoped temp tables
	No 'IN' predicate
	No 'FIND' string search function for producing the offset to a match
	No find/replace string functions for plain strings (i.e. not regex)
	No regular UNION, INTERSECT, or MINUS operators
	Null values are treated differently than empty string, and are exported differently. IE, empty strings are exported as '\n' and nulls are exported as nulls		This isn't unique to Hive but still annoying when exporting data from Hive into another system.
	No hierarchical/self-referencing querying		Most distributed computing solutions can't do this, but it can be very handy.
	No Update or Delete statements
	No cost-based explain plans.		Running explain plans generally just shows the path of accessing data. Useful to some degree but it would be great if it was more advanced in that it could help the user understand which steps are causing the biggest slowdowns
	Hive Does not support the ability to run a query that select from tables in more than one database	It is possible
	Hive does not support sub-queries such as those connected by IN/EXISTS in the WHERE clause
	Hive does not support the truncation of data from a table
	No inequality join
	group_concat () is missing in Hive QL		it is available with Impala

Friday, June 13, 2014

Hive vs Pig

Scenario/Feature	Pig	Hive	Remark
Utilizing SQL experience	Pig Latin’s syntax is Data flow oriented	HiveQL syntax is very similar to SQL
Query Optimization	Developer has some control	Developer has no say in query optimization. Hive Optimizer is final authority.
Coding Style	Verbose	SQL like
General Usage	Scheduled jobs to crunch massive data ETL like jobs	Adhoc queries
Thought process	Think in terms of Flow chart	Think in terms of SQL like declarative style
Underlying platform	Hadoop, Dryad	Hadoop
Connection to external world	Integrated with Hadoop streaming which make Pig accessible to other languages	Get connected to external world using Thrift server (e.g. JDBC). Easy to integrate existing BI tools
Temporary table concept	No such requirement	For complex task, for hive you have to manually to create temporary table to store intermediate data,
Queries in case of complex data structure	Queries involving complex data structure are easier to write. Pig has Tuple and Bag data types.	Queries involving complex data structure are difficult to write
Meta data support	Pig has no metadata support, (or it is optional, in future it may integrate hcatalog).	Hive has tables' metadata stored in relational database
Ease of writing code	Writing UDF in pig much is easier	Writing UDF in hive is not easy.	My opinion ( derived from support to complex data structure)
Streaming of data	Pig allows one to load data and user code at any point in the pipeline. This is can be particularly important if the data is a streaming data, for example data from satellites or instruments.	Hive, which is RDBMS based, needs the data to be first imported (or loaded) and after that it can be worked upon. So if you were using Hive on streaming data, you would have to keep filling buckets (or files) and use hive on each filled bucket, while using other buckets to keep storing the newly arriving data.
Suitability for parallelization	Pig is well suited to parallelization and so it has an edge for systems where the datasets are huge, i.e. in systems where throughput has higher precedence than latency (the time to get any particular datum of result)
Who is faster	Pig is faster in the data import.	Hive is faster is execution
Handling of Skewed data	Pig has a special join mode (skew-join) which users can use to query over data whose join skew distribution in data is not even. It samples the data and uses that information to distribute the load evenly. Pig order-by command also similarly samples the data first. (Pig 'order by' statement does global sorting of data in a scalable fashion (multiple map/reduce tasks))	Hive sort-by sorts within each reduce task	If your data is not evenly distributed (e.g. across join or sort keys), this can greatly affect the runtime of the query- few of the tasks can get much larger share of the processing.

Agile, Scrum, Kanban, Architecture, ...

Monday, June 16, 2014

Blog: For my High Schooler - Compilation vs Interpretation

Saturday, June 14, 2014

HiveQL vs SQL

Friday, June 13, 2014

Hive vs Pig

Followers

Add to Technorati Favorites

My Docs

Blog Archive

Contributors

My Blog List

Disclaimer & Copyright

Agile, Scrum, Kanban, Architecture, ...

Monday, June 16, 2014

Blog: For my High Schooler - Compilation vs Interpretation

Saturday, June 14, 2014

HiveQL vs SQL

Friday, June 13, 2014

Hive vs Pig

Subscribe To SOA Blog

Followers

Add to Technorati Favorites

My Docs

Blog Archive

Contributors

My Blog List

Disclaimer & Copyright