Saturday, June 14, 2014

HiveQL vs SQL


Scenario/Feature
HiveQL
SQL
Remarks
Default Join
"equi" join
Inner join
"equi" join - the only entries that are returned are the ones where the condition is true and returns no null values
Join syntax
LEFT OUTER JOIN
RIGHT OUTER JOIN
LEFT JOIN
RIGHT JOIN

Largest table last
Hive attempts to perform a map-side join where it loads the first table into memory and reads the second table in as normal input to the map function

When writing queries, try to facilitate this as much as possible and order the tables used in the join so that the largest table is last.
Data Type
No interval types



All queries must reference a table
'dual' or table-less queries supported


No session-scoped temp tables



No 'IN' predicate



No 'FIND' string search function for producing the offset to a match



No find/replace string functions for plain strings (i.e. not regex)



No regular UNION, INTERSECT, or MINUS operators



Null values are treated differently than empty string, and are exported differently.  IE, empty strings are exported as '\n' and nulls are exported as nulls

This isn't unique to Hive but still annoying when exporting data from Hive into another system.

No hierarchical/self-referencing querying

Most distributed computing solutions can't do this, but it can be very handy.

No Update or Delete statements



No cost-based explain plans. 

Running explain plans generally just shows the path of accessing data.  Useful to some degree but it would be great if it was more advanced in that it could help the user understand which steps are causing the biggest slowdowns

Hive Does not support the ability to run a query that select from tables in more than one database
It is possible


Hive does not support sub-queries such as those connected by IN/EXISTS in the WHERE clause



Hive does not support the truncation of data from a table



No inequality join



group_concat () is missing in Hive QL

it is available with Impala

No comments:

Post a Comment