Wednesday, October 2, 2013

Does unstructured data exist?



While talking to NO SQL enthusiastics, I often hear that No SQL database can handle unstructured data. Similar arguments are also echoed by Hadoop and BigData devotes.  Are these people are technically correct or just using marketing hype to influence IT decision makers who are business savvy but technical dependent?

In my point of view, there is nothing called unstructured data. NO SQL, Hadoop and BigData zesty people are calling any dataset which does not fit in relational data base as unstructured data. What do you think?

In the context of data, there are two attributes which defines complexity. First is relationship among objects (equivalent to tables in relational database) of data and second is varying number of elements (equivalent to columns in a table in relational database) in objects. With respect to these two parameters there are four possible combinations:

      * Both number of elements in objects and relationship among objects is fixed; it is not changing over time period.
a.       Numbers of elements in objects are fixed and relationships among objects are simple and can be described using relational math. This type of data is prime candidate for relational database.
b.      Numbers of elements in objects are fixed and relationships among objects are not simple and difficult/nearly impossible to describe using relational math. For example if relationships among objects are mimicking graph structure than graph database (e.g. Neo4j) is better choice than relational or any other type of database.

      * Numbers of elements in objects are varying on ad hoc basis irrespective of complexity of relationships among objects than relational database is not the solution. You need database which can accommodate varying number of elements in  objects such as MongoDB

       * Numbers of elements in objects are fixed but relationship among object is varying on ad hoc basis. Again relational database is not the solution. You should explore HBase or MongoDBfor this scenario. 

          *Both numbers of elements in objects are varying and relationships among objects are changing on ad hoc basis. Yep, you guessed correctly, relational database is not the part of the solution. For this scenario you can explore HBase or MongoDB.

In above discussion, I have not considered volume of data.

In truly unstructured data, structure of data is not definable. If one can’t define a structure, then structure does not exist from programming perspective. 

There is no unstructured data. Data has structure, we may have not been able to discover or comprehend it yet.

No comments:

Post a Comment