Monday, July 28, 2014

Book Review: Functional Thinking: Paradigm Over Syntax

Book Review:  Functional Thinking: Paradigm Over Syntax by Neal Ford: Publisher- O'Reilly: ISBN- 13: 978-1449365516

Functional Thinking: Paradigm Over Syntax is supposed to be a book about paradigm shift, about thought process of being functional but completely failed. The book starts with assumption that reader knows about Functional Programming. If a user knows about Functional programming then why will he be interested in this book?

Just an example, book repeadly uses term High Order functional but at least till chapter three there is no explanation about the same. After chapter three, I lost my patience.

Disclaimer: I did not get paid to review this book, and I do not stand to gain anything if you buy the book. I have no relationship with the publisher or the author. I got electronic format of book from publisher for review.

Friday, July 18, 2014

Sizing of Name Node Ram and Physical Memory for Data Nodes

Recently while working with one of client, I was asked to advice about RAM requirement for Name Node and Physical storage capacity for Data Nodes. This is one of the questions, I am asked repeatedly. To solve the issue once for all, I like to formalize the answer in terms of mathematical formula, so ambiguity can be take out from answer.

The associated Excel file is also created and available at scribd.

Hadoop NameNode RAM and Physical Memory for DataNodesSizing

Thursday, July 17, 2014

MapReduce – The Model

In the map-reduce programming model, work is divided into two phases: a map phase and a reduce phase. Both of these phases work on key-value pairs. What these pairs contain is completely up to you: they could be URLs paired with counts of how many pages link to them, or movie IDs paired with ratings. It all depends on how you write and set up your map-reduce job.
A map-reduce program typically acts something like this:
  1. Input data, such as a long text file, is split into key-value pairs. These key-value pairs are then fed to your mapper. (This is the job of the map-reduce framework.)
  2. Your mapper processes each key-value pair individually and outputs one or more intermediate key-value pairs.
  3. All intermediate key-value pairs are collected, sorted, and grouped by key (again, the responsibility of the framework).
  4. For each unique key, your reducer receives the key with a list of all the values associated with it. The reducer aggregates these values in some way (adding them up, taking averages, finding the maximum, etc.) and outputs one or more output key-value pairs.
  5. Output pairs are collected and stored in an output file (by the framework).

What makes this model so good for parallel programming should be apparent from the figure above: each key-value pair can be mapped or reduced independently. This means that many different processors, or even machines, can each take a section of the data and process it separately—a classic example of data parallelism. The only real step where synchronization is needed is during the collecting and sorting phase, which can be handled by the framework (and, when done carefully, even this can be parallelized).
So, when you can fit a problem into this model, it can make parallelization very easy. What may seem less obvious is how a problem can be solved with this model in the first place.

Real life MapReduce

Tree of Maps: