Big Data Extension – The Basics - Just Another IT Blog

Do you know that 90% of the world’s data has been produced in the past two years ? That’s an impressive number, right ?!

Companies of all sizes understand that there are opportunities to enhance their customer relationships, product quality and to understand trends through analyzing this huge amount of data (that’s why the term “Big Data”). The problem is that many of those data sources are unstructured data that cannot be analyzed with traditional tools. For instance the data can be in form of text, video, images, social medias, you name it.

Driven by this challenge, a new application platform has been born, Hadoop. In a nutshell, this platform uses a divide and conquers approach, using distributed components to allow the processing of large amount of data.

The main roles within Haddop 1.0 architecture are:

NameNodes: in charge of managing the Hadoop File System namespace (HDFS) and data block placement;

DataNodes: are the ones storing the data and retrieving them when necessary, you can scale the nodes to storage more data and provide data resiliency;

JobTracker: Jobs are submitted to JobTracker, which splits the job into tasks and send them to the TaskTracker for execution, controlling and scheduling the jobs execution;

TaskTracker: runs on each worker node, processing the tasks and requesting the data results from the DataNodes.

This architecture had some scalability and efficiency problems, that’s why an architectural change has been made with Hadoop 2.0; The main architectural change was the split of JobTracker functionalities into two new roles:

ResourceManager: is now in charge of tracking resource usage, node liveness, enforces allocation invariants and contention among tenants;

ApplicationMaster: is responsible for coordinating job execution plans, requesting resources from the ResourceManager and coordinating the execution of tasks.

Of course there's more about the architecture and how it works behind the scenes, if you want to know more here's a good paper to digest.

As you could imagine creating those Hadoop clusters, with dozens of nodes, followed by installation and configuration of the application on each one, takes a lot of time !! That’s where VMware comes into play.

VMware vSphere Big Data Extensions (BDE), is the VMware’s solution to the problems of managing and deploying Hadoop efficiently and in an agile way. No matter how complex your Hadoop solution could be, BDE can deploy it, automating the nodes creation, application installation and configuration in a much shorter time than the traditional way, saving you hours if not days of manual tasks.

To make it even better, BDE can work with the distribution of your taste: Apache Hadoop, Cloudera, Pivotal, Hortonworks and MapR.

Now that we know the basics of BDE, we are ready to explore more advanced topics:

- Hadoop default template password

- Graphical Interface vs command line

- Managing Disks and Controller Types

- Topology Matters

Coming Soon

- IP Allocation and Name resolution

Eduardo Meirelles da Rocha

Copyright 2018, Just Another IT Blog. All rights Reserved. Template by Colorlib.