I’m an IT specialist with over 15 years of experience, working from IT infrastructure to management products, troubleshooting and project management skills from medium to large environments. Nowadays I'm working for VMware as a Solutions Architect, helping customers to embrace the Cloud Era and make them successful on their journey. Although I'm a VMware employee, these postings reflect my opinion and do not represent VMware's position, strategies or opinions. Reach me at @dumeirell
Do you know that 90% of the world’s data has
been produced in the past two years ? That’s an impressive number, right ?!
Companies of all sizes understand that there
are opportunities to enhance their customer relationships, product quality and
to understand trends through analyzing this huge amount of data (that’s why the
term “Big Data”). The problem is that many of those data sources are
unstructured data that cannot be analyzed with traditional tools. For instance the
data can be in form of text, video, images, social medias, you name it.
Driven by this challenge, a new application
platform has been born, Hadoop. In a nutshell, this platform uses a divide and
conquers approach, using distributed components to allow the processing of
large amount of data.
The main roles within Haddop 1.0 architecture are:
NameNodes: in charge of managing the Hadoop File System
namespace (HDFS) and data block placement;
DataNodes: are the ones storing the data and
retrieving them when necessary, you can scale the nodes to storage more data
and provide data resiliency;
JobTracker: Jobs are submitted to JobTracker, which splits
the job into tasks and send them to the TaskTracker for execution, controlling
and scheduling the jobs execution;
TaskTracker: runs on each worker node, processing the
tasks and requesting the data results from the DataNodes.
This architecture had some scalability and
efficiency problems, that’s why an architectural change has been made with
Hadoop 2.0; The main architectural change was the split of JobTracker functionalities
into two new roles:
ResourceManager: is now in charge of
tracking resource usage, node liveness, enforces allocation invariants and
contention among tenants;
ApplicationMaster: is responsible for coordinating
job execution plans, requesting resources from the ResourceManager and
coordinating the execution of tasks.
Of course there's more about the architecture and how it works behind the scenes, if you want to know more here's a good paper to digest.
As you could imagine creating those Hadoop
clusters, with dozens of nodes, followed by installation and configuration of
the application on each one, takes a lot of time !! That’s where VMware comes
into play.
VMware vSphere Big Data Extensions (BDE), is
the VMware’s solution to the problems of managing and deploying Hadoop efficiently
and in an agile way. No matter how complex your Hadoop solution could be, BDE
can deploy it,automating the nodes creation, application installation and configuration in a much shorter time
than the traditional way, saving you hours if not days of manual tasks.
To make it even better, BDE can work with the
distribution of your taste: Apache Hadoop, Cloudera, Pivotal, Hortonworks and
MapR.
Now that we know the basics of BDE, we are ready to explore more advanced topics: