Deploying Hadoop on VMware vSphere - Topology Matters

As it has always been on virtual environments, there are, at least, two aspects you should always take into consideration when designing your solutions:

- Performance: it’s known that the closest the two VMs are from each other, the best performance it provides, like an application and database server, if they are both on the same host, it’s the best performance you can get.
- Availability: on the other hand, if you place all VMs on the same host, in the case of a host failure, your entirely application will fail as well.

With that said, placements policies became so important especially as Hadoop overcome the traditional challenge of monolithic applications, providing a distributed platform, were replicas are provisioned on multiples locations keeping the data available everywhere while enhancing the performance of the solution.

But how deploy Hadoop clusters with performance and availability in mind ?
People can argue that it's possible to manually deploy clusters with topology awaress, well..it's true, but in a new world of self-services and automation , it just don't fit in anymore.

vSphere Big Data Extension provides this automated topology awareness placement decision so you can take the best of your virtualizes Hadoop in terms of performance and availability.

When creating a Hadoop cluster you can choose the following topologies:

- HOST_AS_RACK: simplified topology, it avoids place all HDFS data block replicas on the same physical host, each host is treated as a rack;

- RACK_AS_RACK: that’s the standard topology, only rack and host information are exposed to Hadoop. (you must supply rack information to BDE);

- Hadoop Virtualization Extensions (HVE): it’s an enhanced cluster reliability and performance provided by refined replica placements, it has full awareness of the topology on which they are running. (you must supply rack information to BDE).

There’s a slight, even though crucial, difference when using RACK_AS_RACK over HVE.

To explain the difference between them we will work with an hypothetical environment of 4 hosts on 2 different racks.

Topology information when using RACK_AS_RACK

Rack01: esxi01, esxi02
Rack02: esxi03, esxi04

HVE in the other hand, introduce a new layer, called group, where all VMs within the same group runs on the same host. With awareness of the group layer HVE can refine locality based policies to optimize performance (it will become clear in the performance section).

Topology information when using HVE
Rack01: Group01, esxi01, esx02
Rack02: Group02, esxi03, esx03

- Availability
One of the most critical components of Hadoop solutions is regarding the data it stores, represented by data nodes (HDFS), as a distributed solution, Hadoop creates replicas of the data between data nodes, providing better performance and high availability.

Using RACK_AS_RACK method
- Multiple data nodes are not placed on the same host
- 1^st data node is on the host of the writer;
- 2^nd data node replica in on a remote rack other than the 1^st one;
- 3^rd data node replica is on the same rack as the 2^nd one;
- additional data nodes are placed randomly

As you could see 2nd and 3rd were placed on the same host, guest what in the case of a host failure...

Using HVE method:
- Multiple data nodes are not placed on the same host or same group;
- 1^st data node is on the host or group of the writer;
- 2^nd data node replica in on a remote rack other than the 1^st one;
- 3^rd data node replica is on the same rack as the 2^nd one, but not the same host;
- additional data nodes are placed randomly.

As you could see, with the full topology awareness the 3^rd data node was not placed together with the 2^nd one, increasing the availability of the solution

- Performance

When talking about performance, we can consider how fast an HDFS client can access the data from HDFS data nodes.

Using RACK_AS_RACK method
Since there are replicas spread on data nodes hosted on several hosts, there’s equal chances to read the data from any data node.

As you can see, the data could come from 2^nd data node wich does not provide the same performance as if you get the data from 1^st date node.

Using HVE method
Now, if HVE has been used, the HDFS client would have full topology awareness and then will always pick up the data from the nearest data node to achieve the best read throughput possible, in this case reading the data from the data node that is hosted along itself.

It’s clear that using HVE is the recommended method, but HVE is dependent on the distribution's support.

If your distribution does not support HVE, use RACK_AS_RACK since it has more benefits than HOST_AS_RACK.

Here if you want to read more about Hadoop Virtualization Extensions

Check this post on to how update topology on BDE (coming soon)

Author Description