Apache Hadoop Best Practices and Infrastructural Considerations

The infrastructure for enterprises, both large and small, now includes various big data applications which are running on VMware and bare-metal servers. With big data technology and its applications constantly evolving, it is difficult for the database administrators to keep up with the needed resources ranging from the conventional Hadoop node to the advanced in-memory applications like Apache Ignite. Here, in this blog, we are trying to share some valuable insights into the big data users, in terms of Hadoop best practices and infrastructure considerations.

A quick overview of do’s and don’ts

When it comes to big data administration, some quick do’s and don’ts for the users include but not limited to.

Do’s:

Consider redundancy
Use the commodity servers
Start at the needed small level and then stay focused
Monitor closely
Establish an ongoing data integration process
Use the compression model
Try to build different environments like Test, Dev, Prod, etc.

Don’ts

Mix the master nodes and data nodes
Virtualize any data nodes
Overbuild

Let’s discuss some of these in more detail.

Embracing redundancy and using commodity servers

One major problem in enterprise big data management is that the infrastructure folks may be brought in to the environment at a later stage and may not be having enough knowledge about how the Hadoop ecosystem works. This may further lead to an over-designed cluster which stretches the budget. The primary objective of Hadoop is to reduce cost and enable redundant data storage by allowing analytical proceedings. This could be further achieved with the set of low-cost JBOD servers.

The concept of starting small and staying focused

We can see the statistics as to how many business projects fail due to the complexity of managing those and the incurred expenses. In such situations, the real usefulness of Hadoop is by allowing the users to start small and add additional nodes instantly as and when required and go step by step. You can choose a project of any smallest size, to begin with, and then further focus on the development of the infrastructure by becoming familiar with the working of this new technology.

Monitoring closely

Even though Hadoop ensures redundancy at both data level and the management level, many of its moving parts need to be closely monitored. There are also many open-source packages available for this monitoring purpose. By default, these applications can monitor the services and different nodes in the cluster. With these monitoring apps, it is also easy to conduct some additional checks like the disk health of the servers, etc.

Ongoing data integration

One significant advantage of Hadoop is that it will let the users populate data and then define the data structures later. It is pretty easy to get data in and out in a Hadoop ecosystem with tools like Flume and Sqoop; however, it is also essential to create a strong data integration procedure upfront. This may include various layers like staging and the naming standards as well as locations.

Using compression

For many years, enterprises maintained a very hazy relationship with the concept of compression. Even though it can save a lot of space, its adverse impact on performance in production systems made it a love-hate approach for enterprise database managers. Curbing this issue, Hadoop now thrives on the solid usage of compression which can help increase the storage space up to about 80%. You may refer to RemoteDBA.com to see how it is done effectively.

Building multiple environments

Similar to any other technical infrastructure project, Hadoop clients can be advised on building multiple environments. This is not only a best practice in general, but can also be vital while considering Hadoop’s primary nature. Any project within an Apache ecosystem may always be changing, so maintaining a parallel non-production environment to test all the upgrades and features becomes vital.

No doubt that Hadoop is an ideal big data platform which continues to evolve and add new features to be on top than any other vendors.

Hadoop disaster recovery

Even though it is a solid platform for big data, disasters may strike the Hadoop ecosystem also as in case of any other similar technology. Sometimes it may be a natural disaster or an extended power outage etc. Some other times, it may be a human error also as an administrator dropping the database accidentally or a bug corrupting the entire data stores on the HDFS.

Creation of multiple replicas is essentially an ideal way to protect Hadoop against any disk failure or server failure. However, these may not prevent disasters due to natural calamities or human errors. Here, we discuss some of the major considerations to be put in place to protect Hadoop against disaster scenarios.

Do regular back up of data to a secondary storage location, remote data center, or on a cloud. If there is a backup at a remote location, then it may also offer protection against natural disasters.
Replicate the data asynchronously. This has to be done from Hadoop production cluster to the standby clusters. The replications will mirror the data from the production to another standby cluster in a different data center. However, it may not have any older copies of the data, so will not protect against loss due to any human errors or corruption of application. However, replication can surely safeguard against natural disasters or widespread power outages, etc.
Another approach to replication is synchronously replicating data from a production cluster to an alternative Hadoop cluster, which is at a different data center. Similar to the above asynchronous replication approach, synchronous replication also may not protect against any human errors, but can safeguard in natural disasters or power outage, etc.

In summary, even though there are real-time replication solutions which can offer the best possible recovery options on Hadoop disasters, all these come with their own limitations and specific considerations. Enterprises should implement an active and viable solution for Hadoop disaster recovery based on the context of their application’s criticality and by ensuring the best return on investment. Otherwise, it may be unnecessary overhead, and also adversely affect the availability of production Hadoop systems by leading to excessive resource consumption to manage a Hadoop environment and thereby spoiling its real purpose.

RELATED ARTICLES