Friday, April 28, 2017

Prerequisites on each nodes before building the hadoop cluster


Prerequisites on each nodes before building the cluster:

1.  Choosing the supported operating system.
2.  Choosing the supported java.

https://wiki.apache.org/hadoop/HadoopJavaVersions

3.  Switch off iptables:


Netfilter is a host-based firewall for Linux. It is included as part of the Linux distribution and activated by default. This firewall is controlled by the program called iptable and this should be turned off.  iptables applies to IPv4.

Type the following two commands (you must login as the root user):

# /etc/init.d/iptables save
# /etc/init.d/iptables stop

To turn off firewall on reboot:
# chkconfig iptables off

4.  Disabling Transparent Hugepage Compaction

The transparent hugepage will automatically use larger pages for dynamically allocated memory and this is not recommended to be enabled for Hadoop.

For RHEL depending upon the version, to disable transparent hugepage compaction, add the following command to /etc/rc.local: (rc.local is the run level script which gets exectued after all the normal services are started.)

echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag
echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/enabled

If reading the above file (defrag/enabled) [always] never means it is enabled
If reading the above file (defrag/enabled) [never] always means it is disabled

5.  vm.swappiness Linux Kernel Parameter


Set vm.swappiness to a value 1.  We do not want the swappiness on the nodes.  Needs to be minimum as possible.

sysctl -w vm.swappiness=1

General onboarding questions for Hadoop

When the companies plan to on board Apache Hadoop, some of these questions will arise,

1.  Prerequisites before building a cluster.
2.  Division of roles and responsibilities among the team.
3.  Infrastructure servers.
4.  Multi-tenancy environment.
5.  Security around it.
6.  Capacity Planning.
7.  How users can access hadoop.
8.  User convenience.
9.  What tools to use.
10.  Tools for anlaytics.
11.  Tools for data ingestion.
12. How to store data in efficient way on HDFS cluster.
13. Setting up a disaster recovery cluster.

Introduction to Hadoop


     We are all aware of the existence of Bigdata and the role/reason of Apache Hadoop in this world. No we are not going to talk about 3 Vs, nor about big data and why the need of Hadoop, nor the comparison about traditional BI system and hadoop, nor its capabilities of distributed computing.

     These are the topics which have been often discussed in great length. Probably everyone by this time should be having a very thorough understanding of all these facts.

This blog is about the practical implementation of hadoop in the real world.

Before that, you may be interested in knowing why an organisation decides to go for tapping into the resources of Big Data.  Of course to turn the dormant data to a powerhouse of information from which the companies can immensely benefit.  Following are some of the outcome of an organisation preferring to reap the advantages from huge amount of data.



1. Fraud detection.
2. Customer behavior and insights.
3. Business optimization.
4. Predictive analysis.
5. Targeted advertising.
6. Business realization of social media data and semi/unstructured data.
7. Staying ahead of the competitors.
8. Trend analysis.
9. Pattern identification.

Understanding Merged Keytabs

What is Merge Keytab Keytab is a file containing pairs of Kerberos principals and encrypted keys.  Keytabs are used in kerberos environm...