Hadoop 3.2.1 on Ubuntu 18.04 on Oracle VM Virtualbox
Hadoop is a tool to distribute systems. Here we will be using it to set up a 3 node system, one master and 2 slaves.
Setting the network configuration on the Master (or primary).


Then after installing Ubuntu we will update it, with sudo apt update
Now we get ssh
sudo apt install ssh
Now we add the following to the end of .bashrc file.
export PDSH_RCMD_TYPE=ssh
We generate a ssh key with:
ssh-keygen -t rsa -P “”
Then we copy it to the authorized keys file.
cat ~/.ssh/id_rsa.pub >> ~/.ssh/autorized_keys
We need a specific java jdk to use hadoop, so we grab it with the following:
sudo apt install openjdk-8-jdk
Now we get the actual hadoop:
sudo wget -P ~ https://mirrors.sonic.net/apache/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz
Now we need to edit some files.
On hadoop-env.sh:
sudo nano ~/hadoop/etc/hadoop/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/
Then we move the file:
sudo mv hadoop /usr/local/hadoop/
Now we need to add to the environment variables.
sudo nano /etc/environment
To PATH we add:
:/usr/local/hadoop/bin:/usr/local/hadoop/sbin
To a new JAVA_HOME we add: (on the next line)
JAVA_HOME=”/usr/lib/jvm/java-8-openjdk-amd64/jre”
Now we need to create a new user and give it some privileges.
sudo adduser hadoopusersudo usermod -aG hadoopuser hadoopusersudo adduser hadoopuser sudo
Now we transfer to it the ownership of some files.
sudo chown hadoopuser:root -R /usr/local/hadoop/sudo chmod g+rwx -R /usr/local/hadoop/
Configuring host names.
sudo nano /etc/hosts

Now we close the VM (power it off), and clone it two times.
When cloning there is a button at the bottom called Expert Mode, then we need to make sure we generate another MAC.
Now on the primary machine we need to change its hostname to master.
sudo nano /etc/hostname
There whe change whatever might be there to master.
On each slave we repeat the process but we change it to slave1 and slave2.
And restart the machines to ensure the changes take effect.
Now on the primary machine we will change user to hadoopuser.
su - hadoopuser
We create a new ssh key.
ssh-keygen -t rsa
And we copy it to each machine with:
ssh-copy-id hadoopuser@masterssh-copy-id hadoopuser@slave1ssh-copy-id hadoopuser@slave2
Still on master we edit core-site.xml file:
sudo nano /usr/local/hadoop/etc/hadoop/core-site.xml
Add to the configuration:
<property>
<name>fs.defaultFS</name>
<value>hdfs://master:9000</value>
</property>
Now hdfs-site.xml
sudo nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml
Add to the configuration:
<property>
<name>dfs.namenode.name.dir</name>
<value>/usr/local/hadoop/data/nameNode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/usr/local/hadoop/data/dataNode</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
Adding the workers:
sudo nano /usr/local/hadoop/etc/hadoop/workers
slave1
slave2
Now we copy these files to the slaves:
scp /usr/local/hadoop/etc/hadoop/* slave1:/usr/local/hadoop/etc/hadoop/scp /usr/local/hadoop/etc/hadoop/* slave2:/usr/local/hadoop/etc/hadoop/
To ensure that the machines are using the updated environment variables.
source /etc/environment
Now we format hdfs (Hadoop file system):
hdfs namenode -format
Configuring Yarn:
On the main machine:
export HADOOP_HOME=”/usr/local/hadoop”export HADOOP_COMMON_HOME=$HADOOP_HOMEexport HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoopexport HADOOP_HDFS_HOME=$HADOOP_HOMEexport HADOOP_MAPRED_HOME=$HADOOP_HOMEexport HADOOP_YARN_HOME=$HADOOP_HOME
On the slaves we have to edit the yarn-site.xml:
sudo nano /usr/local/hadoop/etc/hadoop/yarn-site.xml
And we add the following to the configuration:
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop-primary</value>
</property>
Now back on the master:
start-dfs.sh
start-yarn.sh
To stop just replace start with stop and if you want you can use all insted of dfs or yarn.
Now we can check it all on the master machine browser with master:8088/cluster.
