Hadoop 3.2.1 on Ubuntu 18.04 on Oracle VM Virtualbox

Cris Ferreira Pires

3 min readJun 1, 2021

Hadoop is a tool to distribute systems. Here we will be using it to set up a 3 node system, one master and 2 slaves.

Setting the network configuration on the Master (or primary).

Then after installing Ubuntu we will update it, with sudo apt update

Now we get ssh

sudo apt install ssh

Now we add the following to the end of .bashrc file.

export PDSH_RCMD_TYPE=ssh

We generate a ssh key with:

ssh-keygen -t rsa -P “”

Then we copy it to the authorized keys file.

cat ~/.ssh/id_rsa.pub >> ~/.ssh/autorized_keys

We need a specific java jdk to use hadoop, so we grab it with the following:

sudo apt install openjdk-8-jdk

Now we get the actual hadoop:

sudo wget -P ~ https://mirrors.sonic.net/apache/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz

Now we need to edit some files.

On hadoop-env.sh:

sudo nano ~/hadoop/etc/hadoop/hadoop-env.sh

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/

Then we move the file:

sudo mv hadoop /usr/local/hadoop/

Now we need to add to the environment variables.

sudo nano /etc/environment

To PATH we add:

:/usr/local/hadoop/bin:/usr/local/hadoop/sbin

To a new JAVA_HOME we add: (on the next line)

JAVA_HOME=”/usr/lib/jvm/java-8-openjdk-amd64/jre”

Now we need to create a new user and give it some privileges.

sudo adduser hadoopusersudo usermod -aG hadoopuser hadoopusersudo adduser hadoopuser sudo

Now we transfer to it the ownership of some files.

sudo chown hadoopuser:root -R /usr/local/hadoop/sudo chmod g+rwx -R /usr/local/hadoop/

Configuring host names.

sudo nano /etc/hosts

Now we close the VM (power it off), and clone it two times.

When cloning there is a button at the bottom called Expert Mode, then we need to make sure we generate another MAC.

Now on the primary machine we need to change its hostname to master.

sudo nano /etc/hostname

There whe change whatever might be there to master.

On each slave we repeat the process but we change it to slave1 and slave2.

And restart the machines to ensure the changes take effect.

Now on the primary machine we will change user to hadoopuser.

su - hadoopuser

We create a new ssh key.

ssh-keygen -t rsa

And we copy it to each machine with:

ssh-copy-id hadoopuser@masterssh-copy-id hadoopuser@slave1ssh-copy-id hadoopuser@slave2

Still on master we edit core-site.xml file:

sudo nano /usr/local/hadoop/etc/hadoop/core-site.xml

Add to the configuration:

  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://master:9000</value>
  </property>

Now hdfs-site.xml

sudo nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml

Add to the configuration:

<property>
    <name>dfs.namenode.name.dir</name>
    <value>/usr/local/hadoop/data/nameNode</value>
  </property>
  <property>
    <name>dfs.datanode.data.dir</name>
    <value>/usr/local/hadoop/data/dataNode</value>
  </property>
  <property>
    <name>dfs.replication</name>
    <value>2</value>
  </property>

Adding the workers:

sudo nano /usr/local/hadoop/etc/hadoop/workers

slave1
slave2

Now we copy these files to the slaves:

scp /usr/local/hadoop/etc/hadoop/* slave1:/usr/local/hadoop/etc/hadoop/scp /usr/local/hadoop/etc/hadoop/* slave2:/usr/local/hadoop/etc/hadoop/

To ensure that the machines are using the updated environment variables.

source /etc/environment

Now we format hdfs (Hadoop file system):

hdfs namenode -format

Configuring Yarn:

On the main machine:

export HADOOP_HOME=”/usr/local/hadoop”export HADOOP_COMMON_HOME=$HADOOP_HOMEexport HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoopexport HADOOP_HDFS_HOME=$HADOOP_HOMEexport HADOOP_MAPRED_HOME=$HADOOP_HOMEexport HADOOP_YARN_HOME=$HADOOP_HOME

On the slaves we have to edit the yarn-site.xml:

sudo nano /usr/local/hadoop/etc/hadoop/yarn-site.xml

And we add the following to the configuration:

 <property>
   <name>yarn.resourcemanager.hostname</name>
   <value>hadoop-primary</value>
 </property>

Now back on the master:

start-dfs.sh

start-yarn.sh

To stop just replace start with stop and if you want you can use all insted of dfs or yarn.

Now we can check it all on the master machine browser with master:8088/cluster.

Hadoop 3.2.1 on Ubuntu 18.04 on Oracle VM Virtualbox

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Cris Ferreira Pires

No responses yet