Introduction
Machine learning on Hadoop has evolved a lot since the debut of Apache Mahout. Back then, a data scientist had to be able to understand at least some basics of Java programming and Linux shell scripting in order to assemble more complex MapReduce jobs and take the most of Mahout. And Mahout is batch-only oriented, which makes it difficult for interactive, data exploratory analysis. But now, machine learning at scale on top of Hadoop is getting a lot of traction with new-generation machine learning frameworks. By embracing YARN, those are not tied only to the MapReduce paradigm anymore and are capable of in-memory data processing. Some examples from this new generation are Apache Spark’s MLlib and Oxdata’s H2O. Moreover, Mahout is being ported to take advantage of those new frameworks as is shown in the “Mahout News” section of its home page.
In this tutorial I will show how to install H2O on a Hadoop
cluster and run some basic machine learning tasks. For the Hadoop cluster
environment I will use Hortonworks 2.1 Sandbox VM. And for H2O, I will use the
H2o driver specific for HDP 2.1 from the latest stable release.
1. Download and Install HDP Sandbox VM
Download and install the Hortonworks Sandbox VM. Note: as of this writing, it seems that H2O is not properly running on the latest HDP Sandbox VM, which is version 2.2. For this tutorial I used HDP Sandbox version 2.1, which can be downloaded from here.
2. Download and Install H2O on the HDP Sandbox VM
Log in your HDP Sandbox VM:
ssh
root@<your-hdp-sandbox-ip-address> -p 22
Create a directory for H2O download and installation:
# cd /opt
# mkdir h2o
# cd
h2o
Check for H2O latest stable release here and click on the corresponding
download link; this will open the download page; copy the download link.
Download H2O:
# wget <your-h2o-download-link>
Unzip and launch H2O as a MapReduce process:
# unzip <your-h2o-downloaded-file>
# cd h2o-<version>/hadoop
# hadoop jar
h2odriver_hdp2.1.jar water.hadoop.h2odriver -libjars ../h2o.jar -mapperXmx 1g
-nodes 1 -output h2o_out
Note that the H2O job is launched with a memory limit of 1 GB. You
can increase it, but first you have to configure YARN appropriately. You can do
that by following the steps shown here. Also
note that if you are in a real, multiple node cluster, you can specify ‘-nodes
<n>’ where n will be the number of map tasks instantiated on Hadoop, each
task being an H2O cluster node.
If the cluster comes up successfully, you will get a message
similar to this:
H2O cluster (1 nodes) is up
(Note: Use the -disown option to exit the driver after cluster formation)
(Press Ctrl-C to kill the cluster)
Blocking until the H2O cluster
shuts down...
Verify that the H2O cluster is up and running:
Open a web browser at http://<your-hdp-sandbox-ip-address>:8000/jobbrowser (HUE Job
Browser) or http://< your-hdp-sandbox-ip-address >:8088/cluster (YARN
Resource Manager web interface). You should see something similar to the image
below (shown for the YARN Resource Manager UI):
Notice that an H2O cluster run as a map-only MapReduce process on top of Hadoop MapReduce
v1 or Hadoop MapReduce on YARN. This is for the purpose of starting up the
cluster infrastructure only. H2O tasks do not run as Hadoop MapReduce jobs. If
you are interested in further understand H2O’s software architecture, I suggest
you to take a look here.
3. Test #1: Run a Simple Machine Learning Task through the H2O Web Interface
Upload and prepare a small dataset in Hadoop: sometimes it is necessary to pre-process a dataset in order to prepare it for a machine learning task. In this case, we will use a dataset that is tab-separated, but there are some entries that are separated by more than one tab. We will use a built-in Hive function that allows for find-and-replace for regular expressions, so that the outcome is the same dataset with entries separated by a unique comma.
The first step is to download the dataset. We will use the
seeds dataset from the UCI Machine
Learning Repository. Go to the seeds
dataset repository here and click on “Data Folder”. Then copy the download link.
Download the seeds dataset in your sandbox and put the file
into HDFS. We will be using the guest’s home folder and HDFS folder for that:
# cd /home/guest
# wget <seeds-dataset-download-link>
# hdfs dfs –put
seeds_dataset.txt /user/guest/seeds_dataset.txt
Now prepare the seeds dataset with help from Hive, as described
earlier:
# hive
hive> CREATE TABLE seeds_dataset(data STRING);
hive> LOAD DATA INPATH '/user/guest/seeds_dataset.txt'
OVERWRITE INTO TABLE `seeds_dataset`
hive> SELECT regexp_replace(data, '[\t]+', ',') FROM
seeds_dataset LIMIT 10;
hive> INSERT OVERWRITE
DIRECTORY '/user/guest/seeds_dataset' SELECT regexp_replace(data, '[\t]+', ',')
FROM seeds_dataset;
Open the H2O web interface at: http://<your-hdp-sandbox-ip-address>:54321
Click on 'Data' -> 'Import Files':
Enter the file path on HDFS (hdfs://<your-hdp-sandbox-ip-address>:8020/user/guest/seeds_dataset/000000_0).
Then click on 'Submit':
If the upload is successful, your file will appear on the next
screen. Click on it:
In the next screen you have some options to parse the dataset
before importing into H2O store. In this case, nothing else is necessary. Just click
on 'Submit':
In the next screen, you can
inspect the imported dataset and build machine learning models with it. Lets
build a random forest classifier. Click on 'BigData Random Forest':
Now you have to enter your model parameters. In the 'destination
key' field, enter a key value to save your model into H2O store. In this case,
the key used is 'RandomForest_Model'. In the 'source' field, the key value of
the imported dataset should be already filled. In this case, it is 'X000000_0.hex',
In the 'response' field, select the column that corresponds to the data class
in the dataset. It is 'C8' in this case. Leave the other fields unchanged and
click on 'Submit':
After running the model, you should see a screen with the results
containing a confusion matrix and some other useful model statistics. The
dataset has 210 rows equally divided into 3 distinct classes, therefore 70 rows
for each class. From the confusion matrix, we notice that the model correctly classified
61 out of 70 rows for class 1, 68 out of 70 for class 2, and 67 out of 70 for
class 3:
You can also inspect H2O data store to see what you have so far.
Click on 'Data' -> 'View All'. You should see 2 entries. One for the imported
dataset and another for your random forest model:
4. Test #2: Run a Simple Machine Learning Task through the H2O R Package
Now, instead of using the H2O built-in web interface to drive the analysis, we will use the H2O R package. This allows a data scientist to drive a data analysis from an R client outside the Hadoop cluster. Notice that in this case, there is no R code running in the cluster. In this case R is just a front-end that translates H2O’s native Java interface into the familiar (for a growing number of data scients) R syntax.
My approach for that scenario is to install R and RStudio
Server (R’s web IDE) in the HDP sandbox VM. But it should also work for a
remote, local R instance to access the H2O cluster, as the access protocol is
just a REST interface.
In the HDP sandbox VM, the first step is to enable epel (if
not already installed or latest release), for the proper R installation. You
also need to install the required openssl package for RStudio:
# yum install epel-release
# yum install R
# yum install openssl098e
Then, download and install the RStudio server package:
# wget
http://download2.rstudio.org/rstudio-server-0.98.1091-x86_64.rpm
# yum install --nogpgcheck rstudio-server-0.98.1091-x86_64.rpm
# rstudio-server
verify-installation
Now, change the password for user guest, as you will use it to log
in RStudio server. It requires a non-system user to log in:
# passwd guest
The next step is to install H2O R package. But before that, you
need to install curl-devel package in your HDP sandbox VM:
# yum install curl-devel
Now, open RStudio in your browser at http://<your-hdp-sandbox-ip-address>:8787
and log in with user guest and the corresponding password you changed before.
In the console pane of RStudio, enter the command:
>
install.packages(c('RCurl', 'rjson', 'statmod'))
After installation of those packages, you should see
somewhere in the output messages:
* DONE (RCurl)
* DONE (rjson)
* DONE (statmod)
Again in the console pane of RStudio, enter the command:
>
install.packages('/opt/h2o/h2o-2.8.4.1/R/h2o_2.8.4.1.tar.gz', repos=NULL,
type='source')
you should see somewhere in the output messages:
* DONE (h2o)
Finally, in the editor pane of RStudio, enter the following block
of code:
library(h2o)
localH2O <- h2o.init(ip =
'192.168.100.130', port =54321)
path <- 'hdfs://192.168.100.130:8020/user/guest/seeds_dataset/000000_0'
seeds <- h2o.importFile(localH2O, path = path, key =
"seeds.hex")
summary(seeds)
head(seeds)
index.train <- sample(1:nrow(seeds), nrow(seeds)*0.8)
seeds.train <- seeds[index.train,]
seeds.test <- seeds[-index.train,]
seeds.gbm <- h2o.gbm(y = 'C8', x =
c('C1','C2','C3','C4','C5','C6','C7'), data = seeds.train)
seeds.fit <- h2o.predict(object=seeds.gbm,
newdata=seeds.test)
h2o.confusionMatrix(seeds.fit[,1],seeds.test[,8])
That block of R code will do the following, in order of line
execution:
- Load H2O
R library in your R session
- Open a
connection to your H2O cluster and store that connection in a local variable
- Define a
local variable for the path of your dataset on HDFS
- Import
your dataset from HDFS into H2O’s in-memory key-value store and define a local
variable that points to it
- Print a
statistical summary of the imported dataset
- Print
the first few lines of the imported dataset
- Create a
vector with 80% of the row numbers randomly sampled from the imported dataset
- Create a
train dataset containing the rows indexed by the vector above (80% of the
entire dataset)
- Create a
test dataset with the remaining rows (20% of the entire dataset)
- Train a
GBM (Gradient Boosting Model), which is an ensemble of models for classification,
on the training dataset
- Score
the GBM model on the test dataset
- Print a confusion matrix showing the actual versus predicted
classes from the test dataset
You could run the above block of code entirely, by selecting all
of it on the editor pane in RStudio and clicking on ‘Run’, or you can run line
by line, by positioning the cursor on the first line of code and clicking on
‘Run’ subsequently:
The seeds dataset defines 7 different measurements of seeds that
seem to be useful for classification of them into 3 different groups. The idea
is to learn rules or combinations of
those measurements (features, variables, parameters, or covariates, in machine learning jargon) that are able to
classify a given seed into one of those 3 classes. What is typically done is to
separate the original dataset into a training and test sets, train the model on
the training set, and assess how well the model generalizes its prediction
power on the test set.
By running the code above, we get the following output in the
console pane of RStudio after printing the confusion matrix:
Predicted
Actual 1 2
3 Error
1 10
2 1 0.23077
2 0 14
0 0.00000
3 3 0 12
0.20000
Totals 13 16 13 0.14286
The output
above shows a relatively good performance of the model considering that the
entire dataset is very small, having only 210 instances. From these, 168
instances were used to train the model and 42 to test it. For class 1, it
correctly classified 10 out of 13 seeds. For class 2 all 14 seeds were
correctly classified. And for class 3, 12 out 15 seeds were correctly
classified.
It is also important to notice that
this entire analysis was run on the H2O cluster. The only data that was created
locally in the R session were the vector defining the row indices for creating
the training data, and the variable holding the path to the dataset in HDFS.
The remaining R objects in the session were mere pointers to the corresponding
objects in the H2O cluster.
No comments:
Post a Comment