Tuesday, May 14, 2013

Hadoop plus Cloudera Manager - Auto Configuration and Deployment


How-to: Automate Your Hadoop Cluster from Java


One of the complexities of Apache Hadoop is the need to deploy clusters of servers, potentially on a regular basis. At Cloudera, which at any time maintains hundreds of test and development clusters in different configurations, this process presents a lot of operational headaches if not done in an automated fashion. In this post, I’ll describe an approach to cluster automation that works for us, as well as many of our customers and partners.

Taming Complexity

At Cloudera engineering, we have a big support matrix: We work on many versions of CDH (multiple release trains, plus things like rolling upgrade testing), and CDH works across a wide variety of OS distros (RHEL 5 & 6, Ubuntu Precise & Lucid, Debian Squeeze, and SLES 11), and complex configuration combinations — highly available HDFS or simple HDFS, Kerberized or non-secure, using YARN or MR1 as the execution framework, etc. Clearly, we need an easy way to spin-up a new cluster that has the desired setup, which we can subsequently use for integration, testing, customer support, demos, and so on.
This concept is not new; there are several other examples of Hadoop cluster automation solutions. For example, Yahoo! has its own infrastructure tools, and you can find publicly available Puppet recipes, with various degrees of completeness and maintenance. Furthermore, there are tools that work only with a particular virtualization environment. However, we needed a solution that is more powerful and easier to maintain.
Cloudera’s automation system for Hadoop cluster deployment provisions VMs on-demand in our internal cloud. As cool as that capability sounds, it’s actually not the most interesting part of the solution. More important is that we can install and configure Hadoop according to precise specifications using a powerful yet simple abstraction — using Cloudera Manager’s open source REST API.


Cloudera Manager API

This is what our automation system does:
  1. Installs the Cloudera Manager (CM) packages on the cluster. Start CM server.
  2. Uses the API to add hosts, installs CDH, defines the cluster and its services.
  3. For configuration, we use the API to tune heap sizes, set up HDFS HA, turn on Kerberos security and generate keytabs, customize service directories and ports, and so on. Every configuration available in Cloudera Manager is exposed in the API.
  4. The API also gives access to management features, such as gathering logs and monitoring information, starting and stopping services, polling cluster events, and creating a DR replication schedule. We use these features extensively in our automated tests.
The end result is a system that has become an indispensable part of our engineering process. It makes the Hadoop setup easy to maintain. For example, the same API call retrieves logs from HDFS, HBase, or any other service, without the user worrying about the different log locations. The same API call stops any service, without the user worrying about any additional steps. (HBase needs to be gracefully shutdown.) And when Cloudera Manager adds support for more services (e.g. Impala), their setup flows are the same as the existing ones.

Use Cases from Partners and Customers

Many of our customers and partners have also adopted the Cloudera Manager API for cluster automation:
  • Some OEM and hardware partners, delivering Hadoop-in-a-box appliances, use the API to set up CDH and Cloudera Manager on bare metal in the factory.
  • Some of our high-growth customers are constantly deploying new clusters, and have automated that with a combination of Puppet and the Cloudera Manager API. Puppet does the OS-level provisioning, and the software installation. After that, the Cloudera Manager API sets up the Hadoop services and configures the cluster.
  • Others have found it useful to integrate the API with their reporting and alerting infrastructure. An external script can poll the API for health and metrics information, as well as the stream of events and alerts, to feed into a custom dashboard.

Code Samples

A previous blog post gave an example of setting up a CDH4 cluster using the Python API client. Instead of repeating that, let me introduce you to the Java API client. (Although our internal automation tool uses the Python client today, we plan to move to Java to better work with our other Java-based tools like jclouds.) To use the Java client, add this dependency to your project’s pom.xml:
 

  
    
      cdh.repo
      https://repository.cloudera.com/content/groups/cloudera-repos
      Cloudera Repository
    
    …
  
  
    
      com.cloudera.api
      cloudera-manager-api
      4.5.2      
    
    …
  
  ...


The Java client works like a proxy. It hides from the caller any details about REST, HTTP, and JSON. The entry point is a handle to the root of the API:
RootResourceV3 apiRoot = new ClouderaManagerClientBuilder()
            .withHost("cm.cloudera.com")
            .withUsernamePassword("admin", "admin")
            .build()
            .getRootV3();

From the root, you can traverse down to all other resources. (It’s called “v3” because the currently Cloudera Manager API version is version 3. But the same builder also returns a v1 or v2 root.) Here is the tree view of some of the key resources and the interesting operations they support:
    * RootResourceV3
        * ClustersResourceV3: hosts membership, start cluster
            * ServicesResourceV3: config, get metrics, HA, service commands
                * RolesResource: add roles, get metrics, logs
                * RoleConfigGroupsResource: config
            * ParcelsResource: parcels management
        * HostsResource: hosts management, get metrics
        * UsersResource: users management
Of course, these are all in the Javadoc, and the full API documentation. To give a short concrete example, here is the code to list and start a cluster:

// List of clusters
ApiClusterList clusters = apiRoot.getClustersResource()
                                 .readClusters(DataView.SUMMARY);
for (ApiCluster cluster : clusters) {
  LOG.info("{}: {}", cluster.getName(), cluster.getVersion());
}

// Start the first cluster
ApiCommand cmd = apiRoot.getClustersResource()
                        .startCommand(clusters.get(0).getName());
while (cmd.isActive()) {
   Thread.sleep(100);
   cmd = apiRoot.getCommandsResource().readCommand(cmd.getId());
}
LOG.info("Cluster start {}", cmd.getSuccess() ?
            "succeeded" : "failed " + cmd.getResultMessage());

To see a full example of cluster deployment using the Java client, see whirr-cm. Specifically, jump straight toCmServerImpl#configure to see the core of the action.
You may find it interesting that the Java client is maintained with very little effort. Using Apache CXF, the client proxy comes free, quite magically. It figures out the right HTTP call to make by inspecting the JAX-RS annotations in the REST interface, which is the same interface used by the Cloudera Manager API server. Therefore, new API methods are available to the Java client automatically.

No comments: