Deploying a Hadoop Cluster on Windows Azure

The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. If you're looking for guidance on deploying a Hadoop cluster on Windows Azure, then be sure to check out the latest blog post, "Hadoop in Azure".This post, demonstrates how to create a typical cluster with a Name Node, a Job Tracker and a […]

The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. If you're looking for guidance on deploying a Hadoop cluster on Windows Azure, then be sure to check out the latest blog post, "Hadoop in Azure".

This post, demonstrates how to create a typical cluster with a Name Node, a Job Tracker and a customizable number of Slaves. The post Also outlines how to dynamically change the number of Slaves using the Windows Azure Management Portal.

Follow these steps to create an Azure package for your Hadoop cluster:

Download all dependencies

  • This Visual Studio 2010 project is pre-configured with Roles for each Hadoop component. Don't worry if you don't have VS or don't want to install the express edition, you can do everything from the command line.
  • The cluster configuration templates.
  • Install the latest Azure SDK. As of this writing the latest version was 1.4.
  • The Hadoop binaries. I used version 0.21. Hadoop is distributed in a tar.gz file, you'll need to convert it to a ZIP file. You can use 7-zip for the task.
  • Now install Cygwin and package it in a single ZIP file. Hadoop 0.21 requires Cygwin under Windows. It's fine if you don't know anything about it, Hadoop uses it behind the scenes so you won't need to even launch it. There's an on-going effort to remove this dependency for Hadoop 0.22 but it's not ready yet. Just run the Cygwin install and accept all defaults. You should end up with Cygwin installed in c:\cygwin. Create a compressed folder of c:\cygwin called cygwin.zip.
  • Download latest version of Yet Another Java Service Wrapper.
  • The last dependency is a Java VM to host Hadoop and YAJSW. If you don't want to update any of the configuration files in this guide you'll need to bundle your favorite JVM in a zip file called jdk.zip. All JVM files must be in a folder also called jvm in the ZIP file. If you've your JVM installed under C:\Program Files\Java\jdk1.6.0_<revision>\ you'll need to rename (or copy) the jdk1.6.0_<revision> folder to jdk and zip it.

Follows rest of the article here.