For many Java programmers, Hadoop is on the short list of technologies to learn or integrate into their application architectures, and getting Hadoop to work with Spring runtimes isn’t as hard as you might expect thanks to the collective group of software engineers contributing to the Spring Hadoop project.

Using Spring Hadoop in Spring Projects

With Spring Hadoop, part of the Spring Data project, the fundamental principles of simplicity and more productive programming are applied, just as you would expect from a Spring project. Here are some examples of what you can do with Spring Hadoop:

  • Manage batches of data or run batch processes like calculations or formatting with Spring Batch and load these on or off Hadoop workflows.
  • Build integration patterns via Spring Integration that can check a directory or FTP folder for new information, trigger a workflow, send an email, invoke an AMQP message, write a file, continuously query Pivotal GemFire, or poll Twitter, while interacting with Hadoop.
  • Use Spring Data to interact with data from Redis, MongoDB, Neo4j, Pivotal GemFire, any JDBC oriented database, Couchbase, FuzzyDB, Elasticsearch, or Solr and push it into or from Hadoop.
  • Develop a user interface or some other business logic to start a MapReduce job or move data into HDFS as part of a general Spring Framework interaction.

An Example of Using Spring Batch with Spring Hadoop

With Spring Hadoop, we can use dependency injection to run Hadoop as a standard Java application. We can also create and configure Java apps for MapReduce, Streaming, Hive, Pig, Cascading, or HBase. Spring Batch flows can now include the execution of Hadoop File System (HDFS) data operations and Hadoop jobs. We can also restart, configure commits and rollbacks, partition data, and manage the overall job.

Let’s say we have an application that needs to pull data from 10,000 log files, load them into the Hadoop Distributed File System, and kick off a MapReduce job to count results across the files.

To start, we can create an XML file (as shown below) to define the Spring Batch job flow with two steps: loading the data with Groovy and running a job. The XML file for the job flow calls out two steps—the import and a wordcount job—each step is a tasklet. These steps could also be sequential, conditional, split, concurrent, or programmatically determined in other scenarios.

<batch:job id="job1">

    <batch:step id="import" next="wordcount">

      <batch:tasklet ref="script-tasklet"/>

    </batch:step>

    <batch:step id="wordcount">

       <batch:tasklet ref="wordcount-tasklet" />

    </batch:step>

</batch:job>

The first tasklet runs the import by executing a script provided by the Spring Hadoop project. The script uses Groovy to clear the directories and then places a text file into the Hadoop Distributed File System (HDFS) as part of the input path. It could also work off of a database or XML file. The script makes use of the predefined variable fsh, which is the embedded Filesystem Shell that Spring for Apache Hadoop provides. It also uses Spring’s Property Placeholder functionality so that the input and out paths can be configured external to the application, for example using property files. The syntax for variables recognized by Spring’s Property Placeholder is ${key:defaultValue}, so in this case /user/gutenberg/input/word and/user/gutenberg/output/word are the default input and output paths.

<script-tasklet id="script-tasklet"> 

  <script language="groovy">

    inputPath = "${wordcount.input.path:/user/gutenberg/input/word/}"   

    outputPath = "${wordcount.output.path:/user/gutenberg/output/word/}"

    if (fsh.test(inputPath)) {

      fsh.rmr(inputPath)   

    }

    if (fsh.test(outputPath)) {

      fsh.rmr(outputPath)

    }

    inputFile = "src/main/resources/data/nietzsche-chapter-1.txt"

    fsh.put(inputFile, inputPath)

  </script>

</script-tasklet>

The wordcount Hadoop tasklet runs next and includes elements from the Spring Hadoop XML Schema. This tasklet connects Spring Batch to Hadoop by calling out the MapReduce job as well as setting the input-path, output-path, the mapper class, and the reducer class. When this runs, the data is pushed through the MapReduce job with output on the other end.

<hdp:tasklet id="hadoop-tasklet" job-ref="mr-job"/>

 <job id="wordcount-job"

  input-path="${wordcount.input.path:/user/gutenberg/input/word/}"

  output-path="${wordcount.output.path:/user/gutenberg/output/word/}"

  mapper="org.apache.hadoop.examples.WordCount.TokenizerMapper" 

  reducer="org.apache.hadoop.examples.WordCount.IntSumReducer" />

As you can see, the approaches offered by Spring Hadoop open up a new way to run Hadoop jobs from Spring applications. The original article provides additional links, information about source downloads, a diagram, and information about other Hadoop 101 articles.