Hadoop, Let’s go Native: Part 2: Optimizing Hadoop with Native Task Framework

first_imgAvik Dey is Director of Worldwide Big Data Engineering at Intel. Avik and his team work on research and development for Apache Hadoop and make their work available to Intel customers through the Intel® Distribution for Apache Hadoop* software. Avik’s focus is on making Apache Hadoop an enterprise class software that works and plays well in today’s data center. Avik’s roots in Hadoop goes back to his days as the Lead Program Manager for Hadoop stack at Yahoo!, where he was responsible for managing delivery of Hadoop as a service, to over 1,000 users hosted in over two dozen clusters large to small, running more than 40,000 nodes. Avik was also the Program Chair for Hadoop Summit 2011. Prior to Intel, Avik worked at eBay and Yahoo!This is Part 2 of a 2-part series. This part will discuss how we used the native task framework to optimize Hadoop processing. Part 1 covered the objectives of bringing the native task framework to IDH.The following graphic is an overview of the NativeTask framework: mapreduce.map.output.collector.delegator.class=org.apache.hadoop.mapred.nativetask.NativeMapOutputCollectorDelegator The details of the framework itself will be the topic for another blog post at a later time. Let’s take a quick look at the Phase 1 of our optimization work, where we focus on improving the Map task by delegating the Map output collector to a native Map output collector implemented using the NativeTask framework.We introduced a task delegation interface at the start of MapTask to handle this delegation in the Hadoop 1.x code line. A delegator will handle task execution if it’s configured to do so. In our case, figure 3 shows the native Map output collector over JNI. The efficiency gains are directly proportional to higher throughput for the Map task since this is the core of the execution process.Those of you who are familiar with Hadoop Streaming or Pipes will notice some similarities with the approach we take here. The key differences between these approaches are that Streaming uses stdin and stdout, while Pipes uses sockets for data exchange. We use JNI for communication and data exchange with synchronized block mode to increase performance. Use NativeTask in MapReduce by adding the new MapReduce configuration in JobConf or setting it for the job at run time as follows: Figure 1:  NativeTask Frameworkcenter_img hive> SET mapreduce.map.output.collector.delegator.class=org.apache.hadoop.mapred.nativetask.NativeMapOutputCollectorDelegator Use NativeTask in Hive by setting the MapReduce delegator class configuration in Hive shell as follows: You don’t need to change anything else in the configuration or code to optimize using NativeTask.Note that this is only Phase 1 of the optimization process. NativeTask framework can provide further native task replacement for the MapReduce execution engine while maintaining compatibility with the current framework.The following graph compares the performance of jobs optimized for NativeTask with those running under on vanilla Apache Hadoop. This data was collected with an input data size of 1 TB, and the only configuration changes made between runs were to enable and disable the native tasks.You can see from the DFSIO benchmarks that there is little or no improvement when the job is purely I/O bound. However, the Map-stage efficiency gains aren’t diluted by inefficiencies in subsequent stages of the job when the Map stage drives almost all of the job timing, as is the case with Wordcount. You see more typical results with benchmarks such as PageRank and Hive-Aggregration, so don’t be surprised when your Hive job runs much quicker than it does today.The NativeTask implementation is currently available in IDH 2.5.1 as a beta version, and you can download it at http://hadoop.intel.com. We have performed extensive validation of this implementation to assure its reliability and compatibility. We are working with customers and partners to continue validating NativeTask in the field. An early version of the code is available at https://github.com/intel-hadoop/nativetask. We will refresh that code with the latest version and make it available under the Apache v2.0 license and contribute the code back to the Apache Hadoop project later this year.One of my engineers will be at IEEE Big Data 2013 in the BPOE workshop (http://prof.ict.ac.cn/bpoe2013) on October 7, 2013 if you’re interested in finding out more about NativeTask. The paper is titled “NativeTask: A Hadoop Compatible Framework for High Performance.”To find out more about the Intel® Distribution of Apache Hadoop* (IDH) please check out hadoop.intel.com.~aviklast_img