28 February 2014
PA’s Willem van Asperen, a data science specialist, is quoted in an article on Hadoop 2. The article looks at the capabilities of Hadoop and the open-source technology stack for big data analysis which has recently become bigger.
In the new version of Hadoop, MapReduce 2 is responsible for job processing, running on top of a new layer in the Hadoop stack, YARN (Yet Another Resource Negotiator), which handles resource management.
Willem comments that this new reconfiguration means programmers can run multiple applications in Hadoop, including MapReduce for batch processing of data, all sharing common resource management, provided by YARN. He says that it will make a huge difference to programmers: “The old resource manager was tuned to batch jobs: you made sure all the data was available, you ran the job and downloaded the results.”
“With YARN, you’ve got a far more open and flexible application-programming interface. This means that it is now easy for other frameworks to use the resource-management layer - not just the batch processing of MapReduce, but a whole host of online and immediate-results frameworks are underway. They run on top of Hadoop, but give the user the immediate response that is vital to many use cases.”
Willem goes on to say that changes to the Hadoop Distributed File System (HDFS) – also included in Apache Hadoop 2 – provide new failover capabilities, for better availability of the stack. “All of a sudden, Hadoop has become a platform for fault-tolerant, resilient, online big-data analysis - it’s a big step forward,” he says.