Hadoop Ecosystem

Hadoop is not a single product, but rather a software family. Its common components consist of the following:

  • Pig, a scripting language used to quickly write MapReduce code to handle unstructured sources
  • Hive, used to facilitate structure for the data
  • HCatalog, used to provide inter-operatability between these internal systems
  • HBase, which is essentially a database built on top of Hadoop
  • HDFS, the actual file system for hadoop.
  • Apache Mahout
  • Packaging for Hadoop: BigTop

Hadoop structures data using Hive, but can handle unstructured data easily using Pig.

Hadoop and Mongo


Amazon EMR Best Practices

Amazon EMR includes

  • Ganglia
  • Hadoop
  • HBase
  • HCatalog
  • Hive
  • Hue
  • Mahout
  • Oozie
  • Phoenix
  • Pig
  • Prest0
  • Spark
  • Sqoop
  • Tez
  • Zeppelin
  • ZooKeeper