Hadoop is an open-source processing framework for distributed storage systems specific to managing data processing and data storage for big data applications. Part of the Hadoop ecosystem of technologies is its HDFS. The Hadoop Distributed File System (HDFS) is a scalable data storage system designed to provide high-performance data access within the Hadoop ecosystem
The main benefit of Hadoop is to efficiently store and process large data sets with significant advantages, listed here.
- Cost-effectiveness — Hadoop has two main cost advantages: first, data clusters consist of inexpensive commodity hardware. Second, Hadoop is open-source, so it is free to install, however, this cost may appear in other ways such as additional technical expertise.
- Storage of Large Data Sets — Hadoop can store unstructured and structured data sets of any size.
- System Resilience — Hadoop is made fault-tolerant through storage redundancy. Data redundancy is the practice of storing copies of data blocks in two or more places, not on the same device, so that if one copy cannot be accessed, another is available to access.
- Compatible and Portable — HDFS has been designed to be heterogeneously portable from one platform to another, and compatible with every major operating system.
- Streaming Data Access — Hadoop is designed to support high data speeds, perfect for accommodating the demands of streaming data.
HDFS is specifically designed to scale data storage across multiple physical devices organized into clusters. At the heart of HDFS is MapReduce, a tool that ingests data, breaks it down into separate blocks, and then distributes them accordingly to different nodes in various clusters. HDFS uses a primary NameNode, which tracks where those blocks are placed.
The nodes where these blocks are stored, called DataNodes, are built of inexpensive commodity hardware organized into a cluster. Each cluster usually is a single node. As well, blocks are also replicated across multiple nodes to improve redundancy and resilience using parallel processing.
The conjoined inner workings of the NameNode and DataNodes enables dynamic adaptation in server capacity. In real time, Hadoop can grow or shrink the number of nodes based on demand. This also creates a “self-healing” property within the system—as the NameNode discovers DataNode faults, it can assign tasks to other DataNodes with duplicate blocks.
The ability to split data amongst multiple nodes, and to replicate that data grants several features over traditional data centers.
- Fault Tolerance and Reliability — Through the use of data replication, Hadoop systems are made redundant. If faults are detected, then data copies stored on redundant nodes can be called up.
- High Availability (HA) — Stemming from data replication and fault tolerance is the practice of High Availability (HA), the goal of which is to make data and applications available with under 0.1% downtime.
- Scalability — The architecture style of Hadoop was designed to scale. As more data is generated and stored, more data clusters can be added and installed with no disruption.
- High Throughput — Parallel processing enhances the speed at which Hadoop can ingest and process data.
- Data Locality — DataNodes processes data where it resides, on the DataNode, rather than sending data to a computer to be processed, which significantly reduces network congestion, and increases throughput.
As with new technology, issues arise when similar terminology is used with seemingly similar technologies. HDFS vs. Cloud Object Storage is one of those cases, but HDFS and Cloud Object Storage are not necessarily comparable.
HDFS is a distributed file system for storing large data, within a file hierarchy. Object Storage, like Amazon’s S3, is a strategy for storing data as objects, and managing those objects with referent metadata, typically to store voluminous unstructured data. While they both deal with storing data in different storage architectures, the functionality of Hadoop makes it far more complex.
Another significant difference is that HDFS follows the premise of moving compute power to where data is, such as the DataNode, where processing can occur in order to reduce network congestion instead of sending data away to be computed. Computation is strictly not included in object storage, which only deals with storing data as objects.
Hadoop is a powerful platform for collecting, processing, analyzing and management storage and data. For scenarios involving big data sets, Hadoop has many applications.
- Website Analytics — Websites can generate tones of data, like traffic data, click-throughs, page visits, and digital marketing data. Designed for such things, Hadoop can store and analyze big data sets.
- Consumer Analytics — Consumer tracking is not new. It's a process that coordinates many data points to understand consumer behavior. Even if data is anonymized, it is still sizable, and in need of a system capable of understanding both structured and unstructured data. Consumer analytics can rely on disparate data like geographic location, and e-commerce history, to understand what may be most relevant to the consumer.
- Retail Analytics — Retail analytics is like consumer analytics, but in the backend. Today, in the retail world, omni-channel marketing is becoming the new business differentiator. For retailers, this means having a system that can ingest and dig through some of their biggest systems, including CRM, ERP, and the retail store databases. Together, they are being connected to provide the widest possible view of retail operations.
- Financial Analytics — The first applications of big data set systems, like Hadoop, is always finance. The pure mathematical nature of finance lends itself as an ideal case for Hadoop.
- Healthcare Management — Much like retail, the healthcare industry is interweaving its inventory, staff, and patient systems. An interworking of these systems helps to deliver better patient care, ensure that medicines and inventory are available, and helps coordinate the schedules of staff, patients, and equipment availability, enabling doctors and nurses to attend to more in need.
Hadoop applications are those designed to work within the Hadoop ecosystem, using the various services to solve problems using big data. Within the Hadoop ecosystem are four main components:
- Hadoop Common Library — Hadoop's common libraries and utilities that build the foundation for all other components.
- Hadoop Distributed File System — Hadoop’s java-based file distribution system designed to span multiple devices and easily and rapidly scale.
- Yet Another Resource Negotiator (YARN) — Hadoop’s main resource management component. Strictly this is the management component of MapReduce.
- MapReduce — MapReduce is the main processing module for big data sets using parallel, distributed algorithms on clusters. MapReduce is a two stage process, Map filters and sorts data, while Reduce summarizes that data.
To support these main components, several other components have also been included in the Hadoop ecosystem.
- Ambari — Provides functions for managing, configuring, and testing Hadoop services via a web interface.
- Flume — Provides collection, aggregation, and migration to HDFS features for large volumes of streaming data.
- HBase — A NoSQL database that runs on top of Hadoop, helpful for storing sparse data sets during operations.
- Hive — A database style programming language, like SQL, useful for presenting data in tables.
- Oozie — A Hadoop job scheduler.
- Pig — An alternative scripting language to MapReduce for HDFS, which can be compiled into MapReduce programs. MapReduce programs are written in Java.
- Solr — A powerful, fast, and scalable search tool for Hadoop.
- Spark — An open-source cluster computing framework with in-memory analytics that allows for faster processing.
- Sqoop — A migration tool that can translate relational database data into HDFS, and vice versa.
- Zookeeper — A tool for coordinating and synchronizing nodes and distributed processing.