What is PostgreSQL?

18th June 2020

Vector Assets in Modern Web Development

25th June 2020

What is Hadoop?

22nd June 2020

Tags

Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware.

The data is stored on inexpensive commodity servers that run as clusters. Its distributed file system enables concurrent processing and fault tolerance. Developed by Doug Cutting and Michael J. Cafarella, Hadoop uses the MapReduce programming model for faster storage and retrieval of data from its nodes. The framework is managed by Apache Software Foundation and is licensed under the Apache License 2.0.

As the World Wide Web grew in the late 1900s and early 2000s, search engines and indexes were created to help locate relevant information amid the text-based content. In the early years, search results were returned by humans. But as the web grew from dozens to millions of pages, automation was needed. Web crawlers were created, many as university-led research projects, and search engine start-ups took off (Yahoo, AltaVista, etc.).

Development Company Junagadh

One such project was an open-source web search engine called Nutch – the brainchild of Doug Cutting and Mike Cafarella. They wanted to return web search results faster by distributing data and calculations across different computers so multiple tasks could be accomplished simultaneously. During this time, another search engine project called Google was in progress. It was based on the same concept – storing and processing data in a distributed, automated way so that relevant web search results could be returned faster

In 2006, Cutting joined Yahoo and took with him the Nutch project as well as ideas based on Google’s early work with automating distributed data storage and processing. The Nutch project was divided – the web crawler portion remained as Nutch and the distributed computing and processing portion became Hadoop (named after Cutting’s son’s toy elephant). In 2008, Yahoo released Hadoop as an open-source project. Today, Hadoop’s framework and ecosystem of technologies are managed and maintained by the non-profit Apache Software Foundation (ASF), a global community of software developers and contributors.

How Hadoop Improves on Traditional Databases

Hadoop solves two key challenges with traditional databases:

1. Capacity: Hadoop stores large volumes of data.

By using a distributed file system called an HDFS (Hadoop Distributed File System), the data is split into chunks and saved across clusters of commodity servers. As these commodity servers are built with simple hardware configurations, these are economical and easily scalable as the data grows.

2. Speed: Hadoop stores and retrieves data faster.

Hadoop uses the MapReduce functional programming model to perform parallel processing across data sets. So, when a query is sent to the database, instead of handling data sequentially, tasks are split and concurrently run across distributed servers. Finally, the output of all tasks is collated and sent back to the application, drastically improving the processing speed.

5 Benefits of Hadoop for Big Data

For big data and analytics, Hadoop is a life saver. Data gathered about people, processes, objects, tools, etc. is useful only when meaningful patterns emerge that, in-turn, result in better decisions. Hadoop helps overcome the challenge of the vastness of big data:

Resilience — Data stored in any node is also replicated in other nodes of the cluster. This ensures fault tolerance. If one node goes down, there is always a backup of the data available in the cluster.
Scalability — Unlike traditional systems that have a limitation on data storage, Hadoop is scalable because it operates in a distributed environment. As the need arises, the setup can be easily expanded to include more servers that can store up to multiple petabytes of data.
Low cost — As Hadoop is an open-source framework, with no license to be procured, the costs are significantly lower compared to relational database systems. The use of inexpensive commodity hardware also works in its favor to keep the solution economical.
Speed — Hadoop’s distributed file system, concurrent processing, and the MapReduce model enable running complex queries in a matter of seconds.
Data diversity — HDFS has the capability to store different data formats such as unstructured (e.g. videos), semi-structured (e.g. XML files), and structured. While storing data, it is not required to validate against a predefined schema. Rather, the data can be dumped in any format. Later, when retrieved, data is parsed and fitted into any schema as needed. This gives the flexibility to derive different insights using the same data.

The Hadoop Ecosystem: Core Components

Hadoop is not just one application, rather it is a platform with various integral components that enable distributed data storage and processing. These components together form the Hadoop ecosystem.

Some of these are core components, which form the foundation of the framework, while some are supplementary components that bring add-on functionalities into the Hadoop world.

The core components of Hadoop are:

HDFS: Maintaining the Distributed File System

HDFS is the pillar of Hadoop that maintains the distributed file system. It makes it possible to store and replicate data across multiple servers.

HDFS has a NameNode and DataNode. DataNodes are the commodity servers where the data is actually stored. The NameNode, on the other hand, contains metadata with information on the data stored in the different nodes. The application only interacts with the NameNode, which communicates with data nodes as required.

YARN: Yet Another Resource Negotiator

YARN stands for Yet Another Resource Negotiator. It manages and schedules the resources, and decides what should happen in each data node. The central master node that manages all processing requests is called the Resource Manager. The Resource Manager interacts with Node Managers; every slave datanode has its own Node Manager to execute tasks.

MapReduce

MapReduce is a programming model that was first used by Google for indexing its search operations. It is the logic used to split data into smaller sets. It works on the basis of two functions — Map() and Reduce() — that parse the data in a quick and efficient manner.

First, the Map function groups, filters, and sorts multiple data sets in parallel to produce tuples (key, value pairs). Then, the Reduce function aggregates the data from these tuples to produce the desired output.

The Hadoop Ecosystem: Supplementary Components

The following are a few supplementary components that are extensively used in the Hadoop ecosystem.

Hive: Data Warehousing

Hive is a data warehousing system that helps to query large datasets in the HDFS. Before Hive, developers were faced with the challenge of creating complex MapReduce jobs to query the Hadoop data. Hive uses HQL (Hive Query Language), which resembles the syntax of SQL. Since most developers come from a SQL background, Hive is easier to get on-board.

The advantage of Hive is that a JDBC/ODBC driver acts as an interface between the application and the HDFS. It exposes the Hadoop file system as tables, converts HQL into MapReduce jobs, and vice-versa. So while the developers and database administrators gain the benefit of batch processing large datasets, they can use simple, familiar queries to achieve that. Originally developed by the Facebook team, Hive is now an open source technology.

Pig: Reduce MapReduce Functions

Pig, initially developed by Yahoo!, is similar to Hive in that it eliminates the need to create MapReduce functions to query the HDFS. Similar to HQL, the language used — here, called “Pig Latin” — is closer to SQL. “Pig Latin” is a high-level data flow language layer on top of MapReduce.

Pig also has a runtime environment that interfaces with HDFS. Scripts in languages such as Java or Python can also be embedded inside Pig.

Hive Versus Pig

Although Pig and Hive have similar functions, one can be more effective than the other in different scenarios.

Pig is useful in the data preparation stage, as it can perform complex joins and queries easily. It also works well with different data formats, including semi-structured and unstructured. Pig Latin is closer to SQL but also varies from SQL enough for it to have a learning curve.

Hive, however, works well with structured data and is therefore more effective during data warehousing. It’s used in the server side of the cluster.

Researchers and programmers tend to use Pig on the client side of a cluster, whereas business intelligence users such as data analysts find Hive as the right fit.

Flume: Big Data Ingestion

Flume is a big data ingestion tool that acts as a courier service between multiple data sources and the HDFS. It collects, aggregates, and sends huge amounts of streaming data (e.g. log files, events) generated by applications such as social media sites, IoT apps, and ecommerce portals into the HDFS.

Development Company Junagadh

Flume is feature-rich, it:

Has a distributed architecture.
Ensures reliable data transfer.
Is fault-tolerant.
Has the flexibility to collect data in batches or real-time.
Can be scaled horizontally to handle more traffic, as needed.

Data sources communicate with Flume agents — every agent has a source, channel, and a sink. The source collects data from the sender, the channel temporarily stores the data, and finally, the sink transfers data to the destination, which is a Hadoop server.

Sqoop: Data Ingestion for Relational Databases

Sqoop (“SQL,” to Hadoop) is another data ingestion tool like Flume. While Flume works on unstructured or semi-structured data, Sqoop is used to export data from and import data into relational databases. As most enterprise data is stored in relational databases, Sqoop is used to import that data into Hadoop for analysts to examine.

Database admins and developers can use a simple command line interface to export and import data. Sqoop converts these commands to MapReduce format and sends them to the HDFS using YARN. Sqoop is also fault-tolerant and performs concurrent operations like Flume.

Zookeeper: Coordination of Distributed Applications

Zookeeper is a service that coordinates distributed applications. In the Hadoop framework, it acts as an admin tool with a centralized registry that has information about the cluster of distributed servers it manages. Some of its key functions are:

Maintaining configuration information (shared state of configuration data)
Naming service (assignment of name to each server)
Synchronization service (handles deadlocks, race condition, and data inconsistency)
Leader election (elects a leader among the servers through consensus)

The cluster of servers that the Zookeeper service runs on is called an “ensemble.” The ensemble elects a leader among the group, with the rest behaving as followers. All write-operations from clients need to be routed through the leader, whereas read operations can go directly to any server.

Zookeeper provides high reliability and resilience through fail-safe synchronization, atomicity, and serialization of messages.

Kafka: Faster Data Transfers

Kafka is a distributed publish-subscribe messaging system that is often used with Hadoop for faster data transfers. A Kafka cluster consists of a group of servers that act as an intermediary between producers and consumers.

In the context of big data, an example of a producer could be a sensor gathering temperature data to relay back to the server. Consumers are the Hadoop servers. The producers publish message on a topic and the consumers pull messages by listening to the topic.

A single topic can be split further into partitions. All messages with the same key arrive to a specific partition. A consumer can listen to one or more partitions.

By grouping messages under one key and getting a consumer to cater to specific partitions, many consumers can listen on the same topic at the same time. Thus, a topic is parallelized, increasing the throughput of the system. Kafka is widely adopted for its speed, scalability, and robust replication.

HBase: Non-Relational Database

HBase is a column-oriented, non-relational database that sits on top of HDFS. One of the challenges with HDFS is that it can only do batch processing. So for simple interactive queries, data still has to be processed in batches, leading to high latency.

HBase solves this challenge by allowing queries for single rows across huge tables with low latency. It achieves this by internally using hash tables. It is modelled along the lines of Google BigTable that helps access the Google File System (GFS).

HBase is scalable, has failure support when a node goes down, and is good with unstructured as well as semi-structured data. Hence, it is ideal for querying big data stores for analytical purposes.

What is Hadoop?

What is PostgreSQL?

Vector Assets in Modern Web Development

What is PostgreSQL?

Vector Assets in Modern Web Development

HDFS: Maintaining the Distributed File System

YARN: Yet Another Resource Negotiator

MapReduce

The Hadoop Ecosystem: Supplementary Components

Hive: Data Warehousing

Pig: Reduce MapReduce Functions

Hive Versus Pig

Flume: Big Data Ingestion

Sqoop: Data Ingestion for Relational Databases

Zookeeper: Coordination of Distributed Applications

Kafka: Faster Data Transfers

HBase: Non-Relational Database

Related posts

Comprehensive Guide: Step-by-Step Approach to Safely Upgrade React Native in Existing Projects

Why continuous feedback is vital for DevOps workflow success

Angular with NativeScript: Creating the Blackout Lighting Console