Menu
Nadzweb.com
Nadzweb.com

Hadoop essential bits

Posted on February 2, 2015February 13, 2015 by admin

Below are answers to some tricky HDFS architecture and MapReduce questions.

How are the HDFS Blocks replicated in Hadoop?
HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks; all blocks in a file except the last block are the same size.
The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file. An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later. Files in HDFS are write-once and have strictly one writer at any time.
The NameNode makes all decisions regarding replication of blocks. HDFS uses rack-aware replica placement policy. In default configuration there are total 3 copies of a datablock on HDFS, 2 copies are stored on datanodes on same rack and 3rd copy on a different rack.

How does the NameNode handle data node failures?
NameNode periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode. When NameNode notices that it has not recieved a hearbeat message from a data node after a certain amount of time, the data node is marked as dead. Since blocks will be under replicated the system begins replicating the blocks that were stored on the dead datanode. The NameNode orchestrates the replication of data blocks from one datanode to another. The replication data transfer happens directly between datanodes and the data never passes through the namenode.

What is DistCp and how is it used?
DistCp (distributed copy) is a tool used for large inter/intra-cluster copying. It uses Map/Reduce to effect its distribution, error handling and recovery, and reporting. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list. Its Map/Reduce pedigree has endowed it with some quirks in both its semantics and execution.

What is HDFS Block size? How is it different from traditional file system block size?
In HDFS data is split into blocks and distributed across multiple nodes in the cluster. Each block is typically 64Mb or 128Mb in size. Each block is replicated multiple times. Default is to replicate each block three times. Replicas are stored on different nodes. HDFS utilizes the local file system to store each HDFS block as a separate file. HDFS Block size can not be compared with the traditional file system block size.

Where is the Mapper’s Intermediate data stored?
The mapper output (intermediate data) is stored on the Local file system (NOT HDFS) of each individual mapper nodes. This is typically a temporary directory location which can be setup in config by the Hadoop administrator. The intermediate data is cleaned up after the Hadoop Job completes.

  • hadoop
  • mapreduce
  • yarn
  • 3 thoughts on “Hadoop essential bits”

    1. Anonymous says:
      December 31, 2016 at 7:28 am

      I could not resist commenting. Exceptionally well written!

      Reply
    2. Anonymous says:
      January 2, 2017 at 5:31 am

      I have read so many articles on the topic of the
      blogger lovers but this piece of writing is genuinely a good article, keep it up.

      Reply
    3. Anonymous says:
      January 11, 2017 at 6:43 pm

      Hello There. I found your blog the use of msn. This is an extremely well written article.
      I’ll be sure to bookmark it and come back to read extra of your helpful info.

      Thank you for the post. I’ll definitely comeback.

      Reply

    Leave a Reply Cancel reply

    Your email address will not be published. Required fields are marked *

    *
    To prove you're a person (not a spam script), type the security word shown in the picture. Click on the picture to hear an audio file of the word.
    Anti-spam image

    Tags

    .htaccess angular angular2 angular2-pipes angular4 angularjs apache bigdata blockchain children codeigniter computer graphics ethereum flot flot charts funny hadoop http javascript jquery kanban lena linux love math mathematics microsoft misc node js php phpframework php frameworks postgres pun-intended python react sass scrum scss silverstripe software ssl story valentines day wordpress

    Archives

    Recent Posts

    • Install only Postgres client 11 on Ubuntu 18.04
    • PostgreSQL – Granting access to users
    • Querying JSONB Postgres fields in SQLAlchemy
    • Angular – Writing unit tests for setTimeout in functions
    • Angular 6 – getting previous url from angular router
    ©2021 Nadzweb.com | Powered by WordPress & Superb Themes