Menu
Nadzweb.com
Nadzweb.com

Moving large amounts of data from HDFS to AWS

Posted on November 2, 2014November 17, 2014 by admin

This post is for BigData advanced developers and data engineers.

Scenario:
A Hadoop cluster containing large amount of data stored locally or on another data center, that needs to be migrated to Amazon AWS on the new AWS BigData architecture.
Data may be log files from web servers, application servers and some sensor data from Home Automation projects.

Solution:
It is ideal to use S3DistCp to copy the data to Amazon S3. By adding S3DistCp as a step in a job flow, you can efficiently copy large amounts of data from Amazon S3 into HDFS where subsequent steps in your EMR clusters can process it. You can also use S3DistCp to copy data between Amazon S3 buckets or from HDFS to Amazon S3.

1. Launch a small Amazon EMR cluster (a single node).

elastic-mapreduce --create --alive --instance-count 1 --instance-type m1.small --ami-version 2.4.1

2. Copy the following jars from Amazon EMR’s master node (/home/Hadoop/lib) to your local Hadoop master node under the /lib directory of your Hadoop installation path (For example: /usr/local/hadoop/lib). Depending on your Hadoop installation, you may or may not have these jars. The Apache Hadoop distribution does not contain these jars.

/home/hadoop/lib/emr-s3distcp-1.0.jar
/home/hadoop/lib/aws-java-sdk-1.3.26.jar
/home/hadoop/lib/guava-13.0.1.jar
/home/hadoop/lib/gson-2.1.jar
/home/hadoop/lib/EmrMetrics-1.0.jar
/home/hadoop/lib/protobuf-java-2.4.1.jar
/home/hadoop/lib/httpcore-4.1.jar
/home/hadoop/lib/httpclient-4.1.1.jar

3. Edit the core-site.xml file to insert your AWS credentials. Then copy the core-site.xml config file to all of your Hadoop cluster nodes. After copying the file, it is unnecessary to restart any services or daemons for the change to take effect.

<property>
<name>fs.s3.awsSecretAccessKey</name>
<value>YOUR_SECRETACCESSKEY</value>
</property>
<property>
<name>fs.s3.awsAccessKeyId</name>
<value>YOUR_ACCESSKEY</value>
</property>
<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>YOUR_SECRETACCESSKEY</value>
</property>
<property>
<name>fs.s3n.awsAccessKeyId</name>
<value>YOUR_ACCESSKEY</value>
</property>

4. Run s3distcp using the following example (modify HDFS_PATH, YOUR_S3_BUCKET and PATH):

s3distcp-1.0.jar,/usr/local/hadoop/lib/EmrMetrics-1.0.jar,/usr/local/hadoop/lib/protobuf-java-2.4.1.jar,/usr/local/hadoop/lib/httpcore-4.1.jar,/usr/local/hadoop/lib/httpclient-4.1.1.jar --src HDFS_PATH --dest s3://YOUR_S3_BUCKET/PATH/ --disableMultipartUpload

Credits: Parviz Deyhim, Databricks
s3distcp options: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html#UsingEMR_s3distcp.options

  • big data
  • bigdata
  • hadoop
  • 4 thoughts on “Moving large amounts of data from HDFS to AWS”

    1. Fredric says:
      May 21, 2015 at 7:08 pm

      My programmer is trying to convince me to move to .net from PHP.
      I have always disliked the idea because of the expenses.
      But he’s tryiong none the less. I’ve been using WordPress on numerous websites for about a year
      and am anxious about switching to another platform.

      Is there a way I can transfer all my wordpress content into
      it? Any kind of help would be greatly appreciated!

      Reply
    2. Reina says:
      February 25, 2017 at 12:04 am

      What’s up, I would like to subscribe for this website to obtain newest
      updates, thus where can i do it please assist.

      Reply
    3. Anonymous says:
      April 10, 2017 at 7:16 pm

      I don’t even know how I ended up right here,
      but I thought this submit was once great. I do not realize
      who you’re but certainly you are going to a famous blogger should you are
      not already. Cheers!

      Reply
    4. broderie says:
      September 26, 2017 at 2:10 am

      Very quickly this web site will be famous amid all blogging users, due to it’s
      nice content

      Reply

    Leave a Reply Cancel reply

    Your email address will not be published. Required fields are marked *

    *
    To prove you're a person (not a spam script), type the security word shown in the picture. Click on the picture to hear an audio file of the word.
    Anti-spam image

    Tags

    .htaccess angular angular2 angular2-pipes angular4 angularjs apache bigdata blockchain children codeigniter computer graphics ethereum flot flot charts funny hadoop http javascript jquery kanban lena linux love math mathematics microsoft misc node js php phpframework php frameworks postgres pun-intended python react sass scrum scss silverstripe software ssl story valentines day wordpress

    Archives

    Recent Posts

    • Install only Postgres client 11 on Ubuntu 18.04
    • PostgreSQL – Granting access to users
    • Querying JSONB Postgres fields in SQLAlchemy
    • Angular – Writing unit tests for setTimeout in functions
    • Angular 6 – getting previous url from angular router
    ©2021 Nadzweb.com | Powered by WordPress & Superb Themes