Home » Featured, Headline, Software Development

Moving large amounts of data from HDFS to AWS

2 November 2014 4 Comments

This post is for BigData advanced developers and data engineers.

A Hadoop cluster containing large amount of data stored locally or on another data center, that needs to be migrated to Amazon AWS on the new AWS BigData architecture.
Data may be log files from web servers, application servers and some sensor data from Home Automation projects.

It is ideal to use S3DistCp to copy the data to Amazon S3. By adding S3DistCp as a step in a job flow, you can efficiently copy large amounts of data from Amazon S3 into HDFS where subsequent steps in your EMR clusters can process it. You can also use S3DistCp to copy data between Amazon S3 buckets or from HDFS to Amazon S3.

1. Launch a small Amazon EMR cluster (a single node).

elastic-mapreduce --create --alive --instance-count 1 --instance-type m1.small --ami-version 2.4.1

2. Copy the following jars from Amazon EMR’s master node (/home/Hadoop/lib) to your local Hadoop master node under the /lib directory of your Hadoop installation path (For example: /usr/local/hadoop/lib). Depending on your Hadoop installation, you may or may not have these jars. The Apache Hadoop distribution does not contain these jars.


3. Edit the core-site.xml file to insert your AWS credentials. Then copy the core-site.xml config file to all of your Hadoop cluster nodes. After copying the file, it is unnecessary to restart any services or daemons for the change to take effect.


4. Run s3distcp using the following example (modify HDFS_PATH, YOUR_S3_BUCKET and PATH):

s3distcp-1.0.jar,/usr/local/hadoop/lib/EmrMetrics-1.0.jar,/usr/local/hadoop/lib/protobuf-java-2.4.1.jar,/usr/local/hadoop/lib/httpcore-4.1.jar,/usr/local/hadoop/lib/httpclient-4.1.1.jar --src HDFS_PATH --dest s3://YOUR_S3_BUCKET/PATH/ --disableMultipartUpload

Credits: Parviz Deyhim, Databricks
s3distcp options: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html#UsingEMR_s3distcp.options


  • Fredric said:

    My programmer is trying to convince me to move to .net from PHP.
    I have always disliked the idea because of the expenses.
    But he’s tryiong none the less. I’ve been using WordPress on numerous websites for about a year
    and am anxious about switching to another platform.

    Is there a way I can transfer all my wordpress content into
    it? Any kind of help would be greatly appreciated!

  • Reina said:

    What’s up, I would like to subscribe for this website to obtain newest
    updates, thus where can i do it please assist.

  • Anonymous said:

    I don’t even know how I ended up right here,
    but I thought this submit was once great. I do not realize
    who you’re but certainly you are going to a famous blogger should you are
    not already. Cheers!

  • broderie said:

    Very quickly this web site will be famous amid all blogging users, due to it’s
    nice content

Leave your response!

Add your comment below, or trackback from your own site. You can also subscribe to these comments via RSS.

Be nice. Keep it clean. Stay on topic. No spam.

You can use these tags:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

This is a Gravatar-enabled weblog. To get your own globally-recognized-avatar, please register at Gravatar.

To prove you're a person (not a spam script), type the security word shown in the picture. Click on the picture to hear an audio file of the word.
Anti-spam image