Moving large amounts of data from HDFS to AWS

This post is for BigData advanced developers and data engineers.

Scenario:
A Hadoop cluster containing large amount of data stored locally or on another data center, that needs to be migrated to Amazon AWS on the new AWS BigData architecture.
Data may be log files from web servers, application servers and some sensor data from Home Automation projects.

Solution:
It is ideal to use S3DistCp to copy the data to Amazon S3. By adding S3DistCp as a step in a job flow, you can efficiently copy large amounts of data from Amazon S3 into HDFS where subsequent steps in your EMR clusters can process it. You can also use S3DistCp to copy data between Amazon S3 buckets or from HDFS to Amazon S3.

1. Launch a small Amazon EMR cluster (a single node).

elastic-mapreduce --create --alive --instance-count 1 --instance-type m1.small --ami-version 2.4.1

2. Copy the following jars from Amazon EMR’s master node (/home/Hadoop/lib) to your local Hadoop master node under the /lib directory of your Hadoop installation path (For example: /usr/local/hadoop/lib). Depending on your Hadoop installation, you may or may not have these jars. The Apache Hadoop distribution does not contain these jars.

/home/hadoop/lib/emr-s3distcp-1.0.jar
/home/hadoop/lib/aws-java-sdk-1.3.26.jar
/home/hadoop/lib/guava-13.0.1.jar
/home/hadoop/lib/gson-2.1.jar
/home/hadoop/lib/EmrMetrics-1.0.jar
/home/hadoop/lib/protobuf-java-2.4.1.jar
/home/hadoop/lib/httpcore-4.1.jar
/home/hadoop/lib/httpclient-4.1.1.jar

3. Edit the core-site.xml file to insert your AWS credentials. Then copy the core-site.xml config file to all of your Hadoop cluster nodes. After copying the file, it is unnecessary to restart any services or daemons for the change to take effect.

<property>
<name>fs.s3.awsSecretAccessKey</name>
<value>YOUR_SECRETACCESSKEY</value>
</property>
<property>
<name>fs.s3.awsAccessKeyId</name>
<value>YOUR_ACCESSKEY</value>
</property>
<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>YOUR_SECRETACCESSKEY</value>
</property>
<property>
<name>fs.s3n.awsAccessKeyId</name>
<value>YOUR_ACCESSKEY</value>
</property>

4. Run s3distcp using the following example (modify HDFS_PATH, YOUR_S3_BUCKET and PATH):

s3distcp-1.0.jar,/usr/local/hadoop/lib/EmrMetrics-1.0.jar,/usr/local/hadoop/lib/protobuf-java-2.4.1.jar,/usr/local/hadoop/lib/httpcore-4.1.jar,/usr/local/hadoop/lib/httpclient-4.1.1.jar --src HDFS_PATH --dest s3://YOUR_S3_BUCKET/PATH/ --disableMultipartUpload

Credits: Parviz Deyhim, Databricks
s3distcp options: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html#UsingEMR_s3distcp.options

4 Comments

Add a Comment

Your email address will not be published. Required fields are marked *

*
To prove you're a person (not a spam script), type the security word shown in the picture. Click on the picture to hear an audio file of the word.
Anti-spam image