This post is for BigData advanced developers and data engineers.
Scenario:
A Hadoop cluster containing large amount of data stored locally or on another data center, that needs to be migrated to Amazon AWS on the new AWS BigData architecture.
Data may be log files from web servers, application servers and some sensor data from Home Automation projects.
Solution:
It is ideal to use S3DistCp to copy the data to Amazon S3. By adding S3DistCp as a step in a job flow, you can efficiently copy large amounts of data from Amazon S3 into HDFS where subsequent steps in your EMR clusters can process it. You can also use S3DistCp to copy data between Amazon S3 buckets or from HDFS to Amazon S3.
1. Launch a small Amazon EMR cluster (a single node).
elastic-mapreduce --create --alive --instance-count 1 --instance-type m1.small --ami-version 2.4.1
2. Copy the following jars from Amazon EMR’s master node (/home/Hadoop/lib) to your local Hadoop master node under the /lib directory of your Hadoop installation path (For example: /usr/local/hadoop/lib). Depending on your Hadoop installation, you may or may not have these jars. The Apache Hadoop distribution does not contain these jars.
/home/hadoop/lib/emr-s3distcp-1.0.jar /home/hadoop/lib/aws-java-sdk-1.3.26.jar /home/hadoop/lib/guava-13.0.1.jar /home/hadoop/lib/gson-2.1.jar /home/hadoop/lib/EmrMetrics-1.0.jar /home/hadoop/lib/protobuf-java-2.4.1.jar /home/hadoop/lib/httpcore-4.1.jar /home/hadoop/lib/httpclient-4.1.1.jar
3. Edit the core-site.xml file to insert your AWS credentials. Then copy the core-site.xml config file to all of your Hadoop cluster nodes. After copying the file, it is unnecessary to restart any services or daemons for the change to take effect.
fs.s3.awsSecretAccessKey YOUR_SECRETACCESSKEY fs.s3.awsAccessKeyId YOUR_ACCESSKEY fs.s3n.awsSecretAccessKey YOUR_SECRETACCESSKEY fs.s3n.awsAccessKeyId YOUR_ACCESSKEY
4. Run s3distcp using the following example (modify HDFS_PATH, YOUR_S3_BUCKET and PATH):
s3distcp-1.0.jar,/usr/local/hadoop/lib/EmrMetrics-1.0.jar,/usr/local/hadoop/lib/protobuf-java-2.4.1.jar,/usr/local/hadoop/lib/httpcore-4.1.jar,/usr/local/hadoop/lib/httpclient-4.1.1.jar --src HDFS_PATH --dest s3://YOUR_S3_BUCKET/PATH/ --disableMultipartUpload
Credits: Parviz Deyhim, Databricks
s3distcp options: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html#UsingEMR_s3distcp.options
My programmer is trying to convince me to move to .net from PHP.
I have always disliked the idea because of the expenses.
But he’s tryiong none the less. I’ve been using WordPress on numerous websites for about a year
and am anxious about switching to another platform.
Is there a way I can transfer all my wordpress content into
it? Any kind of help would be greatly appreciated!
What’s up, I would like to subscribe for this website to obtain newest
updates, thus where can i do it please assist.
I don’t even know how I ended up right here,
but I thought this submit was once great. I do not realize
who you’re but certainly you are going to a famous blogger should you are
not already. Cheers!
Very quickly this web site will be famous amid all blogging users, due to it’s
nice content