Archive for August 2013

I had a task the other day where I had 110GB of compressed log files and wanted to import into Impala (Cloudera). Currently, Impala does not support compressed files so I had to decompress them all. I created this handy script and thought you might find it useful. I mounted the EC2 bucket using s3fs I mentioned in my earlier post.

#!/bin/bash
# Utils
elapsed()
{
   (( seconds  = SECONDS ))
   "$@"
   (( seconds = SECONDS - seconds ))
   (( etime_seconds = seconds % 60 ))
   (( etime_minuts  = ( seconds - etime_seconds ) / 60 % 60 ))
   (( etime_hours   = seconds / 3600 ))
   (( verif = etime_seconds + (etime_minuts * 60) + (etime_hours * 3600) ))

   echo "Elapsed time: ${etime_hours}h ${etime_minuts}m ${etime_seconds}s"
 }

convert()
{
# Remove the .gz extention from the compressed file name
UFILE=`echo ${FILE:0:${#FILE}-3}`

# Decompress gz file
sudo -u hdfs hdfs dfs -cat /user/hdfs/oms/logs/$FILE | \ 
sudo -u hdfs gunzip -d | sudo -u hdfs hdfs dfs -put - /user/hdfs/oms/logs/$UFILE

# Discard original gz file
sudo -u hdfs hdfs dfs -rm -skipTrash /user/hdfs/oms/logs/$FILE
sudo -u hdfs hdfs dfs -ls /user/hdfs/oms/logs/$UFILE
}

for FILE in `ls /media/ephemeral0/logs/`
  do
    elapsed convert $FILE
    echo "Decompressed $FILE to $UFILE on hdfs"
  done

exit 0

s3fs is a open-source project which lets you mount your S3 storage locally to have access to your files at the system level so that you could actually work with them. I use this method to mount S3 buckets on my EC2 instances. Below, I go through the installation steps and also document some of the problems and their workarounds.

Dowload s3fs source code from to your EC2 instance and decompress it:

[ec2-user@ip-10-xx-xx-xxx ~]$ wget http://s3fs.googlecode.com/files/s3fs-1.63.tar.gz
--Make sure your libraries are installed/up-to-date
[ec2-user@ip-10-xx-xx-xxx ~]$ sudo yum install gcc libstdc++-devel gcc-c++ fuse fuse-devel curl-devel libxml2-devel openssl-devel mailcap
[ec2-user@ip-10-xx-xx-xxx ~]$ cd s3fs-1.63
[ec2-user@ip-10-xx-xx-xxx ~]$ ./configure --prefix=/usr

At this point you might get the following exception indicating that s3fs requires a newer version of Fuse (http://fuse.sourceforge.net/).

configure: error: Package requirements (fuse >= 2.8.4 libcurl >= 7.0 libxml-2.0 >= 2.6 libcrypto >= 0.9) were not met:
Requested 'fuse >= 2.8.4' but version of fuse is 2.8.3
Consider adjusting the PKG_CONFIG_PATH environment variable if you
installed software in a non-standard prefix.

Alternatively, you may set the environment variables DEPS_CFLAGS
and DEPS_LIBS to avoid the need to call pkg-config.
See the pkg-config man page for more details.

Follow the steps to upgrade your Fuse posted at http://fuse.sourceforge.net/.

[ec2-user@ip-10-xx-xx-xxx ~]$ wget http://downloads.sourceforge.net/project/fuse/fuse-2.X/2.8.4/fuse-2.8.4.tar.gz
[ec2-user@ip-10-xx-xx-xxx ~]$ tar -xvf fuse-2.8.4.tar.gz
[ec2-user@ip-10-xx-xx-xxx ~]$ cd fuse-2.8.4
[ec2-user@ip-10-xx-xx-xxx ~]$ sudo  yum -y install "gcc*" make libcurl-devel libxml2-devel openssl-devel
[ec2-user@ip-10-xx-xx-xxx ~]$ sudo ./configure --prefix=/usr
[ec2-user@ip-10-xx-xx-xxx ~]$ sudo make && sudo make install
[ec2-user@ip-10-xx-xx-xxx ~]$ sudo ldconfig
--Verify that the new version is now in place
[ec2-user@ip-10-xx-xx-xxx ~]$ pkg-config --modversion fuse
2.8.3

Now we can return to our s3fs installation step to add the AWS credentials in the following format: AWS Access Key:Secret Key

[ec2-user@ip-10-xx-xx-xxx ~]$ sudo vi /etc/passwd-s3fs
-- Set file permission
[ec2-user@ip-10-xx-xx-xxx ~]$ sudo chmod 640 /etc/passwd-s3fs

Now you should be able to successfully mount your AWS S3 bucket onto your local folder as such:

[ec2-user@ip-10-xx-xx-xxx ~]$ sudo s3fs  

That is about it and thanks for reading!