Catoira's Development Notes: Shell Scripts

We can't deny scripting is powerful, depending on what we are supposed to do. But some people look at bash scripts as a really complex technology. Early this year, in 2013, for example, I was asked to help in a proof of concept for a specific client. The team was already working on the main tasks, but they were struggling in some validations and for an automation with the PoC files.

So, to validate the files and its contents, I wrote a couple of Java programs. But for the file automation (copy/uncompress/move to another file system - hadoop), with some extra validations, I decided to write a generic bash script on the Linux box. So, I decided to write here some really simple chunks of code to show how easy it is to use, and how it solves some issues in a simple way:

Before starting to write the main logic, it's necessary to define the global variables. For example:

### Set main directory
rootDir=/
targetHadoopDir=/user/jdoe/client/data/
tempDir=/home/jdoe/tmp/

First action is checking if a parameter was passed on the command line. If not, a message is printed, and the execution is stopped:
### check if there's a directory name passed as parameter

if [ $# -ne 1 ]
then
echo "Need a directory as a parameter!"
exit 1
else
rootDir=$1
fi

After that, it was necessary to check if a temporary directory exists. If it's there, the script is supposed to clean it up. If not, it has to create the directory. So, the call to the sub routine responsible to perform the logic described is done just by using the sub routine name:

### check if a temporary directory exists
checkTemp

And the routine has to be specified before the main logic:

### check if temp directory exists, and create it or clean it
checkTemp() {
if [ -d "${tempDir}" ]
then
rm -r $tempDir
if [ $? -gt 0 ]
then
### error
echo "Error removing directory '$tempDir'!"
exit 1
fi
fi
mkdir $tempDir
if [ $? -gt 0 ]
then
### error
echo "Error creating temp directory '$tempDir'!"
exit 1
else
echo "Temp directory '$tempDir' ready to be used!"
fi
}

You might have noticed how simple it is to write a command and check if it ran or not. Another tip is to print the steps and comment your script. Try to be clear, since you might forget later why you were performing certain steps;
So, the last step on the main logic is to process the files. In fact, all the steps above were basically a preparation to process the files. So, in the main logic I called the main sub routine using the roo=t directory as a parameter:

### run down the directory tree and work with the filesprocessDir $rootDir
echo "Done!"

This routine is recursive. It checks all the files under it, and calls itself if it's a directory, or processes the file if it's a file:

### process directory search, recursively
processDir() {
for file in $1/*
do
if [ -f "$file" ]
then
### it's a file
processFile $file
else
if [ -d "$file" ]
then
### it's a directory
processDir $file
fi
fi
done
}

I won't write here the whole processDir routine and all the other called sub routines, but I'll point some interesting pieces of code in it:

How to get the parameter within the sub routine:

### process each fileprocessFile() { fileName=$1

Comparing the file name against a string:

if echo $fileName | grep -q "_food_"
then
prefix="f"
echo " it's a food file!"
else

All the other steps are basically linux and hadoop commands to safely perform the copy.

After performing some tests, and adjusting some minor issues, I left it running for the whole weekend, and got the results on the following Monday.

The PoC ended up being a success, and this step on the process that could have become a huge manual task was solved with a simple bash script.

So, in my opinion, if you are thinking of a solution for repetitive tasks or to orchestrate complex tasks, you should consider write a script. That can be the easiest and more efficient solution.

Thursday, June 27, 2013

Shell Scripts