Thursday, June 27, 2013

Shell Scripts

We can't deny scripting is powerful, depending on what we are supposed to do. But some people look at bash scripts as a really complex technology. Early this year, in 2013, for example, I was asked to help in a proof of concept for a specific client. The team was already working on the main tasks, but they were struggling in some validations and for an automation with the PoC files.

So, to validate the files and its contents, I wrote a couple of Java programs. But for the file automation (copy/uncompress/move to another file system - hadoop), with some extra validations, I decided to write a generic bash script on the Linux box. So, I decided to write here some really simple chunks of code to show how easy it is to use, and how it solves some issues in a simple way:

  • Before starting to write the main logic, it's necessary to define the global variables. For example:
### Set main directory
rootDir=/
targetHadoopDir=/user/jdoe/client/data/
tempDir=/home/jdoe/tmp/
        • First action is checking if a parameter was passed on the command line. If not, a message is printed, and the execution is stopped:
          ### check if there's a directory name passed as parameter
        if [ $# -ne 1 ]
        then
          echo "Need a directory as a parameter!"
          exit 1
        else
          rootDir=$1
        fi
                      • After that, it was necessary to check if a temporary directory exists. If it's there, the script is supposed to clean it up. If not, it has to create the directory. So, the call to the sub routine responsible to perform the logic described is done just by using the sub routine name:
                      ### check if a temporary directory exists
                      checkTemp
                        • And the routine has to be specified before the main logic:
                        ### check if temp directory exists, and create it or clean it
                        checkTemp() {
                          if [ -d "${tempDir}" ]
                          then
                            rm -r $tempDir
                            if [ $? -gt 0 ]
                            then
                              ### error
                              echo "Error removing directory '$tempDir'!"
                              exit 1
                            fi
                          fi
                          mkdir $tempDir
                          if [ $? -gt 0 ]
                          then
                            ### error
                            echo "Error creating temp directory '$tempDir'!"
                            exit 1
                          else
                            echo "Temp directory '$tempDir' ready to be used!"
                          fi
                        }
                                                                      • You might have noticed how simple it is to write a command and check if it ran or not. Another tip is to print the steps and comment your script. Try to be clear, since you might forget later why you were performing certain steps;
                                                                      • So, the last step on the main logic is to process the files. In fact, all the steps above were basically a preparation to process the files. So, in the main logic I called the main sub routine using the roo=t directory as a parameter:
                                                                      ### run down the directory tree and work with the filesprocessDir $rootDir 
                                                                      echo "Done!"
                                                                            •  This routine is recursive. It checks all the files under it, and calls itself if it's a directory, or processes the file if it's a file:
                                                                            ### process directory search, recursively
                                                                            processDir() {
                                                                              for file in $1/*
                                                                              do
                                                                                if [ -f "$file" ] 
                                                                                then
                                                                                  ### it's a file
                                                                                  processFile $file
                                                                                else
                                                                                  if [ -d "$file" ] 
                                                                                  then
                                                                                    ### it's a directory
                                                                                    processDir $file
                                                                                  fi
                                                                                fi
                                                                              done
                                                                            }
                                                                                                             I won't write here the whole processDir routine and all the other called sub routines, but I'll point some interesting pieces of code in it:
                                                                                                            • How to get the parameter within the sub routine:
                                                                                                            ### process each fileprocessFile() {  fileName=$1
                                                                                                                • Comparing the file name against a string:
                                                                                                                if echo $fileName | grep -q "_food_"
                                                                                                                then
                                                                                                                  prefix="f"
                                                                                                                  echo "  it's a food file!"
                                                                                                                else
                                                                                                                All the other steps are basically linux and hadoop commands to safely perform the copy.

                                                                                                                After performing some tests, and adjusting some minor issues, I left it running for the whole weekend, and got the results on the following Monday. 

                                                                                                                The PoC ended up being a success, and this step on the process that could have become a huge manual task was solved with a simple bash script.

                                                                                                                So, in my opinion, if you are thinking of a solution for repetitive tasks or to orchestrate complex tasks, you should consider write a script. That can be the easiest and more efficient solution.