Optimizing shell scripts

2024-01-15

by alex

in 2024

Updated: 2024-03-05

Introduction
Input Data
Desired Output
Approach
Original Script
Optimized Script
Conclusion

Introduction

I consider myself a fairly competent shell scripter. I typically prefer to program towards readabilty, but at the end tend to write towards terse code.

Usually, readability is important because other people (including myself in the future) will need to read the code and figure out what is going on.

I think because I used to write perl for a long time, I tend also to write fairly terse code. This is not something to be proud of.

The other day, I was writing a small shell script to generate a menu of links in markdown from a simple specially written text file.

Input Data

See input:

Desired Output

See output:

Approach

I initially wrote it on my PC with a comparatively faster CPU. The response time on my PC was about one second. When I transferred the script to my Web server (with an small CPU), the reponse time was 15 seconds to generate the menu. This was surprising to me, as I did not expect the performance difference between my PC and my Web server to be so massive.

I figured that this need to be optimized. The simplest way to do this is to add caching. Which I did very quickly. In reality, I needed to improve the code.

The first thing you need to do when optimizing code is to measure the effects of changes on the code. In order to do that, I created a simple test harness:

The options to the harness are:

-1 or --no-count : run the code once. Essentially to check that the output is correct.
--count=int : Will run the code the specified number of times
1 or 2 : Run implementation 1 or implementation 2.

Run this with the time built-in to measure running times.

So running:

sh th.cgi -1 1

sh th.cgi -1 2

Can be used to make sure that implementation 1 and 2 are correct.

Then, you can run:

time sh th.cgi --count=100 1

time sh th.cgi --count=100 2

This will show the how fast/slow the code peforms. For testing, I would also test with busybox sh, as that is the shell that I have in my web server. On my PC, sh is bash

So the general approach is to measure the original performance, then make some changes and measure if the performance improves (or not).

Original Script

The original un-optimized code is:

Running this on my PC executes in almost 10 minutes.

Optimized Script

The optimized version is:

This version executes in 16 seconds.

So overall is a pretty good performance enhancement. So the approach I took is first to isolate what part of the code is the one that executes the slowest. This script essentially runs in a loop and runs on two halves, the first half is scanning lines, the second half is outputing markdown.

Commenting out the different parts of the code, I was able to determine that most of the time was spent in the scanning half.

The next step was to review the code and re-factor it to faster versions:

Moving invariant code out of loops

Move invariable code outside the loop. Assigning elem2o and elem2c was originally done in the loop. This was moved outside the loop. This did not yield much improvement, but in general this is always a good optimizaiton.

Replacing if/then/else with case

I was doing:

if (echo "$ARGS" | grep -q '>') ; then
  ...
else
  ...
fi

This actually forks and execs two additional commands and a subshell. This was replaced with:

case "$ARGS" in
*\>*)
  ...
  ;;
*)
  ...
  ;;
esac

Since this runs entirely inside the shell, this removes spawning commands and forking subshells.

Using IFS for parsing

I replaced:

local i=1
while [ -n "$(echo "$ARGS" | cut -d'>' -f$i-)" ] ; do
  set - "$@" "$(echo "$ARGS" | cut -d'>' -f$i | xargs)"
  i=$(expr $i + 1)
done

Which is a terrible way to parse a line. This was replaced with:

Adding -e 's/[ ]*>[ ]*/>/g' to the sed command at the top in front of the loop, and then...

IFS=">" ; set - $ARGS; IFS="$oIFS"

The performance difference here is massive.

Conclusion

So there you have it. Turns out that my skills at scripting are poor. On the other hand it could be that I am so used writing shell scripts that I am not using the best tools for the job. In this particular example, I shouldn't have started with shell script but use something more suitable for text manipulation, such as perl or awk. However, most scripting languages come with useful string processing capabilities and would have done the job well.