Labs: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14.

We will extend our knowledge about shell scripting in this lab. We will introduce control flow constructs and other bits to make our shell scripts more powerful. But we will also learn how to detect bugs in our scripts without even running them.

Again, we will use a running example for learning new constructs in shell, but we will also learn about tools that can help us detect bugs in our programs and we will have a look at some networking tools too.

Preflight checklist

  • You can use shell variables and command substitution.
  • You remember what Pandoc was used for.
  • You have uploaded your public key to one of the 05/key.[0-9].pub files.

Reading network configuration

Before diving into the main topic we will do a small detour to a practical thing that comes very useful. And that is how to view network configuration of your machine from the command-line.

For the following text we will assume your machine is connected to the Internet (this includes your virtualized installation of Linux).

The basic command for setting and reading network configuration is ip.

Probably the most useful one for us at the moment is ip addr.

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: enp0s31f6: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN group default qlen 1000
    link/ether 54:e1:ad:9f:db:36 brd ff:ff:ff:ff:ff:ff
3: wlp58s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 44:03:2c:7f:0f:76 brd ff:ff:ff:ff:ff:ff
    inet 192.168.0.105/24 brd 192.168.0.255 scope global dynamic noprefixroute wlp58s0
       valid_lft 6209sec preferred_lft 6209sec
    inet6 fe80::9ba5:fc4b:96e1:f281/64 scope link noprefixroute
       valid_lft forever preferred_lft forever
8: vboxnet0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 0a:00:27:00:00:00 brd ff:ff:ff:ff:ff:ff

It lists four interfaces (lo, enp0s31f6, wlp58s0 and vboxnet0) that are available on the machine. Your list will differ as well as the naming.

The name signifies interface type.

  • lo is the loopback device that will be always present. With loopback device, you can test network applications even without having a “real” connectivity.
  • enp0s31f6 (often also eth*) is a wired ethernet.
  • wlp58s0 is a wireless adapter.
  • vboxnet0 is a virtual network card used by VirtualBox when you create a virtual subnet for your virtual machines (you will probably not have this one there).

If you are connected via VPN, you might also see a tun0 interface.

The state of the interface (up and running or not) is at the same line as the adapter name.

The link/ denotes the MAC address of the adapter. Lines with inet specify the IP address assigned to this interface, including the network. In this example, lo has 127.0.0.1/8 (obviously), enp0s31f6 is without an address (state DOWN) and wlp58s0 has address 192.168.0.105/24 (i.e., 192.168.0.105 with netmask 255.255.255.0).

Your addresses will be slightly different, but typically you will see also a private address (behind a NAT), as you are probably connecting through a router to your ISP.

Running example

We will return to our example with web generation again. The sources are again in our examples repository but as you can see in 08/web, there are many more pages now and we have split the files into multiple subdirectories.

There is content with input files in Markdown, there is static with the CSS file and possibly other files that would be copied as-is to the web server and there is also templates with Pandoc templates for our pages.

We will now build a decent shell script that would be able to build this web but also copy it to a web server so it is publicly available.

We acknowledge that there are special tools for exactly that. They are called static site generators (or just SSG) and there is a huge amount of them available. But their task offers the right playground to show what shell is capable of :-).

We will start with the trivial builder that is basically a copy of the one from one of the previous labs.

We highly recommend that you copy the fragments from our repository to your repository (feel free to use the submission one) and commit each version. Use git from the command line and use proper commit messages.

And if you create a separate issue for each part that you later close from a commit message, you will also practice good software engineering skills.

Script modularization and configuration loading (. and source)

So far, our scripts were always self-contained in a single file. No surprise: they were all quite short. But sometimes it makes sense to split to code into multiple files to allow sharing code across multiple scripts.

That is fine if the shared code is a standalone script. Imagine we create a file called msg.sh with the following content:

#!/bin/bash

echo "$( date '+%Y-%m-%d %H:%M:%S |' )" "$@" >&2

Then in other scripts we can call into msg.sh to print logging messages.

...

./msg.sh "Starting computation..."
...
./msg.sh "Computation done."

That might work well but such short code would probably be better as a function and probably we might want to have multiple functions inside single file. Especially for one-liners, having a separate file for each “function” is somewhat impractical.

We will thus store the function in a separate file logging.sh.


msg() {
    echo "$( date '+%Y-%m-%d %H:%M:%S |' )" "$@" >&2
}

To use the functions in a different script, we need to instruct shell to include our file using source or . (yes, standalone dot) construct.

# Both lines are equal, only one of them would be used in reality
. logging.sh
source logging.sh

...

msg "Starting computation"

Why simply calling the shell script would not work?Answer.

There are other uses of source. We have seen that we can use it to load shared code but it is also very often used to load configuration.

Imagine we want to allow the user define a directory where to store result files.

results=/home/intro/results/

Certainly we can try to parse this file using cut and grep, but it is actually much easier to simply load this file using source and access $results directly.

The truth is that we give the user much more than a plain configuration file, but we do not need to think about a specific format and advanced users can include more shell magic, while beginners would see it as a file in var=value format.

This is actually how your shell is configured. Recall that we have updated ~/.bashrc with EDITOR variable. This file is also sourced when Bash is started and it should by clear by now why it often contains the following snippet:

if [ -f /etc/bashrc ]; then
    . /etc/bashrc
fi

If the given file exists (we will get to the proper syntax later in this lab), we source it, i.e. import its content here. Therefore we import the global Bash configuration stored in /etc directory.

The source behaves as if the content of the included file was really pasted instead of the source line. Bash does not have any fancy namespace support or similar.

Usually files that are expected to be included via source do not have a shebang specified and are usually not executable. That is mostly to emphasize the fact that they are not standalone executables but rather “libraries”.

The same also applies to Python modules: you will usually see shebang in the main program (and x bit set) while actual modules (that you import) are often shebang-less and are rw- only.

Advancing the running example

We will rework our example to a versatile solution where the user will provide a site configuration that our script will read.

We will create the following ssg.rc inside our directory with the webpage.

# My site configuration

site_title="My site"

build_page "index.md"
build_page "rules.md"
build_page "alpha.md"

And we will modify our main script to look like this.

#!/bin/bash

set -ueo pipefail

msg() {
    echo "$( date '+%Y-%m-%d %H:%M:%S | SSG |' )" "$@" >&2
}

get_version() {
    git rev-parse --short HEAD 2>/dev/null || echo unknown
}

build_page() {
    local input_file="$1"
    local output_file="public/$( basename "$input_file" ".md" ).html"
    msg "Generating $input_file => $output_file"
    pandoc \
        --template templates/main.html \
        --metadata site_title="$site_title" \
        --metadata page_version="$( get_version )" \
        "src/$input_file" >"$output_file"
}

site_title="$( whoami )'s site"

mkdir -p public

source ssg.rc

cp -R static/* public/

What we have created? Our configuration file ssg.rc actually contains a trivial domain-specific language (DSL) that drives website generation. Our main script provides the build_page function that is called from the main script.

Inside this function we compute the output filename (try what basename input.md .md does!) and run Pandoc.

Actually, it is a very straightforward piece of code but we managed to split configuration and actual generation into separate files and create a reusable tool. Compare how much work this would be in a different language. Just imagine how much work would it be to parse the configuration file…

Before moving on make sure you understand how the above code works. You should be able to answer the following questions:

  1. Why ssg.rc is not executable (and does not need to be)?
  2. Why is $site_title variable set before sourcing ssg.rc?
  3. What would happen if we would source ssg.rc before the call to mkdir -p public?

Control flow in shell scripts

Before diving into control flow in shell scripts, let us mention that multiple commands can be separated by ; (the semicolon). While in the shell scripts it is preferable to write one command per line, interactive users often find it easier to have multiple commands on one line (even if only to allow faster history browsing with the up arrow).

We will see semicolons at various places in the control flow structures, serving as a separator.

for loops

For loops in the shell always iterate over a set of values provided when the loop starts.

The general format is as follows:

for VARIABLE in VAL1 VAL2 VAL3; do
    body of the loop
done

Typical uses include iterating over a list of files, often generated by expanding wildcards.

Let us see an example that counts the number of digits in all *.txt files:

for i in *.txt; do
    echo -n "$i: "
    tr -c -d '0-9' <"$i" | wc -c
done

Notice that the for statement is given the variable name i without a $. We also see that variable expansion can be used in redirection of stdin (or stdout).

When writing this in the shell, the prompt would change to plain > (probably, depending on your configuration) to signal that you are expected to enter the rest of the loop.

Squeezing the whole loop into one line is also possible (but useful only for fire-and-forget type of scripts):

for i in *.txt; do echo -n "$i: "; tr -c -d '0-9' <"$i" | wc -c; done

When we want to iterate over values with spaces, we need to quote them. Wildcards expansion is safe in this respect and would work regardless of spaces in the filename.

for i in one "two three"; do
    echo "$i";
done

What does the following code print assuming there is no *.txxt file in the current directory?

for i in *.txxt; do
    echo "$i"
done

Hint.

Answer.

if and else

The if condition in the shell is a bit more tricky.

The essential thing to remember is that the condition is always a command to be executed and its outcome (i.e., the exit code) determines the result.

So the condition is actually never in the traditional format of a equals b as it is always the exit code that controls the flow.

The general syntax of if-then-else is this:

if command_to_control_the_condition; then
    success
elif another_command_for_else_if_branch; then
    another_success
else
    the_else_branch_commands
fi

Note that if has to be terminated by fi and that elif and else branches are optional.

Simple conditions can be evaluated using the test command that we already know. See man test to inspect what things can be tested.

Let us see how to use if with test to check whether we are inside a Git project:

if test -d .git; then
    echo "We are in the root of a Git project"
fi

In fact, there exists a more elegant syntax: [ (left bracket) is a synonym for test which does the same, except that it requires that the last argument is ]. Using this syntax, our example can look as follows:

if [ -d .git ]; then
    echo "We are in the root of a Git project"
fi

Still, [ is just a regular command whose exit code determines what if shall do.

By the way, look into /usr/bin to see that the application file is really named [. But Bash (our shell) also implements [ as a builtin, so it is a little bit faster than executing an external program.

You can also encounter the following snippet:

if [[ -d .git ]]; then
    echo "We are in the root of a Git project"
fi

This [[ ... ]] is a different construct, closely related to the $(( ... )) syntax for arithmetic expressions. The condition is evaluated by Bash itself. This syntax is a little bit more powerful, but it is limited to recent versions of Bash, so it is unlikely work in other shells.

We will be using the traditional variant with [ only.

while loops

While loops have the following form:

while command_to_control_the_loop; do
    commands_to_be_executed
done

Again, the condition is true if the command_to_control_the_loop returns with exit code 0.

The following example finds the first available name for a log file. Note that this code is not immune against races when executed concurrently. That is, it assumes it can be run multiple times, but never in more processes at the same time.

counter=1
while [ -f "/var/log/myprog/main.$counter.log" ]; do
    counter=$(( counter + 1 ))
done
logfile="/var/log/myprog/main.$counter.log"
echo "Will log into $logfile" >&2

To make the program race-resistant (i.e., against concurrent execution), we would need to use mkdir that fails when the directory already exists (i.e., it is atomic enough to distinguish if we were successful and are not just stealing someone else’s file).

Note that it uses exclamation mark ! to invert the program outcome.

counter=1
while ! mkdir "/var/log/myprog/log.$counter"; do
    counter=$(( counter + 1 ))
done
logfile="/var/log/myprog/log.$counter/main.log"
echo "Will log into $logfile" >&2

break and continue

As in other languages, the break command is available to terminate the currently executing loop. You can use continue as usual, too.

Switch (a.k.a. case ... esac)

When we need to branch our program based on a variable value, shell offers the case construct. It is somehow similar to the switch construct in other languages, but it has a bit of shell specifics mixed in.

The overall syntax is the following:

case value_to_branch_on in
    option1) commands_for_option_one ;;
    option2) commands_for_option_two ;;
    *) the_default_branch ;;
esac

Notice that like with if, we terminate with the same keyword reversed and that there are two semicolons ;; to terminate the commands for a particular option.

The options can contain wildcards and | to make the matching a bit more flexible.

A simple example can look like this:

case "$EDITOR" in
    mc|mcedit) echo 'Midnight Commander rocks' ;;
    joe) echo 'Small but powerful' ;;
    emacs|vi*) echo 'Wow :-)' ;;
    *) echo "Someone really uses $EDITOR?" ;;
esac

Advancing the running example

Armed with the knowledge about control flow available, we can make our script for site generation even better.

We will remove the burden of specifying the list of files manually and find the files ourselves. Therefore, the user configuration file would be completely optional.

We will change our script as follows.

build_page() {
    local input_file="$1"
    local output_file="public/$( basename "$input_file" ".md" ).html"
    msg "Generating $input_file => $output_file"
    pandoc \
        --template templates/main.html \
        --metadata site_title="$site_title" \
        --metadata page_version="$( get_version )" \
        "$input_file" >"$output_file"
}

...

if [ -f ssg.rc ]; then
    source ssg.rc
fi

for page in src/*.md; do
    if ! [ -f "$page" ]; then
        continue
    fi

    build_page "$page"
done

We have modified build_page to not prepend src when running pandoc and we iterate over the Markdown files by ourselves.

The ! reverts the meaning of the exit code, i.e. behaves as boolean not.

Why do we test for -f inside the loop? Answer.

Redirection of bigger shell portions

The whole control structure (e.g, for, if, or while with all the commands inside) behaves as a single command. So you can apply redirection to the whole structure.

To illustrate this, we can transform the message to upper case like this.

if test -d .git; then
    echo "We are in a root of a Git project"
else
    echo "This is not a root of a Git project"
fi | tr 'a-z' 'A-Z'

Script parameters and getopt

Recall that when a shell script receives parameters, we can access them via special variables $1, $2, $3. There is also $@ for accessing all parameters (recall that $@ must be quoted to work properly (the explanation is beyond the scope of this course)).

The special variable $# contains the number of arguments on the command-line and $0 refers to the actual script name.

getopt

When our script needs one argument, accessing $1 directly is fine. When you want to recognize options, parsing of arguments becomes more complicated. Shell offers a getopt command that is able to handle command-line parsing for you.

We will not describe all the details of this command. Instead, we show an example that you can modify to your own needs.

Parsing command-line options is unfortunately not very much standardized across different flavours of unix. Approach shown here works well on any recent Linux and provides good user experience. But it is not portable to other similar systems. There is getopts (yes, the differences is only the extra s at the end) that is much more portable but much more limited in its features.

The main arguments controlling getopt behavior are -o and -l, that contain description of the switches for our program.

Let us assume that we would want to handle options --verbose to make our script a bit more descriptive and --output to specify an alternate output file. We would also like to handle short versions of these options: -o and -v. With --version, we want to print the version of our script. And we should not forget about --help too. Non-option arguments will be interpreted as names of input files.

The specification of the getopt switches is simple:

getopt -o "vho:" -l "verbose,version,help,output:"

Single-letter switches are specified after -o, long options after -l, and a colon : after the option denotes that it expects an argument.

After that, we add -- followed by the actual parameters. Let us try:

getopt -o "vho:" -l "verbose,version,help,output:" -- --help input1.txt --output=file.txt
# prints: --help --output 'file.txt' -- 'input1.txt'
getopt -o "vho:" -l "verbose,version,help,output:" -- --help --verbose -o out.txt input2.txt
# prints: --help --verbose -o 'out.txt' -- 'input2.txt'
...

As you can see, getopt is able to parse the input and convert the parameters to a unified form, moving the non-option arguments to the end of the list.

The following “magical” line (you do not need to understand it to use it) resets $1, $2 etc. to contain the values as parsed by getopt.

eval set -- "$( getopt -o "vho:" -l "verbose,version,help,output:" -- "$@" )"

The actual processing is then quite straightforward:

#!/bin/bash

set -ueo pipefail

opts_short="vho:"
opts_long="verbose,version,help,output:"

# Check for bad usage first (notice the ||)
getopt -Q -o "$opts_short" -l "$opts_long" -- "$@" || exit 1

# Actually parse them (we are here only if they are correct)
eval set -- "$( getopt -o "$opts_short" -l "$opts_long" -- "$@" )"

be_quiet=true
output_file=/dev/stdout

while [ $# -gt 0 ]; do
    case "$1" in
        -h|--help)
            echo "Usage: $0 ..."
            exit 0
            ;;
        -o|--output)
            output_file="$2"
            shift
            ;;
        -v|--verbose)
            be_quiet=false
            ;;
        --version)
            echo "$0 version 1.0.0"
            exit 0
            ;;
        --)
            shift
            break
            ;;
        *)
            echo "Unknown option $1" >&2
            exit 1
            ;;
    esac
    shift
done

$be_quiet || echo "Starting the script"

for inp in "$@"; do
    $be_quiet || echo "Processing $inp into $output_file ..."
done

Several parts of the script deserve explanation.

true and false are not boolean values, but they can be used as such. Recall how we have used them in lab 06 (there really are /bin/true and /bin/false).

exit immediately terminates a shell script. The optional parameter denotes the exit code of the script.

shift is a special command that shifts the variables $1, $2, … by one. After shift, $3 becomes $2, $2 becomes $1 and $1 is lost. "$@" is modified accordingly. Thus, the whole loop processes all options until encountering -- that separates options from other arguments. It doesn’t require the user to provide the -- option. getopt does that when unifying the parameters (check the output above). The for loop therefore iterates over the other arguments.

Advancing the running example

We will modify our script to accept -w or --watch so that the program keeps watching the src/*.md for modifications and regenerates the web on each such change.

We forgot to include the required inotifywait program in our Linux image. Execute the following command (it will ask for your password first) to install this program into your Fedora.

sudo dnf install -y inotify-tools

We first include the change to use getopt and then we add the support for --watch.

#!/bin/bash

set -ueo pipefail

msg() {
    echo "$( date '+%Y-%m-%d %H:%M:%S | SSG |' )" "$@" >&2
}

get_version() {
    git rev-parse --short HEAD 2>/dev/null || echo unknown
}

build_page() {
    local input_file="$1"
    local output_file="public/$( basename "$input_file" ".md" ).html"
    $LOGGER "Generating $input_file => $output_file"
    pandoc \
        --template templates/main.html \
        --metadata site_title="$site_title" \
        --metadata page_version="$( get_version )" \
        "$input_file" >"$output_file"
}

generate_web() {
    for page in src/*.md; do
        if ! [ -f "$page" ]; then
            continue
        fi
        build_page "$page"
    done

    cp -R static/* public/
}

opts_short="vwh"
opts_long="verbose,version,help,watch"

getopt -Q -o "$opts_short" -l "$opts_long" -- "$@" || exit 1
eval set -- "$( getopt -o "$opts_short" -l "$opts_long" -- "$@" )"

LOGGER=:
watch_for_changes=false

while [ $# -gt 0 ]; do
    case "$1" in
        -h|--help)
            echo "Usage: $0 ..."
            exit 0
            ;;
        -v|--verbose)
            LOGGER=msg
            ;;
        -w|--watch)
            watch_for_changes=true
            ;;
        --)
            ;;
        *)
            echo "Unknown option $1" >&2
            exit 1
            ;;
    esac
    shift
done


site_title="$( whoami )'s site"

mkdir -p public

if [ -f ssg.rc ]; then
    source ssg.rc
fi

generate_web

To actually support --watch for will use inotifywait which is a special program that receives list of files and terminates when one of the files is modified. Therefore, the script will effectively do nothing until file is modified as inotifywait will “block” its execution.

We will add the following to our script to run indefinitely, watching for changes and rebuilding the web automatically. Hit Ctrl-C to actually terminate the execution when started with --watch.

...

if [ -f ssg.rc ]; then
    source ssg.rc
fi

generate_web

if $watch_for_changes; then
    while true; do
        $LOGGER "Waiting for file change..."
        inotifywait -e modify src/* src static static/*
        generate_web
    done
fi

The read command

So far our scripts either did not needed standard input at all or they passed it completely to other programs.

But it is possible to also read standard input line by line in shell if you need to process lines separately.

When a shell script needs to read from stdin into a variable, there is the read built-in command:

read FIRST_LINE <input.txt
echo "$FIRST_LINE"

Typically, read is used in a while loop to iterate over the whole input. read is also able to split the line to fields on white space and assign each field in a different variable.

Considering we have an input of this format, the following loop computes the average of the numbers.

/dev/sdb 1008
/dev/sdb 1676
/dev/sdc 1505
/dev/sdc 4115
/dev/sdd 999
count=0
total=0
while read device duration; do
    count=$(( count + 1 ))
    total=$(( total + duration ))
done
echo "Average is about $(( total / count ))."

As you can guess from the above snippet, read returns 0 as long as it is able to read into the variables. Reaching the end of the file is announced by a non-zero exit code.

read can be sometimes too smart about certain inputs. For example, it interprets backslashes. You can use read -r to suppress this behavior.

Other notable parameters are -t or -p: use read --help to see their description.

If we want to read from a specific file (assuming its filename is stored in variable $input), we can also redirect input to the whole loop and write the script like this:

while read device duration; do
    count=$(( count + 1 ))
    total=$(( total + duration ))
done <"$input"

That is actually quite common use for the while read pattern.

Check you understand how read works

Assume we have the following text file data.txt.

ONE
TWO

We also have the following script reader.sh:

#!/bin/bash

set -ueo pipefail

read -r data_one <data.txt
read -r data_two <data.txt
read -r stdin_one
read -r stdin_two

echo "data_one=${data_one}"
echo "data_two=${data_two}"
echo "stdin_one=${stdin_one}"
echo "stdin_two=${stdin_two}"

Select all true statements about output of the following invocation.

./reader.sh <data.txt
You need to have enabled JavaScript for the quiz to work.

Bigger exercise I

Imagine we have an input data with match results in the following format (team, goals shot, colon, goals shot by the other team, other team).

alpha 2 : 0 bravo
bravo 0 : 1 charlie
alpha 5 : 4 charlie

Write a shell script that prints a table with summarized results.

Assign 3 points for victory, 1 point for a tie. Your program does not need to handle the situation when two teams have the same amount of points.

Solution

We start with a function that receives two arguments – goals shot by each side – and prints the amount of points assigned.

get_points() {
    local goals_mine="$1"
    local goals_opponent="$2"
    if [ "$goals_mine" -eq "$goals_opponent" ]; then
        echo 1
    elif [ "$goals_mine" -gt "$goals_opponent" ]; then
        echo 3
    else
        echo 0
    fi
}

Other function then computes points for each match.

preprocess_scores() {
    local team_one team_two
    local goals_one goals_two

    while read -r team_one goals_one colon goals_two team_two; do
        if [ "$colon" != ":" ]; then
            echo "WARNING: ignoring invalid line $team_one $goals_one $colon $goals_two $team_two" >&2
            continue
        fi
        echo "$team_one" "$( get_points "$goals_one" "$goals_two" )"
        echo "$team_two" "$( get_points "$goals_two" "$goals_one" )"
    done
}

These two functions together transform the input into the following:

alpha 3
bravo 0
bravo 0
charlie 3
alpha 3
charlie 0

On this, we can call our well-known group_sum.py script or write it in shell ourselves. For shell implementation, we will expect that the data are already sorted by key to simplify the implementation.

sum_by_sorted_keys() {
    local key value
    local prev_key=""
    local sum=0

    while read -r key value; do
        if [ "$key" != "$prev_key" ]; then
            if [ -n "$prev_key" ]; then
                echo "$prev_key $sum"
            fi
            prev_key="$key"
            sum=0
        fi
        sum=$(( sum + value ))
    done
    if [ -n "$prev_key" ]; then
        echo "$prev_key $sum"
    fi
}

Why do we need to expect data sorted? Can’t we just sort them ourselves? Would the following modification (only this one line changed) work?

    # replacing "while read -r key value; do"
    sort | while read -r key value; do
Answer.

What change inside this function would work then? Answer.

Together these functions provide the building blocks to solve the whole puzzle:

preprocess_scores | sum_by_keys | sort -n -k 2 -r | column -t

It is a matter of opinion if this task would be better solved in a different programming language. It all depends on the context and on other requirements.

Shell usually excels in situations where we need to combine data from multiple files that are in some textual (preferably line-oriented) format.

The advantage of shell is in its interactivity. Even the functions can be defined interactively (i.e. not stored in any file first) and one can easily build the final pipeline incrementally, checking the output after adding each step.

Sidenote: how web pages are published

We will now perform a small detour to the area of (history of) website publishing. Publishing a website today generally means renting a webspace where you can either upload your HTML (or PHP) files or even renting a configured instance of your web application, such as Wordpress.

Traditionally you often also received a webspace as part of your unix account on some shared machine. The setup was usually done in such way that whatever appeared in your $HOME/public_html was available under the page example.com/~LOGIN.

You might have encountered such pages, typically for university pages of individual professors.

With the advance of virtualization (and cloud) it became easier to not give users access as real users but insert another layer where user can manipulate only certain files without having shell access at all.

Web pages on lab machines

Our lab machines (e.g. u-pl* ones) also offer this basic functionality.

SSH into one of these (recall the list from 05) and create a directory ~/WWW.

Create a simple HTML file in WWW (skip if you already uploaded some files before).

echo '<html><head><title>Hello, World!</title><body><h1>Hello, World!</h1></body></html>' >index.html

Its content will be available as http://www.ms.mff.cuni.cz/~LOGIN/.

Note that you will need to add the proper permissions for the AFS filesystem using the fs setacl command.

fs setacl ~/WWW www read
fs setacl ~/. www l

SCP & rsync

In order to copy files between two Linux machines, we can use scp. Internally, it establishes a SSH connection and copies the files over it.

The syntax is very simple and follows the semantics of a plain cp:

scp local_source_file.txt user@remote_machine:remote_destination_file.txt
scp user@remote_machine:remote_source_file.txt local_destination_file.txt

SCP issues

For those who care about security we should note that that the SCP protocol has some security vulnerabilities. These can be used to attack your local computer while connecting to a malicious server.

SCP is actually a very old protocol, which is showing its age. Better replacements include SFTP (beware that this is different from FTPS – FTP over SSL/TLS) and Rsync.

More information on this topic can be found on LWN.net.

Rsync

A much more powerful tool for copying of files is rsync. Similarly to scp, it runs over a SSH connection, but it has to be installed at both sides (usually that is not a problem)

It can copy whole directory trees, handle symlinks, access rights, and other file attributes. It can also detect that some of the files are already present at the other side (either exactly or approximately) and transfer just the differences.

The syntax of a simple copy follows cp and scp, too:

rsync local_source_file.txt user@remote_machine:remote_destination_file.txt
rsync local_source_file.txt user@remote_machine:remote_destination_directory/

Advancing the running example

Use the rsync manual page and extend the running example to include --upload that will upload the generated files to a remote server specified in $rsync_target.

You can use LOGIN@u-plNNNN.ms.mff.cuni.cz:WWW/nswi177 as a reasonable target for uploading to the Rotunda machines.

Solution.

Bigger exercise II

We have created the script to compute the scoring table. It would be nice to generate it during web generation.

Extend our running example with the following functionality. Each *.bin file in src/ would be treated as a script that will be executed and its output stored to HTML file with the same name.

Try this on your own first before looking at our solution.

Recall that file extension is not important and .bin is generic enough to hide any (interpreted) programming language (as long as the script has proper shebang). As a matter of fact, it will work for a compiled (C, Rust and similar) programs too.

Feel free to return to this task later on: it is okay to skip it and read the next section first :-).

Solution

The change is relatively simple. We have also renamed build_page to build_markdown_page for better clarity.

build_dynamic_page() {
    local input_file="$1"
    local output_file="public/$( basename "$input_file" ".bin" ).html"
    $LOGGER "Generating $input_file => $output_file"
    "$input_file" >"$output_file"
}

generate_web() {
    local page
    for page in src/*.md; do
        if ! [ -f "$page" ]; then
            continue
        fi
        build_markdown_page "$page"
    done

    local script
    for script in src/*.bin; do
        if ! [ -f "$script" -a -x "$script" ]; then
            continue
        fi
        build_dynamic_page "$script"
    done

    cp -R static/* public/
}

And we can extend our table generation script into the following.

...

as_markdown_table() {
    echo
    echo '| Team | Points |'
    echo '| ---- | -----: |'
    while read team score; do
        echo '|' "$team" '|' "$score" '|'
    done
    echo
}

. ssg.rc

(
    echo '---'
    echo 'title: Scoring table'
    echo '---'

    echo '# Scoring table'

    preprocess_scores <scores.txt | sum_by_keys | sort -n -k 2 -r | as_markdown_table
)  | pandoc \
        --template templates/main.html \
        --metadata site_title="$site_title" \
        --metadata page_version="$( git rev-parse --short HEAD 2>/dev/null || echo unknown )"

Improving the script further

Does it look good?

Hardly. There is the repeated fragment of calling Pandoc. Our site generator is not perfect.

Let us improve it.

As a second version, extend it so that it distinguishes *.md.bin and *.html.bin scripts and those with .html.bin extension are expected to generate HTML directly while .md.bin will generate Markdown that we will process ourselves.

Solution

...

pandoc_as_filter() {
    pandoc \
        --template templates/main.html \
        --metadata site_title="$site_title" \
        --metadata page_version="$( get_version )" \
        "$@"
}

build_markdown_page() {
    local input_file="$1"
    local output_file="public/$( basename "$input_file" ".md" ).html"
    $LOGGER "Generating $input_file => $output_file"
    pandoc_as_filter "$input_file" >"$output_file"
}

build_dynamic_html_page() {
    local input_file="$1"
    local output_file="public/$( basename "$input_file" ".html.bin" ).html"
    $LOGGER "Generating $input_file => $output_file"
    "$input_file" >"$output_file"
}

build_dynamic_markdown_page() {
    local input_file="$1"
    local output_file="public/$( basename "$input_file" ".md.bin" ).html"
    $LOGGER "Generating $input_file => $output_file"
    "$input_file" | pandoc_as_filter >"$output_file"
}

generate_web() {
    local page
    for page in src/*.md; do
        if ! [ -f "$page" ]; then
            continue
        fi
        build_markdown_page "$page"
    done

    local script
    for script in src/*.md.bin; do
        if ! [ -f "$script" -a -x "$script" ]; then
            continue
        fi
        build_dynamic_markdown_page "$script"
    done
    for script in src/*.html.bin; do
        if ! [ -f "$script" -a -x "$script" ]; then
            continue
        fi
        build_dynamic_html_page "$script"
    done

    cp -R static/* public/
}

...

And our table generation script table.md.bin can be significantly simplified.

...

echo '---'
echo 'title: Scoring table'
echo '---'

echo '# Scoring table'

preprocess_scores <scores.txt | sum_by_keys | sort -n -k 2 -r | as_markdown_table

Last improvement

As a last exercise, extend our script to support building reusable scripts. Before we run the *.bin scripts, we should extend $PATH with bin/ directory in our SSG directory.

Why do we want to do that? At this moment, the scoring table path is hard-coded inside the script and the script is not usable for multiple tables (imagine there are two groups running). If we would have the script in $PATH, we can store the scores as a script with the following shebang and thus reuse the script for multiple tables.

#!/usr/bin/env score_table.sh
alpha 2 : 0 bravo
bravo 0 : 1 charlie
alpha 5 : 4 charlie

Certainly this is bordering on the abuse of shebang as we are turning a data-file into a script but there might be other use cases than our primitive SSG where such extension would make sense.

Here, take it as an exercise to refresh your memory about env, shebangs and $PATH.

Solution

The changes are actually trivial.

build_dynamic_html_page() {
    ...
    env PATH="$PATH:$PWD/bin" "$input_file" >"$output_file"
}

build_dynamic_markdown_page() {
    ...
    env PATH="$PATH:$PWD/bin" "$input_file" | pandoc_as_filter >"$output_file"
}

And the bin/score_table.sh would be modified on single line too.

grep -v '#' "$1" | preprocess_scores | sum_by_keys | sort -n -k 2 -r | as_markdown_table

We drop all lines containing # which certainly drops the shebang and we do not expect team name to contain hash sign (later on, we will see regular expressions that would allow more precise filtering but this is fine for now).

Source code linting with ShellCheck

You have already written quite a lot shell scripts. It is thus time to introduce you to ShellCheck.

ShellCheck is a tool that checks your shell scripts for common issues. These issues are not syntax errors nor logical errors. The issues raised by ShellCheck are patterns that are well-known to cause unexpected behavior, degrade performance, or may be even hiding some nasty surprises.

One such example could be if your script contains the following snippet.

cat input.txt | cut -d: -f 3

Do you know what could be possible wrong?

Technically, this code is correct and by itself does not contain any bug. However, the first cat is redundant as it prints one file only: the code can be reduced to the following form without change of functionality:

cut -d: -f 3 <input.txt

As you can see, this is essentially harmless.

But it might mean that you wanted to cat multiple files or that cat is a left-over from a previous version. Thus ShellCheck will warn you.

Another issue where ShellCheck helps is the following code:

dir_name=results/
if [ -d $dri_name ]; then
    echo "$dir_name already exists."
fi

Here ShellCheck will actually detect the typo as dri_name was not assigned before.

Another trap awaits in the following code:

if [ -d ]; then
    echo "$dir_name already exists."
fi

Of course, this is completely wrong. But guess what: test (or [) will accept this and evaluate it as true. This looks crazy, but actually test with exactly one argument checks if the argument is non-empty.

We have test -n these days, but once we did not have and we must keep backward compatibility. See this page for details.

So this is a correct piece of shell code, but it likely does not do what you wanted it to. Here comes ShellCheck to help you.

ShellCheck is able to warn you about hundreds of possible issues as can be seen on this page. Get into the habit to run it on your shell scripts regularly.

In our practice, ShellCheck seldom gives false positives, but it saved us many times.

Some of the graded tasks that you submit will be checked by ShellCheck, too (and we might penalize your solutions if your scripts are not ShellCheck-error-free).

Running Shellcheck

Running Shellcheck is really easy.

shellcheck ssg.sh

If you want to see also style hints, add -o all or use -i for more selective checks.

shellcheck -o all ssg.sh

Exercise

Go back to your submitted shell scripts and run ShellCheck on them. Fix all the errors found or reason if leaving them in is alright.

Other languages

Similar tools exist for other languages.

Pylint is such tool for Python that can detect plenty of issues and is also highly customizable.

As an exercise, find such tooling for your own language and start using it regularly. Many tools also contain IDE extensions for better user experience.

The important take away

Start using ShellCheck, Pylint or any other tool for you favorite language.

It will not detect logical errors (at least not all of them), but it will surely detect so-called code smells: places in your code that often lead to errors, undefined behaviour, or similar issues.

This is doubly important if you are new to some language: the chances are that you misunderstood some feature rather than the tool being wrong.

Tasks to check your understanding

We expect you will solve the following tasks before attending the labs so that we can discuss your solutions during the lab.

Automated tests will be published later.

Automated tests are already available.

The program convert from ImageMagick can convert images between formats using convert source.png target.jpg (with almost any file extensions).

Convert all PNG images (with extension .png) in the current directory to JPEG (extension .jpg).

Solution.

Extend the previous example to not overwrite existing files.

Solution.

Last example with ImageMagick. With -resize 800x600 it can resize the image to fit in the given envelope.

Create a tool that creates thumbnails from files provided on the command-line, transforming file name from dir/filename.ext to dir/filename.thumb.ext.

Solution.

Create a factorial function that computes factorial of a given number.

Solution.

Extend our scoring table script from the running example to handle correctly situations when the amount of points is the same and we need to distinguish by the amount of shot goals (which is quite a common rule in many tournaments).

Therefore, from the following data we would like to compute a slightly different table.

alpha 2 : 0 bravo
bravo 7 : 1 charlie
alpha 1 : 4 charlie
| Team | Points | Goals |
| ---- | -----: | ----: |
| bravo | 3 | 7 |
| charlie | 3 | 5 |
| alpha | 3 | 3 |

Solution.

Write a shell script for drawing a labeled barplot. The user would provide data in the following format:

12  First label
120 Second label
1 Third label

The script will print graph like this:

First label (12)   | #
Second label (120) | #######
Third label (1)    |

The script will accept input filename as the first argument and will adjust the width of the output to the current screen width. It will also align the labels as can be seen in the plot above.

You can safely assume that the input file will always exist and that it will be possible to read it multiple times. No other arguments need to be recognized.

Hints

Screen width is stored in the variable $COLUMNS. Default to 80 if the variable is not set. (You can assume it will be either empty (not set) or contain a valid number).

The plot should be scaled to fill the whole width of the screen (i.e. scaled up or down).

You can squeeze all consecutive spaces to one (even for labels), the first and second column are separated by space(s).

See what wc -L does.

Note that the first tests use labels of the same length to simplify writing the first versions of the script.

Consider using printf for printing the aligned labels.

The following ensures that bc computes with fractional numbers but the result is displayed as an integer (which is useful for further shell computations).

echo 'scale=0; (5 * 2.45) / 1' | bc -l

Examples

2 Alpha
4 Bravo
# COLUMNS=20
Alpha (2) | ####
Bravo (4) | ########
2 Alpha
4 Bravo
16 Delta
# COLUMNS=37
Alpha (2)  | ###
Bravo (4)  | ######
Delta (16) | ########################

This example can be checked via GitLab automated tests. Store your solution as 08/barplot.sh and commit it (push it) to GitLab.

Create a script for listing file sizes.

The script would partially mimic behaviour of ls: without arguments it lists information about files in the current directory, when some arguments are provided, they are treated as list of files to print details about.

Example run can look like this:

./08/dir.sh /dev/random 08/dir.sh 08
/dev/random  <special>
08/dir.sh          312
08               <dir>

The second column will display file size for normal files, <dir> for directories and <special> for any other file. File size can be read through the stat(1) utility.

Nonexistent files should be announced as FILENAME: no such file or directory. to stderr.

You can safely assume that you will have access to all files provided on the command-line.

You will probably find the column utility useful, especially the following invocation:

column --table --table-noheadings --table-columns FILENAME,SIZE --table-right SIZE

You can assume that these filenames will be reasonable (e.g. without spaces). To simplify things, we will not check exit code to be different when some of the files were not found.

This example can be checked via GitLab automated tests. Store your solution as 08/dir.sh and commit it (push it) to GitLab.

ping is a tool that sends ICMP packets and is often used as a basic test that remote machine is up. The truth is that a machine may decide to filter ICMP requests and not respond to them at all (hence behave as being down) and vice versa, machine responding to ping might have all other services down. But it is still a useful tool to check and debug basic connectivity issues.

Try running ping d3s.mff.cuni.cz to see its output. The tool sends the packets forever, terminate it with Ctrl-C.

Your task is to create a tool that accepts the following arguments and prints host status based on ping (of course, you need to use ping in mode when it sends single request only and timeouts quickly).

  • -d or --delimiter that accepts string used to delimit the output columns, defaults to space
  • -v or --verbose when it prints output of ping to standard error output (by default the output of ping is not printed at all)
  • -w to specify a different timeout than the default of one second

Normal parameters are DNS names or IP address to contact via ping and print their status.

The tool exit code denotes the amount of DOWN machines (you can safely assume that there will never be more than 126 of parameters and you do not have to handle whether exit code is a signed or unsigned byte etc.).

We expect you will use getopt to handle the command-line options.

Following examples show invocation with different parameters and expected output.

Default execution

08/ping.sh seznam.cz google.com google.comx
seznam.cz UP
google.com UP
google.comx DOWN

Use of -d and --verbose

08/ping.sh seznam.cz -d : google.com --verbose

Note that the output mixes stdout and stderr.

PING seznam.cz (77.75.77.222) 56(84) bytes of data.
64 bytes from www.seznam.cz (77.75.77.222): icmp_seq=1 ttl=56 time=4.46 ms

--- seznam.cz ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 4.460/4.460/4.460/0.000 ms
seznam.cz:UP
PING google.com (142.251.36.78) 56(84) bytes of data.
64 bytes from prg03s10-in-f14.1e100.net (142.251.36.78): icmp_seq=1 ttl=114 time=3.64 ms

--- google.com ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 3.642/3.642/3.642/0.000 ms
google.com:UP

This example can be checked via GitLab automated tests. Store your solution as 08/ping.sh and commit it (push it) to GitLab.

Learning outcomes

Learning outcomes provide a condensed view of fundamental concepts and skills that you should be able to explain and/or use after each lesson. They also represent the bare minimum required for understanding subsequent labs (and other courses as well).

Conceptual knowledge

Conceptual knowledge is about understanding the meaning and context of given terms and putting them into context. Therefore, you should be able to …

  • explain what is a linter and style checker

  • explain what kind of issues can be detected by style checkers

  • explain concurrency issues that can occur when using temporary files

  • explain how program exit code is used to drive control flow in shell scripts

  • explain what commands are executed and how is evaluated a shell construct if true; then echo "true"; fi

  • explain what considerations are important when deciding between use of shell vs Python

Practical skills

Practical skills are usually about usage of given programs to solve various tasks. Therefore, you should be able to …

  • use temporary files securely in shell scripts

  • use control flow in shell scripts (for, while, if, case)

  • use read command

  • use getopt for parsing command line arguments

  • use . and source to load functions from different files

  • use and interpret results of ShellCheck

  • use scp to copy individual files to (or from) a remote machine

  • optional: use rsync to synchronize whole directories

This page changelog

  • 2024-04-04: Published automated tests.