Přeložit do češtiny pomocí Google Translate ...

Labs: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14.

In this lab we will complete our tour of shell scripting: you will learn about conditions and loops in the shell, and about variables. And much more.

It is important to note that all of the topics below are usable both in the scripts (i.e., non-interactively) as well as when using terminal interactively. Especially the for loops over list of files are often used directly on the command-line without enclosing them in a full-fledged script.

ShellCheck

Before diving into shell programming, we need to introduce you to ShellCheck. ShellCheck is a tool that checks your shell scripts for common issues. These issues are not syntax errors neither logical errors. The issues raised by ShellCheck are patterns that are well-known to cause unexpected behavior, degrade performance or may be even hiding some nasty surprises.

One such example could be if your script contains the following snippet.

cat input.txt | cut -d: -f 3

Do you know what could be possible wrong? Technically, this code is correct and by itself does not contain any bug. However, the first cat is redundant as it prints one file only and the code can be reduced to the following form without change of functionality.

cut -d: -f 3 <input.txt

As you can see, this is essentially harmless. But it might mean that you wanted to cat multiple files or that cat is a left-over from a previous version. So ShellCheck warns you.

ShellCheck is able to warn you about hundreds of possible issues as can be seen on this page. Get into the habit to run it on your shell scripts regularly.

Some of the graded tasks that you submit will be checked by ShellCheck too (and we might penalize your solutions if your scripts are not ShellCheck-error-free).

Other linters and checkers

Note that ShellCheck is not the only tool available. Virtually every programming language has similar tools or tools that check your style.

For example, Python has Pylint that you should definitely try and run regularly too. It can detect real bugs as well as make your code more Pythonic – feature that is immensely important for team-work.

Shell variables

Variables in shell are often called environment variables as they are (unlike variables of any other language) visible in other programs too. We have already set variable EDITOR that is used by Git to determine which editor to launch.

Variables are assigned by the following construct:

MY_VARIABLE="value"

Note that there can be no spaces around = as otherwise shell would consider that as calling program MY_VARIABLE with arguments = and value.

The value is usually enclosed in quotes, you can omit them if the value contains no spaces or other special characters. Generally, it is safer to always quote the value.

Value of the variable is retrieved by prefixing the variable name with dollar sign $. To print the value, we can execute the following:

echo "Variable MY_VARIABLE is $MY_VARIABLE."
# Prints Variable MY_VARIABLE is value.

Note that environment variables (i.e. those that will be read by other applications) are usually named in upper cases, for purely shell variables you may prefer lower case names. In both cases, the convention is usually for snake_case.

Unlike in other languages, variables in shell are always strings.

Uninitialized values and similar caveats

If you try to use a variable that was not initialized, shell will pretend it was initialized to empty value. That can be often very useful but it can be also a source of nasty surprises.

As we mentioned earlier, you should always start you shell scripts with set -u to warn you about such situations.

However, sometimes you may want to read from a uninitialized variable to check whether the value was set. For example, you want to read from $EDITOR (to check if the user has some preference) and if it is not set, use some default. This is easily done in shell with ${VAR:-default_value} notation. If VAR was set, its value is used, otherwise default_value is used.

Thus, you can encounter following idiom to set variable default value. Note that the :- operator is fine for set -u as we expect that the variable value might not have been set yet.

EDITOR="${EDITOR:-mcedit}"

And later in the script, we may call the editor:

"$EDITOR" file-to-edit.txt

Note that variable may contain the program to launch and even here could have used the :- operator:

"${EDITOR:-mcedit}" file-to-edit.txt

Note that it is possible to use the syntax ${EDITOR} to explicitly delimit the variable name. It is useful if you want to print variable followed by a letter.

file_prefix=nswi177-
echo "Will store into ${file_prefix}log.txt"
echo "Will store into $file_prefixlog.txt"

Reading environment variables in Python (and export)

If we want to read shell variable in Python, you can use os.getenv(). Note that this function has optional argument (apart from variable name) for default value. Always use it or explicitly check for None – there is no guarantee that the variable was set.

Note that you can also use os.environ.

By default, shell does not make all variables available to Python (or any other application, for that matter). Only so-called exported variables are visible outside shell. To make your variable visible, simply run one of the following (the first call assumes VAR was already set).

export VAR
export VAR="value"

It is also possible to specify environment variable as exported for the specific command using the following shortcut.

VAR=value command_to_run ...

This doesn’t change variables in the script.

Variable expansion (and wild-cards too)

It is essential to understand how variables are expanded. Instead of describing the formal process (that is quite complicated), we will show several examples to demonstrate typical situations. Note that looking at variable expansion as simple string replacement is often the best way to look at the behavior.

We will call args.py to demonstrate what happens – replace it with your path to the Python utility we used in Lab #3.

Parameters are prepared (split) after variable expansion.

VAR="value with spaces"
args.py "$VAR"
args.py $VAR

Prepare files named one.sh and with space.sh for the following example.

VAR="*.sh"
args.py "$VAR"
args.py $VAR
args.py "\$VAR"
args.py '$VAR'

Run the above again but remove one.sh after assigning VAR.

Tilda expansion (your home directory) is a bit more tricky.

VAR=~
echo "$VAR" '$VAR' $VAR
VAR="~"
echo "$VAR" '$VAR' $VAR

Arithmetic in the shell

Shell is capable of basic arithmetic operations directly. However, if you need to do more advanced calculations, it is better to use a different language. Shell is fine for computing sums or keeping track of number of processed files. Not for computing standard deviations or square roots.

Simple calculations are done inside a special $(( )) environment.

COUNTER=1
COUNTER=$(( COUNTER + 1 ))

Note that variables do not need to be prefixed with $ and it is generally better to use this “bare” format (see SC2004 for an example).

What happens if you pass invalid number (e.g. xyz) to the arithmetics block?

Special variables and declare and env

If you want to see a list of exported variables, you can use env that prints their names as well as their values.

For a list of all variables, you can execute declare (or declare -p).

You should be wondering how is it possible that the variable would be available to some other application if it is not exported. The truth is that some commands (such as declare or cd) are so-called built-ins – they are handled by the shell itself and shell does not launch an external (i.e., standalone) program.

Why is cd a built-in? Answer.

Note that some built-ins do not have their own man page but are instead described in man bash – in the manual page of the shell we are using.

There are several variables worth knowing that are usually present in any shell on any Linux installation.

$HOME refers to your home directory. Useful if you do not want to use tilda ~.

$PWD contains your current working directory.

$USER contains name of the current user (e.g. intro).

$RANDOM contains a random number, different in each expansion (try echo $RANDOM $RANDOM $RANDOM).

Command substitution (a.k.a. capturing stdout into a variable)

Often, we need to store output from a command into a variable. This also includes storing content of a file (or part of it) in a variable.

A prominent example is the use of mktemp(1) command. You have already seen several scripts where we created temporary files (often inside /tmp) and we have also warned you against the dangers of overwriting someone else’s content. mktemp is able to create a file (or a whole directory) ensuring it has unique name. This name is printed to stdout. Obviously, to use this in later commands, we need to store this value in a variable.

Shell offers the following syntax for the so-called command substitution.

MY_TEMP="$( mktemp -d )"

The command mktemp -d is run and its output is stored inside the variable $MY_TEMP.

Where is stderr stored? Answer.

How would you capture stderr then? Answer.

For example, the following snippet unpacks the downloaded tarball in a temporary directory.

MY_TEMP="$( mktemp -d )"
wget "https://d3s.mff.cuni.cz/f/teaching/nswi177/tests.tar.gz" -O "$MY_TEMP/tests.tar.gz"
cd "$MY_TEMP" && tar xzf tests.tar.gz

Command substitution is often used in logging or when transforming filenames (use manpages to learn what date, basename, and dirname do).

echo "I am running on $( uname -m ) architecture."

INPUT_FILENAME="/some/path/to/a/file.sh"
BACKUP="$( dirname "$INPUT_FILENAME" )/$( basename "$INPUT_FILENAME" ).bak"
OTHER_BACKUP="$( dirname "$INPUT_FILENAME" )/$( basename "$INPUT_FILENAME" .sh ).bak.sh"

Control flow in shell scripts

Before diving into the control flow structures in the shell, we need to mention that multiple commands can be separated by ;. While in the shell scripts it is preferable that each command has its own line, for interactive use it is often easier to have multiple commands on one line (even if only to allow faster history browsing with up-arrow).

We will see semicolon at various special places in the control flow structure to denote end of the control part of a condition or a loop.

for loops

This loops in shell always iterate over a set of values. These values has to be provided when the loop starts. Typically, such loop would iterate over a list of files. Note that we can use shell wildcards here as well.

The general format is following.

for VARIABLE in VAL1 VAL2 VAL3; do
    body of the loop
done

To count number of digits in all *.txt files, the following loop could be used.

for i in *.txt; do
    echo -n "$i: "
    tr -c -d '0-9' <"$i" | wc -c
done

Notice that variable is named without $ in the loop header. Why? Answer.

Note that we can use variables to redirect stdin (or stdout).

If writing this in the shell, the prompt would change to plain > (probably, depending on your configuration) and you would be able to enter the rest of the loop. Squeezing it into one line is also possible (but useful only for fire-and-forget type of scripts).

for i in *.txt; do echo -n "$i: "; tr -c -d '0-9' <"$i" | wc -c; done

What happens when no *.txt file exists and why? Answer.

if and else

The if condition in the shell is a bit more tricky. The essential thing to remember is that the condition is always a command to be executed and its outcome (i.e., the exit code) determines the result. So the condition is actually never in the traditional format of a equals b as it is always the exit code that controls the flow.

The general syntax of the condition is following.

if command_to_control_the_condition; then
    success
elif another_command_for_else_if_branch; then
    another_success
else
    the_else_branch_command
fi

Notice that if has to be terminated by fi and that elif and else branches are optional.

Our example from last lab can be thus rewritten as follows.

if test -d .git; then
    echo "We are in a root of a Git project"
fi

Because using test in the conditions is quite common, test is also named [ (left bracket). With this invocation, the last parameter has to be ] to make the code look more like a classic condition. But mind you, it is still an external program to be executed and the result is determined by its exit code.

By the way, look into /usr/bin to see that the application file is really named [. By the way, how do you think the code looks like to ensure the same source code can be used for both test and ] filenames? Hint. Answer.

Therefore, you would usually encounter the following snippet instead of the one above.

if [ -d .git ]; then
    ...
fi

Note that there are more options that can be used to test some expression (e.g. compare two numbers), try man [ yourself.

while loops

The general skeleton of the while loop in shell is the following code.

while command_to_control_the_loop; do
    commands_to_be_executed
done

The following example finds the first available name for a log file. Note that this code is not immune against races when executed concurrently. That is, it assumes it can be run multiple times, but never in more processes at the same time.

counter=1
while [ -f "/var/log/myprog/main.$counter.log" ]; do
    counter=$(( counter + 1 ))
done
logfile="/var/log/myprog/main.$counter.log"
echo "Will log into $logfile" >&2

To make the program race-resistant (i.e., against concurrent execution), we would need to use mkdir that fails when the directory already exists (i.e., it is atomic enough to distinguish if we were successful and are not just stealing someone else’s file).

Note that it uses exclamation mark ! to invert the program outcome.

counter=1
while ! mkdir "/var/log/myprog/log.$counter"; do
    counter=$(( counter + 1 ))
done
logfile="/var/log/myprog/log.$counter/main.log"
echo "Will log into $logfile" >&2

Note that there is also do ... until loop in shell. If you need such loop, please, consult the manual for details.

break and continue

As in other languages, break command is available to terminate currently executing loop. You can use continue as usual, too.

Switch (a.k.a. case ... esac)

When we need to branch our program based on variable value, shell offers the case construct. It is somehow similar to the switch construct in other languages but has a bit of shell specifics mixed in.

The overall syntax is following.

case value_to_branch_on in
    option1) commands_for_option_one ;;
    option2) commands_for_option_two ;;
    *) the_default_branch ;;
esac

Notice that like with if, we terminate with the same keyword reversed and that there are two semicolons ;; to terminate the commands for a particular option.

The following example abuses the fact that /etc/os-release contains only shell-style variable assignment and uses a simple trick to print one of the values. Then we use case to print a wannabe funny note (note that multiple values can be delimited by | and that simply wildcard-style matching is available too):

os_id="$( echo 'echo $ID' | cat /etc/os-release - | sh )"
case "$os_id" in
    arch) echo 'Good choice :-)' ;;
    debian|ubuntu) echo 'Is Python 3 already the default?' ;;
    fedora*) echo 'Have you ever wore it on your head?' ;;
    gentoo) echo 'Are you finished compiling, yet?' ;;
    *) echo "Wow, is $os_id really a distribution?" ;;
esac

Note that nothing prevents us from merging the first two lines into

case "$( echo 'echo $ID' | cat /etc/os-release - | sh )" in

It is like with any other programming language. Think about maintainability and readability of your scripts. Unreadable source code is rarely a consequence of the used language but of the programmer that wrote that piece of source code (surely, there are exceptions).

Redirection

Note that redirection can be applied to the whole control structure.

For example, the following is a valid construct.

if test -d .git; then
    echo "We are in a root of a Git project"
else
    echo "This is not a root of a Git project"
fi | tr 'a-z' 'A-Z'

$PATH

We will now finish talking about $PATH that we already mentioned several times but without a thorough explanation.

There are two basic situations when shell is asked to run a command. The command is specified via a path – either relative or absolute (e.g., ./script.sh or 01/longest.py or /bin/bash) – or without any path (e.g., ls).

In the first case, shell resolves the path using working directory (if needed) and executes the program. Surely, assuming executable bit is set.

If the command is specified without any path, shell looks into directories specified in the environment variable $PATH and tries to find the program there. If the program is in no directory specified in $PATH, shell announces a failure.

The directories in $PATH are separated by colon : and typically, $PATH would contain at least /usr/local/bin, /usr/bin, /bin. Find out how your $PATH looks like. Hint.

Note that something like $PATH exists on other operating systems too. However, installed programs are not always installed to the directories listed in it and thus you typically cannot run them from the command-line easily. Extra pro hint for Windows users: if you use Chocolatey, the programs will be in the $PATH and installing new software via choco will make the experience at least a bit less painful :-).

It is possible to add . to the $PATH (i.e., current directory). Do not do that (even if it is a modus operandi on other systems). This thread explain several reasons why it is a bad idea. A typical scenario involves the fact that programmers often name their temporary scripts just test which is a name of shell command. Having . as the first component in the $PATH allows the attacker to overwrite standard commands, having it as the last component can lead to confusion as standard commands could be used instead of your script.

However, it is very useful to create a special directory, typically ~/bin where you collect all useful scripts that you want to run from anywhere.

And in your ~/.bashrc, extend the $PATH with $PATH:$HOME/bin.

For example, the args.py we have used is one candidate for such script to be placed into ~/bin. Naming it show_argv would certainly prevent name collisions and you would have it handy when variable expansion is not working as you would expect.

In shebang (why we need env)

Shebang needs an absolute path for things to work. However, one often needs to be more flexible.

Because of this, Python scripts use /usr/bin/env python3 shebang where env launches the program specified as the first argument (i.e., python3), looking for it in the $PATH.

Note that the script filename is appended as another argument, so everything works as one could expect.

This is something we have not mentioned earlier – shebang can have up to one argument, filename (of the interpreted program) is always added after that.

Therefore, the env-style shebang causes program env to run with parameters python3, path-to-the-script.py and all other arguments. env then finds python3 in $PATH, launches it and passes path-to-the-script.py as the first argument.

Note that this is the same env command we have used to print environment variables. Without any arguments, it prints the variables. With arguments, it runs the command.

The read command

When a shell scripts needs to read from stdin into a variable, there is the read built-in command.

read FIRST_LINE <input.txt
echo "$FIRST_LINE"

Typically, read is used in a while loop to iterate over the whole input. read is also able to split input on whitespace into multiple variables.

Considering we have input of this format, the following loop computes the average of the numbers.

/dev/sdb 1008
/dev/sdb 1676
/dev/sdc 1505
/dev/sdc 4115
/dev/sdd 999
count=0
total=0
while read device duration; do
    count=$(( count + 1 ))
    total=$(( total + duration ))
done
echo "Average is about $(( total / count ))."

As you can guess from the above snippet, read returns 0 as long as it is able to read into the variables, reaching end-of-file is announced by non-zero return code.

read can be sometimes too smart about certain inputs. Calling it with -r prevents escaping with backslash.

Other notable parameters are -t or -p: use read --help to see their description.

Script parameters and getopt

When a shell script receives parameters, we can access them via special variables $1, $2, $3

Check with the following script:

echo "$#"
echo "${0}"
echo "${1:-parameter one not set}"
echo "${2:-parameter two not set}"
echo "${3:-parameter three not set}"
echo "${4:-parameter four not set}"
echo "${5:-parameter five not set}"

and run as

./script.sh
./script.sh one
./script.sh one two
./script.sh one two three
./script.sh one two three four
./script.sh one two three four five
./script.sh one two three four five six

If you want to access all parameters, there is a special variable $@ for that. Try adding show_args "$@" to the script above and re-execute. What is the difference between quoting "$@" and not doing so? Answer.

The special variable $# contains number of options on the command-line and $0 refers to the actual script name (like sys.argv[0]).

getopt

When our script needs one argument, accessing $1 directly is fine. When the invocation is more complicated, it is better to use a more standardized way. Shell offers a getopt command that is able to simplify command-line parsing for you.

We will not describe all the details of this command but instead show an example that you can modify to your own needs.

The main arguments controlling getopt behavior are -o and -l that contain description of the switches for our program.

Let us assume that we would want to handle arguments --verbose to make our script a bit more descriptive and --output to specify alternate output file. And we would like to handle short versions of -o and -v too. And for --version, we want to print version. And we shall not forget about --help too. Other arguments would be names of input files.

The specification of the getopt switches is simple:

getopt -o "vho:" -l "verbose,version,help,output:"

Single-letter switches are specified after -o, long after -l and a colon : after the option denotes an argument.

Run the above command appending -- and then the actual parameters to see what is happening.

getopt -o "vho:" -l "verbose,version,help,output:" -- --help input1.txt --output=file.txt
getopt -o "vho:" -l "verbose,version,help,output:" -- --help --verbose -o out.txt input2.txt
...

As you can see, getopt is able to parse the input and convert the parameters to a unified form, moving the normal arguments to the end of the list.

The following magical line (you do not need to understand it to use it) resets $1, $2 etc. to contain the values as parsed by getopt.

eval set -- "$( getopt -o "vho:" -l "verbose,version,help,output:" -- "$@" )"

Note that at this moment we assume the user would not invoke the command incorrectly. We will fix that later.

The actual processing is then quite straightforward.

#!/bin/bash

set -ueo pipefail

eval set -- "$( getopt -o "vho:" -l "verbose,version,help,output:" -- "$@" )"

be_verbose=false
output_file=/dev/stdout

while [ $# -gt 0 ]; do
    case "$1" in
        -h|--help)
            echo "Usage: $0 ..."
            exit 0
            ;;
        -o|--output)
            output_file="$2"
            shift
            ;;
        -v|--verbose)
            be_verbose=true
            ;;
        --)
            shift
            break
            ;;
        *)
            echo "Unknown option $1" >&2
            exit 1
            ;;
    esac
    shift
done

$be_verbose && echo "Starting the script"

for inp in "$@"; do
    $be_verbose && echo "Processing $inp into $output_file ..."
done

Several parts of the script require a bit more explanation.

true and false are actually very simple programs that simply terminates with the proper exit code. In other words – they are not boolean values. But they can be used as such. Note how we use them to drive the logging.

exit immediately terminates a shell script. The optional parameter denotes the exit code of your script.

shift is a special command that shifts the variables $1, $2 … by one. Hence, after shift, $3 becomes $2, $2 becomes $1 and $1 is lost. "$@" is modified accordingly. Thus, the whole loop processes all options until encountering -- that delimits normal arguments. These are processed separately.

Functions in shell

Functions in shell are defined like this.

function_name() {
    commands
}

Note that arguments are not specified, return type is always integer as that is the exit code.

Arguments inside the function are available as $1, $2 etc.

Please, consult the following section on variable scoping for details about which variables are visible inside a function.

A simple logging function could look like this.

msg() {
    echo "$( date '+%Y-%m-%d %H:%M:%S |' )" "$@" >&2
}

It prints current date and then the actual message, all to stderr.

Recall our example with case above. Reading the variable could be encapsulated into a function for example like this.

get_value() {
    echo "echo \$$1" | cat /etc/os-release - | sh
}

this_os_id="$( get_value ID )"
this_os_name="$( get_value NAME )"

Note how function stdout is captured into a variable.

Calling return terminates function execution, the exit code is return parameter.

is_shell_script() {
    case "$( head -n 1 "$1" 2>/dev/null )" in
        \#!/bin/sh|\#!/bin/bash)
            return 0
            ;;
        *)
            return 1
            ;;
    esac
}

Such function can be used in if like this, for example.

if is_shell_script "$1"; then
    echo "$1 is a shell script"
fi

Note how good naming simplifies reading the final program. It is also a good idea to give a name to the function argument, rather than using $1, $2 and so on. We can do that with local (see further sections).

is_shell_script() {
    local file="$1"
    case "$( head -n 1 "$file" 2>/dev/null )" in
        \#!/bin/sh|\#!/bin/bash)
            return 0
            ;;
        *)
            return 1
            ;;
    esac
}

You might notice that aliases, functions, built-ins and regular commands are called the same way. Therefore, the shell has a fixed order of precedence: Aliases are checked first, then functions, then builtins, and finally regular commands from $PATH. Regarding that, the built-ins command and builtin might be of use (e.g. in functions of same name).

Subshell and variable scoping

This paragraph describes few rules and facts about variable scoping and why some constructs could not work.

Shell variables are always global. All variables are visible in all functions, variable modification inside a function modifies the global variable.

The only exception is when variable is declared as local inside a function.

When a variable is modified in a program (be it Python script or another shell script), this value remains local to that process (similar to working directory).

Using pipe is – in terms of variable scoping – equivalent to launching a new shell: variables set inside a piped environment are not propagated to the outer code.

Enclosing part of our script in ( .. ) creates so-called subshell that behaves as if another script was launched. Again, variables modified inside this subshell are not visible to the outer one.

Read and run the following code to understand the mentioned issues.

global_var="one"

change_global() {
    echo "change_global():"
    echo "  global_var=$global_var"
    global_var="two"
    echo "  global_var=$global_var"
}

change_local() {
    echo "change_local():"
    echo "  global_var=$global_var"
    local global_var="three"
    echo "  global_var=$global_var"
}

echo "global_var=$global_var"
change_global
echo "global_var=$global_var"
change_local
echo "global_var=$global_var"

(
    global_var="four"
    echo "global_var=$global_var"
)

echo "global_var=$global_var"

echo "loop:"
(
    echo "five"
    echo "six"
) | while read value; do
    global_var="$value"
    echo "  global_var=$global_var"
done
echo "global_var=$global_var"

Regular expressions (a.k.a. regex)

We already mentioned that systems from the Unix family are built on top of text files. The utilities we have seen so far offered basic operations but none of them was really powerful. Use of regular expressions will change that.

We will not cover the theoretical details, there are special courses for that. We will look at regular expressions as means to capture patterns in text.

Such patterns could be that we are interested in

  • lines starting with date and containing HTTP code 404,
  • files containing our login,
  • or a line preceding a line with a valid filename

While regular expressions are very powerful, their use is complicated by the fact that different tools use slightly different syntax. Keep this in mind when using grep and sed, for example. And each programming language has some exceptions too.

We will be using following text (assuming it is store in lorem.txt):

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed iaculis, magna at
varius scelerisque, nisl diam bibendum risus, eget vulputate massa dolor vel
dolor. In eu tempor mauris. Nulla turpis nisi, magna ut 1980-02-05 nisi sit
amet, condimentum maximus tortor. Etiam volutpat vel massa eu semper.

Praesent sollicitudin fringilla neque, non faucibus eros fringilla a. Quisque
cursus. Etiam a elit vitae urna porttitor laoreet. Vestibulum pharetra sodales
neque non iaculis. Nulla vitae luctus quam. 2020-02-30 Curabitur in massa id felis
volutpat cursus nec mattis libero. Sed at luctus sem.

Nullam aliquet magna ut risus gravida mattis. Etiam ut magna ipsum. Nam
cursus ipsum primis in faucibus. Vivamus quis dolor id ipsum euismod dignissim at
2017-12-03 id tellus 2018-06-14.

Proin a libero vehicula, convallis nisl at, pharetra est. Proin quis mi id
orci dui, hendrerit vitae risus vitae, volutpat dictum odio. Donec malesuada
augue eget felis venenatis vehicula. 1951-04-31 Donec egestas vel mauris a lobortis.
Mauris posuere a mi nec convallis.

The general tool for using regular expressions is called grep (the legend says that g stands for globally – i.e., across the whole file, re for regex and p for print).

In its basic form, it takes two arguments. Regular expression to match against and the text file where to apply the search.

It can be used for searching for a specific word, e.g. the following will print all lines in /etc/passwd that contain the word system.

grep system /etc/passwd

Note that grep supports --color which might be quite useful.

If we want to search lines starting with this word, we need to add an anchor ^.

grep "^cursus" lorem.txt

If the line is supposed to end with a pattern, we need to use $ anchor. Note that it is safer to use single quotes to prevent any variable expansion.

grep 'at$' lorem.txt

Regular expressions are more powerful. We can find all lines starting with either r, s or t using [...] list.

grep '^[rst]' lorem.txt

Finding all three-digit numbers is also possible:

grep '[0-9][0-9][0-9]' lorem.txt

Note that this does not prevent grep to find four-digits too.

We can also find lines not starting with any of letter between r and z.

grep '^[^r-z]' lorem.txt

It is also possible to specify that the first four characters has to be between a and p.

grep '^[a-p]\{4\}' lorem.txt

Note that this does not require that all four characters are the same.

As a special case, * denotes that the previous part of the pattern can appear multiple times or never at all. Together with dot . that captures any single character (except newline), we can create patterns with required prefix and suffix and arbitrary content in the middle.

grep '^c.*t$' lorem.txt

Using groups we can apply the \{COUNT\} to longer parts of the pattern (otherwise, it applies only to the last specifier).

grep '[a-m][n-z][a-m][n-z]' lorem.txt
grep '\([a-m][n-z]\)\{2\}' lorem.txt

Use the manual page to find how to print line numbers, how to print context lines and how to specify multiple patterns.

Text substitution

The full power of regular expressions is unleashed when we use them to substitute patterns. We will show this on sed (a stream-editor) that can perform regular expression-based text transformations.

In its simplest form, sed replaces one word by another. Note that the command reads substitute (s), next character is a delimiter – typically one uses : or # but it can be any character that is not used without escaping in the rest of the command, followed by text to be replaced, delimiter and the replacement.

sed 's:magna:angam:' lorem.txt

Note that this replaced only the first occurrence on each line. Adding g modifier (for global) replaces all occurrences.

sed 's:magna:angam:g' lorem.txt

The text to be matched can be any regular expression.

sed 's:[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]:DATE-REDACTED-OUT:g' lorem.txt

Typically, the substitution uses grouping (...) to use parts of the original text.

The following example transforms the date into the Czech form. Note how \3 refers to the third group.

sed 's:\([0-9][0-9][0-9][0-9]\)-\([0-9][0-9]\)-\([0-9][0-9]\):\3. \2. \1:g' lorem.txt

Testing in Shell with BATS

In this section we will briefly describe BATS – the testing system that we use for automated tests that are run on every push to GitLab.

Generally, automated tests are the only reasonable way to ensure your software is not slowly rotting and decaying. Good tests will capture regressions, ensure bugs are not reappearing and often serve as documentation of the expected behavior.

The motto write tests first may often seem exaggerated and difficult but it contains a lot of truth (several reasons are listed for example in this article).

BATS is a system written in shell that targets shell scripts or any programs with CLI interface. If you are familiar with other testing frameworks (e.g. Python Nose), you will find BATS probably very similar and easy to use.

Generally, every test case is one shell function and BATS offers several helper functions to structure your tests.

Let us look at the example from BATS homepage.

#!/usr/bin/env bats

@test "addition using bc" {
  result="$(echo 2+2 | bc)"
  [ "$result" -eq 4 ]
}

The @test "addition using bc" is test definition, internally BATS translates this into a function (indeed, you can imagine it as running simple sed script over the input and piping it to sh) and the body is a normal shell code.

BATS uses set -e to terminate the code whenever any program terminates with non-zero exit code. Hence, if [ terminates with non-zero, the test fails.

Apart from this, there is nothing more about it in its basic form. Even with this basic knowledge, you can start using BATS to test your CLI programs.

Executing the tests is simple – make the file executable and run it. You can choose from several outputs and with -f you can filter which tests to run. Look at bats --help or here for more details.

Better assertions

BATS offers extensions for writing more readable tests.

Thus, instead of calling test directly, we can use assert_equal that produces nicer message.

assert_equal "expected-value" "$actual"

NSWI177 tests

Our tests are packed with the assert extension plus several of our own. All of them are part of the tests.tar.gz that is downloaded by run_tests.sh in your repositories.

Feel free to execute the [01][0-9].bats file directly if you want to run just certain test locally (i.e., not on GitLab).

Commented example of a bigger shell script

To put the knowledge from this lab into perspective, let us have a look at a bigger program. It will be a skeleton of the scoring program we mentioned in Lab #03. But with a few extra bits.

Please, first try to understand the program yourself and then look at our comments for things that might have escaped your attention.

#!/bin/bash

set -ueo pipefail

msg() {
    echo "INFO: " "$@" >&2
}

usage() {
    echo "Usage: $2 [options]"
    echo "..."
    exit "$1"
}

getopt_options="-o vho: -l verbose,version,help,output:"
# shellcheck disable=SC2086 # we want option splitting
getopt -Q $getopt_options -- "$@" || usage 1 "$0"
# shellcheck disable=SC2086 # we want option splitting
eval set -- "$( getopt -q $getopt_options -- "$@" )"

logger=:
output_file=/dev/stdout

while [ $# -gt 0 ]; do
    case "$1" in
        -h|--help)
            usage 0 "$0"
            ;;
        -o|--output)
            output_file="$2"
            shift
            ;;
        -v|--verbose)
            logger=msg
            ;;
        --version)
            echo "scoring.sh 0.1"
            exit 0
            ;;
        --)
            shift
            break
            ;;
        *)
            echo "Unknown option $1" >&2
            exit 1
            ;;
    esac
    shift
done

my_temp="$( mktemp -d )"

$logger "Starting the script"

echo -n "" >"$output_file"
for inp in "$@"; do
    while read -r cmd args; do
        case "$cmd" in
            \#*)
                ;;
            add)
                echo "$args" | (
                    read -r team _ score
                    $logger "Read $team has $score..."
                    echo "$score" >>"$my_temp/$team.txt"
                )
                ;;
            summary)
                echo "$args" >>"$output_file"
                for team_file in "$my_temp/"*.txt; do
                    echo "  $( basename "$team_file" .txt ): $( paste -sd+ <"$team_file" | bc )" >>"$output_file"
                done
                ;;
            *)
                ;;
        esac
    done <"$inp"
done

rm -rf "$my_temp"

The tricky lines with getopt are there simply to handle wrong invocation (such as unrecognized switch) without repeating the list of switches twice. Look up the individual (getopt’s) switches if you are interested in the details. And the comments disable ShellCheck warnings as we want to split the variable into multiple arguments (usually, you do not want this – recall section about variables above). Otherwise feel free to copy this.

Tricky line with logger=: assigns a special : command that does nothing. (Unlike a comment, it “hides” only the first command in a pipe, so : echo Hello | cat it is practically equivalent to echo -n "" | cat.) The trick is that without --verbose, logger messages are actually preceded by : and hence not printed. With --verbose, $logger contains msg and messages are printed (and to stderr to not disrupt the normal output).

Call to mktemp and rm at the end of the script shall be obvious – we want to clean-up after ourselves.

The special files /dev/stdin, /dev/stdout and /dev/stderr represent the three standard input/output streams of the current program (regardless of any redirections). Recall, that in Linux everything is a file.

echo -n "" >"$output_file" ensures we start with empty output file. For /dev/stdout that represents standard output it makes no difference, when the user specifies -o we overwrite existing content. Note that we have to use /dev/stdout explicitly to allow redirection to a file (i.e., storing &1 into the variable would not work).

Note how the whole while read cmd args uses standard input redirection.

The weird construct for add command bypasses the problem that variable assignment is not visible once pipe terminates. That is, the following construct does not work.

echo "$args" | read team _ score
$logger "Read $team has $score..."

Note that we use _ to mark a variable we will not use.

You can compare the complexity with Python. Some operations are simpler, some are more tricky but generally it is a lot about personal taste. When deciding which programming language to use, starting with shell is often a good idea because writing the prototype is fast. Especially if the task requires a lot of (text) file manipulations. Once the program becomes too big or you need structured variables (e.g., lists of dictionaries etc.), it is perhaps time to switch to a more sophisticated language. But often the shell solution would be just good enough to remain for years ;-).

Exercises

Mass image conversion

The program convert from ImageMagick can convert images between formats using convert source.png target.jpg (with almost any file extensions). Convert all PNG images (with extension .png) in current folder to JPG. Answer.

By the way, ImageMagick allows to do plenty of operations, one of those that are worth remembering is resizing.

convert DSC0000.jpg -resize 800x600 thumbs/DSC0000.jpg

Standard input or arguments?

Write fact.sh with a function which computes factorial of given number. Create two variants: 1) Read input from stdin. 2) Read input from the first argument ($1). Answer.

What version was easier to write? Which makes more sense?

Ad-hoc processing of CSV files

Write a script csv_sum.sh that reads a CSV file from stdin. Sum all the numbers in the column that is specified as the only argument. Do not forget to exit with non-zero code and proper error message if no argument is provided.

Considering the following file named file.csv.

family_name,first_name,age,points,email
Doe,Joe,22,1,joe_doe@some_mail.com
Fog,Willy,38,8,ab@some_mail.com
Zdepa,Pepa,10,1,pepa@some_mail.com

The output of the command ./csv_sum.sh points <file.csv should be 10. Answer.

Barplots (the shell style)

Write bar_plot.sh which prints horizontal bar plot. Input numbers indicate the bar size. Decide what input option is more viable for you. Example:

$ ./bar_plots.sh 7 1 5
7: #######
1: #
5: #####

If the largest value is greater than 60 rescale the whole plot. The maximum number of # is 60.

Answer.

Finding anomalies in web server logs

Assume that you have a file apache.log with:

Sep  7 18:33:21 ewait apache2[access_per_ip]: 1220805201 79.187.241.62 20 172 "OPTIONS * HTTP/1.0"
Sep 29 00:29:12 ewait apache2[access_per_ip]: 1222640952 217.79.176.125 37 238 "HEAD / HTTP/1.0"
Sep 10 14:03:58 ewait apache2[access_per_ip]: 1221048237 147.32.80.98 22 451 "OPTIONS * HTTP/1.1"
Oct 20 12:38:53 ewait apache2[access_per_ip]: 1224499133 147.32.80.98 842 354 "POST /y36aws/examples/03-cgi/formular/param.bin HTTP/1.1"
Oct 20 12:38:29 ewait apache2[access_per_ip]: 1224499109 147.32.80.98 852 1670 "POST /y36aws/examples/03-cgi/formular/cgi-bin/param HTTP/1.1"
Jul 22 12:24:16 ewait apache2[access_per_ip]: 1216722256 147.32.83.234 456 821 "GET /y36aws/examples/03-log/access_per_ip/2008/07/22.log HTTP/1.1"
Jul 22 12:24:08 ewait apache2[access_per_ip]: 1216722248 147.32.83.234 456 1446 "GET /y36aws/examples/03-log/access_per_ip/2008/07/22.log HTTP/1.1"
Dec  5 03:03:24 ewait apache2[access_per_ip]: 1228442604 193.142.127.145 19 238 "HEAD / HTTP/1.0"
Dec  1 18:36:21 ewait apache2[access_per_ip]: 1228152981 90.47.147.34 107 1218 "CONNECT ircvoila1.tchat.orange.fr:6667 HTTP/1.0"

Moreover, you have requests.txt with:

CONNECT
GET
HEAD
HELP
OPTIONS
POST
SEARCH

Write a script/command which filters file apache.log, so you see only lines, that contain one of the word from requests.txt. In full log file, the majority of requests are GET or POST. Filter them out as well (we do not want them), but do NOT modify requests.txt!

Answer.

Miscellaneous

As usual, a few unrelated notes.

Git and directories

You perhaps noticed this by yourself anyway. Git does not track directories but only files.

Hence, if you create an empty directory, Git will completely ignore it and calling git add on that directory will have no effect.

Once the directory contains at least one file, Git will show this new directory as _Untracked file.

If you need to track empty directory, you have to make it unempty artificially. Typically, by creating a hidden file, usually named .gitkeep or just .keep and adding this file to be tracked by Git (i.e., git add .keep).

Faster command-line editing

Obviously, you use Tab completion and up/down arrows for browsing through recent commands.

But there are plenty of other shortcuts that are worth knowing. A must is Ctrl-R for searching in history but there are plenty of others as described here or here.

Graded tasks

05/scoring.sh (15 points)

Extend the scoring skeleton above to support also the podium and csv commands as described for 03/scoring.py (but keep the existing switches).

Note that calling your scoring.py implementation is prohibited (i.e., do not create a shell wrapper around your Python implementation).

05/backup.sh (30 points)

Create a shell script for performing simple backups. The script takes files (or directories) as arguments and creates their copies in the directory given by an environment variable BACKUP_DIR (or ~/backup if not set) with names in the form YYYY-MM-DD_hh-mm-ss_~ABSOLUTE~PATH~TO~FILE, containing current date and time and the absolute path to the original file with / replaced by ~ (tilde). It should print the path to each successfully created backup. (realpath may be useful.)

Example use:

export BACKUP_DIR=~/my_backup
cd /home/intro/my_dir
../path/to/05/backup.sh a.zip b/c/

Example output:

/home/intro/my_backup/2021-03-26_10-01-23_~home~intro~my_dir~a.zip
/home/intro/my_backup/2021-03-26_10-01-23_~home~intro~my_dir~b~c

You may use the script for fast temporal backups of your current work, and clear the backup dir time to time.

Note that we expect use of cp -R if the parameter is a directory.

05/mails_from_web.sh (20 points)

Write a script which takes an URL as its only parameter and prints all clickable email addresses on that web page one per line in alphabetical order. For simplicity look only for parts of the form href="mailto:EMAIL@ADDRESS", where EMAIL@ADDRESS should contain at least one character from the set of all lowercase and uppercase letters of English alphabet extended by dash, dot and numbers, followed by @ and at least one character of the same set; other characters are forbidden. For downloading the web page use curl, not wget.

Example content of the web page:

Write mail
to <a href="mailto:bob@example.org">me</a> or <a href="mailto:Alice25@HerWork.com">Alice</a>
or <a href="mailto:nobody">nowhere</a>.
Definitely do not contact eve@TheMiddleOfTheWire.com.

Example script output:

Alice25@HerWork.com
bob@example.org

Note that curl understands this path too: file:///home/intro/Downloads/page.html.

05/interactive_calc.sh (20 points)

Write a shell script for summing, subtracting, multiplying and dividing numbers. On each line, user specifies an operation in format + NUM, - NUM, * NUM, or / NUM where NUM is an integer. If the script is executed with one argument it will read the input from the file. If there is no argument or the argument is - (dash) it will read the input from stdin. If the input file does not exists, the program exits with an error code 1.

Program terminates when it reaches the end of the input while printing the result in the format = result. You may assume that input will be always well formatted. The initial value is always 0.

Example session:

+ 5
* 2
/ 5
^D
= 2

The ^D symbolizes end of input here (e.g., Ctrl-D in interactive mode).

UPDATE: The operator is always applied to the last intermediate result (i.e. left asociativity regardless of actual priority). I.e. it is not needed to store the whole history and evaluate it in the end with respect to the operator priority. Quite the opposite – compute everything on the fly.

Formally, the program should compute this: (((0 OP1 NUM1) OP2 NUM2) OP3 NUM3) and not this: 0 OP1 NUM1 OP2 NUM2 OP3 NUM3. Sorry about the confusion.

05/user_vars.sh (15 points)

Write a shell script that prints names of environment variables that contain username of the current user, sorted alphabetically. The name of the current user can be retrieved by the whoami command. Print only the variable names, not their content (and search for username in their value, not in their name). Typically, you should see the following output.

HOME
LOGNAME
MAIL
PATH
USER

And also probably some other environment variables related to your desktop environment (such as GNOME_KEYRING_CONTROL or XAUTHORITY). You might also see PWD, depending on the directory you run the script from.

NOTE: You may assume that the environment doesn’t contain any exported functions, and values don’t contain line breaks (i.e. you can parse by lines)

Deadline: April 19, AoE

Solutions submitted after the deadline will not be accepted.

Note that at the time of the deadline we will download the contents of your project and start the evaluation. Anything uploaded/modified later on will not be taken into account!

Note that we will be looking only at your master branch (unless explicitly specified otherwise), do not forget to merge from other branches if you are using them.