Labs: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14.

Do not forget that the Before class reading is mandatory and there is a quiz that you are supposed to complete before coming to the labs.

Before class reading

Regular expressions (a.k.a. regex)

We already mentioned that systems from the Unix family are built on top of text files. The utilities we have seen so far offered basic operations, but none of them was really powerful. Use of regular expressions will change that.

We will not cover the theoretical details – see the course on Automata and grammars for that. We will view regular expressions as simple tools for matching of patterns in text.

For example, we might be interested in:

  • lines starting with date and containing HTTP code 404,
  • files containing our login,
  • or a line preceding a line with a valid filename.

While regular expressions are very powerful, their use is complicated by the unfortunate fact that different tools use slightly different syntax. Keep this in mind when using grep and sed, for example. Libraries for matching regular expressions are also available in most programming languages, but again beware of variations in their syntax.

The most basic tool for matching files against regular expressions is called grep. (There is a legend that the g in the name stands for “globally”, meaning the whole file, while re is regex, and p is print). If you run grep regex file, it prints all lines of file which match the given regex. We will try a lot of examples during the labs.

Regex syntax

In its simplest form, a regex searches for the given string (usually in case-sensitive manner).

system

This matches all substrings system in the text. In grep, this means that all lines containing system will be printed.

If we want to search lines starting with this word, we need to add an anchor ^.

^system

If the line is supposed to end with a pattern, we need to use the $ anchor. Note that it is safer to use single quotes in the shell to prevent any variable expansion.

system$

Moreover, we can find all lines starting with either r, s or t using the [...] list.

^[rst]

This looks like a wildcard, but regexes are more powerful and the syntax differs a bit.

Let us find all three-digit numbers:

[0-9][0-9][0-9]

This matches all three-digit numbers, but also four-digit ones: regular expressions without anchors do not care about surrounding characters at all.

We can also find lines not starting with any of letter between r and z. (The first ^ is an anchor, while the second one negates the set in [].)

^[^r-z]

The quantifier * denotes that the previous part of the regex can appear multiple times or never at all. For example, this finds all lines which consist of digits only:

^[0-9]*$

Note that this does not require that all digits are the same.

A dot . matches any single character (except newline). So the following regex matches lines starting with super and ending with ious:

^super.*ious$

When we want to apply the * to a more complex subexpression, we can surround it with (...). The following regex matches bana, banana, bananana, and so on:

ba(na)*na

If we use + instead of *, at least one occurrence is required. So this matches all decimal numbers:

[0-9]+

The vertical bar ("|" a.k.a. the pipe) can separate alternatives. For example, we can match lines composed of Meow and Quork:

^(Meow|Quork)*$

The [abc] construct is therefore just an abbreviation for (a|b|c).

Another useful shortcut is the {N} quantifier: it specifies that the preceding regex is to be repeated N times. We can also use {N,M} for a range. For example, we can match lines which contain 4 to 10 lower-case letters enclosed in quotation marks:

^"[a-z]{4,10}"$

Finally, the backslash character changes whether the next character is considered special. The \. matches a literal dot, \* a literal asterisk. On the other hand, many regex dialects (including grep without further options) require +, (, |, and { to be escaped to make them recognized as regex operators. (You can run grep -E or egrep to activate extended regular expressions, which have all special characters recognized as operators without backslashes.)

Text substitution

The full power of regular expressions is unleashed when we use them to substitute patterns. We will show this on sed (a stream editor) that can perform regular expression-based text transformations.

In its simplest form, sed replaces one word by another. The command reads: substitute (s), then a single-character delimiter, followed by the text to be replaced (the left-hand side of the substitution), again the same delimiter, then the replacement (the right-hand side), and one final occurrence of the delimiter. (The delimiter is typically :, /, or #, but generally it can be any character that is not used without escaping in the rest of the command.)

sed 's:magna:angam:' lorem.txt

Note that this replaces only the first occurrence on each line. Adding a g modifier (for global) at the end of the command causes it to replace all occurrences:

sed 's:magna:angam:g' lorem.txt

The text to be replaced can be any regular expression, for example:

sed 's:[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]:DATE-REDACTED-OUT:g' lorem.txt

The right-hand side can refer to the text matched by the left-hand side. We can use & for the whole left-hand side or \n for the n-th group (...) in the left-hand side.

The following example transforms the date into the Czech form. Similarly to grep, we have to escape the ( and ) characters to make them act as grouping operators instead of literal ( and ).

sed 's:\([0-9][0-9][0-9][0-9]\)-\([0-9][0-9]\)-\([0-9][0-9]\):\3. \2. \1:g'

We will see more examples during the labs, stay tuned.

ShellCheck

You have already written quite a lot shell scripts. It is thus time to introduce you to ShellCheck. ShellCheck is a tool that checks your shell scripts for common issues. These issues are not syntax errors nor logical errors. The issues raised by ShellCheck are patterns that are well-known to cause unexpected behavior, degrade performance, or may be even hiding some nasty surprises.

One such example could be if your script contains the following snippet.

cat input.txt | cut -d: -f 3

Do you know what could be possible wrong? Technically, this code is correct and by itself does not contain any bug. However, the first cat is redundant as it prints one file only: the code can be reduced to the following form without change of functionality:

cut -d: -f 3 <input.txt

As you can see, this is essentially harmless. But it might mean that you wanted to cat multiple files or that cat is a left-over from a previous version. So ShellCheck warns you.

Another issue where ShellCheck helps is the following code:

dir_name=results/
if [ -d $dri_name ]; then
    echo "$dir_name already exists."
fi

Here ShellCheck will actually detect the typo as dri_name was not assigned before.

Another trap awaits in the following code:

if [ -d ]; then
    echo "$dir_name already exists."
fi

Of course, this is completely wrong. But guess what: test (or [) will accept this and evaluate it as true. This looks crazy, but actually test with exactly one argument checks if the argument is non-empty. We have test -n these days, but once we did not have and we must keep backward compatibility. See this page for details.

So this is a correct piece of shell code, but it likely does not do what you wanted it to. Here comes ShellCheck to help you.

ShellCheck is able to warn you about hundreds of possible issues as can be seen on this page. Get into the habit to run it on your shell scripts regularly.

In our practice, ShellCheck seldom gives false positives, but it saved us many times.

Some of the graded tasks that you submit will be checked by ShellCheck, too (and we might penalize your solutions if your scripts are not ShellCheck-error-free).

Other linters and checkers

Note that ShellCheck is not the only tool available. Virtually every programming language has similar tools or tools that check your style.

For example, Python has Pylint that you should definitely try and run regularly too. It can detect real bugs as well as make your code more Pythonic – this is quite important for team work.

The important take away

Start using ShellCheck, Pylint or any other tool for you favorite language.

It will not detect logical errors (at least not all of them), but it will surely detect so-called code smells: places in your code that often lead to errors, undefined behaviour, or similar issues.

This is doubly important if you are new to some language: the chances are that you misunderstood some feature rather than the tool being wrong.

Before class quiz

The quiz file is available in the 08 folder of this GitLab project.

Copy the right language mutation into your project as 08/before.md (i.e., you will need to rename the file).

The questions and answers are part of that file, fill in the answers in between the **[A1]** and **[/A1]** markers.

The before-08 pipeline on GitLab will test that your answers are in the correct format. It does not check for actual correctness (for obvious reasons).

Submit your before-class quiz before start of lab 08.

Testing in Shell with BATS

In this section we will briefly describe BATS – the testing system that we use for automated tests that are run on every push to GitLab.

Generally, automated tests are the only reasonable way to ensure your software is not slowly rotting and decaying. Good tests will capture regressions, ensure bugs are not reappearing and often serve as documentation of the expected behavior.

The motto write tests first may often seem exaggerated and difficult, but it contains a lot of truth (several reasons are listed for example in this article).

BATS is a system written in shell that targets shell scripts or any programs with CLI interface. If you are familiar with other testing frameworks (e.g. Python Nose), you will find BATS probably very similar and easy to use.

Generally, every test case is one shell function and BATS offers several helper functions to structure your tests.

Let us look at the example from BATS homepage:

#!/usr/bin/env bats

@test "addition using bc" {
  result="$(echo 2+2 | bc)"
  [ "$result" -eq 4 ]
}

The @test "addition using bc" is a test definition. Internally, BATS translates this into a function (indeed, you can imagine it as running simple sed script over the input and piping it to sh) and the body is a normal shell code.

BATS uses set -e to terminate the code whenever any program terminates with non-zero exit code. Hence, if [ terminates with non-zero, the test fails.

Apart from this, there is nothing more about it in its basic form. Even with this basic knowledge, you can start using BATS to test your CLI programs.

Executing the tests is simple – make the file executable and run it. You can choose from several outputs and with -f you can filter which tests to run. Look at bats --help or here for more details.

Commented example

Let’s write a test for our factor.py program. We will use the version that reads the number from argv.

#!/usr/bin/env bats

@test "Factorize 7" {
    run ./factor.py 7
    [ "$output" = "7" ]
}

@test "Factorize 17" {
    run ./factor.py 17
    [ "$output" = "17" ]
}

We use a special BATS command run to execute our program that also captures its stdout into a variable named $output.

And then we simply verify the correctness.

Let’s add another test case:

@test "Factorize 8" {
    run ./factor.py 8
    [ "$output" = "2 2 2" ]
}

This will fail, but the error message is not very helpful.

   (in test file factor.bats, line 15)
     `[ "$output" = "2 2 2" ]' failed

This is because BATS is a very thin framework that basically checks only the exit codes and not much more.

But we can improve that.

#!/usr/bin/env bats

check_it() {
    run ./factor.py "$1"
    [ "$output" = "$2" ]
}

@test "Factorize 7" {
    check_it 7 7
}

@test "Factorize 17" {
    check_it 17 17
}

@test "Factorize 8" {
    check_it 8 "2 2 2"
}

The error message is not much better but the test is much more readable this way.

Let’s improve the check_it function a bit more.

check_it() {
    run ./factor.py "$1"
    if [ "$output" = "$2" ]; then
        return 0
    fi
    echo >&2
    echo "-- Actual output --" >&2
    echo "$output" >&2
    echo "-- Expected output --" >&2
    echo "$2" >&2
    return 1
}

Let’s run the test again:

   (from function `check_it' in file factor.bats, line 13,
    in test file factor.bats, line 25)
     `check_it 8 "2 2 2"' failed

   -- Actual output --
   2
   2
   2
   -- Expected output --
   2 2 2

So basically our test was wrong all the time :-).

But this is actually usable for debugging our program.

We simply need to change our test a bit:

@test "Factorize 8" {
    check_it 8 "2
2
2"
}

Yes, shell strings can span multiple lines just fine.

Adding more test cases is now a piece of cake. After this trivial update, our test suite will actually start making sense. And it will be useful to us.

Better assertions

BATS offers extensions for writing more readable tests.

Thus, instead of calling test directly, we can use assert_equal that produces nicer message.

assert_equal "expected-value" "$actual"

NSWI177 tests

Our tests are packed with the assert extension plus several of our own. All of them are part of the repository that is downloaded by run_tests.sh in your repositories.

Feel free to execute the *.bats file directly if you want to run just certain test locally (i.e., not on GitLab).

grep and sed

We have already mentioned these commands. The first one prints lines matching a given regular expression, the other one is able to change the lines according to the provided regular expression and its replacement.

Warning: both commands use a slightly different regex syntax. Always check with the man page if you are not sure. Generally, the biggest differences across tools/languages are in handling of special characters for repetition or grouping ((), {}).

Exercises

Find all lines in /etc/passwd that contain the digit 9.

Accounts with /sbin/nologin in /etc/passwd are generally system accounts not used by a human user. Print the list of these accounts. Solution.

Find all lines in /etc/passwd that start with any of the letters A, B, C or D (case-insensitive). Solution.

Find all lines which contain an even number of characters. Solution.

Find all e-mail addresses. Assume that a valid e-mail address has a format <s1>@<s2>.<s3>, where each sequence <sN> is a non-empty string of characters from English alphabet and sequences <s1> and <s2> may also contain digits or a dot .. Solution.

Print all lines containing a word (in English alphabet) which begins with capital letter and all other letters are lowercase. Test that the word TeX will not be matched. Solution.

Remove all trailing spaces and tabulators. Solution.

Put every word (non-empty sequence of characters of the English alphabet) in parentheses. Solution.

Replace “Name Surname” by “Surname, N.”. Solution.

Delete all empty lines. Hint. Solution.

Reformat input to contain each sentence on a separate line. Assume that each sentence begins with a capital English letter and ends with ., !, or ?; there may be any number of spaces between sentences. Hint. Solution.

Bigger script example

We will describe the following script in a bit more detail to explain typical idioms you can encounter. We will also build the script incrementally to give you an idea how to approach building bigger scripts.

But we provide complete script as well for you to check that you have build it from the fragments correctly.

Task description

Write a script that prints basic system information (hardware platform, kernel version, number of CPUs, and RAM size). The user should be able to choose different output formats.

Solution.

Solution description

The core of our script is simple.

echo "Hardware platform: $( uname -m )"
echo "Kernel version: $( uname -r )"
echo "CPU count: $( nproc )"
echo "RAM size: $( sed -n 's#^MemTotal:[ ]*\([0-9]*\) kB#\1#p' </proc/meminfo )"

This output is useful for a human reader but not for machine processing. So let’s add a version that prints the output as assignment to shell variables that can be later used. I.e., in the following format.

PLATFORM="x86_64"
KERNEL_VERSION="5.10.16-arch1-1"

Of course, duplicating the script to contain the following is not a nice solution.

if [ "$format" = "shell" ]; then
    echo "PLATFORM=$( uname -m )"
    ...
else
    echo "Hardware platform: $( uname -m )"
    ...
fi

But it is possible to convert between these two formats. Let’s convert our script like this:

if [ "$format" = "shell" ]; then
    column_no=1
else
    column_no=2
fi
(
    echo "PLATFORM:Hardware platform:$( uname -m )"
    echo "KERNEL_VERSION:Kernel version:$( uname -r )"
    echo "CPU_COUNT:CPU count:$( nproc )"
    echo "RAM_TOTAL:RAM size:$( sed -n 's#^MemTotal:[ ]*\([0-9]*\) kB#\1#p' </proc/meminfo )"
) | cut '-d:' -f $column_no,3-

Not perfect but we are getting there. Let’s hide the conversion into a separate shell function.

format_normal() {
    cut '-d:' -f 2,3
}

format_shell() {
    cut '-d:' -f 1,3 | sed 's#:\(.*\)#="\1"#'
}

Then the script would contain the following pipeline:

(
    ...
    echo "RAM_TOTAL:RAM size:$( sed -n 's#^MemTotal:[ ]*\([0-9]*\) kB#\1#p' </proc/meminfo )"
) | "format_${format}"

In a sense, we have used a polymorphism in our script as the $format variable is technically a replacement of a virtual method table.

Adding JSON is a bit more complicated, but still doable. Note that we down-case the variable names for nicer output. The final sed is used to replace the trailing comma (JSON is a very strict format).

format_json() {
    local varname
    local varvalue
    echo "{"
    cut '-d:' -f 1,3 | sed 's#:# #' | while read -r varname varvalue; do
        echo -n "$varname" | tr 'A-Z' 'a-z' | sed 's#.*#  "&": #'
        echo "\"$varvalue\"",
    done | sed '$s#,$##'
    echo "}"
}

We can certainly use getopt to allow the user to select the output format but we will opt for using a configuration file or setting an environment variable. Then, the default format can be specified in "$HOME/.nswi177/sysinfo.rc" or the script can be launched with:

FORMATTER=json ./sysinfo.sh

Many programs offer you all three options where the script first loads the settings from a configuration file, optionally overrides them with a environment variable, and getopt can override these.

The loading in the script then looks like this (we switched to capitals to emphasize that the variable comes from the user and thus will be exported).

if [ -r "$HOME/.nswi177/sysinfo.rc" ]; then
    . "$HOME/.nswi177/sysinfo.rc"
fi

if [ -z "${FORMATTER:-}" ]; then
    FORMATTER="${DEFAULT_FORMATTER:-normal}"
fi

Graded tasks (deadline: Apr 17)

IMPORTANT NOTE #1: the tasks below use intentionally simplified assumptions and target well-formatted input. If behaviour is not defined by the text, it is defined by the tests. Many cases are intentionally not defined and not tested – use common sense to define the behaviour in these cases.

IMPORTANT NOTE #2: do not forget to check your implementation by ShellCheck.

08/timeconv.sh (20 points)

Write a shell script that converts time in AM/PM format to 24-hour format.

The script reads stdin and prints the result to stdout. No arguments will given and no arguments are expected to be recognized.

The script will find all occurences of hh:mmAM or hh:mmPM and replace them with 24-hour format equivalent.

Example input/output may look like this:

The event starts at 03:25PM and is expected to end at 06:17PM.
Registration will be opened from 09:00AM until 06:00 PM.
The event starts at 15:25 and is expected to end at 18:17.
Registration will be opened from 09:00 until 06:00 PM.

We expect that you will use separate expressions for individual PM hours as converting 03 to 15, 04 to 16 etc. directly in sed is not very straightforward.

But feel free to generate parts of the script if you like. Hint:

echo "49 50 51 52 53 54" | sed -e "$( for i in 50 51 52; do echo "s:$i:$(( i - 50 )):g"; done )"

08/ip.sh (20 points)

Download here an excerpt of Apache access log. Basically, it is a list of files a web server was asked for (e.g. user typed their URL or clicked a link). This log file contains sucessful requests but also entries where the request was not satisfied, i.e. the file was not present (a.k.a. HTTP 404).

Some of the entries are genuine typos but some of them actually reveal that bots were trying to break into a WordPress installation (that was never present on the server anyway).

Each line contains IP address of the originator of the request, date, requested URL (together with method), error code, response size and user agent (browser identification).

Your script should read such file on stdin and print IP address of the machine that tried to access non-existent pages (look for 404) the most. No arguments will given and no arguments are expected to be recognized.

Note that the tests operate on small fragments of the actual log file to simplify debugging. The link mentioned above serves as a demonstration of what you can actually encounter.

In a real-world setup, you would use a specialized tool for processing such logs in a more automated and structured way. However, grep and sed are perfect fits for a hobby server or if you need to operate in a isolated environment.

Note that we have randomly modified the IP addresses to preserve anonymity.

By the way, for the full log, the most offending (anonymized) IP address is 62.150.128.144.

08/normalize.sh (20 points)

Write a script that normalizes given path.

The script will accept single argument: path to normalize. You can safely assume that the argument will be always provided.

The script will normalize the provided path in the following way:

  • references to current directory ./ will be removed as they are redundant
  • references to parent directory will be removed in such way not to change the actual meaning of the path (possibly repeatedly)
  • the script will not convert relative to absolute path or vice versa
  • the script will not check whether the file actually exists

Following examples illustrates the expected behaviour.

  • /etc/passwd/etc/passwd
  • a/b/././c/da/b/c/d
  • /a/b/../c/a/c
  • /usr/../etc//etc/

You can assume that components of the path will not contain new-lines or other special characters such as :, ", ' or any kind of escape sequences.

Hint: sed ':x; s/abb/ba/; tx' causes that s/abb/ba/ is called repeatedly as long as substition is performed (:x defines a label while tx is a conditional jump to that label if the previous substition changed the input). Try with echo 'abbbb' | sed ....

08/markdown.sh (40 points)

Write a simple Markdown convertor to HTML.

We again intentionally simplify the syntax a lot: a full-fledged parser would generally work better here but the point of this task is to excercise your knowledge of basic regular expressions.

The convertor must support the following styles:

  • Text with _emphasis of several words_. will be rendered as Text with <em>emphasis of several words</em>.
  • Text with *strong emphasis*. will be rendered as Text with <strong>strong emphasis</strong>.
  • Any >, < or & must be converted to HTML entities.
  • Links in the form of [http://...|link text] will be converted to<a href="http://...">link text</a>.
    • URL will always start with http:// or https://
    • Characters <, >, & and " must be escaped inside the URL, i.e. they must be converted to their respective HTML entities.

The markdown shall ignore other common Markdown features such as paragraph detection or (ordered or unordered) list formatting.

We do not require and we will not test nesting of any of the above mentioned markups. Therefore, it is not necessary to handle situations such as some _emphasis *inside* another_ one or _special > characters_ etc.

You can also safely assume that formatting marks never span multiple lines. But there can be several of them on one line, but without overlaps.

The script reads stdin and prints the result to stdout. No arguments will given and no arguments are expected to be recognized.

Learning outcomes

Conceptual knowledge

Conceptual knowledge is about understanding the meaning and context of given terms and putting them into context. Therefore, you should be able to …

  • explain what a regular expression is

  • explain why linters and style checkers should be used for source code checks

Practical skills

Practical skills is usually about usage of given programs to solve various tasks. Therefore, you should be able to …

  • create and use simple regular expressions to filter text with grep

  • use sed to perform text substitution

  • use . and source

  • use and interpret results of Shellcheck

  • use and interpret results of Pylint

  • execute BATS-based tests

  • read BATS test

  • create simple BATS tests (optional)