Labs: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14.

This lab focuses on build systems – tools that streamline the process from a source code to a publishable artefact. This includes creating an executable binary from a set of C sources, HTML generation from Markdown sources or creating thumbnails in different resolutions from the same photo.

Before class reading

Before class reading will be a little bit different this time. First we will introduce a helper tool that we will be using during the lab and we want you to get familiar with it before the lab starts. The second part of the reading is the standard introduction, this time into build systems.

Setup

For trying the commands and examples yourself, please, update your clone of the examples repository and move into the 13/pandoc subdirectory.

Ensure you have pandoc installed (sudo dnf install pandoc).

Pandoc

Pandoc is a universal document converter that can convert between various formats, including HTML, Markdown, Docbook, LaTeX, Word, LibreOffice, or PDF.

Basic usage

Start with running (inside the 13/pandoc directory of the examples repo).

cat example.md
pandoc example.md

As you can see, the output is a conversion of the Markdown file into HTML, though without an HTML header.

Markdown can be combined with HTML directly (useful if you want a more complicated HTML code: Pandoc will copy it as-is).

<p>This is an example page in Markdown.</p>
<p>Paragraphs as well as <strong>formatting</strong> are supported.</p>
<p>Inline <code>code</code> as well.</p>
<p class="alert alert-primary">
Third paragraph with <em>HTML code</em>.
</p>

If you add --standalone, it generates a full HTML page. Let’s try it (both invocations will have the same end result):

pandoc --standalone example.md >example.html
pandoc --standalone -o example.html example.md

Try opening example.html in your web browser, too.

As mentioned, Pandoc can create OpenDocument, too (the format used mostly in the OpenOffice/LibreOffice suite).

pandoc -o example.odt example.md

Note that we have omitted the --standalone here as it is not needed for anything else than HTML output. Check how the generated document looks like in LibreOffice/OpenOffice or you can even import it to some online office suites.

Note that you should not commit example.odt into your repository as it can be generated. That is a general rule for any file that can be created.

Sidenote about LibreOffice

Did you know that LibreOffice can be used from the command line, too? For example, we can ask LibreOffice to convert a document to PDF via the following command:

soffice --headless --convert-to pdf example.odt

The --headless prevents opening any GUI and --convert-to should be self-explanatory.

Combined with Pandoc, three commands are enough to create an HTML page and PDF output from a single source.

Pandoc templates

By default, Pandoc uses its own default template for the final HTML. But we can change this template, too.

Look inside template.html. When the template is expanded (or rendered), the parts between dollars would be replaced with actual content.

Let’s try it with Pandoc.

pandoc --template template.html index.md >index.html

Check what the output looks like. Notice how $body$ and $title$ were replaced.

Running example

We will use the above-mentioned files as a running example. We will use them as a starting point for a very simple website that displays information about a tournament.

Notice that for index and rules, there are Markdown files to generate HTML from. Page teams.html – that is mentioned in the menu – does not have a corresponding Markdown file.

Instead, it is generated via following command:

./bin/teams2md.sh teams.csv | cat header-teams.md - | pandoc --template template.html >teams.html

Note that items in teams.csv are separated by spaces and they are intended to be read with read -r team_id team_color team_name.

Further uses of Pandoc

Pandoc can be used even in more sophisticated ways. It supports conversion to and from LaTeX and plenty of other formats (try with --list-output-formats and --list-input-formats).

It can be also used as a universal Markdown parser with -t json (feel free to omit | json_reformat or replace it with | jq if you do not have json_reformat installed).

echo 'Hello, **world**!' | pandoc -t json | json_reformat

Build systems

The above steps for creating the simple tournament website serve as the main motivation for this lab.

While the above steps do not build an executable from sources (as is the typical case for software development), they represent a typical scenario.

Building a software usually consists of many steps that can include actions as different as:

  • compiling source files to some intermediate format
  • linking the final executable
  • creating bitmap graphics in different resolution from a single vector image
  • generating source-code documentation
  • preparing localization files with translation
  • creating a self-extracting archive
  • deploying the software on a web server
  • publishing an artefact in a package repository

Almost all of them are simple by themselves. What is complex is their orchestration. That is, how to run them in the correct order and with the right options (parameters).

For example, before an installer can be prepared, all other files have to be prepared. Localization files often depend on precompilation of some sources but have to be prepared before final executable is linked. And so on.

Even for small-size projects, the amount of steps can be quite high yet they are – in a sense – unimportant: you do not want to remember these, you want to build the whole thing!

Note that your IDE can often help you with all of this – with a single click. But not everybody uses the same IDE and you may even not have a graphical interface at all. Furthermore, you typically run the build as part of each commit – the GitLab pipelines we use for tests are a typical example: they execute without GUI yet we want to build the software (and test it too). Codifying this in a build script simplifies this for virtually everyone. (Note that we will discuss GitLab pipelines in the next lab.)

To illustrate the above points, let’s imagine how a shell script for building the example site would look like.

#!/bin/bash

set -ueo pipefail
set -x

mkdir -p out
pandoc --template template.html index.md >index.html
pandoc --template template.html rules.md >rules.html
...

This looks like a nice improvement: a new member of the team would not need to investigate all the tiny details and just run single build.sh script.

The script is nice but it overwrites all files even if there was no change. In our small example, it is no big deal (you have a fast computer, after all).

But in a bigger project where we, for example, compile thousands of files (e.g. look at source tree of Linux kernel, Firefox, or LibreOffice), it matters. If an input file was not changed (e.g. we modified only rules.md) we do not need to regenerate the other files (e.g., we do not need to re-create index.html).

Let’s extend our script a little bit (look-up man test for -nt).

#!/bin/bash

set -ueo pipefail
set -x

mkdir -p out
[ "index.md" -nt "index.html" ] \
    && pandoc --template template.html index.md >index.html
[ "rules.md" -nt "rules.html" ] \
    && pandoc --template template.html rules.md >rules.html
...

We can do that for every command to speed-up the web generation.

But.

That is a lot of work. And probably the time saved would be all wasted by rewriting our script. Not mentioning the fact that the result looks horrible. And it is rather expensive to maintain.

Also, we often need to build just a part of the project: e.g., regenerate documentation only (without publishing the result, for example). Although extending the script along the following way is possible, it certainly is not viable for large projects.

if [ -z "$1" ]; then
    ... # build here
elif [ "${1:-}" = "clean" ]; then
    rm -f index.html rules.html teams.html
elif [ "${1:-}" = "publish" ]; then
    cp index.html rules.html teams.html /var/www/web-page/teams/
else
    ...

Luckily, there is a better way.

There are special tools, usually called build systems that have a single purpose: to orchestrate the build process. They provide the user with a high-level language for capturing the above-mentioned steps for building software.

In this lab, we will focus on make. make is a relatively old build system, but it is still widely used. It is also one of the simplest tools available: you need to specify most of the things manually but that is great for learning. You will have full control over the process and you will see what is happening behind the scene.

We will talk about make during the labs.

As a teaser, the following fragment of so-called Makefile builds the files index.html and rules.html from the corresponding Markdown sources.

all: index.html rules.html

.PHONY: all clean

index.html: index.md template.html
	pandoc --template template.html index.md >index.html

rules.html: rules.md template.html
	pandoc --template template.html rules.md >rules.html

clean:
	rm -f index.html rules.html

Before class quiz

The quiz file is available in the 13 folder of this GitLab project.

Copy the right language mutation into your project as 13/before.md (i.e., you will need to rename the file).

The questions and answers are part of that file, fill in the answers in between the **[A1]** and **[/A1]** markers.

The before-13 pipeline on GitLab will test that your answers are in the correct format. It does not check for actual correctness (for obvious reasons).

Submit your before-class quiz before start of lab 13.

Sidenote: programs xargs and find (and parallel)

The following two programs can often come handy but we were unable to squeeze them into previous labs about shell scripting. Hence they come here as the build-system topic is quite short and simple.

xargs

xargs in its simplest form reads standard input and converts it to program arguments for a user-specified program.

Assume we have the following files in a directory:

2022-04-10.txt  2022-04-16.txt  2022-04-22.txt  2022-04-28.txt
2022-04-11.txt  2022-04-17.txt  2022-04-23.txt  2022-04-29.txt
2022-04-12.txt  2022-04-18.txt  2022-04-24.txt  2022-04-30.txt
2022-04-13.txt  2022-04-19.txt  2022-04-25.txt
2022-04-14.txt  2022-04-20.txt  2022-04-26.txt
2022-04-15.txt  2022-04-21.txt  2022-04-27.txt

The following script removes files that are older than 20 days:

cutoff_date="$( date -d "20 days ago" '+%Y%m%d' )"
for filename in 2022-[01][0-9]-[0-3][0-9].txt; do
    date_num="$( basename "$filename" .txt | tr -d '-' )"
    if [ "$date_num" -lt "$cutoff_date" ]; then
        echo rm "$filename"
    fi
done

This means that the program rm would be called several times, always removing just one. Note that we have echo rm there to not actually remove the files but to demonstrate the operation. The overhead of starting a new process could become a serious bottleneck for larger scripts (think about thousands of files, for example).

It would be much better if we would call rm just once, giving it a list of files to remove (i.e., as multiple arguments).

xargs is the solution here. Let’s modify the program a little bit:

cutoff_date="$( date -d "20 days ago" '+%Y%m%d' )"
for filename in 2022-[01][0-9]-[0-3][0-9].txt; do
    date_num="$( basename "$filename" .txt | tr -d '-' )"
    if [ "$date_num" -lt "$cutoff_date" ]; then
        echo "$filename"
    fi
done | xargs echo rm

Instead of removing the file right away, we just print its name and pipe the whole loop to xargs where any normal arguments refer to the program to be launched.

Instead of many lines with rm ... we will se just one long line with single invocation of rm.

Of course, tricky filenames can still cause issues as xargs assumes that arguments are delimited by whitespace. (Note that for above, we were safe as the filenames were reasonable.) That can be changed with --delimiter.

If you are piping input to xargs from your program, consider delimiting items with zero byte (i.e., the C string terminator, \0). That is the safest option as this character cannot appear anywhere inside any argument. And tell xargs about it via -0 or --null.

Note that xargs is smart enough to realize when the command-line would be too long and splits it automatically (see manual for details).

It is also good to remember that xargs can execute the command in parallel (i.e., split the stdin into multiple chunks and call the program multiple times with different chunks) via -P. If your shell scripts are getting slow but you have plenty of CPU power, this may speed things up quite a lot for you.

parallel

This program can be used to execute multiple commands in parallel, hence speeding up the execution.

parallel behaves almost exactly as xargs but runs the individual jobs (commands) in parallel.

Please, refer to parallel_tutorial(1) (yes, that is a man page) and for parallel(1) for more details.

find

While ls(1) and wild-card expansion are powerful, sometimes we need to select files using more sophisticated criteria. There comes the find(1) program useful. Without any arguments, it lists all files in current directory, including files in nested directories. Do not run it on root directory (/) unless you know what you are doing (and definitely not on the shared linux.ms.mff.cuni.cz machine).

With -name parameter you can limit the search to files matching given wildcard pattern. Following command finds all alpha.txt files in current directory and in any subdirectory (regardless of depth).

find -name alpha.txt

Why the following command for finding all *.txt files would not work?

find -name *.txt

Hint. Answer.

find has many options – we will not duplicate its manpage here but mention those that are worth remembering.

-delete immediately deletes the found files. Very useful and very dangerous.

-exec runs a given program on every found file. You have to use {} to specify the found filename and terminate the command with ; (since ; terminates commands in shell too, you will need to escape it).

find -name '*.md' -exec wc -l {} \;

Note that for each found file, new invocation of wc happens. This can be altered by changing the command terminator (\;) to +. See the difference between invocation of the following two commands:

find -name '*.md' -exec echo {} \;
find -name '*.md' -exec echo {} +

Caveats

By default, find prints one filename per-line. However, filename can even contain the newline character (!) and thus the following idiom is not 100% safe.

find -options-for-find | while read filename; do
    do_some_complicated_things_with "$filename"
done

If you want to be really safe, use -print0 and IFS= read -r -d $'\0' filename as that would use the only safe delimiter – \0 (recall what you have header about C strings – and how they are terminated – in your Arduino course). Alternatively, you can pipe the output of find -print0 to xargs --null

However, if you are working with your own files or the pattern is safe, the above loop is fine (just do not forget that directories are files too and they can contain \n in their names too).

Note that shell allows you to export a function and call back to it from inside xargs.

#!/bin/bash

my_function() {
    echo ""
    echo "\$0 = $0"
    echo "\$@ =" "$@"
}
export -f my_function

find . -print0 | xargs -0 -n 1 bash -c 'my_function "$@"' arg_zero arg_one

make

Move into 13/make directory first, please. The files in this directory are virtually the same, but there is one extra file: Makefile. Notice that Makefile is written with capital M to be easily distinguishable (ls in non-localized setup sorts uppercase letters first).

This file is a control file for a build system called make that does exactly what we tried to imitate in the previous example. It contains a sequence of rules for building files.

We will get to the exact syntax of the rules soon, but let us play with them first. Execute the following command:

make

You will see the following output (if you have executed some of the commands manually, the output may differ):

pandoc --template template.html index.md >index.html
pandoc --template template.html rules.md >rules.html

make prints the commands it executes and runs them. It has built the website for us: notice that the HTML files were generated.

Execute make again.

make: Nothing to be done for 'all'.

As you can see, make was smart enough to recognize that since no file was changed, there is no need to run anything.

Update index.md (touch index.md would work too) and run make again. Notice how index.html was rebuilt while rules.html remained untouched.

pandoc --template template.html index.md >index.html

This is called an incremental build (we build only what was needed instead of building everything from scratch).

As we mentioned above: this is not much interesting in our tiny example. However, once there are thousands of input files, the difference is enormous.

It is also possible to execute make index.html to ask for rebuilding just index.html. Again, the build is incremental.

If you wish to force a rebuild, execute make with -B. Often, this is called an unconditional build.

In other words, make allows us to capture the simple individual commands needed for a project build (no matter if we are compiling and linking C programs or generating a web site) into a coherent script. It takes care of dependencies and executes only commands which are really needed.

Makefile explained

Makefile is a control file for the build system named make. In essence, it is a domain-specific language to simplify setting up the script with the [ ".." -nt ".." ] constructs we mentioned above.

Important: Unlike typical programming languages, make makes a difference between tabs and spaces. All indentation in the Makefile must be done using tabs. You have to make sure that your editor does not expand tabs to spaces. It is also a common issue when copying fragments from a web-browser. (Usually, your editor will recognize that Makefile is a special file name and switch to tabs-only policy by itself.) If you use spaces instead, you will typically get an error like Makefile:LINE_NUMBER: *** missing separator. Stop..

The Makefile contains a sequence of rules. A rule looks like this:

rules.html: rules.md template.html
      pandoc --template template.html rules.md >rules.html

The name before the colon is the target of the rule. That is usually a file name that we want to build. Here, it is rules.html.

The rest of the first line is the list of dependencies – files from which the target is built. In our example, the dependencies are rules.md and template.html.

The third part are the following lines that has to be indented by tab. They contain the commands that have to be executed for the target to be built. Here, it is the call to pandoc.

make runs the commands if the target is out of date. That is, either the target file is missing, or one or more dependencies are newer than the target.

The rest of the Makefile is similar. There are rules for other files and also several special rules.

Special rules

The special rules are all, clean, and .PHONY. They do not specify files to be built, but rather special actions.

all is a traditional name for the very first rule in the file. It is called a default rule and it is built if you run make with no arguments. It usually has no commands and it depends on all files which should be built by default.

clean is a special rule that has only commands, but no dependencies. Its purpose is to remove all generated files if you want to clean up your work space. Typically, clean removes all files that are not versioned (i.e., under Git control).

This can be considered misuse of make, but one with a long tradition. From the point of view of make, the targets all and clean are still treated as file names. If you create a file called clean, the special rule will stop working, because the target will be considered up to date (it exists and no dependency is newer).

To avoid this trap, you should explicitly tell make that the target is not a file. This is done by listing it as a dependency of the special target .PHONY (note the leading dot).

Generally, you can see that make has a plenty of idiosyncrasies. It is often so with programs which started as a simple tool and underwent 40 years of incremental development, slowly accruing features. Still, it is one of the most frequently used build systems. Also, it often serves as a back-end for more advanced tools – they generate a Makefile from a more friendly specification and let make do the actual work.

Exercise

One. On your own, extend the Makefile to call the generating script (the script is described in the before-class text). Do not forget to update the all and clean rules.

Solution.

Two. Notice that there is an empty out/ subdirectory (it contains only .gitignore that specifies that all files in this directory shall be ignored by Git and thus not shown by git status). Update the Makefile to generate files into this directory. The reasons are obvious:

  • The generated files will not clutter your working directory (you do not want to commit them anyway).
  • When syncing to a webserver, we can specify the whole directory to be copied (instead of specifying individual files).
Solution.

Three. Add a phony target upload that will copy the generated files to a machine in Rotunda. Create (manually) a directory there ~/WWW. Its content will be available as http://www.ms.mff.cuni.cz/~LOGIN/.

Note that you will need to add the proper permissions for the AFS filesystem using the fs setacl command.

Solution.

Four. Add generation of PDF from rules.md (using LibreOffice). Note that soffice supports a --outdir parameter.

Think about the following:

  • Where to place the intermediate ODT file?
  • Shall there be a special rule for the generation of the ODT file or shall it be done with a single rule with two commands?
Solution.

Improving the maintainability of the Makefile

The Makefile starts to have too much of a repeated code.

But make can help you with that too.

Let’s remove all the rules for generating out/*.html from *.md and replace them with:

out/%.html: %.md template.html
      pandoc --template template.html -o $@ $<

That is a pattern rule that captures the idea that HTML is generated from Markdown. Here, the percent sign represents so called stem – the variable (i.e., changing) part of the pattern.

In the command part, we use make variables (they start with dollar as in shell) $@ and $<. $@ is the actual target and $< is the first dependency.

Run make clean && make to verify that even with pattern rules, the web is still generated.

Apart from pattern rules, make also understands variables. They can improve readability as you can separate configuration from commands. For example:

PAGES = \
      out/index.html \
      out/rules.html \
      out/teams.html

all: $(PAGES) ...
...

Note that unlike in the shell, variables are expanded by the $(VAR) construct.

Non-portable extensions

make is a very old tool that exists in many different implementations. The features mentioned so far should work with any version of make. (At least a reasonably recent one. Old makes did not have .PHONY or pattern rules.)

The last addition will work in GNU make only (but that is the default on Linux so there shall not be any problem).

We will change the Makefile as follows:

PAGES = \
      index \
      rules \
      teams

PAGES_TMP=$(addsuffix .html, $(PAGES))
PAGES_HTML=$(addprefix out/, $(PAGES_TMP))

We keep only the basename of each page and we compute the output path. $(addsuffix ...) and $(addprefix ...) are calls to built-in functions. Formally, all function arguments are strings, but in this case, comma-separated names are treated as a list.

Note that we added PAGES_TMP only to improve readability when using this feature for the first time. Normally, you would only have PAGES_HTML assigned directly to this.

PAGES_HTML=$(addprefix out/, $(addsuffix .html, $(PAGES)))

This will prove even more useful when we want to generate a PDF for each page, too. We can add a pattern rule and build the list of PDFs using $(addsuffix .pdf, $(PAGES)).

Graded tasks (deadline: May 22)

13/Makefile (100 points)

UPDATE #1: the variable is supposed to be named COURSES (sorry for the typo).

UPDATE #2: we expect you would use GNU Make and use its extensions (such as addprefix).

UPDATE #3: you may find automatic variable $* useful when generating course webpage.

Convert the web page generation in 13/graded into a Makefile-driven project.

The whole web generation is driven by a bin/build.sh script. Convert it to a Makefile to ensure that pages are not recreated needlessly.

Feel free to create helper shell scripts where it makes sense.

Your Makefile must also support the clean target to remove all generated files (public_html directory can be committed with .gitignore and does not need to be created automatically).

You must ensure that modification of one of the pages does not trigger a rebuild of all of them. However, when metadata are modified for any of the pages, the menu is rebuilt (and hence the whole website needs to be rebuilt).

Copy the src/ folder as-is to your project (e.g. you will have file 13/src/NSWI177.meta in your repository).

To allow testing, specify the list of courses in a variable inside the Makefile. With that in place, it would be possible to rebuild the website with a subset of courses with simple make "COURSES=NSWI177" (just use the variable and if you keep with the simplest solution, it will probably work out of the box without extra work).

The list of courses is expected to be specified manually inside your Makefile. That is, feel free to start your Makefile with the following (note that the list can be generated with a bit of ls ... | sed ... so that you do not have to copy all the codes manually):

COURSES = \
    NAIL062 \
    NDBI025 \
    NDMI002 \
    ...

Even if you are an experienced make-r, use this form to simplify testing (or make sure your solution works with our tests).

Learning outcomes

Conceptual knowledge

Conceptual knowledge is about understanding the meaning and context of given terms and putting them into context. Therefore, you should be able to …

  • name few steps that are required to make sources into a distributable software

  • explain why making these steps reproducible (i.e. codifying them in some way) is useful

  • explain why it makes sense to have a special programming language for such codification

  • explain basic concepts of such programing languages

Practical skills

Practical skills is usually about usage of given programs to solve various tasks. Therefore, you should be able to …

  • use Pandoc to convert between various text formats

  • use xargs

  • use find

  • build make-based projects with default options

  • create Makefile that builds web pages from Markdown sources

  • create Makefile that use wildcard rules

  • improve readability of Makefile by using variables (optional)

  • use create a trivial template for Pandoc for customized conversion (optional)

  • use basic GNU Make extensions to simplify the Makefiles (optional)

  • use parallel