So I thought you wanted to write a Static Site Generator

published on 2025-01-29

I did!

It’s been a while now since I posted about why I decided to write my own static site generator, what I chose to use for templating and teased that I would post more about it.

So it has been pretty much done now for a while (at least for my current use case) – I just never felt like writing anything about it. Let’s do a quick (maybe not so quick after all) rundown of how it works and what I ended up writing.

Part 1: Where I ramble about Umbrella Projects

I decided pretty early on to split the sitegen into two separate Elixir applications:

The sitegen itself.
This way I can easily run it in CI with the minimal dependencies necessary to build my site.
The development server.
It still doesn’t have a functional file watcher, but it’s nice during development to quickly have a preview of what the site looks like (even if I have to restart it for every build). Its dependencies are marked as only: :dev so they can be skipped if one (read: CI) is only interested in the sitegen itself.

Both are contained in the same repository as part of an Elixir umbrella project). I fought for a while with how this works, because there are a couple issues with it if you want to run the applications independently.

Umbrella projects are designed to start all applications they contain, there’s just no good way around it. If I do a mix run in the repository root it will always start both the sitegen and the development server (as well as all applications I added as dependencies). This means if I install only the production dependencies for the sitegen in CI (mix deps.get --only prod) it will fail to compile, because Mix (the Elixir build tool) will try to compile and run all applications.

I played around a bit with ways around this, including mix cmd --app ocarrd mix run (which is somewhat deprecated?) and the newer mix do --app ocarrd run, but neither offered a satisfying solution.

The mix cmd approach actually worked. This is because it changes to the directory of the application to execute the given command – which can be any command and not just those provided by Mix. The only issue I had with it is that color in the application log didn’t work.
The mix do approach did not work. It is mainly meant to chain multiple Mix tasks like performing different build steps and (I assume) less for actually running the applications.

Since I run everything through a justfile anyway I eventually decided to just cd to the sitegen directory in CI, this way easily solving the issue I had with mix cmd instead of spending more hours coming up with a way more complicated solution.

The actual communication between the applications is more straightforward (if we skip over the part where I overcomplicated it for a while). But first I gotta confess that–

Part 2: There’s actually three applications

This is not surprising at all if you think about it.

Apart from the sitegen and the development server you also need a site itself! I could’ve come up with a configuration-only approach, but having the site as its own application gives me way more flexibility in deciding what a specific site needs. More on that in a bit, but first we have to finish talking about the development server. So let’s break down what each of the applications does when it starts:

The sitegen starts the builder process and listens for requests (it doesn’t build anything yet).
The development server registers itself as an observer with the sitegen. It then waits for a message from the sitegen telling it that there’s something to build. Finally it asks the sitegen to (re)build whatever it has. This could support building multiple sites in the future. It would even work with new site applications being started while the sitegen and development server are still running. I still haven’t implemented a file watcher. But if it was implemented the development server could just ask the sitegen to rebuild itself again.
The actual site. It always tells the sitegen “these are the things you need to do to build me”. Currently it has logic that checks whether the development server is available. If the server is available, it doesn’t do anything (apart from telling its requirements to the sitegen). If it is not, it triggers a build. I should probably refactor this logic so it is part of the sitegen. Hmmm.

Now that we’ve gotten the development server out of the picture we can focus more on what the sitegen actually does.

Part 3: So what about the sitegen

The sitegen contains a couple modules that a site can use to build itself.

The Builder

This is the only actually required module.

The site defines a list of tasks (function references) that it hands over to the Builder. For my current site these are the following:

Pages: This uses another module from the sitegen to build all its pages.
Static Images: These are static images that should be copied to the output directory as is.
Sass: This runs the Sass compiler to generate all the CSS. I’m barely using any Sass features, but this was the easiest way to to inject some config and also minify the CSS. (Without adding NodeJS dependencies. I really do not want to do that.)

What I like about this is that apart from what is done inside the Pages task, everything else is defined entirely in the site itself. The site defines how to copy the images to the output directory. The site defines how to run the Sass compiler and inject SCSS into it to set the base URL for image references etc.

It probably makes sense to move some of these functions into modules inside the sitegen at some point. Not all sites need images or custom stylesheets or–. If I were to move my personal website (the one you’re on right now) away from Pandoc and a bunch of shell scripts I could, for example, easily add a task that generates an RSS feed. And if that RSS generator works well enough I might move it into the sitegen itself or create a separate application or I could easily plug in an existing generator. I could only migrate some parts and create a task that calls Pandoc to still render the blog posts.

If you remember my frustrations with Gatsby it probably makes sense (I hope) why it ended up this way. I wanted to have pluggable building blocks that just do a thing and don’t require you to implement stuff that interacts with The Framework™.

Anyway. The Builder takes this list of tasks and executes them in parallel. Since my current tasks don’t depend on each other this seemed like a good way to do things. And if your site (or more likely my next one) has tasks that depend on each other you can just do your own thing.

The Pages

Pages are mainly a registry that a site can register its pages in. Each page consists of a path that it should be served under and a function that generates it.

The path is automatically sanitized of symbols that shouldn’t appear in URLs, and if it contains a plain name (without .html at the end) the registry assumes that it should create a directory with an index.html file instead.

The function simply returns a string containing the final HTML. Due to the nature of the Taggart library (see my previous post for more details) all you have to do is call a function that contains a template. Or you can do whatever you want! While my sitegen has other modules that depend on Taggart it does not require it at all to generate the pages. You can load your own HTML file from disk and return its contents inside the function.¹ Or load any other file you want to serve and maybe run some preprocessing on.

Finally you simply call Pages.write_all() to write all pages inside the registry to the output directory.

One last thing that the Pages registry does is that it allows you to verify if a page actually exists. For example if you use the Link component provided by the sitegen it will throw an error if you link to a page that does not exist.

Components

We’re finally getting to the things that I actually wrote the sitegen for.

I wanted to easily add new media formats that weren’t supported by Gatsby plugins, so I wrote an Image and a Video component (as well as the mentioned Link component) to offer a simple way to add responsive media of different file types to a page. A “component” in this case simply being a function that accepts a map of all the different media and then additional data like the alt text to attach to the HTML element.

For example the function for the Image component looks as follows (slightly simplified in order to take up less space):

  def image(media_map, alt: alt, sizes: sizes) do
    {fallback, media_map} = pop_fallback(media_map)
    {src, srcset} = to_fallback_src(fallback)

    picture do
      for {type, media_list} <- sort_by_prio(media_map) do
        source(type: type, srcset: to_srcset(media_list), sizes: sizes)
      end

      img(alt: alt, src: to_absolute_url(src), srcset: srcset, sizes: sizes)
    end
  end

A media map is structured as follows:

%{
  :"image/png": [
    %Media{width: …, height: …, abs_url: …, …},
    …
  ],
  :"image/avif": […]
}

First the component removes the fallback type that should be able to be displayed in all browsers from the media map and then uses Taggart to create a picture element. This contains a source element for the remaining types (sorted by the priority in which they should be used) as well as a regular img element that’s used for the fallback type.

But where does the media map come from?

Media State and Transformer

Let’s start with the Media State because that’s the part a page interact with directly.

In order to get the media map that should be passed to a component you simply ask the State to get it for you:

%MediaWrap{category: category, type: type, value: media} =
  MediaState.get_or_compute(file, opts)

As you can see you pass the path to the media file, as well as a list of options (for example rendering: "pixelated" to use nearest-neighbor scaling). It returns a MediaWrap struct that contains the detected category (image or video), the type of the data (a single “unresponsive” file or a media map) and the value itself. Since the function is named get_or_compute it computes the result the first time and then caches it for all subsequent calls. The media state itself isn’t that interesting: It registers callers as observers, starts a process in which the Transformer does all the actual work, and then tells everyone that the result is ready.

Transformer is a behaviour that can be implemented to provide transformations for a specific file type. It also provides a generic transform(file, opts) function that tries all known transformers until it finds one that works. From the arguments you can see that this is pretty much exactly what also gets passed to the Media State.

Because it is a behaviour, it specifies the following functions that each transformer should implement:

extensions()
get_category(path)
get_metadata(path)
create_intermediates(input)
transform_new(input, intermediates)

extensions/0 returns the list of file extensions that this transformer supports. The generic transform function has a list of all transformers and calls this function in order to decide which transformer to use.

get_category/1 returns the category that a given file has. This is the same category used by the Media State earlier, and can currently be either image or video (and requires the file path as an argument since one transformer could support both images and videos).

get_metadata/1 returns metadata about a file such as its width and height.

Finally there’s the two functions involved in the actual transformation.

create_intermediates/1 is the main function that gets called every time to create a list of “intermediates” out of some input data (the file name and associated options like rendering: "pixelated"). Each intermediate corresponds to one output file that will later be written when doing the actual transformation. There are two important things that are happening here.

It will set the actual transformation options based on the input options (for example turn rendering: "pixeldated" into the proper name of the resampling kernel to use). This happens on a per output basis because it may be desired to interpret the input options differently for specific image sizes or formats.
Generate the output file name based on the input file and all the options. This has the general form <input-hash>/<output-hash>.<ext> and assures that the same input will always generate the same output.

Afterwards the transformer can split the list of intermediates into ones for which files have already been generated (by checking if the output path exists) and ones that have yet to be generated (by calling transform_new/2 on the intermediates). This is especially useful when two sets of inputs overlap in the types of files that should be generated. Imagine one that wants images in the scales [0.25, 0.5, 0.75] and one that wants them in the scales [0.5, 1.0] – the 0.5 scale image only needs to be generated once and the output path will be identical for both.

transform_new/2 mostly calls a program or library with the necessary options for the transformation. In case of libraries like vips that allow you to create a kind of pipeline that first transforms the input image to the desired output sizes and then saves each in all the desired image formats. This way not every intermediate has to go through the whole transformation process and the shared steps are only performed once. Since I decided pretty early on to first implement a transformer around vips and wanted to support this feature it is also the reason that this function accepts a list of intermediates instead of performing the transformation only for a single output at a time.

At the end the transformer concats the lists of existing and newly generated media and wraps them in MediaWrap.

Conclusion?

I hope this helped to show some insights into how I decided to implement my sitegen. I tried to minimize the amount of code shown to only complement the text, but maybe that made things more difficult to understand. I actually sat on the last section a couple weeks because I wasn’t sure how detailed my description of the transformer itself should be. Overall I had a lot of fun working on the sitegen and learned some new things about Elixir that I didn’t get to properly use before. As mentioned in the beginning it pretty much does what I need it to do now. I can think of a couple things that could be improved, but since it is just a hobby project I’ll get to it when I get to it and focus on other stuff in the mean time.

Between starting this blog post and finishing it I actually implemented two new features that I thought about for a while! The first one is setting an image “placeholder” to avoid content shift while the images are still loading (all without JavaScript). The second one is a new component that automatically downloads fonts from an external stylesheet and a new transformer that uses the features described earlier to transform the stylesheet into a local one if it hasn’t been already. And going over everything again for this post gave me a couple more ideas.

Who knows if I’ll write a third part. For now I’m happy that I managed to collect my thoughts on both the issues I had with existing sitegens and how I decided to solve them (for my particular use case).

If you’re interested in more than just my attempt to summarize everything you can check out the code in my repository or visit the example site where I try out the features before implementing them in my own site!

As long as it returns {:safe, your_string_like_thing}.
Hey this is is pretty much the same footnote I wrote in the previous post.
I guess I just think it’d be too much noise to mention this in the main body and instead decide to put it in a footnote that most people are gonna click on anyway.↩︎