Infrastructure as Code (IaC) is an approach that involves describing infrastructure as code and then applying it to make the necessary changes. IaC does not dictate how exactly to write code, it just provides tools instead. A good examples are Terraform, Ansible and Kubernetes itself where you don't say what to do, rather than you dictate what state you want you infrastructure to get into.
Keep the infrastructure code readable. Your colleagues would be able to easily understand it, and, if necessary, add or test it. Looking to be an obvious point, it is quite often is forgotten, resulting in “write-only code” - the one can only be written, but cannot be read. Its author inclusive, and is unlikely to be able to understood what he wrote and figure out how it all works, even a few days afterward.
An example of a good practice is keeping all variables in a separate file. This is convenient because they do not have to be searched throughout the code. Just open the file and immediately get what you need.
Adhere to a certain style of writing code. As a good example, you may want keeping the code line length between 80-120 characters. If the lines are very long, the editor starts wrapping them. Line breaks destroy the overall view and interfere with the understanding of the code. One has to spend a lot of time just figuring out where the line starts and where it ends.
It's nice to have the coding style check automated, at least use by using the CI/CD pipeline for this. Such a pipeline could have a Lint step: a process of statistical analysis of what is written, helping to identify potential problems before the code is applied.
Utilize git repositories same way developers do. Saying that I mean developing new branches, linking branches to tasks, reviewing what has already been written, sending Pull Requests before making changes, etc.
Being a solo maintainer one may seem the listed actions to be redundant - it is a common practice when people just come and start committing. However, even if you have a small team, it could be difficult to understand who, when, and why made some corrections. As the project grows, such practices will increasingly help the understanding of what is happening and mess up the work. Therefore, it is worth investing some time into adopting some of the development practices to work with repositories.
Infrastructure as Code tools are typically associated with DevOps. As we know DevOps as specialists who not only deal with maintenance but also help developers work: set up pipelines, automate test launches, etc. - all the above also applies to IaC.
In Infrastructure as Code, automation should be applied: Lint rules, testing, automatic releases, etc. Having repositories with let's say Ansible or Terraform, but rolled out manually (by an engineer manually starting a task) is not that much good. Firstly, it is difficult to track who launched it, why, and at what moment. Secondly, it is impossible to understand how that worked out and draw conclusions.
With everything kept in the repository and controlled by an automatic CI/CD pipeline, we can always see when the pipeline was launched and how it performed. We can also control the parallel execution of pipelines, identify the causes of failures, quickly find errors, and much more.
You can often hear from maintainers that they do not test the code at all or just first run it somewhere on dev. It's not the best practice, because it does not give any guarantee that dev matches prod. In the case of Ansible or other configuration tools, standard testing could be something as:
- launched a test on dev;
- rolled on dev, but crashed with an error;
- fixed this error;
- once again, the test was not run because dev is already in the state to which they tried to bring it.
It seems that the error has been corrected, and you can roll on prod. What will happen to prod? It is always a matter of luck - hit or miss, guess or miss. If somewhere in the middle, something falls again, the error will be corrected and everything will be restarted.
But infrastructure code can and should be tested. At the same time, even if specialists know about different testing methods, they still cannot use them. The reason is that Ansible roles or Terraform files are written without the initial focus on the fact that they will need to be tested somehow.
In an ideal world, at the moment of writing a code developer is aware of what (else) needs to be tested. Accordingly, before starting to write a code, developer plans on how to test it, commonly know as TDD. Untested code is low-quality code.
The same exactly applies to infrastructure code: once written, you should be able to test it. Decent testing allows to reduce the number of errors and make it easier for colleagues who will finalize your roles on Ansible or Terraform files.
A few words about automation. A common practice when working with Ansible is that even if something could be tested, there is no automation to it. Usually, this is a case when someone creates a virtual machine, takes some role written by colleagues, and launches it. Afterward that person relizes the need to add certain new things to it - appends and launches again on the virtual machine. Then he realizes that even more changes are equired and also the current virtual machine has already been brought to some kind of state, so it needs to be killed, new virtual machine reinstantstiated and the role rolled over it. In case something does not work, this algorithm would have to be repeated until all errors are eliminated.
Usually, the human factor comes into a play, and after the N-th number of repetitions, it becomes too lazy deleting the VM and re-creating it again. Once everything seems to work exactly as it should (this time), so one seems could freeze the changes and roll into the prod environment. But reality is that errors could still occur, that is why automation is needed. When it works through automated pipelines and Pull Requests are used - it helps to identify bugs faster and prevent their re-appearance.