State machine miracle

How I learned what a state machine is and that good design is an essential part of programming.

🇨🇿 Česká verze je níže / Scroll down for Czech version.

Choosing the right design when coding is pretty much limited by the number of tricks I've already seen in my nine-month career. However, my job is everything but repetitive. (Not complaining!) Most of the things I do, I do for the first time. That includes numerous failures. I have gradually developed a sixth sense of when to stop coding and start asking. That's when Aleš's eyes start shining, I can be sure my code is all wrong, and there's a brand new concept I will learn.

Tracking the state of the microservices

In TeskaLabs, we follow microservice architecture. Deploying and monitoring all the microservices requires some automation, which has been my task for a few months. I was so happy I could correctly collect data about all the services in the cluster when I was presenting my work to Jakub.
"This is nice. But I don't really care about the data. All I need is to get an alert when some service in the cluster crashes."
Obviously.
I realized that I had entirely skipped the main feature of my growing microservice. Not only do I have to collect the data, but I also have to evaluate them and provide the user with the current state of the microservice.

Ok, no problem, I thought. I know all the data that go through. I will just write a few conditions. If data say microservice is running, it is running. No big deal. However, there was more than one data source. The number of conditions increased geometrically. When addressing this problem on review, Aleš simply said: "You will need a state machine. Draw a picture first."

My code is not a toaster

State machine. Sounds cool. I googled it. It is a mathematical concept introduced by E. F. Moore in 1956. As a biologist, I got especially interested in his futurist paper about Artificial Living Plants, a wish for self-reproducing mechanic machines using organic sources to produce probably anything and migrating into a factory when ready for harvest. (Wonderful pictures included!) I hoped he had been a better mathematician than a biochemist.

The finite state machine can be in exactly one of a finite number of states at a time. The change from one state to another is called transition, and it is triggered by an event. All the tutorials show a schema of a toaster. But my code is not a toaster! How should I make software out of this toaster theory? I realized I should start listing all possible states and events. And I thought how clever I am! I've read something about UML diagrams and started drawing. Looking at the two long lists, I decided to start with the happy flow. But the arrows didn't lead anywhere. I couldn't connect all states and even justify the already-created connections. The data sources somehow didn't click in my mind.

The next day I had to admit my drawing didn't work, and I needed help.
"Why do you have so many states?"
"Well, when I start the service, I know it is starting, then I get a message that it has started, but I don't know whether it is running," I said, losing my self-confidence with every word.
"What is the difference between starting and restarting?"
"Ehm... Don't know..."
"And what is the unknown state?" asked Aleš, slightly laughing.
"I sometimes just cannot decide."
After a few more rather humiliating questions, there was a picture on a whiteboard with only four bubbles and not much more arrows. In the end, I realized I simply needed an arrow from each state to each state. So the picture looked a bit like this:

If I did what I should have done

I was happy. For the first time, I had clear instructions on implementing a concept. I chose a small and easy pysm library. All others seemed overkill to me. I got discouraged by their rather lengthy documentation. I was eager to start coding. After several code review rounds, there was a beautiful code, a textbook example of a state machine. Almost like a toaster. Beautiful, but not working.

Writing unit tests earlier, I would realize it immediately. Instead, it took me some time to investigate pysm library wasn't suitable for my case. In my code, there is not only one state machine. There is one state machine for each microservice in the cluster, creating its "digital twin". All states are common for all state machines, but every state machine is triggered by its respective microservice behavior. I needed the state machine to remember its state and trigger a defined transition when its time came. However, digging into the pysm library code, I realized the State object (the definition of a state) "remembers" its state machine. This reverse awareness prevented me from defining the same states for more state machines.

I had to choose another library and write the state machine implementation from scratch. However, this time I knew my criteria for selecting the library, and documentation that I had found rather complicated at the beginning appeared to be quite readable. Using the transitions library allowed me to refactor my code to something magical, testable, and hopefully sustainable.

I found using the state machine concept miraculous. I enjoyed how the initial image, drawn on the whiteboard, appeared printed in the code. Identifying a state machine in the task gave me a blueprint for my Python code. When I got a bit more familiar with the concept, I learned that other problems of changing types of behavior in response to events could be solved using a state machine. And much more. I'll let you know when I understand how parsers work. :)

🇨🇿 Zázrak konečného automatu

Aneb jak jsem se naučila, co to je „state machine“, a že esenciální součástí programování je dobrý design.

Má schopnost zvolit správný design je ostře limitovaná množstvím triků, které jsem zatím stihla za svou devítiměsíční kariéru pochytit. Jenže poslední, na co bych si mohla stěžovat, je repetitivní zaměstnání. Každý den dělám něco poprvé. (Nestěžuji si!) Stává se mi tedy docela často, že mé dovednosti a znalosti zase nestačí. Za těch pár měsíců jsem si ale vyvinula jakýsi šestý smysl, který mi říká, že mám přestat psát a začít se ptát. To je moment, kdy se Alešovi rozsvítí oči a nenechá mě ani domluvit. Je to chvíle, kdy si můžu být jistá, že můj kód je úplně špatně, a nezbývá než se naučit nový trik.

Jak sledovat stav mikroservis

O mikroservisové architektuře, které se v TeskaLabs držíme, již bylo něco napsáno. V posledních pár měsících je mým úkolem automatizace monitorování a nasazování mikroservis. Měla jsem takovou radost, když jsem konečně dokázala dostat správná data o všech mikroservisách v klastru, a tak jsem ukázala výsledek své práce Jakubovi.
„To je moc hezký,“ řekl Jakub, a já věděla, že přijde „ale“.
„... , ale já se na to nechci koukat. Chci dostat upozornění, když nějaký kontejner spadne.“
No jasně!
Úplně jsem přeskočila hlavní funkci svého rostoucího softwaru. Nejen, že musím data sbírat, ale jsem také zodpovědná za jejich vyhodnocování. Uživateli pak musím komunikovat stav jednotlivých mikroservis.

Nevadí, říkala jsem si. Znám všechna data, tak jen napíšu pár podmínek. Když data říkají, že kontejner běží, tak prostě běží. Co na tom? Jenže data nepřicházejí jen z jednoho zdroje, je jich víc. Množství podmínek tak narůstalo geometrickou řadou. Když jsem na review blekotala o tom, kde jsem se zasekla, Aleš mě přerušil a řekl jenom: „Potřebuješ state machine. Začni obrázkem.“

Můj kód není topinkovač

State machine. Začala jsem googlit. Přestože mé vyhledávání i většina mých myšlenek již dlouho ubíhá v angličtině, dovolte mi trochu obrozenecké nálady a použití pojmu konečný automat. Tento matematický koncept představil v roce 1956 E. F. Moore. Jako vystudovanou bioložku mě zaujal jeho futuristický článek o uměle vytvořených, ale živých rostlinách (Artificial Living Plants). Pan Moore v tomto článku popisuje svou vizi mechanických strojů, které by se samy replikovaly, byly schopné využívat organické látky, dokázaly z nich vyrábět prakticky cokoliv, a dokonce samy migrovat do zpracovatelských závodů, když přijde čas sklizně. (Článek doporučuji už jen pro ty úžasné obrázky!) Doufala jsem jen, že byl lepším matematikem, než biochemikem.

Konečný automat se může nacházet vždy jen v jednom z konečného počtu stavů. Přechod z jednoho stavu do druhého je spuštěn pouze s příchodem specifického vstupu. Každý internetový návod začíná popisem topinkovače. Jenže můj kód není topinkovač! Jak mám opékání převést v software sledující stav mikroservis? Rozhodla jsem se začít tím, že vypíšu všechny možné stavy a vstupy. A říkala jsem si, jak jsem strašně chytrá! Přečetla jsem si něco málo o UML diagramech a dala se do kreslení. Když jsem se podívala na ty dva dlouhé seznamy, začala jsem s „happy flow“ – nejčastějším scénářem. Nakreslila jsem stavy jako bubliny. To bylo jednoduché. A pak jsem je začala spojovat šipkami. Jenže ty šipky vůbec nikam nevedly. Nedařilo se mi pospojovat všechny stavy v jednu síť a ani už jsem si nebyla jistá, co všechny ty šipky znázorňují. Nějak se datové zdroje v mé hlavě nespojily.

Nazítří mi tedy nezbylo nic jiného než přiznat, že můj obrázek nefunguje a potřebuji pomoct.
„Proč máš tolik stavů?“
„No, přece když zapnu kontejner, tak vidím, že startuje (starting), pak dostanu zprávu, že nastartoval (started), ale ještě nevím, jestli opravdu běží,“ obhajovala jsem se, ale s každým dalším slovem jsem si byla méně a méně jistá svou úvahou.
„Jaký je rozdíl mezi starting a restarting?“
„Já... vlastně ani nevím.“
„A co znamená stav unknown?“ A to už se Aleš otevřeně smál.
„Prostě se někdy neumím rozhodnout,“ odpovídala jsem bezradně.
Po pár dalších podobně nepříjemných otázkách jsem tedy dostala dobrou radu. Na tabuli jsme nakreslili jen čtyři bubliny a ne o mnoho víc šipek. Při bližším ohledání jsem zjistila, že jednoduše potřebuji šipku (přechod) z každého stavu do každého. Takže nakonec ten obrázek vypadal nějak takhle:

Kdybych bývala byla chytřejší

Byla jsem spokojená. Poprvé jsem měla v ruce jasný návod, jak uchopit nějaký koncept. Vybrala jsem si malou a jednoduchou knihovnu pysm pro implementaci konečného automatu v Pythonu. Všechny ostatní mě odrazovaly svou nekonečnou dokumentací, funkcemi, které jsem nepotřebovala, a (tedy) nutností proniknout knihovně hlouběji do duše. Po několika kolech pokusů, omylů a code review jsem se dostala k něčemu, co vypadalo nádherně. Byl to téměř učebnicový příklad konečného automatu. Skoro jako topinkovač. Nádherný, ale nefunkční.

Kdybych si napsala unit testy, přišla bych na to hned. Místo toho jsem ale strávila pár hodin hledáním, proč se mi knihovna pysm vlastně vůbec nehodí. Neobsluhuji totiž jen jediný konečný automat, ale tolik automatů, kolik je mikroservis v klastru. Pro každý kontejner vytvářím jakési „digitální dvojče“. Stavy jsou stejné pro všechny přítomné automaty, ale každý z nich sám řídí, ve kterém stavu se nachází podle chování mikroservis „tam venku“. Nakonec jsem zjistila, že v knihovně pysm si objekt State (tedy definice stavu) pamatuje svůj konečný automat (digitální dvojče), ke kterému patří. Tento oboustranný vztah mezi objekty stavu a samotného automatu mě nakonec donutil vybrat si jinou knihovnu.

Napodruhé už jsem byla obezřetnější. Už jsem znala kritéria a dokumentace, která se mi prve zdála neprostupná, najednou vypadala docela čitelně. Knihovna transitions mi nakonec umožnila úplně zázračně refaktorovat kód, napsat testy, a vytvořit tak snad udržitelný způsob, jak monitorovat stav mikroservis v našem ekosystému.

Koncept konečného automatu se mi zalíbil. Vypadalo to jako kouzlo, když se obrázek nakreslený na tabuli najednou zjevil v kódu. Nalezení vzoru v dané úloze mi dalo jasný návod, jak ji řešit. Bylo to trochu jako stavět skříň z Ikey. Když jsem si osvojila konečný automat jako kladivo, najednou všechno kolem mě začalo vypadat jako hřebíky. Mnoho situací, kde dochází ke změnám typů chování na základě nějakých vstupů, lze řešit pomocí konečných automatů. Až se naučím, jak fungují parsery, dám vám vědět. :)

About the Author

Eliška Novotná

Junior backend developer at TeskaLabs. Python and unicorns lover.

TeskaLabs LogMan.io

Log Management and SIEM

Tweets by @TeskaLabs

Most Recent Articles

You Might Be Interested in Reading These Articles

Q&A: Mobile App Developers Asked How SeaCat Will Protect Their Apps, Backend, and the Data. Here Are the Answers

We've spent a great deal of time talking to mobile app developers to understand their approach to handling mobile application security. In this Q&A, we put together the answers to the most common questions asked by these app developers.

Continue reading ...

tech

Published on May 07, 2015

The Birth of Application Server Boilerplate

One of the most exciting tasks for our team in the last month was to create a new application server “boilerplate” that would be used as a basis for most of our growing data-processing products, as well as for other people and companies ...

Continue reading ...

tutorial development asab

Published on January 16, 2018

Log Management: Pre-implementation analysis

Pre-implementation analysis is a preparatory phase on the way to deploying a functional cybersecurity management system, i.e. TeskaLabs SIEM and security event management, Log Management. It is an integral and essential part of the entire implementation process, as it helps uncover any potential risks that may arise when deploying the mentioned SIEM and log management tools.

Continue reading ...

logman tech

Published on September 15, 2022

Tags: development, tech, eliska

Follow @TeskaLabs