
Anyone who gives a self-learning algorithm control over a heavy-duty business connection will sooner or later ask themselves the question we posed at the end of the first article: how do you prevent a system constantly seeking rewards from making decisions that jeopardize operational reliability? The short answer: not by deploying AI everywhere, but by carefully choosing where to apply it and where not. A self-learning system excels at handling uncertainty and volatile market prices. However, for specific customer topologies and strict physical limits, AI is precisely the wrong tool.

This principle requires three things: a system that remains under control in all circumstances, a well-considered choice in how you train the algorithm, and an architecture that scales to hundreds of locations without customisation becoming dominant. We'll start with the most important aspect: control.Â
A common concern with Deep Reinforcement Learning (DRL) is its black box nature: the algorithm receives data as input and provides an action as output, but the logic behind these actions is difficult to comprehend from the outside. What if the model makes a decision that no one understands? Or worse, one that jeopardizes the grid connection?
We have built the architecture around three control layers that structurally address this risk.
Explainability
In practice, a DRL model is not an absolute black box, but complete transparency is also an illusion. For individual decisions, we can determine which input data likely carried the most weight. This doesn't provide mathematical proof, but it does offer a good estimate. Enough for our engineers to understand why a particular charging or discharging strategy was chosen.Â
For long-term strategies, this is more difficult. Why the model deploys the battery differently over weeks than a domain expert algorithm would emerges over millions of training steps and cannot simply be attributed to a single cause.Â
Hard Limits
The most critical layer involves setting boundaries for the AI. The algorithm has the freedom to act, but only within strictly defined limits. In the background, prediction models run as a continuous safety net. They calculate the absolute boundaries within which the AI must operate, based on two strict preconditions:
If the AI chooses a strategy that violates these parameters, the safety net immediately blocks the action. Unsafe decisions never reach the physical infrastructure.
Continuous Benchmarking
A DRL model sometimes takes actions that initially seem counter-intuitive but prove to be a better long-term strategy than the obvious choice. To allow the model to demonstrate that value, we continuously benchmark its performance against domain expert algorithms (the rule-based logic from Part 1) that run in parallel. After a fixed period, for example, two weeks, we compare the results.
With control in place, the second design choice arises: how do we train the model? This choice directly determines whether the system is viable for one location or a hundred. Within DRL, two approaches exist.Â
In practice, data scarcity at new locations is a structural problem. At a new location, years of detailed consumption data at 15-minute intervals are often lacking. Where a model-free approach gets stuck here, model-based RL achieves a working result with significantly less data. Building and maintaining an accurate model is complex and domain-specific. However, for energy systems, where the physical limits are known, this is a manageable problem.
The choice for model-based solves one problem. A second dilemma remains: how do you scale this to hundreds of locations without starting from scratch for each one?
Beyond the choice of learning method, the system's architecture is crucial for long-term effectiveness. If an EMS operates at dozens or hundreds of business locations, a design dilemma arises:
We solve this by providing the model with the right context for each location. A single model learns generic patterns and receives location-specific information such as connection capacity, production profile, and consumption pattern. This way, it applies the same learned patterns to the specific situation. As we make the models more complex, we envision model families: variants for specific location types, such as severe grid congestion or solar panel overcapacity. New locations then directly benefit from what has already been learned within their category.Â
Model families work because they apply learning capabilities where learning adds value: recognizing patterns in uncertain data such as energy prices, weather, and demand. What they don't do is learn the specific installation setup of a client. You don't build that knowledge through trial-and-error; you know it beforehand, or you conduct measurements and record it.
This is a principle that permeates the entire design. AI excels at handling uncertainty: price forecasts, demand prediction, and market strategy across day-ahead, intraday, and imbalance markets. For certainties such as physical limits, connection contracts, and topology, we rely on the domain expert. Not because AI couldn't learn it, but because the resulting custom AI solution would be expensive and not scalable.Â
The combination of three control layers, model-based learning, and model families makes it possible to deploy a self-learning system that optimizes returns without compromising operational reliability. AI learns, but within boundaries we define, and only in areas where learning truly adds value. For us, that's the difference between AI as a marketing term and AI as reliable operational technology.Â
Every month without proper management means lost margin.
Zympler can be immediately deployed on existing infrastructure.