Insights
March 9, 2026
How to safely grant AI control over a high-capacity business connection
Reinout de Jongh

Anyone who gives a self-learning algorithm control over a heavy-duty business connection will sooner or later ask themselves the question we posed at the end of the first article: how do you prevent a system constantly seeking rewards from making decisions that jeopardize operational reliability? The short answer: not by deploying AI everywhere, but by carefully choosing where to apply it and where not. A self-learning system excels at handling uncertainty and volatile market prices. However, for specific customer topologies and strict physical limits, AI is precisely the wrong tool.

This principle requires three things: a system that remains under control in all circumstances, a well-considered choice in how you train the algorithm, and an architecture that scales to hundreds of locations without customisation becoming dominant. We'll start with the most important aspect: control. 

The Three Control Layers

A common concern with Deep Reinforcement Learning (DRL) is its black box nature: the algorithm receives data as input and provides an action as output, but the logic behind these actions is difficult to comprehend from the outside. What if the model makes a decision that no one understands? Or worse, one that jeopardizes the grid connection?

We have built the architecture around three control layers that structurally address this risk.

Explainability

In practice, a DRL model is not an absolute black box, but complete transparency is also an illusion. For individual decisions, we can determine which input data likely carried the most weight. This doesn't provide mathematical proof, but it does offer a good estimate. Enough for our engineers to understand why a particular charging or discharging strategy was chosen. 

For long-term strategies, this is more difficult. Why the model deploys the battery differently over weeks than a domain expert algorithm would emerges over millions of training steps and cannot simply be attributed to a single cause. 

Hard Limits

The most critical layer involves setting boundaries for the AI. The algorithm has the freedom to act, but only within strictly defined limits. In the background, prediction models run as a continuous safety net. They calculate the absolute boundaries within which the AI must operate, based on two strict preconditions:

  • Minimum reserve capacity: The system always reserves sufficient battery capacity to absorb both expected and unexpected power peaks.
  • Grid connection: The system operates on a quarter-hourly basis: for each upcoming quarter-hour, it first calculates how much power remains available within the contracted capacity after baseline consumption. The system may only propose actions that fit within the remaining power. Every proposed charging or discharging action is calculated first before being executed. If it doesn't fit, it is blocked. Thus, an exceedance is prevented not by post-hoc correction, but by mathematical exclusion upfront.

If the AI chooses a strategy that violates these parameters, the safety net immediately blocks the action. Unsafe decisions never reach the physical infrastructure.

Continuous Benchmarking

A DRL model sometimes takes actions that initially seem counter-intuitive but prove to be a better long-term strategy than the obvious choice. To allow the model to demonstrate that value, we continuously benchmark its performance against domain expert algorithms (the rule-based logic from Part 1) that run in parallel. After a fixed period, for example, two weeks, we compare the results.

Rapid Learning at New Locations

With control in place, the second design choice arises: how do we train the model? This choice directly determines whether the system is viable for one location or a hundred. Within DRL, two approaches exist. 

  • One model-free approach learns purely through experimentation. The agent tries out millions of actions in a virtual environment and discovers what works based on direct feedback. This flexibility comes at a price: the method is extremely data-intensive. To recognize reliable patterns, years of detailed historical datasets are needed, at 15-minute intervals, per location.
  • A model-based approach combines that virtual environment with an internal model of the environment. We pre-define for the algorithm the consequences of an action: that 'charging' fills the battery and directly equates to purchasing electricity, and that a discharge action reduces the available buffer capacity. As a result, the agent does not have to derive these fundamental relationships from the data itself. It learns faster, requires less historical data, and can simulate the consequences of an action before that action is executed. 

In practice, data scarcity at new locations is a structural problem. At a new location, years of detailed consumption data at 15-minute intervals are often lacking. Where a model-free approach gets stuck here, model-based RL achieves a working result with significantly less data. Building and maintaining an accurate model is complex and domain-specific. However, for energy systems, where the physical limits are known, this is a manageable problem.

The choice for model-based solves one problem. A second dilemma remains: how do you scale this to hundreds of locations without starting from scratch for each one?

Scaling without compromising performance

Beyond the choice of learning method, the system's architecture is crucial for long-term effectiveness. If an EMS operates at dozens or hundreds of business locations, a design dilemma arises:

  • Individual models per client: excel at customization, but training and maintenance are time-consuming. What the algorithm learns at location A is not automatically applied to location B.
  • One universal model: establishes connections across all locations and requires less maintenance, but performs suboptimally at an individual level. Any adjustment for one specific situation directly affects behavior at all other locations. 

We solve this by providing the model with the right context for each location. A single model learns generic patterns and receives location-specific information such as connection capacity, production profile, and consumption pattern. This way, it applies the same learned patterns to the specific situation. As we make the models more complex, we envision model families: variants for specific location types, such as severe grid congestion or solar panel overcapacity. New locations then directly benefit from what has already been learned within their category. 

Where AI does and does not belong

Model families work because they apply learning capabilities where learning adds value: recognizing patterns in uncertain data such as energy prices, weather, and demand. What they don't do is learn the specific installation setup of a client. You don't build that knowledge through trial-and-error; you know it beforehand, or you conduct measurements and record it.

This is a principle that permeates the entire design. AI excels at handling uncertainty: price forecasts, demand prediction, and market strategy across day-ahead, intraday, and imbalance markets. For certainties such as physical limits, connection contracts, and topology, we rely on the domain expert. Not because AI couldn't learn it, but because the resulting custom AI solution would be expensive and not scalable. 

Conclusion

The combination of three control layers, model-based learning, and model families makes it possible to deploy a self-learning system that optimizes returns without compromising operational reliability. AI learns, but within boundaries we define, and only in areas where learning truly adds value. For us, that's the difference between AI as a marketing term and AI as reliable operational technology. 

The grid doesn't wait.

Every month without proper management means lost margin.
Zympler can be immediately deployed on existing infrastructure.