Problem Statment
Currently, on EVE software module, ZedAgent is responsible for top level orchestration, basos upgrade validation, cloud connectivity for configuration/status.
In the whole EVE node boot up process, ZedAgent and associated modules are spawned, only after network connectivity(through nim, waitfor address) and device registration (zedclient).
For baseos upgrade validation, this leaves a gap between node boot up and real baseos upgrade transition process invocation in zedagent. Any failure inbetween, the device boot up until zedagent starts, may lead to device being struck in some indefinite state and may turn the device to a non-functional unit.
Proposal
The zedagent module will be broken-up. The base of validation and over all connectivity and device health will be managed by DevAgent. The DevAgent will be one of the first modules to be spawned along with ledmanager, and will be persistent for the whole lifetime of the EVE node. The ZedAgent will be only responsible for cloud connectivity and configuration parsing and status/metrics publication. The baseos upgrade validation will be covered by DevAgent module, covering all the intermediary state for the device boot up.
EVE Node Health Monitor Function
EVE Node health check functionality, consists of the following,
pillar agent(s) run state and responsiveness
Each agent's health is monitored through watchdog timer.
Controller connectivity
The controller connectivity for the EVE node is evaluated, as following,
Reset Time
In normal operation scenario, for controller connectivity loss, the EVE node is rebooted after the reset timer interval.
Fallback Time
On baseos upgrade, in validation phase, for controller connectivity loss, EVE Node falls back to fallback image, after the fallback time interval.
Current Implementation
The EVE node reset and fallback timer functionalities are currently part of ZedAgent Module.
Proposal for Refactoring
Baseosmgr Module
ZedAgent Module
DevAgent Module
DevAgent will listen to the following,
- ledBlinker Status. – for EVE node registration, controller connectivity change events
- Zboot Status
- Zedagent Status
DevAgent will publish to the following,
- Zboot Config
- DevAgent Status
ZedAgent additionally will listen to the following,
- Dev Agent Status
PS.
Currently, the scope of device health, as defined above, does not include the following,
- cpu usage health
- disk space usage health
- network usage health
- each agent's basic functionality check