Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Problem Statment

Currently, on EVE software module, ZedAgent is responsible for top level orchestration, basos upgrade validation, cloud connectivity for configuration/status.

...

For  baseos upgrade validation, this leaves a gap between node boot up and real baseos upgrade transition process invocation in zedagent. Any failure inbetween, the device boot up until zedagent starts, may lead to device being struck in some indefinite state and may turn the device to a non-functional unit. 


Proposal

The zedagent module will be broken-up. The base of baseos upgrade validation and over all connectivity and device health will be managed by DevAgentNodeAgent. The DevAgent NodeAgent will be one of the first modules to be spawned along with ledmanager, and will be persistent for the whole lifetime of the EVE node. The ZedAgent will be only responsible for cloud connectivity and configuration parsing and status/metrics publication. The baseos upgrade validation will be covered by DevAgent module, covering all the intermediary state for the device boot up. Baseosmgr will interact with NodeAgent for the baseos upgrade  installation and valitaion.

EVE Node Health Monitor Function

EVE Node health check functionality, currently consists of the following, 

...

 Watchdog Time : For Pillar Agent(s)

...

Health and responsiveness

...

.

 Each agent's health is monitored through software watchdog timer. 

 Controller connectivity

The controller connectivity for the EVE node is evaluated, as following

Reset Timer Function:

For a normal operation scenario, for . If an agent does not retouch the pid file for watchdog time interval, the device is rebooted.

Reset Time: For controller connectivity health in normal operation mode

On controller connectivity loss, the EVE node is rebooted after the reset timer time interval.

 Fallback

...

Time: For controller connectivity health during baseos upgrade

...

validation

For controller connectivity loss, EVE Node reboots and falls back to fallback image, after the fallback timer time interval.

Current Implementation

The watchdog time handler functionality is based on  wdctl utility, and it is part of device-steps.sh.

The EVE node reset and fallback timer time functionalities are currently part of ZedAgent Module.  

Proposal for Refactoring

Baseosmgr Module

ZedAgent Module

DevAgent Module

DevAgent will  listen to the following,

   - ledBlinker Status.  – for EVE node registration, controller connectivity change events

   - Zboot Status

   - Zedagent Status

DevAgent will publish to the following,

    - Zboot Config

    - DevAgent Status

ZedAgent additionally will listen to the following,

    - Dev Agent Status

PS. 

Currently, the scope of device health, as defined above, does not include the following,

            - cpu usage health

            - disk space usage health

            - network usage health

...

Refactoring Details

The watchdog time functionality will remain as such. The reset and fallback time functionality will be moved into a new agent called, NodeAgent. The whole baseos upgrade validation orchestration functionality will be moved into NodeAgent module. NodeAgent will be spwaned along with ledmanager. NodeAgent will listen to ledmanager ledblinker config messages to determine controller connectivity status along with successful configuration pull message time stamps from zedagent, to orchestrate the baseos upgrade validation functionality. NodeAgent will be owner for Zboot config and will publish them for usage by baseosmanager. Also on successful baseos installation and reset/fallback timer expiry, the device reboot operations will be triggered through "NodeAgent status"  pusub topic.

Zedagent module will only be responsible for controller connectivity related functionalities, like pulling latest configuration blob from controller, and publishing status/info/metrics  messages to controller. And will update this information through "zedagent status" pubsub topic. Zedagent will subscribe to "NodeAgent status" pubsub topic to execute device reboot commands.

Baseosmanger will listen to NodeAgent module, zboot config messages to handle, and update zboot status, for baseos installation and upgrade validation orchestration.

In a nutshell, the following are going to be changes in event handling per module.

Baseosmgr Module

Baseosmgr will subscribe to the following topic,

  •       "zboot config" from NodeAgent

            For baseos installation and upgrade validation

ZedAgent Module

 Zedagent wiill subscribe to the following topic,

  •      "NodeAgent status" , generated by NodeAgent

           For executing device reboot command

           To publish the remaining test time to controller, for baseos upgrade validation

  Zedagent will publish the following topic,

  •        "zedagent status"

           Time stamp for last successful configuration pull from controller

NodeAgent Module

NodeAgent  module will  subscribe to the following topics,

  •    "ledBlinker config", generated by zedclient/zedagent, etc

For EVE node registration, controller connectivity change events

  •    "zboot status", generated by baseosmgr

        For baseos installation and upgrade validation orchestration

  •    "zedagent status", generated by zedagent

        For the last successful config fetch time stamp, from controller

NodeAgent will publish the following topics,

  •     "zboot config"

        Zboot partition information

  •     "NodeAgent status"

         For device reboot event, in baseos installation and reset/fallback timer expiry 

         Remaining test time, for publication  to controller ( consumed by zedagent)


P.S.

For completeness and future workscope, the following items are noted, for EVE node health. This list is not exhaustive, and the necessary actions for them needs be defined. 

  • cpu usage health
  • disk space usage health
  • network usage health
  • each agent's basic functionality check, (on upgrade)
  • controller driven testing and marking the baseos as active