Some Fit and Finish items for Eve for smooth operation of Edge Devices
Considering Fit-n-Finish is the priority as compared to new features, the following are some of the things that need to be improved:
Logging / Debuggability
Linux state ( Ip Table Rules, Disk contents, ifconfig -a, ip route etc)
Need Kernel Information out to debug Broken hardware
PubSub state
Ability to run some pre-defined diagnostic commands
Better visibility into each agent state
Internal Agent state
Internal Agent counters
e.t.c
App Instance Create / Delete / Refresh / Purge / Restart etc. For Example:
Verifying?
Auto Retriable Error
User intervention needed.
App Instance Create RCVD ( Name, time etc )
Downloading / Verifying Images
Download / Verifying images Done
Error
Copying For RW image
Reserving Resources
Starting Instance
Instance Create Done
Have a structured format to log such events as INFO events - which user can then look to know the actual details from the device
These are also very useful from the developer perspective.
Kernel state / Disk State / Temp etc..
The main thrust is to improve the content of logging, to be able to debug 95% of the issues without reproducing the issue or accessing the device using console / ssh.
Ability to prioritize messages based on severity level - CRIT / ERROR / INFO / DEBUG - in that order and drop the messages starting from lowest priority ones
Proper Device Events - In INFO - Provide various events, AS SEEN FROM DEVICE, for each trigger:
Kernel Coredump in case of kernel crashes..
Information to debug broken hardware
Device Reboot reason
First boot after install
User Triggered Reboot - Time
Upgrade - Time
Upgrade Failure - Rollback
Unexpected Reboot
Agent Crash - Details
Agent Watchdog Timeout - Details
Hardware Watchdog - Details
Kernel Crash
Power Failure
In these cases - log.Errorf() message details so that it goes into Kibana
UI - should just put this from user perspective - and hide the details of the crashes
Current reboot reason is more for the developer.
For the user - Reboot reason should be seen as follows:
Device Events
These don’t always correlate with what is going through the system, especially after reboot
We see events toggling between Init / unknown / init / downloading etc.
These fixes are more of Bug Fixes
More visibility into state of Bigger triggers
Include more granular information, like Reboot Started etc.
Currently, there is no visibility on when the device received the reboot, when it is done shutting down all app instances and when it is actually rebooting. This is all useful information to an admin waiting for systems to come up.
Some states are:
Shutdown applications
Starting Reboot ( Still from Older image )
Bootup ( As soon as possible )
Or may be the existing msg is good enough
Upgrade - We can provide more visibility to the user:
Image Downloaded
Verification
Install
Shutting down applications
Reboot
Booted up as part of Upgrade
Testing in progress ( Update time remaining )
Testing Done / Upgrade Successful
Upgrade FAILURE Fallback - cases:
Shutdown applications
Reboot to Fallback image Started
Bootup of Fallback image
.. This part is common to bootup
Reboot Device
Advertise capabilities to Cloud to enable smooth upgrades, to easily allow cloud deal with multiple versions of devices.
For example - Cloud can send Encrypted secrets Vs. Clean text - Doesn’t need to send both
Are these used by Cloud to change behavior dynamically? Or just for Inventory analysis?
Eve images on Docker hub.
Fallback Interface configuration ( Lower priority )