Workload Lifecycle Event Hooks Feature Design Candidate

Status: Draft

Sponsor User: IBM

Date of Submission: Aug 4, 2024 

Submitted by: @Joseph Pearson 

Affiliation(s): IBM

<Please fill out the above fields, and the Overview, Design and User Experience sections below for an initial review of the proposed feature.>

Scope and Signoff: (to be filled out by Chair)

Overview

<Briefly describe the problem being solved, not how the problem is solved, just focus on the problem. Think about why the feature is needed, and what is the relevant context to understand the problem.>

Enable code to be called by anax on a target edge node before and/or after deploying a workload, and before and/or after removing a workload.  Allow optional implicit and explicit configuration to be passed to the code from a workload's Service Definition or Deployment Policy.  Determine if there should also be node restart and workload upgrade and rollback events.

Design

<Describe how the problem is fixed. Include all affected components. Include diagrams for clarity. This should be the longest section in the document. Use the sections below to call out specifics related to each aspect of the overall system, and refer back to this section for context. Provide links to any relevant external information.>

Interesting Moments (Events)

Minimum Required Events

  • onBeforeWorkloadRun - after the AgBot completes negotiations and the edge node receives all the information about the service and deployment files, immediately before the Agent calls the Docker API to run the image, call this event and pass it all of the required information (to be specified below). An example of a third party service that would need this functionality would be executing a runtime security policy locking down resource access.

  • onAfterWorkloadRun - after the Agent successfully calls the Docker API to run the image, as soon as it is running (and able to react?), call this event and pass it all of the required information. And example of a third party service that would use this functionality would include a data collection or data streaming service.

  • onBeforeWorkloadTerminate - just before the Agent stops a running service, call this event and pass it the required information. An example service that could use this would be a data collection service.

  • onAfterWorkloadTerminate - just after the Agent stops a running service, and upon confirmation that is has stopped, call this event and pass it the required information. And example service that would use this includes removal of a security policy. Also, termination of any monitoring processes to prevent false alarms.

Nice-to-have Events

  • onBeforeWorkloadRestart - just before the Agent restarts a running service (to perform an update or other change), call this event and pass it the required information.

  • onAfterWorkloadRestart - just after successfully restarting a service, call this event and pass it the required information.

  • similar events for models?

User Experience

  1. As a solution architect, I want developers to be able to call third party components before or after specific workload lifecycle events so that the component can take action at the right time (before a change happens or immediately after the change takes effect).

  2. As a solution architect, I want third party components to be loosely (not tightly) coupled to Open Horizon through an event subscription mechanism so that Open Horizon's functionality is not hampered by a potential error condition in the environment or the third-party component.

  3. As a solution architect or developer, I expect the eventing solution used by Open Horizon to use one or more common standards for the event "hooks" so that integrating with third-party components is easy to do, well-understood, and well-supported by existing libraries.

  4. As a solution architect, I want Open Horizon to pause deployment while subscribed third-party component(s) take(s) action so that there is minimal opportunity for race conditions and resource contentions to take place.

  5. As a solution architect, any pause in deployment should have a reasonable timeout so that deployments cannot be unintentionally stopped by a misconfigured or malicious process subscribed to an event.  

  6. As a third-party component developer or integrator, I expect events to pass a payload containing information about the workload, workload configuration, host, node configuration, and optionally information about the third-party solution's desired configuration so that the third-party component can completely understand the operating environment and context to take appropriate action. 

  7. As a developer, I expect documentation to be provided describing the structure and contents of an event payload so that the feature is easy to understand and use. 

  8. As a developer, I expect documentation to list all of the interesting moments when events can be thrown so that I know what events exist. 

  9. As a developer, I expect to find examples showing how to subscribe to events and use the provided payloads so that I can learn by example.

  10. As a QA engineer, I anticipate using the examples to create unit and end-to-end tests so that the expected usage has ample test coverage.

 

Command Line Interface

<Describe any changes to the hzn CLI, including before and after command examples for clarity. Include which users will use the changed CLI. This section should flow very naturally from the User Experience section.>

 

External Components

<Describe any new or changed interactions with components that are not the agent or the management hub.>

 

Affected Components

<List all of the internal components (agent, MMS, Exchange, etc) which need to be updated to support the proposed feature. Include a link to the github epic for this feature (and the epic should contain the github issues for each component).>

 

Security

<Describe any related security aspects of the solution. Think about security of components interacting with each other, users interacting with the system, components interacting with external systems, permissions of users or components>

 

APIs

<Describe and new/changed/deprecated APIs, including before and after snippets for clarity. Include which components or users will use the APIs.>

 

Build, Install, Packaging

<Describe any changes to the way any component of the system is built (e.g. agent packages, containers, etc), installed (operators, manual install, batch install, SDO), configured, and deployed (consider the hub and edge nodes).>

 

Documentation Notes

<Describe the aspects of documentation that will be new/changed/updated. Be sure to indicate if this is new or changed doc, the impacted artifacts (e.g. technical doc, website, etc) and links to the related doc issue(s) in github.>

 

Test

<Summarize new automated tests that need to be added in support of this feature, and describe any special test requirements that you can foresee.>