- User Stories Documented
- User Stories Reviewed
- Design Reviewed
- APIs reviewed
- Release priorities assigned
- Test cases reviewed
- Blog post
Programs on CDAP have broad state definitions that don't clearly indicate what the program is doing during execution. We want to have clearer definitions for our program states to accurately reflect its true behavior during a program run.
Redefine program statuses to accurately represent the current status of any program at any time.
- As a user with access to a small cluster, I want to see when my workflow is in a state of acquiring resources so that I know why my program is taking so long to begin running.
- As a user, if I stop a flow, I want to see the flow marked as 'stopped', and not killed or failed
- As a user, when I start a service, I want to be able to send requests to it as soon as the the service is running.
- As a user, if I kill a program through the YARN resource manager, I want to see the same state reflected on CDAP.
As a user, when I start a Spark job, I should see the Spark job run only after it has finished requesting resources so that I know when my Spark job is actually running.
Add more granular state definitions of programs to aid in debugging and clearer program state distinction.
Add transitions between these granular state definitions at the relevant sections in the code so that program states accurately represent the current status of the program.
Use TMS to broadcast any user-facing program status transitions so that other services can subscribe to these notifications.
See the parent and related JIRAs related to these issues here: https://issues.cask.co/browse/CDAP-4157
- Before we attempt to refactor how program states, we should understand how they are defined in CDAP today. These are the different ways that program states are defined today in CDAP:
- ProgramController.State: This is the internal representation of a program's state, used only by program runners and program controllers. This is not exposed to the user.
- ProgramStatus: This is the user-facing representation of a program's state.
- ProgramRunStatus: This is used for querying program runs.
- Below are two diagrams that propose more granular program states.
Program State Diagram
- The yellow boxes represent the states that would be exposed to the user if the program status was requested at any given point.
- The blue boxes represent the internal states of the program.
*Workflows have a different behavior that is not depicted fully in the diagram. A workflow is defined to be running when executing the first node. Since the first node of a workflow can be a Spark job, a MR job, or a custom action, internally, that node itself is executing a program, so internally, that node's state will be looping back to the CONFIGURING_JOB state for a Spark or MR job. However, the workflow itself would remain in a running state. This is for two reasons:
- It is difficult to propagate the internal program's states back to the workflow.
- There can be parallel executions of nodes.
- There isn't a use case currently for knowing the substate (node's state) of a workflow.
If a random node fails during a workflow's run, the failure will be propagated to the workflow state.
User-Facing Program States Diagram
Note that a workflow will go directly from INITIALIZING to RUNNING since the workflow itself will not be requesting resources for a job, but its nodes will. The nodes themselves may be Spark or MR jobs, which individually request job resources to begin their jobs. However, the workflow stays in a RUNNING status. See the note above this diagram for more information.
These are the states that should be persisted to the store and notifications should be sent through TMS for every one of these transitions.
- The above two flow charts represents the programs state transitions. Failure states that may occur along every step are not explicitly shown. For instance, if we fail to request containers, then the program status will transition to KILLED.
- Whenever a program state has been updated, we should determine if the state transition is valid and update the state.
- A message should be sent through a TMS topic (program.status.event.topic) for every user-facing program status transition (the second diagram).
- We no longer make HTTP requests to the store with an updated status of a program run. Instead, we can subscribe to the TMS topic and persist the appropriate state changes into the store when necessary.
Publishing Program Status Changes
- Before defining more granular program status, we need to first refactor the store by publishing existing program statuses to TMS so that the store can subscribe to the TMS topics.
- We will move all store persistence (Flows, Services, Workers) to the program type's respective containers to unify where the state changes occur for each program. ( - CDAP-2013Getting issue details... STATUS )
- Then, we will replace the store persistence calls with program status notification calls that will validate the program state change before publishing the notification.
- The RemoteRuntimeStore can be replaced with a store that subscribes to the program status notifications.
- If the container dies, or if there is any error with publishing program statuses, then the record fixer needs to handle the program state inconsistency and publish a notification to correct the incomplete program state.
What does it mean for a program to be running?
For every program type, the program should be marked running when it is actually running:
Serviceis marked running after the HTTP server is up and ready to receive requests.
SparkJob is running after it has been submitted by the SparkSubmitter (and in AbstractTwillProgramController, we get an event from the
MapReducejob is running after the job has been submitted (and in AbstractTwillProgramController, we get an event from the
Workflowis running when the first node is running (actions, MapReduce job, Spark job).
Flowis running when it is ready to process events.
Workeris running when
worker.run()has been called.
- What about when 80% of workers are running and 20% are not running? What will the state be classified as then?
- In this case, not all workers are running, but since there are workers that are running, the Worker should be marked as running
We wish to have 3 terminal states:
COMPLETED: When a batch program successfully completes
Ex: any successful batch program completes.
FAILED: When a program terminates with an error.
Ex: A MapReduce job fails with an error, or a container dies.
This can occur for both batch and realtime programs
STOPPED: When a user specifically tries to stop a program (both realtime and batch can be stopped). Currently, we map a stopped process to killed. This would no longer happen under this new proposal. If there was an error with trying to stop a program, then this can transition to FAILED accordingly.
A program that is stopped can be started again (although under a different run), so it doesn't make sense for a stopped program to be marked killed as they represent different things.
Ex: When a user stops a flow or a worker.
- What happens if Yarn / Twill returns killed? Do we map it to stopped now?
Program Statuses (External)
We wish to add a few more user-facing program statuses so that this can be exposed to the user and make it clearer what state a program is involved in, at a higher level:
PREPARING: Represents building and starting an application environment for a program
REQUESTING_RESOURCES: Represents whenever the program is requesting resources to start application master and the basic containers to run a Spark job or a MR job. Useful since requesting containers can take a long time.
REQUESTING_JOB_RESOURCES: Represents whenever the job is requesting containers to start its packaged program. This is different from
REQUESTING_RESOURCESbecause this state is active after the application master has already started. This state would only be reached by Spark and MR jobs.
INITIALIZING: Represents building of program jar and localizing program resources and configuration before running a program
New Programmatic APIs
An internal State enum must be composed:
The user-facing ProgramStatus enum will look like:
Deprecated Programming APIs
- We wish to deprecate ProgramRunStatus in favor of using ProgramStatus. To maintain the existing functionality of querying program runs regardless of the status (i.e using ProgramRunStatus.ALL), users no longer specify any status, and all statuses will be assumed instead.
New Status Endpoint
|Path||Method||Description||Request Body||Response Code||Response|
To get the status of a Program
200 - On success
404 - When application is not available
500 - Any internal server errors
Deprecated REST API
- We still need to add backwards compatibility for the API, to support how new external program statuses will be defined now as well as the current definitions.
Can we distinguish when a Twill program is initializing and when a Twill program is actually running? Doesn't seem to be the case, since Yarn controller and Yarn Application State doesn't have an
twillControlleronly subscribes to
- Answer: No, it would be nice if we could.
When a workflow is suspended, we seem to mark it as STOPPED, but shouldn't it be marked instead as SUSPENDED? ProgramStatus currently doesn't have SUSPENDED as a status (but it should).
- Answer: We should mark workflows as suspended when they are suspended.
- How long is the abort time when requesting containers?
- Answer: It is configurable.
- Does it seem redundant now to have ProgramStatus and ProgramRunStatus? I don't believe so, as the ProgramStatus enum will be used for other things, like program status based scheduling. It wouldn't be appropriate to use ProgramRunStatus as that should be associated with querying records only.
- Answer: Yes it is. Let's look to remove ProgramRunStatus.
- The internal program state enum needs to be in a different place other than ProgramController, as the lifetime of the ProgramController no longer matches the lifetime of the states. Ideas on where to define this?
- Should we deprecate proto's ProgramStatus, an enum that just shows running or stopped? It is used as part of the API we are deprecating.
ProgramController.Statemust be internal, and its internal state will not be exposed. Currently, it doesn't seem to be exposed either, so there is no change here.
ProgramRunStatuscan represent the status of a specific program run for the user. This will also continue to be exposed to the user, but the difference is that the user-exposed states are a bit more granular.
Impact on Infrastructure Outages
If TMS is down, sending program states would also go down. No additional outages should occur.
|Test ID||Test Description||Expected Results|
- Work #1
- Work #2
- Work #3