- User Stories Documented
- User Stories Reviewed
- Design Reviewed
- APIs reviewed
- Release priorities assigned
- Test cases reviewed
- Blog post
CDAP pipeline is composed of various plugins that can be configured by users as CDAP pipelines are being developed. While building CDAP pipelines, pipeline developer can provide invalid plugin configurations or schema. For example, the BigQuery sink plugin can have output schema which does not match with underlying BigQuery table. CDAP pipeline developer can use new validation endpoint to validate the stages before deploying the pipeline. In order to fail fast and for better user experience, validation endpoint should return all the validation errors from a given stage when this endpoint is called.
Data pipeline app exposes various error types for plugin validation. In future releases, new error types can be introduced. With current implementation, when plugins with new error types are pushed to hub, data pipeline artifacts need to be updated for every new type of error that is introduced. This is because the validation errors are defined in the data pipeline app itself. A better approach would be to modify data pipeline app so that app artifacts do not need to be replaced for every new type of error.
To fail fast and for better user experience, introduce a new api to collect multiple validation error messages from a stage at configure time
- Decouple validation error types from data pipeline app
Instrument plugins to use this api to return multiple error messages for validation endpoint
- As a CDAP pipeline developer, when I validate a stage, I expect that all the invalid config properties and input/output schema fields are highlighted on CDAP UI with appropriate error message and corrective action.
- As a plugin developer, I should be able to capture all the validation errors while configuring the plugin so that all the validation errors can be surfaced on CDAP UI.
- As a plugin developer, I should be able to use new validation error types without replacing data pipeline app artifacts.
API Changes for Plugin Validation
Collect Multiple errors from plugins
To collect multiple stage validation errors from the stage, StageConfigurer, MultiInputStageConfigurer and MultiOutputStageConfigurer can be modified as below. Current implementation does not expose stage name to the plugin in configurePipeline method. Stage name will be needed by the plugins to create stage specific errors. For that, stage name will be exposed to plugins through stage configurer as below.
Decouple plugin error types from data pipeline app
Approach - 1
To carry error information, a new ValidationFailure class is introduced to collect multiple validation failures in stage configurer. This class can be built using a ValidationFailureBuilder which only allows string properties. The builder expose methods to get message, type and properties of a failure. The validation failures are collected using ValidationException. Using this validation exception whenever plugin has an invalid property that is tied to another invalid property, plugin can throw a validation exception with all the errors collected so far. This keep plugin validation code much simpler.
API usage in plugins
Approach - 2
Validation error represents an error with various causes with different attributes for each cause. For example, when the input schema field type does not match the underlying sink schema, the cause is input field mismatch with attributes such as stage name, field name, suggested type etc. Each error message can be associated to more than one causes. This can happen for plugins such as joiner and splitter where there are multiple input or output schemas from a given stage. For example, when input schemas for joiner are not compatible, the causes will include mismatching fields from input schemas of incoming stages. This means that a validation error can be represented as a list of causes where each cause is a map of cause attribute to its value as shown below.
All the attributes of a cause can be tracked at central location as below:
API usage in plugins
Impact on UI
|Type||Description||Scenario||Approach - 1 - Json Response||Approach - 2 - Json Response|
|StageError||Represents validation error while configuring the stage||If there is any error while connecting to sink while getting actual schema|
"correctiveAction" : "Make sure correct driver is available.",
|InvalidProperty||Represents invalid configuration property||If config property value contains characters that are not allowed by underlying source or sink|
"correctiveAction" : "Make sure 'millis' is greater than 0.",
|PluginNotFound||Represents plugin not found error for a stage. This error will be added by the data pipeline app||If the plugin was not found. This error will be thrown from the data pipeline app|
"correctiveAction" : "Please make sure the 'Mock' plugin is installed.",
"pluginId" : "Mock"
|InvalidInputSchema||Represents invalid schema field in input schema||If the input schemas for joiner plugin is of different types|
|InvalidOutputSchema||Represents invalid schema field in output schema||If the output schema for the plugin is not compatible with underlying sink|
"correctiveAction" : "Schema should be of type 'string' at output port 'port'",
|InvalidFieldInProperty||Represents an invalid field in property list||If the property represents list of fields, the failure should include the property name along with invalid field|
There are 2 contracts in this design. Programmatic contract between data pipeline app and plugins and another between data pipeline app and UI. Approach 2 does not introduce concept of failure type. This means that contract with UI will be based on the cause attributes rather than the type. This means that if plugins creates a custom failure and uses any of the UI compatible attributes, the UI can still highlight them. Approach 2 also provides association between causes which represents the failure better in case there are multiple causes causing this failure. Hence, Approach 2 is suggested.