Page tree
Skip to end of metadata
Go to start of metadata

Checklist

  • User Stories Documented
  • User Stories Reviewed
  • Design Reviewed
  • APIs reviewed
  • Release priorities assigned
  • Test cases reviewed
  • Blog post

Introduction 

CDAP currently captures lineage at the dataset level. With lineage, users can tell the program that read from or wrote to a dataset. It can help users determine which program wrote to/read from a dataset in a given timeframe.

However, as a platform, CDAP understands schemas for most datasets. Schemas contain fields. It would be useful to be able to drill into how a field in a particular dataset was used (CREATE/READ/WRITE/DELETE) in a given time period.

Goals

  • Provide CDAP platform support (in the form of API and storage) to track field level lineage.
  • Pipelines can then expose this functionality to the plugins.
  • Plugins (such as wrangler) will need to be updated to use this feature.

Use Cases 

Id
Use Case
FLL-1

As a data governance reviewer or information architect at a financial institution,

I would like to generate a report of how a PII field UID from the dataset DailyTransactions was consumed in the specified time period

so that

  1. UID is PII data, and I would like to know which processes accessed it in the given time period
  2. If the data is breached/compromised or improperly generated, I can understand its impact, and take appropriate remedial action
  3. I can also understand which fields may have been generated from the UID field in downstream processes, and judge the impact on those fields to take steps towards remediation
  4. I can generate compliance reports to certify that my organization does not breach contracts with third-party data providers. My licenses with third party providers require me to enforce strict retention policies on their data and any generated downstream data. I would like to ensure that those retention policies are adhered to at all times.
FLL-2

As a data scientist at a healthcare organization,

I would like to trace the provenance of the field patient_medical_score in the dataset PatientRecords over the last month

so that

  1. I can understand the true source(s) from which the patient_medical_score was generated
  2. I can understand the operations that were performed on various source fields to generate the patient_medical_score
  3. I can use this information to determine if patient_medical_score is a suitable field that can be relied upon for generating my ML model.

User Stories

  1. Spark program can perform various transformations on the input fields of the dataset to generate new fields. For example concatenate the first_name and last_name fields of the input dataset, so that resultant dataset only has Name as field. As a developer of CDAP program(for example CDAP Spark program), I should be able to provide these transformations so that I will know later about how the field Name was generated.
  2. Similar transformations on the fields can be done in the CDAP plugins as well. Plugin developer should be able to provide such transformations.
  3. Few plugins such as Javascript transform, Python transform etc execute the custom code provided by the pipeline developer. Pipeline developer in this case should be able to provide the field transformations through the plugin config UI.

Design

Consider a following pipeline: 

 
File------->Parser------->Name-Concatenator------->IdGenerator------->Store
 
 

Consider a following sample structured record which was processed by the pipeline along with the operations that happened on the fields:

Pipeline StageFields Emitted with valuesField Level Operations
Filebody: John,Smith,31,Santa Clara,CA

operation: read

input: file

output: body

description: read the file to generate the body field

Parser

first_name: John

last_name: Smith

age: 31

city: Santa Clara

state: CA

operation: parse

input: body

output: first_name, last_name, age, city, state

description: parsed body field

Name-Concatenator

name: John Smith

age: 31

city: Santa Clara

state: CA

operation: concat

input: first_name, last_name

output: name

description: concatenate first_name and last_name fields

 

operation: drop

input: first_name

description: delete first_name

 

operation: drop

input: last_name

description: delete last_name

IdGenerator

id: JohnSmith007

name: John Smith

age: 31

city: Santa Clara

state: CA

operation: create

input: name

output: id

description: generated unique id

Store

id: JohnSmith007

name: John Smith

age: 31

city: Santa Clara

state: CA

No field level operations performed

There are two ways to show the lineage graph for the field id as -

  1. Simple view: In this view, the lineage graph for each field has only 2 types of nodes - one belonging to the source datasets from which the field is ultimately created and the field itself. The edge between these nodes represents the set of operations that have been performed on the source nodes. Any intermediate generated fields will not be shown.

    											parse
    										   	concat
    										   	create		
    									body--------------->id
  2. Detailed view: In this view, each node of the graph represents the field (including any intermediate field generated) and each edge represents the single operation which transforms the input node into the corresponding output node.

    	
    									|---first_name---->|
    							 (parse)|				   | (concat)     (create)
    						 body------>|				   |-------->name---------->id
    									|				   |	
    									|---last_name----->|

CDAP Platform API changes

We will add FieldOperation class in the cdap-api to represent individual field operation.

// Identifies the source from where the data is read
public class Source {
   String namespace;
   String name; // could be the name of the dataset, Kafka topic etc.			
   String description; // Optional description associated with the source
}


// Represents the input for the operation
public interface Input {
  // Name of the input
  String getName(); 
 
  // Optional description associated with the input
  String getDescription();
 
  // Get the type associated with the input
  // Probably can be used to distinguish between different types of inputs such as source input or field input.
  Type getType(); 
}
 
// Represent Source as an input for the Field operation
public class SourceInput extends AbstractInput {
   Source source;
}
 
// Represent Field as an input for the Field operation
public class FieldInput extends AbstractInput {
   Schema.Field field;   
}
 
public class FieldOperation {
   // Operation name
   String name;
 
   // Optional detailed description about the operation
   String description;
 
   // Set of input fields participate in the operation	
   Set<Input> inputs;
 
   // Set of output fields generated as a part of this operation
   // Question: Can this be null? For example in case of Drop operation. 
   // However if the field is dropped and its not present in the destination dataset can it be reached?
   Set<Schema.Field> outputs;
 
   // Builder for the FieldOperation
   public static Builder {
      String name;
      String description;
      Set<Input> inputs;
      Set<Schema.Field> outputs;
 
      private Builder() {
        inputs = new HashSet<>();
        outputs = new HashSet<>();
      }
 
      public Builder setName(String name) {
         this.name = name;   
         return this;
      }
 
      public Builder setDescription(String description) {
         this.description = description;
         return this;
      }
 
      public Builder addInputSource(Source source) {
         Input input = new SourceInput(source);
         inputs.add(input);
         return this;
      }
 
      public Builder addInputField(Field field) {
         Input input = new FieldInput(field);
         inputs.add(input);
         return this;
      }
 
      public Builder addOutput(Field field) {
         outputs.add(field);
         return this; 
      }    
   }		
}

 

Once constructed following platform api can be used to record these field operations:

/**
 * This interface provides methods that will allow programs to record the field level operations.
 */
public interface LineageRecorder {
    /**
 	 * Record the field level operations against the given destination.
 	 *
 	 * @param destination the destination for which to record field operations
 	 * @param fieldOperations The list of field operations.
 	*/
    void record(Destination destination, List<FieldOperation> fieldOperations);
}
 
public class Destination {
   String namespace;
   String id;
   String description;
}

 

LineageRecorder will be available in the initialize method of the CDAP programs through program runtime context.

Sample program code to record the operations as done for the above pipeline from ETLMapReduce. Note that this code is called from the CDAP program and not from the plugin.

public class ETLMapReduce extends AbstractMapReduce {
...
   public void initialize() throws Exception {
      MapReduceContext context = getContext();
	  ...
	  List<FieldOperation> operations = new ArrayList<>();
      Source source = Source.from("myns", "user_data", "files contains user information");
 
      FieldOperation.Builder builder = new FieldOperation.Builder();
      builder
         .setName("read")
         .setDescription("Reading the data from the file")
         .addInputSource(source)
         .addOutput(offsetField)
         .addOutput(bodyField);      
      operations.add(builder.build());
       
      builder = new FieldOperation.Builder();
      builder
         .setName("parse")
         .setDescription("parsing the input body field")
         .addInputField(bodyField)
         .addOutput(firstNameField)
         .addOutput(lastNameField)
         .addOutput(ageField)
         .addOutput(cityField)
         .addOutput(stateField);
      operations.add(builder.build());


      builder = new FieldOperation.Builder();
      builder
         .setName("concat")
         .setDescription("concatenating the first name and last name")
         .addInputField(firstNameField)
         .addInputField(lastNameField)
         .addOutput(nameField);
      operations.add(builder.build());


      builder = new FieldOperation.Builder();
      builder
         .setName("drop")
         .setDescription("deleting the first name field")
         .addInputField(firstNameField)
      operations.add(builder.build());
 
      builder = new FieldOperation.Builder();
      builder
         .setName("drop")
         .setDescription("deleting the last name field")
         .addInputField(lastNameField)
      operations.add(builder.build());

      builder = new FieldOperation.Builder();
      builder
         .setName("create")
         .setDescription("creating unique id from the name field")
         .addInputField(nameField)
         .addOutput(idField)
      operations.add(builder.build());

      context.record(Destination.from("myns", "mytableds"), operations);
      ...    
   }
...
}

Storage

We will store the field level lineage information in the FieldLevelLineage system dataset. Common access pattern for the field level lineage dataset is "How field X in dataset Y is generated during the certain time window".

To satisfy such query we will have the row key as

// Storage format for row keys
// ---------------------------
//
// ------------------------------------------------------------------------------------
// <id.namespace> | <id.dataset> | <field-name> | <inverted-run-start-time> | <id.run>
// ------------------------------------------------------------------------------------

JSON serialized lineage information would be stored as a value corresponding to the above row key.

{
   "destination": {
      "namespace": "lineage",
      "name": "KafkaSink",
      "description": "Kafka broker running at address localhost:9092"  
   },
   "source": {
      "namespace": "myns",
      "name": "user_data",
      "description": "files contains user information"
   },
   "operations": [
      {
         "inputs": ["source"],
         "outputs": ["offset", "body"], 
         "name": "read",
         "description": "Reading the data from the file"  
      },
      {
	      "inputs": ["body"],
          "outputs": ["firstName", "lastName", "age", "city", "state"],
          "name": "parse",
          "description": "parsing the input body field"
      },
      {
          "inputs": ["firstName", "lastName"],
          "outputs": ["name"],
          "name": "concat",
          "description": "concatenating the first name and last name"  
      },
      {
          "inputs": ["name"],
          "outputs": ["uniqueid]",
          "name": "create",
          "description": "creating unique id from the name field"  
      }         
   ]
}

Notes about the storage format described above:

  1. inputs and outputs are field names represented as string. However they will be the complete path in the structured record for unique identification of the field for example "employees/name" where "name" is nested in the "employee" record.
  2. CDAP platform will receive the list of field operations as a part of the API call and then extract the individual field operations in order to store them in the HBase.

Retrieval of the Field level lineage information:

Http handler (FieldLevelLineageHandler) will be responsible for retrieving the lineage information. Following is the only endpoint required to provide the information:

GET /v3/namespaces/<namespace-id>/datasets/<dataset-id>/fields/<field-name>/lineage?start=<start-ts>&end=<end-ts>
 
Where:
namespace-id: namespace name
dataset-id: dataset name
field-name: name of the field for which lineage information to be retrieved
start-ts: starting timestamp(inclusive) in seconds
end-ts: ending timestamp(exclusive) in seconds for lineage

 

Sample response:

{
  "fieldName": "uniqueId",
  "startTimeInSeconds": 1442863938,
  "endTimeInSeconds": 1442881938,
  "paths": [
   ....
       list of paths which represent the different ways field is created  
   ....
  ]  
}
 
Each path will look as follows:
 
{
   "operations": [
      {
         "inputs": ["source"],
         "outputs": ["offset", "body"], 
         "name": "read",
         "description": "Reading the data from the file"  
      },
      {
	      "inputs": ["body"],

          "outputs": ["firstName", "lastName", "age", "city", "state"],
          "name": "parse",
          "description": "parsing the input body field"
      },
      {
          "inputs": ["firstName", "lastName"],
          "outputs": ["name"],
          "name": "concat",
          "description": "concatenating the first name and last name"  
      },
      {
          "inputs": ["name"],
          "outputs": ["uniqueid]",
          "name": "create",
          "description": "creating unique id from the name field"  
      }         
   ],
 
   "runs": [
      ....
          list of run ids which resulted in this path
      ....
   ]
}
 
 

 

Following section still need to be updated. Please do not review.

In order for plugins to be able to record this information at runtime we can update the StageContext interface with the following method.

public interface StageContext {
...
    /**
 	 * Record the field level mutations.
 	 *
 	 * @param fieldMutations The list of field mutations
 	 /
	void recordFieldMutations(List<FieldMutation> fieldMutations); 
...
}

This method will then be available to the plugins in the prepareRun method where field mutations can be recorded. Note that we do not need to expose the dataset name parameter in this method. Since plugins operate on the individual records and they do not have any information about the dataset where the current record will land up. However data pipeline app knows about the datasets used in the pipeline and also the information about the DAG. It is possible to figure out from DAG which stage will lead to which dataset. So data pipeline can figure out this information and call the platform method with dataset name as an argument and field mutations accumulated from the various stages.

Pipeline config changes

For some of the plugins such as Javascript transform or Python transform, the prepareRun method is written by the plugin developer, however transform method is supplied by the pipeline developer. Since pipeline developer knows more about the transform method and can provide the field level mutations better, we will need an ability to provide the mutations via pipeline config as well.

Following are the proposed changes for the pipeline configs - 

{
 ...
    "stages" : [
		{
                "name": "Projection",
                "plugin": {
                    "name": "Projection",
                    "type": "transform",
                    "label": "Projection",
                    "artifact": {
                        "name": "core-plugins",
                        "version": "1.8.5-SNAPSHOT",
                        "scope": "SYSTEM"
                    },
                    "properties": {
                        "drop": "headers",
                        "rename": "ts:seqId,body:f1"
                    }
                },
 
                "fieldMutations": [
                    {
						"inputFields": [
							{\"name\":\"ts\",\"type\":\"long\"}
						],
						"mutation": "rename",
                        "description": "renaming timestamp to the sequence id",
						"outputField": {\"name\":\"seqId\",\"type\":\"long\"}						
					},
                    {
						"inputFields": [
							{\"name\":\"body\",\"type\":\"string\"}
						],
						"mutation": "rename",
                        "description": "renaming body to the f1",
						"outputField": {\"name\":\"f1\",\"type\":\"string\"}
					},
                    {
						"inputFields": [
							{\"name\":\"headers\",\"type\":{\"type\":\"map\",\"keys\":\"string\",\"values\":\"string\"}
						],
						"mutation": "delete",
                        "description": "deleting headers"
					}
                ],
                "outputSchema": "{\"type\":\"record\",\"name\":\"etlSchemaBody\",\"fields\":[{\"name\":\"seqId\",\"type\":\"long\"},{\"name\":\"f1\",\"type\":\"string\"}]}",
                "inputSchema": [
                    {
                        "name": "Stream",
                        "schema": "{\"name\":\"etlSchemaBody\",\"type\":\"record\",\"fields\":[{\"name\":\"ts\",\"type\":\"long\"},{\"name\":\"headers\",\"type\":{\"type\":\"map\",\"keys\":\"string\",\"values\":\"string\"}},{\"name\":\"body\",\"type\":\"string\"}]}"
                    }
                ],
                "type": "transform",
                "label": "Projection",
                "icon": "icon-projection",
            }
    ]
 ...
}

Scenarios for Schema propagation

  1. Configure time schema: When the pipeline is configured with the static schema, prepareRun method of the plugin will have access the schema. In this case the plugin will be able to emit the required lineage information.
  2. Dynamic schema through macro: It is possible that the schema is provided as a macro during configure time and the value for which is supplied through runtime arguments. In this case in order for schema to be available in the prepareRun method we might need some changes.[TBD]
  3. Unknown schema: It is possible to implement the custom plugins which work independent of the schema. For example consider the following pipeline - 

    File-------->CSV Parser--------->Javascript Transform-------->Custom Table Sink

    Assume that the Javascript transform just looks for the fields in the input StructuredRecord which contains "secure" in it, for example "secure_id", "secure_account" and drops them. Custom table sink takes all the fields from the input StructuredRecord and writes to a Table dataset with each individual field corresponds to the column name. In this case the schema of the Table will be variable based on the individual record. In this case even the pipeline developer wont be able to correctly specify the field mutations.

Approach

Approach #1

Approach #2

API changes

New Programmatic APIs

New Java APIs introduced (both user facing and internal)

Deprecated Programmatic APIs

New REST APIs

PathMethodDescriptionResponse CodeResponse
/v3/apps/<app-id>GETReturns the application spec for a given application

200 - On success

404 - When application is not available

500 - Any internal errors

 

     

Deprecated REST API

PathMethodDescription
/v3/apps/<app-id>GETReturns the application spec for a given application

CLI Impact or Changes

  • Impact #1
  • Impact #2
  • Impact #3

UI Impact or Changes

  • Impact #1
  • Impact #2
  • Impact #3

Security Impact 

What's the impact on Authorization and how does the design take care of this aspect

Impact on Infrastructure Outages 

System behavior (if applicable - document impact on downstream [ YARN, HBase etc ] component failures) and how does the design take care of these aspect

Test Scenarios

Test IDTest DescriptionExpected Results
   
   
   
   

Releases

Release X.Y.Z

Release X.Y.Z

Related Work

  • Work #1
  • Work #2
  • Work #3

 

Future work

  • No labels

14 Comments

  1. Sagar Kapare - We need to expose all the capabilities for business metadata - Tags, Properties and Lineage. Can we update the APIs to do that?

    1. Sreevatsan Raman I have renamed the FieldAccess class to FieldMetadata and added more information to it for storing tags and properties along with the AccessType.

  2. However

    Can you explain a little more the motivation? Field level lineage is not about the type of access (read vs write) but about how a field was produced, what transformations were applied, what was the field in the source etc. 

    An example would also help illustrate this. 

  3. I would like to generate a report of how a PII field UID from the dataset DailyTransactions was consumed in the specified time period

     

    This is not the primary use case. We really want to trace back a field in the sink to its origin. 

  4. How does the FieldMetadata class represent a transformation that created a field? A program/plugin/directive should be able to emit a sequence of "mutations", each of which has

    • the name of one or more input fields
    • the name of one or more output fields
    • the name of the operation applied
    • a human readable description of what that operation is
  5. Option #2: Add new versions of the getDataset methods

    I don't think I understand the purpose of these APIs. This would instantiate a dataset and at the same time tag the dataset and its fields with new meta data? Do you have an example use case for that?

  6. I don't understand why the field transformation lineage need to associate with a dataset (as suggested by the record API that a dataset name is needed). For a transformation stage, it doesn't know anything about dataset, isn't it?

    Also, how is it different than metadata, given that metadata now can be associated with any key/value pairs? Why need new platform API for that? (I agree to have ETL-api for plugins to use).

    1. I don't understand why the field transformation lineage need to associate with a dataset (as suggested by the record API that a dataset name is needed). For a transformation stage, it doesn't know anything about dataset, isn't it?

      >> The record API accepting the name of the dataset is from platform, but not from the plugin perspective. Since multiple datasets can have field with the same name we want to distinguish between those fields, which is why the field transformations are associated with the datasets. In case of data pipeline transformation stage, it wont know the dataset name, but the app would know it based on the DAG. App will then call the platform method with the appropriate dataset as a parameter.

      1. But then what the dataset name will be if there are multiple inputs to a transform?

        1. The mutations will be recorded from the perspective of the sink datasets.

          For example consider a pipeline which joins customer and product datasets. After joining the data it does the projection to remove the fields like ssn and store the data in a sink dataset called transactions.

          The app will call the platform API with the dataset name as transactions. I imagine when we show this information on the UI, user would first see the transactions dataset and fields within it. Then user can drill down further to know how the fields in this transaction datasets are generated. 

  7. Is the lineage information stored for each run of a program, or for each run of a pipeline? If the pipeline is a workflow, is it stored for the workflow? Or for each stage in the workflow? How would the lineage be composed for the entire pipeline if it is stored for each stage?

    1. Lineage information is stored for each run of the program. For a pipeline, it will be stored against the Workflow runid. It is not stored at the stage level.

  8. The same pipeline will run many times and produce the same field lineage over and over again. How do we deduplicate that?

    1. For every run we will store Field Level Lineage as a new row in the store. However while retrieving the lineage in the given time window we will group the lineage together along with the runids responsible for generating that lineage information. Example is added above for the response. Please take a look and let me know.