Page tree
Skip to end of metadata
Go to start of metadata

Proposal

Currently, CDAP with Spark engine uses RDD APIs, we propose to add support for Spark Dataframe/Dataset API for CDAP data processing.

Benefits of doing so!

  • Performance benefits of Spark data frame (tungsten, filter pushdowns, serialization, GC to name few)
  • Data frame/Dataset to and from RDD conversion in each plugin is a major overhead in the pipeline runtime.

How to implement it?

Option 1: We drop support of RDD<StructuredRecord> and move to Dataframe<Row> with the necessary change in CDAP core.

Option 2: Add a new Engine "Spark Dataframe" and add the implementation for it with the necessary change in CDAP core.


  • No labels