Page tree
Skip to end of metadata
Go to start of metadata


 

 

Goals

  1. Custom Namespace Mapping: Ability to map a namespace to user-provided underlying storage namespaces
  2. Cross Namespace access: Cross-namespace dataset/stream access (read and write only)

Checklist

  • User stories documented (Rohit)
  • User stories reviewed (Nitin)
  • Design documented (Rohit)
  • Design reviewed (Andreas/Terence)
  • Feature merged (Rohit)
  • Blog post 

User Stories

  1. As a CDAP user, I would like to specify the namespace in an underlying storage (e.g. HBase namespace, HDFS directory, Hive database) to use for a particular CDAP namespace.
  2. As a CDAP admin, I want to allow users to access (read/write) a dataset/streams from a program in a different namespace, as long as the said user is authorized to access that dataset to perform the operation.

Design

Design 1: Store Custom Namespace Mapping in NamespaceConfig


To support mapping a namespace to user-provided storage provider (HBase, Hive, HDFS) we will accept these custom mapping during namespace create operation (UI will need to change to support this) and then store it. These mapping will be stored in NamespaceConfig which currently stores custom yarn queue names ("scheduler.queue.name") and is used by the NamespaceMeta. We will add three additional fields to it:

NamespaceConfig
/**
 * Represents the configuration of a namespace. This class needs to be GSON serializable.
 */
public class NamespaceConfig {

  @SerializedName("scheduler.queue.name")
  private final String schedulerQueueName;

  @SerializedName("hbase.namespace")
  private final String hbaseNamespace;

  @SerializedName("hdfs.directory")
  private final String hdfsDirectory;

  @SerializedName("hive.database")
  private final String hiveDatabase;

  NamespaceConfig(@Nullable String schedulerQueueName, @Nullable String hbaseNamespace, @Nullable String hdfsDirectory,
                  @Nullable String hiveDatabase) {
    this.schedulerQueueName = schedulerQueueName;
    this.hbaseNamespace = hbaseNamespace;
    this.hdfsDirectory = hdfsDirectory;
    this.hiveDatabase = hiveDatabase;
  }

  NamespaceConfig(String schedulerQueueName) {
    this(schedulerQueueName, null, null, null);
  }
}

// NoteToSelf: Rename NamespaceConfig to NamespaceProperties and mark schedulerQueueName as Nullable.

 

The NamespaceConfig is exposed through NamespaceMeta.

NamespaceMeta
/**
 * Represents metadata for namespaces
 */
public final class NamespaceMeta {

  private final String name;
  private final String description;
  private final NamespaceConfig config;
}

Design Decisions:

  1. User is responsible for managing the lifecycle of custom hdfs directory/hive database/hbase namespace. CDAP will not create or delete any custom namespaces. These custom namespaces in the underlying storage provider must be empty
    We are making this decision to have consistency in the behavior. Also, allowing users to use namespace which already has some data in it can lead to various issue. For example, for every namespace in hbase we  create a queue table. What if a table with similar name exists ? Furthermore, we don't see a use case where user will want CDAP to handle the lifecycle  of custom namespaces or have external data in it.
    // NoteToSelf: Sync with Ali Anwar for impersonation to figure out if we need to pre-check for underlying namespace to be existing and  empty.

  2. Users can provide custom namespace for one or more or all storage providers. CDAP will be responsible for managing the lifecycle of all the storage provider namespace for which user did not provide any custom value.
    This is done to allow users to have flexibility to use custom namespaces for only the needed storage and let CDAP handle others. 

  3. Namespace custom mapping is final and immutable. It can only be provided during the creation of the namespace and  cannot be changed afterwards.
    This is done to keep the design simple for now. Supporting mutable mapping needs answering a lot of other issues like, what to do with existing data ? Will we need a migration tool ? How to migrate hbase, hive, hdfs data for cdap etc. 

  4. An underlying storage namespace can be mapped only to one cdap namespace. Users will not be allowed to create two cdap namespaces which uses same underlying storage namespace or its part (subdir). During namespace creation we will explicitly check that no other cdap namespace is using the custom storage namespace. We will also check that the directory is not a subdir of the other directory used in some other namespace. 
    We are making this design decision because sharing of underlying namespace will lead to a lot of weird consequences since programs will be sharing datasets, metadata etc. For example deleting a dataset from a namespace will delete it from another one too. 

 

User Story 1 and User Story 2 are related. To address user story 2 we need to allow user to specify a namespace for dataset/stream rather than implicitly looking into the namespace where the program is running. Once we identify the cdap namespace for a dataset/stream we fall back to user story 1 i.e. using the custom namespace mapping to access them on underlying storage.

 

Creating and Deleting Underlying Custom Mapped Namespaces

DatasetFramework which is responsible for creating/deleting a namespace in the storage provider will be modified to take NamespaceMeta rather than Id.Namespace. 

DatasetFramework: Current APIs
public interface DatasetFramework {
  /**
   * Creates a namespace in the Storage Providers - HBase/LevelDB, Hive and HDFS/Local File System.
   *
   * @param namespaceId the {@link Id.Namespace} to create
   */
  void createNamespace(Id.Namespace namespaceId) throws DatasetManagementException;

  /**
   * Deletes a namespace in the Storage Providers - HBase/LevelDB, Hive and HDFS/Local File System.
   *
   * @param namespaceId the {@link Id.Namespace} to delete
   */
  void deleteNamespace(Id.Namespace namespaceId) throws DatasetManagementException;
}
DatasetFramework: Proposed APIs
public interface DatasetFramework {
  void createNamespace(NamespaceMeta namespaceMeta) throws DatasetManagementException;
  void deleteNamespace(NamespaceMeta namespaceMeta) throws DatasetManagementException;
}

The underlying implementation will be changed to create namespaces in underlying storage with the custom name rather than cdap namespace name.

// NoteToSelf: DatasetFramework should not have methods to create/delete namespace. Move it out to an independent interface.

Retrieving Dataset/Streams from Custom Mapped Namespaces

HBase: Dataset

To support cross namespace access DatasetContext and Input class will be changed to accept namespace with dataset name. We will add the following new APIs to these classes:

DatasetContext: Existing APIs
public interface DatasetContext {

  /**
   * Get an instance of the specified Dataset.
   *
   * @param name The name of the Dataset
   * @param <T> The type of the Dataset
   * @return An instance of the specified Dataset, never null.
   * @throws DatasetInstantiationException If the Dataset cannot be instantiated: its class
   *         cannot be loaded; the default constructor throws an exception; or the Dataset
   *         cannot be opened (for example, one of the underlying tables in the DataFabric
   *         cannot be accessed).
   */
  <T extends Dataset> T getDataset(String name) throws DatasetInstantiationException;  
  ...
  // other overloaded getDataset(...)
  ...
}
DatasetContext: Proposed APIs
public interface DatasetContext {
 
  <T extends Dataset> T getDataset(String namespace, String name) throws DatasetInstantiationException;  
  ...
  // same for all other overloaded getDataset(...)
  ...
}

 

We will also modify the Input class to take Namespaced dataset/streams. This can be achieved in the different ways which are listed below:

 Agreed Upon API Change:

We will make ofDataset and ofStream return StreamInput and DatasetInput respectively and these classes will have a new fromNamespace method which can be used specify the namespace. If an user does not  use fromNamespace then the dataset/stream will be looked in the current namespace.

Input: Existing APIs
public abstract class Input {
	public static Input ofDataset(String datasetName) {
		...
	}
 
	public static Input ofStream(String streamName) {
		...
	}
 
    ...
	// other overloaded methods for ofDataset(...) and ofStream(...)
    ...
}
Input: Proposed APIs
public abstract class Input {
	public static DatasetInput ofDataset(String datasetName) {
		...
	}
 
	public static StreamInput ofStream(String streamName) {
		...
	}
 
    ...
	// Same changes in all overloaded methods for ofDataset(...) and ofStream(...)
    ...
 
	
	public static class StreamInput extends Input {

	  private String namespace;

	  public <T extends Input> T fromNamespace(String namespace) {
	    this.namespace = namespace;
    	return (T) this;
	  }
	}
 
	public static class DatasetInput extends Input {

	  public <T extends Input> T fromNamespace(String namespace) {
	    this.namespace = namespace;
    	return (T) this;
	  }
	}
}
 
// Use it in code somewhere as: 
Input dsInput = Input.ofDataset("someDataset").fromNamespace("ns1");

Other proposed ways to achieve this were: 

  1. Move StreamId and DatasetId class to cdap-api module.
    1. We already have StreamId and DatasetId class in cdap-proto which takes namespace(string) and dataset/stream(string) name.
    2. We can add additional APIs which takes in StreamId/DatasetId to support cross namespace access.
    3. Though, we will keep the existing apis to allow users to access stream/dataset by just providing its name in current namespace. 

      Input.java
      /** * Defines input to a program, such as MapReduce. */ public abstract class Input { private final EntityId entityId; /** * Returns an Input defined by a dataset. * * @param datasetName the name of the input dataset */ public static Input ofDataset(DatasetId datasetID) { ... } .... .... /** * Returns an Input defined with the given stream name with all time range. * * @param streamName Name of the stream. */ public static Input ofStream(StreamId streamId) { ... } }



  2. Add a new minimal Stream/Dataset classes in cdap-api
    We will modify the Input  APIs to take namespaced Dastaset/Stream instead of dataset/stream names as string

    Stream
    public class Stream { private final String namespace; private final String stream; // to access stream in the current namespace public Stream(String stream) { this.namespace = null; this.stream = stream; } // to access stream in a different namespace public Stream(String namespace, String stream) { this.namespace = namespace; this.stream = stream; } }
    Dataset
    public class Dataset { private final String namespace; private final String dataset;   // to access dataset in the current namespace public Dataset(String dataset) { this.namespace = null; this.dataset = dataset; } // to access dataset in a different namespace public Dataset(String namespace, String dataset) { this.namespace = namespace; this.dataset = dataset ; } }

    Now we can add additional APIs in Input class to take these namespaced Dataset/Stream. This is similar to approach 1 but it does not require moving classes from cdap-proto to cdap-api. But it does introduce some duplication as these classes are almost same as StreamId/DatasetId.

  3. Take Namespace in Input class as string
    For all existing APIs in Input class for stream/dataset we can add additional ones which can take namespace as string and store in Input

    Input.java
    /** * Defines input to a program, such as MapReduce. */ public abstract class Input { private final String name; private final String namespace; /** * Returns an Input defined by a dataset. * * @param datasetName the name of the input dataset */ public static Input ofDataset(String datasetName) { ... }   public static Input ofDataset(String namespace, String datasetName) { ... }   .... .... /** * Returns an Input defined with the given stream name with all time range. * * @param streamName Name of the stream. */ public static Input ofStream(String streamName) { ... }   public static Input ofStream(String namespace, String streamName) { ... } /** * Returns an Input defined by an InputFormatProvider. * * @param inputName the name of the input */ public static Input of(String inputName, InputFormatProvider inputFormatProvider) { ... } }

Other ways to access a Dataset/Stream:

UseDataset Annotation: UseDataset is an annotation which we provide users for convenience. Since, its a java annotation the processing takes place at compile time. So, the user will be required to know the namespace of the dataset for the application being written while writing code. Since, CDAP application are self contained entities and not bind to a namespace we will not support cross namespace access  through UseDataset annotation. UseDataset annotation should always be used for accessing the dataset in current namespace i.e. the one in which the application will be deployed.

Access in beforeSubmit: Program can access stream/dataset in beforeSubmit() by its name. But theses methods are deprecated now and we expect users to use APIs which take Input class (changes proposed above). Since cross namespace access is a new feature we will not support it through deprecated APIs. In place of these deprecated, initialize/destroy APIs will be added soon and we will support accessing dataset across namespaces in these. 

Admin interface for Dataset: We allow users to perform various dataset operation through Admin interface. For example exists(String name), getProperties(String name) etc. We will need to support cross namespace lookup through this interface too but since we don't have any use case for this in this release we are keeping this out of scope. 

 

 

Once we can accept namespace from an user (User Story 2: Addressed above) it becomes similar to User Story 1 which is being able to map cdap namespace to underlying custom namespace, if one has been provided. This will be addressed by fetching the NamespaceMeta for the given cdap namespace for every DatasetContext.getDatset call through a caching version of NamespaceAdmin.get(Id.Namespace namespaceId). CachingNamespaceAdmin will be injected to all the classes which implement this interface (for example: BasicMapReduceContext, BasicFlowletContext etc) so that they can fetch namespace meta and map to the internal storage provider namespace.

TableId class will be modified to store the custom hbase namespace as a string rather than cdap namespace if one has been provided. Everyone creating an instance of TableId will be responsible for providing the underlying HBase Namespace. We will achieve this with CachingNamespaceAdmin in a similar way as mentioned above.

TableId: Current Implementation
public class TableId {
  private final Id.Namespace namespace;
  private final String tableName;
}
TableId: Proposed Change
public class TableId {
  private final String namespace; // underlying hbase namespace
  private final String tableName;
}

 

HDFS (Streams/FileSets):

Streams and Filesets uses NamespacedLocationFactory to get  baselocation for streams and filesets. A CachingNamespaceAdmin will be injected to the implementations of  NamespacedLocationFactory  so that they can look up NamespaceMeta containing underlying custom mapping while generating base location.

NamespacedLocationFactory
public class DefaultNamespacedLocationFactory implements NamespacedLocationFactory {

  private final LocationFactory locationFactory;
  private final String namespaceDir;
  private final NamespaceAdmin nsAdmin // a caching namespace admin
  
  ...  

  ....

  @Override
  public Location get(Id.Namespace namespaceId, @Nullable String subPath) throws IOException {
	// nsAdmin : look up custom namespace mapping if one exists else use namespaceDir 
	// return the path
  }
}

 

Hive: Explore Tables

Custom Namespace: To support custom namespace for Hive in CDAP we will inject a CachingNamespaceAdmin to BaseHiveExploreService to get  the custom hive namespace if one has been provided. We will change the implementation of getHiveDatabase to support this custom mapping.

BaseHiveExploreService
public abstract class BaseHiveExploreService extends AbstractIdleService implements ExploreService {

	private String getHiveDatabase(@Nullable String namespace) {
	  // null namespace implies that the operation happens across all databases
	  if (namespace == null) {
    	return null;
	  }
	  1. Look up NamespaceMeta through NamespaceAdmin
	  2. Return custom hive namespace if one exists else: 
		  String tablePrefix = cConf.get(Constants.Dataset.TABLE_PREFIX);
		  return namespace.equals(Id.Namespace.DEFAULT.getId()) ? namespace : String.format("%s_%s", tablePrefix, namespace);
	}
}

 

User can run cross namespace explore queries by specifying the  hive_database_name.table_name in the query. This is supported in our current release itself. Note: The database (namespace) given in the query should be hive database name  not the cdap namespace. Supporting cross namespace access with cdap namespace rather than hive database name will require pre-processing (we don't pre-process explore queries right now)  explore queries to find the cdap namespace and then replace it with the mapped hive database (namespace). We are keeping this out of scope for 3.5

Design 2: Store Custom Namespace Mapping in NamespaceId

 

We can also store the custom namespace mapping for underlying storage handler in the NamespaceId itself. This approach has some advantages and disadvantages.

Advantage:

  • Storing the custom mapping in NamespaceId itself will remove the need to inject a CachingNamespaceAdmin to the above mentioned place as they will not be responsible for fetching the custom mapping. They can simply get it from the NamespaceId class.

Disadvantage:

  • This approach will introduce storing the custom mapping in NamespaceId which is ugly.
  • We will need to do a RPC call to fetch namespace meta to get custom mapping during the creation of NamespaceId given a namespace name. We can optimize it by caching the mapping for namespaces though.

JIRA

CDAP-6153 - Getting issue details... STATUS

CDAP-6157 - Getting issue details... STATUS

 

Out-of-scope User Stories (4.0 and beyond)

  1. Support of accessing entities other than dataset/stream in different namespace. For example, a cdap user in namespace ns1, I should be able to create an application app1 using an artifact artifact2 which is present in namespace ns2. 
  2. Cross namespace access in explore queries with cdap namespace. Currently, users can do cross namespace access by providing the underlying hive database name.
  3. Admin interface for Dataset should be able to perform crossname namespace access. 

References


Appendix A: API changes


Changes for dataset
// Dataset Context:
<T extends Dataset> T getDataset(String namespace, String name) 
<T extends Dataset> T getDataset(String namespace, String name, Map<String, String> arguments) 



// Add APIs to programs to support accessing dataset from a different namespace: 

// MapReduce: 
context.addInput(Input.ofStream("stream").fromNamespace("ns")); 

// Spark: 
public <K, V> JavaPairRDD<K, V> fromDataset(String namespace, String datasetName) 

public <K, V> JavaPairRDD<K, V> fromDataset(String namespace, String datasetName, Map<String, String> arguments) 

public abstract <K, V> JavaPairRDD<K, V> fromDataset(String namespace, String datasetName, Map<String, String> arguments, @Nullable Iterable<? extends Split> splits); 	
Changes for stream
// Add APIs for different programs to support accessing stream from another namespace: 

// MapReduce: 
context.addInput(Input.ofStream("stream").fromNamespace("ns")); 


// Flowlet: 
void connectStream(String stream, Flowlet flowlet) 
void connectStream(String stream, String flowlet) 


// Spark: 
JavaRDD<StreamEvent> fromStream(String namespace, String streamName, long startTime, long endTime); 

JavaPairRDD<Long, V> fromStream(String namespace, String streamName, Class<V> valueType) 

JavaPairRDD<Long, V> fromStream(String namespace, String streamName, long startTime, long endTime,Class<V> valueType); 

JavaPairRDD<K, V> fromStream(String namespace, String streamName, long startTime, long endTime,Class<? extends StreamEventDecoder<K, V>> decoderClass,Class<K> keyType, Class<V> valueType); 

JavaPairRDD<Long, GenericStreamEventData<T>> fromStream(String namespace, String streamName,FormatSpecification formatSpec,long startTime, long endTime,Class<T> dataType); 

10 Comments

  1. Is it worth it to store NamespaceMeta instead of the namespace string in TableId? Is TableId only used for HBase - can it be used anywhere else?

    1. As far as I can see now, storing the hbase namespace will be enough. I think the TableId class should not be aware of cdap namespace or its meta at all. if during the implementation phase we see a need for this we will add it.

  2. // NoteToSelf: Rename NamespaceConfig to NamespaceProperties and mark schedulerQueueName as Nullable.

    This is a cdap-proto class so be careful before doing this.

  3. Design decision #4: Not only check that two namespaces don't use the same directory. But also one directory is not a subdir of the other. 

    1. Good point. I made a note of it in the design decision. 

  4. The proposed implementation of fromNamespace for the Input's make them mutable. Better return a new Input object.

  5. beforeSubmit()/onFinish() are deprecated but they are being replaced by initialize()/destroy(). Within these methods we need to support accessing datasets in other namespaces.

    1. Added a note in the beforeSubmit section about this. I will work on adding cross namespace access once we have the APIs. 

  6. CachingNamespaceAdmin is caching forever because namespaces are immutable? Is that the idea? However, a namespace can still be deleted and recreated with a different configuration.