- Custom Namespace Mapping: Ability to map a namespace to user-provided underlying storage namespaces
- Cross Namespace access: Cross-namespace dataset/stream access (read and write only)
- User stories documented (Rohit)
- User stories reviewed (Nitin)
- Design documented (Rohit)
- Design reviewed (Andreas/Terence)
- Feature merged (Rohit)
- Blog post
- As a CDAP user, I would like to specify the namespace in an underlying storage (e.g. HBase namespace, HDFS directory, Hive database) to use for a particular CDAP namespace.
- As a CDAP admin, I want to allow users to access (read/write) a dataset/streams from a program in a different namespace, as long as the said user is authorized to access that dataset to perform the operation.
Design 1: Store Custom Namespace Mapping in NamespaceConfig
To support mapping a namespace to user-provided storage provider (HBase, Hive, HDFS) we will accept these custom mapping during namespace create operation (UI will need to change to support this) and then store it. These mapping will be stored in NamespaceConfig which currently stores custom yarn queue names ("scheduler.queue.name") and is used by the NamespaceMeta. We will add three additional fields to it:
// NoteToSelf: Rename NamespaceConfig to NamespaceProperties and mark schedulerQueueName as Nullable.
The NamespaceConfig is exposed through NamespaceMeta.
- User is responsible for managing the lifecycle of custom hdfs directory/hive database/hbase namespace. CDAP will not create or delete any custom namespaces. These custom namespaces in the underlying storage provider must be empty
We are making this decision to have consistency in the behavior. Also, allowing users to use namespace which already has some data in it can lead to various issue. For example, for every namespace in hbase we create a queue table. What if a table with similar name exists ? Furthermore, we don't see a use case where user will want CDAP to handle the lifecycle of custom namespaces or have external data in it.
// NoteToSelf: Sync with Ali Anwar for impersonation to figure out if we need to pre-check for underlying namespace to be existing and empty.
- Users can provide custom namespace for one or more or all storage providers. CDAP will be responsible for managing the lifecycle of all the storage provider namespace for which user did not provide any custom value.
This is done to allow users to have flexibility to use custom namespaces for only the needed storage and let CDAP handle others.
- Namespace custom mapping is final and immutable. It can only be provided during the creation of the namespace and cannot be changed afterwards.
This is done to keep the design simple for now. Supporting mutable mapping needs answering a lot of other issues like, what to do with existing data ? Will we need a migration tool ? How to migrate hbase, hive, hdfs data for cdap etc.
- An underlying storage namespace can be mapped only to one cdap namespace. Users will not be allowed to create two cdap namespaces which uses same underlying storage namespace or its part (subdir). During namespace creation we will explicitly check that no other cdap namespace is using the custom storage namespace. We will also check that the directory is not a subdir of the other directory used in some other namespace.
We are making this design decision because sharing of underlying namespace will lead to a lot of weird consequences since programs will be sharing datasets, metadata etc. For example deleting a dataset from a namespace will delete it from another one too.
User Story 1 and User Story 2 are related. To address user story 2 we need to allow user to specify a namespace for dataset/stream rather than implicitly looking into the namespace where the program is running. Once we identify the cdap namespace for a dataset/stream we fall back to user story 1 i.e. using the custom namespace mapping to access them on underlying storage.
Creating and Deleting Underlying Custom Mapped Namespaces
DatasetFramework which is responsible for creating/deleting a namespace in the storage provider will be modified to take NamespaceMeta rather than Id.Namespace.
The underlying implementation will be changed to create namespaces in underlying storage with the custom name rather than cdap namespace name.
// NoteToSelf: DatasetFramework should not have methods to create/delete namespace. Move it out to an independent interface.
Retrieving Dataset/Streams from Custom Mapped Namespaces
To support cross namespace access DatasetContext and Input class will be changed to accept namespace with dataset name. We will add the following new APIs to these classes:
We will also modify the Input class to take Namespaced dataset/streams. This can be achieved in the different ways which are listed below:
Agreed Upon API Change:
We will make ofDataset and ofStream return StreamInput and DatasetInput respectively and these classes will have a new fromNamespace method which can be used specify the namespace. If an user does not use fromNamespace then the dataset/stream will be looked in the current namespace.
Other proposed ways to achieve this were:
Move StreamId and DatasetId class to cdap-api module. We already have StreamId and DatasetId class in cdap-proto which takes namespace(string) and dataset/stream(string) name. We can add additional APIs which takes in StreamId/DatasetId to support cross namespace access. Though, we will keep the existing apis to allow users to access stream/dataset by just providing its name in current namespace. Add a new minimal Stream/Dataset classes in cdap-api We will modify the Input APIs to take namespaced Dastaset/Stream instead of dataset/stream names as string Now we can add additional APIs in Input class to take these namespaced Dataset/Stream. This is similar to approach 1 but it does not require moving classes from cdap-proto to cdap-api. But it does introduce some duplication as these classes are almost same as StreamId/DatasetId. Take Namespace in Input class as string For all existing APIs in Input class for stream/dataset we can add additional ones which can take namespace as string and store in Input
Other ways to access a Dataset/Stream:
UseDataset Annotation: UseDataset is an annotation which we provide users for convenience. Since, its a java annotation the processing takes place at compile time. So, the user will be required to know the namespace of the dataset for the application being written while writing code. Since, CDAP application are self contained entities and not bind to a namespace we will not support cross namespace access through UseDataset annotation. UseDataset annotation should always be used for accessing the dataset in current namespace i.e. the one in which the application will be deployed.
Access in beforeSubmit: Program can access stream/dataset in beforeSubmit() by its name. But theses methods are deprecated now and we expect users to use APIs which take Input class (changes proposed above). Since cross namespace access is a new feature we will not support it through deprecated APIs. In place of these deprecated, initialize/destroy APIs will be added soon and we will support accessing dataset across namespaces in these.
Admin interface for Dataset: We allow users to perform various dataset operation through Admin interface. For example exists(String name), getProperties(String name) etc. We will need to support cross namespace lookup through this interface too but since we don't have any use case for this in this release we are keeping this out of scope.
Once we can accept namespace from an user (User Story 2: Addressed above) it becomes similar to User Story 1 which is being able to map cdap namespace to underlying custom namespace, if one has been provided. This will be addressed by fetching the NamespaceMeta for the given cdap namespace for every DatasetContext.getDatset call through a caching version of NamespaceAdmin.get(Id.Namespace namespaceId). CachingNamespaceAdmin will be injected to all the classes which implement this interface (for example: BasicMapReduceContext, BasicFlowletContext etc) so that they can fetch namespace meta and map to the internal storage provider namespace.
TableId class will be modified to store the custom hbase namespace as a string rather than cdap namespace if one has been provided. Everyone creating an instance of TableId will be responsible for providing the underlying HBase Namespace. We will achieve this with CachingNamespaceAdmin in a similar way as mentioned above.
Streams and Filesets uses NamespacedLocationFactory to get baselocation for streams and filesets. A CachingNamespaceAdmin will be injected to the implementations of NamespacedLocationFactory so that they can look up NamespaceMeta containing underlying custom mapping while generating base location.
Hive: Explore Tables
Custom Namespace: To support custom namespace for Hive in CDAP we will inject a CachingNamespaceAdmin to BaseHiveExploreService to get the custom hive namespace if one has been provided. We will change the implementation of getHiveDatabase to support this custom mapping.
User can run cross namespace explore queries by specifying the hive_database_name.table_name in the query. This is supported in our current release itself. Note: The database (namespace) given in the query should be hive database name not the cdap namespace. Supporting cross namespace access with cdap namespace rather than hive database name will require pre-processing (we don't pre-process explore queries right now) explore queries to find the cdap namespace and then replace it with the mapped hive database (namespace). We are keeping this out of scope for 3.5
Design 2: Store Custom Namespace Mapping in NamespaceId
We can also store the custom namespace mapping for underlying storage handler in the NamespaceId itself. This approach has some advantages and disadvantages. Advantage: Storing the custom mapping in NamespaceId itself will remove the need to inject a CachingNamespaceAdmin to the above mentioned place as they will not be responsible for fetching the custom mapping. They can simply get it from the NamespaceId class. Disadvantage: This approach will introduce storing the custom mapping in NamespaceId which is ugly. We will need to do a RPC call to fetch namespace meta to get custom mapping during the creation of NamespaceId given a namespace name. We can optimize it by caching the mapping for namespaces though.
Out-of-scope User Stories (4.0 and beyond)
- Support of accessing entities other than dataset/stream in different namespace. For example, a cdap user in namespace ns1, I should be able to create an application app1 using an artifact artifact2 which is present in namespace ns2.
- Cross namespace access in explore queries with cdap namespace. Currently, users can do cross namespace access by providing the underlying hive database name.
- Admin interface for Dataset should be able to perform crossname namespace access.