Page tree
Skip to end of metadata
Go to start of metadata



  • User Stories Documented
  • User Stories Reviewed
  • Design Reviewed
  • APIs reviewed
  • Release priorities assigned
  • Test cases reviewed
  • Blog post


Implement configuring impersonation at the application level. Enable impersonation in Explore queries.


Application Impersonation: As a part of, CDAP-6131 - Getting issue details... STATUS we implemented impersonation for programs and data operations, but this could only be configured at the namespace level. We need the ability to configure this at the application level, so that we can run programs as different users, without having to manage additional namespaces for each additional user.

Explore Impersonation: As a part of, CDAP-6587 - Getting issue details... STATUS we implemented impersonation in Hive for Explore queries to impersonate the namespace user if one was provided. For better security measures we will like to run explore queries as the user who submits them.

Entity ownership: CDAP-8065 - Getting issue details... STATUS Entities created by applications should be owned by the application owner. Access permissions to those entities could be given to other users at create time or at a later time.

User Stories 

  • (Similar to Secure Impersonation user stores, but with application-level impersonation)
    1. As a CDAP admin, I would like to map an application (and the entities it contains) to a Kerberos principal. When CDAP programs of this application are submitted to YARN, the applications should be run as that user.
    2. As a CDAP application developer, my application should access HDFS, HBase, Hive, and other resources as the user/principal configured for it, instead of the global 'cdap' (or other, configured) user. 
  • As CDAP admin, I would like explore queries to run as the user submitting the explore query.
  • As CDAP application/dataset/stream owner I will like give access to other users on application/dataset/stream during creation or afterwards.


Scenario 1: App Creation

Alice is a human user and will like to create an app using an artifact. Alice has ADMIN access on the CDAP instance. She specifies a kerberos principal Louis as the owner. After the app has been created she will like the following to be true:
  1. Louis should get all the privileges (READ/WRITE/EXECUTE) on all the entities created by the deployed application, with CDAP authorization.
  2. Louis should own all streams/dataset created by deployed app i.e. he will own the HDFS files, HBase tables, and Hive tables.
  3. All programs should run with Louis' credentials (e.g. Kerberos ticket) i.e if another user Bob, who has sufficient privileges to run a program (EXECUTE on program and READ on namespace, if CDAP Authorization is turned on), starts the program then the program should run as Louis.

Scenario 2: Dataset Creation/Maintenance

  1. Alice is a human user. Alice will like to create a dataset without deploying an application and during creation she wants to specify an owner who will own the dataset i.e. the HDFS files/HBase tables/Hive tables. She specifies the principal for a headless user Louis, whose account she has access to, as the owner.
  2. Alice will like to perform dataset maintenance operations (truncate, delete, update) from REST APIs, CLI, or UI and she will like for these operations to be performed as the dataset owner Louis.
  3. Another user Bob who has sufficient privileges to administer the dataset can perform the maintenance operations, all operations will be performed as the dataset owner Louis.

Scenario 3: Access Control

  1. Jules is a human user who does not have CDAP credentials and wants to run a Hive query outside of CDAP. Her access to the data can be controlled by group permissions.
  2. Mary is a headless user who owns a CDAP program that reads from a dataset owned by Louis. An admin adds Mary to the group for the dataset. The program owned by Mary can now read the dataset.
  3. Eve is a human user who has both LDAP and kerberos credentials. She logs into the CDAP UI with her LDAP credentials and submits a query. While submitting the query she provides her kerberos principal and password. The query should be run as her kerberos principal.


Currently, whenever we need to perform a data operation or launch a program in YARN, we lookup the namespace that this entity exists in, and based upon the principal mapping for that namespace, we impersonate for that principal. If there is no mapping, we perform actions as the current user (cdap system user). Now, we will need to maintain a mapping from entities such as applications, streams, and datasets.

Entity ownership

The ownership information for entities will be stored in a "owner.meta" table. The table will store the Entity to the owners kerberos principal (as a string) mapping. This information along with the permissions on the entity will be pushed down to the storage provider and that will be used to control access (future work).

This will introduce an additional step during entity creation. An entry will need to be made to the owner.meta table. 

The table will not be used to store ACLs for this release as that will be handled by the storage provider but in future releases, we can expand this to manage the ACLs. This feature will be useful for storage providers that don't support ACLs. It will also be useful in providing a layer of abstraction over authorization backends like Apache Sentry and Apache Ranger.

Note: If an entity exists with an associated owner and the same entity is being created by some other user then this operation will fail. Also, if this entity creation was triggered by some other operation then the complete operation will fail too. For example, Alice has deployed an app in CDAP which created a dataset called 'employees'. Now if Bob tries to deploy another app which creates the same dataset called 'employee' then the app deployment will fail. If Bob wants to read the employee dataset from his app then he should be get the 'employee' dataset in his program dynamically. Now he should be able to read this dataset if Scenario 3.2 conditions are meet.


Rows in owner.meta will be of the format

The row key will be constructed from the entity id and will capture the Entity hierarchy. e.g. for a stream it will be constructed using the namespace and stream id.

rowkey: {<created from entity id>}, column {'c'}, and the owner's principal as the value

User management

To allow headless users access to the system, other authorized users need to impersonate them. To allow this impersonation we set the following convention:

  • All keytabs are present on the local filesystem on which CDAP master is running. 
  • These keytabs are present under path which needs to be specified in cdap-security.xml:
    1. /dir1>/<dir2>/${name}/${name}.keytab
  • ${name} will be replaced with the short name of the owner's principal. They can be used anywhere in the path. e.g. /home/${name}/kerberos/keytabs/${name}.keytab




Pushing permissions to storage engines after creation (Out of 4.1 Scope)

The permissions assigned for entities will need to be pushed down to storage providers so that access outside the system will have the same restrictions. Both HBase and HDFS support ACLs and they will be used to assign finer grained permissions to the underlying tables or files. 

Directory permissions

The directory structure will be as follows, CDAP will own the parent directories for the namespace. The directories will be group writable and everyone who has app deployment privileges will be part of that group so that they can create subdirectories. For any cleanup, for example, when the namespace is being deleted, the system user will impersonate the subdirectory owners to do the deletion. With this impersonation in place, the system user will not need access permissions on user directories. 


The groups for the directories will be specified while the entry is being created and once the directory is created the system will do a chgrp to change it to the provided group.



drwxrwxr-x   - cdap supergroup          0 2017-01-16 04:39 /cdap/namespaces/

To be able to create a namespace the user will need to be a part of the "supergroup".

A group can also be specified in cdap-security.xml with property "namespace.creators". If a group is specified for this property then CDAP will change the group of /cdap/namespaces to the specified group allowing users in the existing group to create namespace.


The namespace directory will be owned by the namespace owner

During the creation of namespace a group can be specified and this group will have write and execute permission on the namespace directory allowing the users of this group to deploy application in the namespace. Note: This will require change in our existing namespace creation API.

drwxrwxr-x   - accountadmin accountgroup          0 2017-01-16 04:39 /cdap/namespaces/account

To be able to create anything under that namespace the user will have to be a part of the "accountgroup"



drwxr-xr-x   - account1 accountgroup          0 2017-01-17 02:41 /cdap/namespaces/account/streams/st1


All the directories will be owned by the headless users whose keytabs need to be present so that they can be impersonated. Additionally during the creation of app, stream and dataset the user can specify a group and CDAP will change the group of the the associated files on hdfs and tables on hbase and hive so that the given group have read access. 


Explore Impersonation

For explore impersonation we won't be using keytabs. A human user will login using their credentials and to run explore queries they will have to provide a kerberos username and a password. The system will authenticate with KDC on behalf of the user and use the tgt to create a UGI for the user through the static method 

static UserGroupInformationgetUGIFromTicketCache(java.lang.String ticketCache, java.lang.String user)

This UGI will then be used to impersonate the queries.


The RemoteUGIProvider provides methods that are called when a UGI is needed to impersonate a user. During the call to RemoteUGIProvider#createUGI the Kerberos TGT can be obtained from the master through a rest API (/impersonation/credentials)


class ImpersonationInfo currently contains a principal and their keytab. This will change to include the path to the ticket cache for the user.




The explore window shows up when the user clicks on the explore icon on any explorable entity. If kerberos is enabled in the cluster then a modal window will show up the first time the explore icon is clicked. Through this window, the user can provide the Kerberos principal that the explore query should run as and the TGT for that principal.

The UI forwards the principal and the TGT to the router which forwards it to CDAP master. Both these routes support SSL. Once master has the TGT it can be serialized to HDFS with permissions set to 600.

Explore container can then use the TGT on HDFS to create a UserGroupInformation object and use that to impersonate the principal for running the query. The UGI once created will be cached.


The user would need to do a kinit before they would be able to launch an Explore query from the CLI. The CLI would then pick up the TGT and rest of the flow is the same as UI.



For running Explore queries through the REST APIs the user will need to provide the TGT and the principal along with the query.

Upgrade tool


Open Questions

  1. Currently, hive impersonation does not work when the engine is set to spark. CDAP-7700 - Getting issue details... STATUS Do we need to fix this in 4.1?


  1. The principal configured for an application MUST have privileges to create tables in the (HBase) namespace it is deployed in. What happens if cdap is the entity creating this HBase namespace? How will the custom principal have CREATE privileges in that namespace?
  2. We will use AuthorizationHandler and PrivilegesManager for managing ACLs on the entities during and after creation. 
  3. The specification for impersonation is at Secure Impersonation Specification


API changes

New Programmatic APIs

New internal APIs:
Impersonation Store: Stores the user keytab information

public class ImpersonationStore {
  public void addImpersonationInfo(final ImpersonationInfo impersonationInfo) throws IOException {  }

  public ImpersonationInfo getImpersonationInfo(final String principal) throws IOException, ImpersonationInfoNotFound {  }

  // idempotent
  public void delete(final String principal) throws IOException {  }

Permission Store: Stores the entity ownership information.

public class PermissionStore {
  public void addOwner(final EntityId entityId, final String principal) throws IOException {  }
  public ImpersonationInfo getOwner(final EntityId entityId) throws IOException, NotFoundException {  }
  // idempotent
  public void deleteOwner(final EntityId entityId) throws IOException {  }
public final class ImpersonationInfo {
  private final String principal;
  private final String keytabURI;

Potential new external APIs (TBD):
Allowing group and permissions for FileSets/Streams/(other?) 


Entity Ownership:

Please see Secure Impersonation Specification#EntityOwnership

Remote Owner Service

We need a Remote implementation of OwnerAdmin so that the program container or cdap service container which performs request under impersonation (which can be either namespace/app/dataset/stream owner) can look up owner information internally if needed.

For example, a explore query on a stream is handled by ExploreQueryExecutorHttpHandler. The handlers here does impersonation as the namespace owner. Now when the query actually runs its might need to look up other cdap resources (for example say the stream configuration). This call in itself does impersonation by doing a doAs for the resource involved (in this case the stream). The Impersonator which is responsible for providing the UGI to be impersonated for this call tries to look up owner information for the resource and will fail since it tries to access owner.meta table which is a system table and cannot be accessed under user impersonation.

This requires adding a Remote implementation of OwnerAdmin which program container and cdap service container can use to get the owner information. We will also need to add a handler in cdap-app-fabric which will serve the requests from the remote client. Since this handler will reside inside cdap master it can query owner store through owner admin since it will be running as cdap user.

We will expose the following endpoints: (Note: Currently, we only support owner for namespace, app, artifact, stream, dataset)

Request Body
Response Code
Adding Owner
 "namespacedEntityId": {},

200 - On success

409 - if owner information for entity already exists

500 - Any internal errors


Deleting Owner
  "stream": "stream",
  "namespace": "default",
  "entity": "STREAM"

200 - On success

500 - Any internal errors

Getting Owner
  "stream": "stream",
  "namespace": "default",
  "entity": "STREAM"

200 - On success

500 - Any internal errors

  "principal": "user/"
Getting Impersonation Information
 "namespacedEntityId": {},

200 - On success

500 - Any internal errors

 "principal": "user/",


Entity Creation:

Please see: Secure Impersonation Specification#EntityCreation

CLI Impact or Changes

  • CDAP-8079 - Provide a way to specify kerberos credentials for launching Explore queries through CLI in impersonated environment ( Open)  Provide a way for the user to specify kerberos credentials while launching an Explore query
  • (optional) Create CLI for the above REST APIs

UI Impact or Changes

  • CDAP-8078 - Provide a way to specify kerberos credentials for launching Explore queries through UI in impersonated environment ( Open)  Provide a way for the user to specify kerberos credentials while launching an Explore query 
  • (optional) Create UI for the above REST APIs

Security Impact 

We will need to implement authorization on the above REST APIs (which manage the impersonation metadata). Authorization will also need to be added when programmatically accessing this metadata (such as when launching the programs or performing dataset operations involving impersonation).

Impact on Infrastructure Outages 

This will rely on HBase for storing metadata (Similar to how we store all sorts of other metadata for applications). Without HBase (and dataset service), this will definitely not work.

Test Scenarios

Test IDTest DescriptionExpected Results
IMP100(default namespace) Deploy an application from an artifact, for principal X, and run a program.The program should run as X. Datasets/streams should havetheirhdfs/hbaseownedby X.
IMP101(default namespace) Deploy another application from the same artifact, without specifying principal, and run a program.The program should run as the cdap system user. Datasets/streams should havetheirhdfs/hbaseownedby cdap system user
IMP102RUN IMP100 and IMP102 in a custom namespace, that doesn't have impersonationExpectation should be the same.
IMP103Run IMP100 and IMP102 in a namespace that already has impersonation configured.< Expected behavior TBD >


Release 4.1.0

Related Work

  • Work #1
  • Work #2
  • Work #3


Future work




  • No labels


  1. impersonation

    This should either be post, or a put with the principal in the uri.

  2. Impact on Authorization? Will we grant privileges to all the entities of an app to the app "owner" upon app deploy?

    We already do this don't we ? 

    1. We do this for namespace-level impersonation, yes. We will need to do it separately for app-level owner upon app deploy.

  3. Scenarios

    We need the additional scenarios:

    • Alice is a headless user who cannot login. Jules is an operator who has access to Alice's account via a keytab file and needs to enact scenarios 1-3 on behalf of Alice.
    • George and Derek are both operators for the CDAP instance. Alice (or Jules) wants to allow Derek to query her data, but not curious George. Alice wants to control access to her data through group membership: Only users who are in the group that she used for her data should have read access, independent of whether they have privileges to operate her application.  


  4. Solution 1:

    Looks like this is a hybrid of two solutions:

    1. CDAP has permission to create and chown 
    2. CDAP owns parent dir, it has group write, and all app owners are in that group so that they can create subdirs. 

    I think either 1. or 2. is needed, but not both


  5. log handling

    In a separate effort for this release, we are redesigning the log saver. There is a high probability that this will result in saving the logs to files owned by CDAP, with read protected by CDAP authn and authz. That is, it will hopefully not be a concern for this effort

  6. application

    If I understand this correctly, the artifacts cam be managed separately. (They could all be owned by a single user). As long as Alice, when she creates the app from the artifact, has privileges to read the namespace and execute the artifact, she should be fine. 

  7. How will users guarantee cross-application read/writes?

    Yes, ACLs have to be propagated to the storage engines. 

  8. We currently grant all privileges to the deploying user on all entities being created by the app deployment

    I don't think that is true. We grant privileges to the user that owns the namespace, right?

    1. We grant privileges in CDAP (CDAP authorization through sentry). Though, I guess I got it confused. The context here is about the privileges in the underlying storage providers. When an app will be deployed with an app owner it we will do deployment impersonating that user. This will make the app owner the owner of dataset files/hbase tables as it does currently inside an impersonated namespace currently I think. A dataset can also be created by hitting an endpoint in that case we will need to be able to accept an dataset owner and create the dataset while impersonating the user. We haven't added this in scenarios as of now. I will add this in the scenarios too. 

  9. Explore Impersonation:

    I think this is not correct in the current implementation. Running a query as the user that owns the namespace (btw, if a query spanned multiple namespaces, which user would we impersonate?) means that anyone who can run explore queries can see all data. Explore queries must be impersonated as the user who is authenticated, never as the user who owns the data. 

  10. How do we get the keytab of the logged in user?

    We should not use keytabs. Typically, keytabs are used for headless users, whereas here we need to impersonate a living human (at least until the Singularity happens). A human user should have two options:

    • Provide user/password to the UI/CLI
    • kinit and provide his TGT to the REST API (first option maps to this)



  11. Does the namespace have an owner? Could he be the one creating the user dirs and then chown them?

    This is related to question 1. If the namespace is not owned by CDAP, then the namespace owner (who can be impersonated by CDAP) fulfills the role of CDAP in my comment to q1.

  12. (User management)? Will it become a bottleneck?

    I think we need that. It will be a bottleneck as much as the namespace service is right now. With proper client(container)-side caching, this should not be an issue. 

  13. How would user programs access system data?

    They should not. They should emit events through TMS, and a system service would consume them and perform all the operation as system. Not sure whether TMS is ready for that in 4.1


  14. Since we have decided that we will need to do user/keytab management. I think it leads to two components: 1. Keytab Management 2. App Impersonation. It might be better if we move the design of these in independent sections for better understanding and also to design keytab management service in a generic way rather than the requirement for app impersonation. 

  15. hadoop.proxyuser.$admin.hosts

    Are we going to use/support proxyuser?? I thought we are using kerberos keytab instead.

    1. Proxy users seem to be a viable way to impersonate users submitting an explore query, as they may not have keytabs in the system.

      1. who may not have keytabs in the system?? The user who submit the query request or the hadoop user to user for running the query?

        1. The user who logged in to the UI using their LDAP credentials and wants to submit a query.

          1. then how does proxy user help? I think one of the requirement is to have user A (a LDAP user, no KDC presence) submitting a query that is having user B (user B is in KDC and have keytab) to actually run the query.

            1. The query should run as user A though, correct?

              1. No, it should be run as user B on Hadoop (e.g. the MR job should be launched and execute as user B). User A may not be a hadoop user.

  16. How's the table schema of the permissions.meta table looks like? Does it support entity id hierarchy?

    1. Yes, the rowkey would be EntityId.toString

      1. Don't use the toString as key, as that's what causing all the incompatibility issue and requiring running upgrade tool. The key should be constructed, with version information in there.

          1. meaning don't use `EntityId.toString` as row key.

            1. Heh! that was supposed to be a thumbs up, but is shows up as question marks. I got what you are saying and was agreeing to it.