Authorization in CDAP
Currently CDAP provides authentication service - identifying whether users are who they claim to be. But once user A, has access to CDAP, he/she can access everything in CDAP - create namespaces, stop programs, delete apps, delete datasets etc.
Apache Sentry Primer:
Apache Sentry, part of CDH distro, specifically addresses the authorization aspect for Hadoop ecosystem tools. At a high level, Sentry has these concepts:
Resource - Entity that you want to regulate access to - namespace, application, program, dataset
Privilege - Read (read-only explore, read in CDAP programs) access to a CDAP Dataset
Role - Collection of privileges - read access to dataset A (data_analyst role - read access to dataset A, write access to dataset B)
Group - Collection of users (LDAP/OS groups) - can assign one or more roles to a group
Users - Users can belong to one or more groups
Admin wants to provide full access for User Super_Dev to Dev Namespace but wants to provide only restricted (tbd) access to Data_Analyst. He should be able to easily grant/revoke privileges using a framework/UI that he is already familiar with and that he is already using for access to other tools in the cluster
Admin should have ability to provide both high/low level access - for ex, ability to start/stop a specific program, read/write access (explore/programmatic) access to datasets
Admin also wants access to audit logs of access requests - grants and denials
Ideally the full restriction of access to say, Dataset A for a restricted user, should prevent the user from reading/writing to that dataset by bypassing CDAP and going to storage layer (such as scanning HBase table directly)
Alternatively, only allow CDAP user to access the storage layer objects, and manage access through CDAP (may not be feasible since existing code directly reads from HDFS/HBase)
Apache Sentry, as described above, provides authorization control for Hadoop tools in CDH. We can thus delegate the ACL management to Sentry
Note that Sentry provides only authorization services, authentication needs to be handled ourselves
Admin needs to use Sentry directly to set ACLs. TBD: In CDAP UI, we just need to decide if we just want to hide namespaces/apps that users don’t have access to.
Sentry service creates/maintains audit log trail (TBD: Figure out admin can access it. Does Hue provide access to it?)
Admin goes to Apache Sentry and provides access to namespaces/applications to specific groups
User provides an Auth Token when he/she makes request to CDAP, from which we can determine the user name
Router forwards the request to the appropriate system service (once the auth token is verified)
System service HTTP Handler (say AppFabric) checks with Sentry service to see if the user has authorization to perform the requested action. Gets a yes/no response and it accepts or denies the request (enables us to provide partial responses, for example, hide apps that user doesn’t have access to in that namespace)
Scope for 3.3:
Namespace authorization : A user gets access to a specific namespace or doesn’t get access to a namespace.
Design of Authorization client on CDAP should be pluggable in nature, so that in future we can plop in Sentry/Ranger implementation and it should work without much modification.
Need to figure out how the plugging in Policy Engine/Data Model for CDAP in Sentry service will work for new/existing Sentry installations (since management of Sentry is outside the scope of CDAP installation)
Future Sentry Integrations:
ACLs should be pushed to underlying storage layers. For example, restricting access to a specific Dataset, should restrict access for that user even in HBase
CDAP Programs (such as Worker) should have inherit the Dataset access control (by impersonating the user who is starting the program)
Dataset operations in DatasetOpExecutor suffers from above issues