Cloudera Navigator tackles varied set of problems and one among them being metadata management. For the CDAP Release 3.3 timeframe, what we have accomplished is the propagation of CDAP entities’ metadata to Navigator. So users of Navigator can search for metadata tags/properties and CDAP entities will show up in the search results. There are still few rough edges that needs to be ironed out on the Navigator client side (https://github.com/cloudera/navigator-sdk/issues/created_by/gokulavasan). Also we have a feature request to allow adding custom SourceTypes and EntityTypes and also allow define and adding technical metadata (will open a Github issue for that). That will make the experience for the user and the data modeling better.
Improvements to Navigator-Integration-App:
i) Entity deletion: Marking deletion of entities in Navigator. This will be a slight hack since there is no direct way to get deletion of entities notification. So we can use the metadata changes which are in system scope with deletion operation as an indicator of deletion of an entity (and mark the entity as deleted). This can be introduced in the app once CDAP-4720 is fixed.
ii) Integration testing: Currently the unit-testing of the app is virtually nonexistent. Primarily because there is no in memory navigator implementation that we can depend upon. So the only way to test it would in Integration tests. That means our Integration Tests should have the ability to install Navigator as part of the cluster setup. The app already has a Query that can be used to search Navigator for metadata tags/properties.
Possible tighter integration options:
i) Two-way Metadata Integration: Note that the current integration only provides a read-only experience for CDAP metadata tags. That is, if the user adds metadata tags/properties to CDAP entities in Navigator, it stays only in Navigator and is not propagated back to CDAP Metadata system. This is because currently there is no easy callback API available in Navigator to fetch metadata changes. They have a way for users to set policy and get JMS notifications on metadata updates. But it is all manual at this point. Other option is to poll the entities periodically for metadata updates. We have to work with the Navigator team to figure out a good way to do it.
- Navigator Requirements: Find out best practice to do call backs on business metadata changes (incremental metadata extraction, policy setting with JMS notification, polling)?
- CDAP Platform requirements: None
ii) Audit Logging: Navigator surfaces audit logs of operations performed in the Hadoop stack - for example, if a HDFS dir was created (when, by whom, success?) etc. Seems to be critical for compliance requirements. They fetch this only for supported source types - I could see hdfs, hbase (may be they have it for hive too). It will be good to surface CDAP audit access logs - both for creation/deletion/start/stop of CDAP entities and also for dataset access - read/write access, explore queries etc. I could not find any API to push this data to Navigator - it might require some committing of code to Navigator to enable them to pull data from CDAP. Or they have to expose an API to push this info. Note that, data access to HBase from YARN (attached screenshot, is already captured by Navigator). Need to figure out if higher level CDAP program access should also be captured. That is CDAP Program -> CDAP Dataset access (that I think is still an audit info that will be useful)
- Navigator: Figure out how to Navigator pulls in the audit info for other sources and find out what would be the best way to do it for CDAP
- CDAP Platform requirements: Currently we have audit logging for REST calls. Should we figure out a way to publish these events - will it capture all events (for example, dataset access by programs will not be captured and this implies lineage info needs to be published somehow) ?
iii) Technical Metadata for CDAP entities: Adding technical metadata such as - when was the program started/stopped, will allow search of entities based on time of start/stop or creating of CDAP entities in the last 30 days etc in Navigator Metadata search. This needs changes on Navigator side to allow this. This might involve code changes on navigator side to allow fully custom entities or add a CDAP source type/entity types specifically.
- Navigator requirements: Allow setting business metadata for custom types. (Or if we have CDAP source/entities natively supported, then this won’t be a problem). There is some overlap between auditing and metadata info (for example, when an entity is created etc). Probably it makes sense to appear in both.
- CDAP Platform requirements: We need to publish events when a program is started/stopped etc. Overlaps with audit logging requirements.
iv) Lineage: We can add lineage relations using the current Navigator Client using @MRelation annotation once we have access to this info. Have to check how it will work if the source/destination changes for a particular program (that is, a Service could write to different datasets based on the PathParam etc). But there is a feature to add this. Needs extensive testing to make sure all the destinations/sources are shown for a specific program.
- Navigator requirement: Co-ordinate with them to make sure we have heading down the right path and make sure the rendering of lineages will work for our varied use cases - realtime/batch with potentially different sources/destinations (during a program execution).
- CDAP Platform requirements: Have to publish the data access info (overlaps with audit logging). Also what might be also useful is publishing lineage info for non-CDAP source/sinks in case of Hydrator pipelines. This would be useful since the user can then visualize which RDBMS they are reading from and where it is ending up etc. This was part of one of the demos that Informatica had a year back.
v) Analytics: This is a new feature/tab in Cloudera navigator. There was not much info about this. We have to talk to Navigator team to find out more about this and ascertain whether it will be something that will be useful to us.