Page tree
Skip to end of metadata
Go to start of metadata


Overview

This document captures the design of enhancements to data discovery in 4.0. Its main goal is to serve the Listing Center Home Page of CDAP 4.0.

Checklist

  • User stories documented (Bhooshan)
  • User stories reviewed (Nitin)
  • User stories reviewed (Todd)
  • Requirements documented (Bhooshan)
  • Requirements Reviewed (Nitin/Todd)
  • Design Documented (Bhooshan
  • Design Reviewed (Andreas/Terence/Poorna)
  • Implementation
  • Documentation

Requirements

The main requirements influencing these enhancements are:

  1. Support configurable sorting for search results. Preferably, both sortBy and sortOrder should be supported. In addition, it would be nice to support multiple combinations of sortBy and sortOrder
  2. Support pagination for search results. The API should accept offset (defines the start position in the search results) and limit (defines the number of results to show at a time) parameters.
  3. Search queries should be able to filter results by one or more entity types
  4. Metadata for every search result should include (**needs confirmation**):
    1. name
    2. description
    3. creation time
    4. version
    5. entity type
    6. owner
    7. Status - Composed of statistics, current state, etc of the entity.
  5. Potential requirement: Ability to annotate (if not filter) an entity by scope (SYSTEM vs USER)

User Stories

  1. As a CDAP user, I should be able to search all entities (artifacts, applications, programs, datasets, streams, views) sorted by a name and/or creation time
  2. As a CDAP user, I should be able to paginate search results by specifying a page size. In addition, I should be able to specify the offset from where to return search results.
  3. As a CDAP user, I should be able to filter search results by a given entity type

Design

Alternatives

The CDAP search backend today has been implemented using an IndexedTable. Implementing sorting and pagination on this implementation may be difficult as well as introduce performance bottlenecks, due to multiple potential HBase scans. Also, an index would have to be stored per sortBy and sortOrder combination. An alternative to this is to fetch results for the provided search query and sort them in-memory after that. But in a big data scenario, this option is not viable.

The eventual goal of CDAP is to move from the current IndexedTable backed search to an external search engine. The major motivations for that are to facilitate richer search queries and full-text search. Some initial investigation about alternatives for search are at External Search and Indexing Engine Investigation. A summary of the two most viable alternatives - Apache Solr and Elasticsearch can be found at these links:

[1] http://solr-vs-elasticsearch.com/

[2] https://thinkbiganalytics.com/solr-vs-elastic-search/

Most research indicates feature parity between the two options, although Elasticsearch seems to have better REST API and JSON support. However, being that Apache Solr is more favored in Hadoop-land (supported by more distributions, is the only search engine that Cloudera supports, and has support in Slider to run on YARN), it makes more sense as the first candidate for supporting a search backend. The search backend, however, can be made pluggable (as an extension loaded using its own classloader using an SPI), so it could be swiped out for ElasticSearch if users wish to in future.

4.0 Requirements in Apache Solr

Sorting

Sorting (including multiple sort orderings) is supported in Apache Solr using the sort parameter.

Pagination

Pagination is supported as a combination of the start and rows parameters.

Filtering

Filtering is supported using the fq parameter.

Running Apache Solr

Distributed Mode

Solr can be run as either a separate Twill Runnable using logic like https://github.com/lucidworks/yarn-proto/blob/master/src/main/java/org/apache/solr/cloud/yarn/SolrMaster.java or can be housed inside the DatasetOpExecutorTwillRunnable as well. This decision depends on some prototyping. Solr will be started to use HDFS for persistence.

Standalone Mode

Solr supports a standalone mode, which starts up a separate Solr process. However, we will prefer to use EmbeddedSolrServer, in the same process as standalone CDAP.

InMemory Mode

CDAP will use EmbeddedSolrServer in in-memory mode.

Data Flow


Like in 3.5, there would be a call to update the index everytime the metadata of an entity is updated. Unlike in 3.5 though, this call would be an HTTP call to the Search Service (running Solr in 4.0). 

Note: Since this call is now an HTTP call,

  1. should it be asynchronous?
  2. it will happen outside of the transaction to update the Metadata Dataset. 

Index Sync

Since the persistence stores for metadata and the search index will be different, we will need a utility to keep them in sync. This could be a service/thread that runs periodically (preferred), or a tool that is invoked manually.

Upgrade

There should be a way to upgrade existing indexes to be stored in the new Search backend. The index sync tool should be developed in a way that it can be run via the Upgrade Tool to update existing metadata in the new search backend.

Indexes (TBD)

  1. What is the schema of data to be indexed in the new search backend?

REST API changes

The following changes would be made in the metadata search RESTful API:

  1. sort parameter that specifies the sort query. It contains a  comma-separated list of sort fields and sort order. e.g. sort=name%20asc,created_time%20desc
  2. an offset parameter that specifies the offset into the search results. Defaults to 0.
  3. size parameter that specifies the number of results to return, starting at the offset. Defaults to Integer.MAX_VALUE.

The response would contain 2 fields, other than the above input parameters:

  1. results - Contains a set of search results matching the search query
  2. total - specifies the total number of matched entities. This can be used to calculate the number of pages.

TODO: Given the format of the entityId object in the search response, figure out if sorting can be applied on the entity name.

$ curl http://localhost:11015/v3/namespaces/default/metadata/search?offset=50&size=2
{
  "sort": "name asc,created_time desc",
  "offset": 141,
  "size": 10,
  "total": 142,
  "results": [
    {
      "entityId":{
         "id":{
            "applicationId":"PurchaseHistory",
            "namespace":{
               "id":"default"
            }
         },
         "type":"application"
      },
      "metadata":{
         "SYSTEM":{
            "properties":{
               "Flow:PurchaseFlow":"PurchaseFlow",
               "MapReduce:PurchaseHistoryBuilder":"PurchaseHistoryBuilder"
            },
            "tags":[
               "Purchase",
               "PurchaseHistory"
            ]
         }
      }
    },
    {
      "entityId":{
         "id":{
            "instanceId":"history",
            "namespace":{
               "id":"default"
            }
         },
         "type":"datasetinstance"
      },
      "metadata":{
         "SYSTEM":{
            "properties":{
               "type":"co.cask.cdap.examples.purchase.PurchaseHistoryStore"
            },
            "tags":[
               "history",
               "explore",
               "batch"
            ]
         }
      }
    }
  ]
}

Status of an Entity

Along with showing the metadata of an entity (name, description, tags, properties, etc), one of the requirements for the home page is to also show a brief 'status' for every entity, which is a summary of statistics and metrics. For each entity type, status should surface:

Artifact: # apps, # extensions, # plugins

Application: Total # programs, # Running, # Stopped

Program

Dataset: Read Rate, Write Rate, # apps using it

StreamRead Rate, Write Rate, # apps connected to it, # stream views created

Stream ViewRead Rate, Write Rate, # apps connected to it

This information will not be surfaced from the metadata system. The UI will have to make separate calls potentially for:

  1. Metrics APIs for getting Read Rate and Write Rate
  2. Usage Registry for apps using datasets, streams and stream views
  3. App Fabric APIs for getting the other information from App Fabric. 

For 2 and 3, there could be an alternative to provide a UI-only (non-documented) batch endpoint.

Dataset Types in Metadata System

Currently, the Metadata System only supports artifacts, applications, programs, datasets, streams and stream views as entities. Is support for dataset types and modules necessary for 4.0?

  • No labels

12 Comments

  1. Are we bringing in Slider to run Solr?

    1. No. That was misleading. I only mentioned Slider to illustrate that Solr has been proven to integrate and run on YARN. We will run Solr via a Twill Runnable - either dedicated or shared.

  2. Do we rely on the Solr from the distro for indexing or do we start our own? How does it compare to ElasticSearch operationally?

    1. One important factor about operation is, which one is easier for us to understand fully and support? Because at the end, it'll be on us to support our customer.

      1. With respect to operation, Elasticsearch is more 'independent'. In fact, the most commonly used model for Elasticsearch is for it to run on a set of Elasticsearch nodes that are managed independently by Elasticsearch. Solr more naturally integrates with YARN, also has wider support amongst distros. 

        For support, Solr seems to have a more diverse community behind it, so I expect that help may be more easily available.

        Some more details on comparison are at http://solr-vs-elasticsearch.com/ and https://www.datanami.com/2015/01/22/solr-elasticsearch-question/

    2. I wrote this design with the intention of starting our own. However, nothing should stop us from re-using an existing Solr installation from a distro. If that's the case though, ElasticSearch is not supported in CDH today.

      1. What do mean by not supported? Meaning can't start it in YARN? Also, how's the multi-tenant support of both engines?

        1. Cloudera Search is built on top of Solr. Solr is integrated into their entire product, including Cloudera Manager, Sentry, etc. There are some articles on Elasticsearch certification on CDH, but it doesn't seem to be a first class citizen. 

          Also, yes, there isn't support for starting it on YARN. This has nothing to do with CDH though, Elasticsearch claims that their integration with YARN is not production-ready - https://www.elastic.co/guide/en/elasticsearch/hadoop/current/es-yarn.html

  3. I stronger suggest the update of meta should be decoupled to some messaging system.

    1. I will discuss this with you tomorrow. Can the messaging system being developed in CDAP in 4.0 be used for this? 

  4. The response is better to include the request parameters (e.g. offsets, page size, ...) for easy rendering.