Page tree
Skip to end of metadata
Go to start of metadata


Goals:

  1. Improve Metadata Search: This requires redesign of how we store metadata. Design proposed below.
    • Make search for tags work for all the tags in the list
    • Support tokenized search where user can search with any word from the value
  2. Schema Search:
    • CDAP Schema for Datasets, Streams and Views should be stored as metadata and searchable through fieldname or and fieldname with fieldtype (only for primitive fieldtype)
  3. Search filtering based on entity type.

Checklist

  • User stories documented (Rohit/Poorna)
  • User stories reviewed (Nitin)
  • Design documented (Rohit/Poorna)
  • Design reviewed (Andreas)
  • Feature merged (Rohit)
  • Examples and guides (Rohit)
  • Integration tests (Rohit) 
  • Documentation for feature (Rohit)
  • Blog post 

User Stories: 

  1. Key Value Metadata Search
    1. User should be able to search with key-value or its prefix
    2. User should be able to search with key and individual word in value or its prefix
    3. User should be able to search with just value or its prefix
    4. User should be able to search with individual words in the value 
  2. Tag Metadata Search
    1. User should be able to search with tags key and a tag value or its prefix
    2. User should be able to search with just a tag value or its prefix.
  3. Schema Search:
    1. User should be able search with fieldname or its prefix
    2. User should be able to search with fieldname or its prefix scoped just to schema 
    3. User should be able to search with fieldname and fieldtype (only for primitive types)
  4. Search Filtering:
    1. User should be able to filter searches to a particular entity type for example app, program, dataset
  5. Partial Searching:
    1. User should be able to see result for individual words in search query.

Design

Search Query Examples:

  1. User stores a key-value metadata with key = "Codename" and value = "Alpha Tango Charlie" for an entity
    1. User can retrieve this entity with the following queries:
      • key-value
        1. Codename: Alpha Tango Charlie
        2. Codename: Alpha Tang*
      • key with part of value
        1. Codename: Alpha
        2. Codename: Tango
        3. Codename: Charlie
        4. Codename: Alp*
      • value
        1. Alpha Tango Charlie
        2. Alpha*
        3. Alpha Tan*
          Note:
          1. We have decided not to support searches for queries which have parts of value for example "Tango Charlie". You can either search for whole value or with prefix or single words (we plan to tokenize on whitespace) 
      • Individual word in value
        1. Alpha
        2. Tango
        3. Charlie
        4. Alph*
        5. Tan*
        6. Ch*
    2. Not supported:
      1. key* i.e. Codename*
  2. User tags an entity with the following tags "Tag1, Tag22"
    • User can retrieve this entity with the following queries:
      • tag key and a tag value:
        1. tags: Tag1
        2. tags: Tag*
      • a tag value
        1. tag22 
        2. tag2*
  3. A dataset has the following schema: 

    Nested Schema
    {
      "EmpName": "String",
      "EmpContact": {
        "EmpTel": "Integer",
        "EmpAddr": "String"
      }
    }

    User can retrieve this dataset entity with the following queries:

    • fieldname:
      1. EmpName
      2. EmpContact
      3. EmpTel 
      4. EmpAddr
      5. Emp*
    • fieldname scoped to schema:
      1. schema: EmpName
      2. schema: EmpContact
      3. schema: EmpTel
      4. schema: EmpAddr 
      5. schema: Emp*
    • fieldname with fieldtype (only for primitive types)
      1. EmpName:String (only for java primitive types)

    Note:
    1. We don't plan to support schema searches with complex fieldType. If a user  searched with a query which is not scoped with schema by default it will search for schema fields besides the normal key-value and tags.
      Open questions:
      • What if an entity has multiple schema (ex: transform which has input and output schema)
        • We will index both schema (After discussion with Nitin)
      • How will an user search for a fieldName across input and output schema ?
        • We do not support searches limited to input/output or just one schema (After discussion with Nitin)
  4. Search Filtering:
    1. User wants to search only for 'dataset'
      1. dataset: Codename: Alpha
      2. dataset: tags: Tag1
      3. dataset: schema: EmpName
        Note: if not entity type is specified we will return all matched entities. 
  5. Partial Searching:
    1. User searches for  "California USA" : Separate every search query on white space and search for every single word (or)
      Search result will contain:
      1. All entities tagged with  "California USA" followed by
      2. All entities tagged with "California" followed by
      3. All entities tagged with "USA"

Storage:

We are going to use the IndexedTable which we are using currently too. In the new storage design we will have two rows: 

  1. Value Row: This row will store the entity id with key and value in the value column
  2. Index Row: This row will store the entity id with key (like above) appended by the index which is also stored in the index column. The index column will be used for indexing.

 

Metadata Storage Format:

Key ColumnValue Column
<VRPrefix><Entity-Id><Key>Value
<VRPrefix><Entity-Id><Tags>Tag1, Tag2, Tag3....
<VRPrefix><Entity-Id><Schema>{Some Schema}

Index Storage Format:

Key ColumnIndex Column
<IRPrefix><Entity-Id><Key><Index>Index
<IRPrefix><Entity-Id><Tags><Index>Index
<IRPrefix><Entity-Id><Schema><Index>Index

 

This table data represents key-value, tags and schema example discussed above to show how we plan to store the data. Index Column contains all the possibilities of search queries. 

Key: Entity with keyValue Column: Value of Metadata (Not Indexed)Index Column: Indexed value (Indexed)
<VRPrefix><Entity-Id><CodeName>Alpha Tango Charlie 
<VRPrefix><Entity-Id><Tags>Tag1, Tag22 
<VRPrefix><Entity-Id><Schema>{EmpName: String, EmpContact: {EmpTel: Integer, EmpAddr: String}} 
<IRPrefix><Entity-Id><Codename><CodeName: Alpha Tango Charlie> CodeName: Alpha Tango Charlie
<IRPrefix><Entity-Id><Codename><Codename: Alpha> Codename: Alpha
<IRPrefix><Entity-Id><Codename><Codename: Tango> Codename: Tango
<IRPrefix><Entity-Id><Codename><Codename: Charlie> Codename: Charlie
<IRPrefix><Entity-Id><Codename><Alpha Tango Charlie> Alpha Tango Charlie
<IRPrefix><Entity-Id><Codename><Alpha> Alpha
<IRPrefix><Entity-Id><Codename><Tango> Tango
<IRPrefix><Entity-Id><Codename><Charlie> Charlie
<IRPrefix><Entity-Id><tags><tags: Tag1> tags: Tag1
<IRPrefix><Entity-Id><tags><tags: Tag22> tags: Tag22
<IRPrefix><Entity-Id><tags><Tag1> Tag1
<IRPrefix><Entity-Id><tags><Tag22> Tag22
<IRPrefix><Entity-Id><schema><schema: EmpName> schema: EmpName
<IRPrefix><Entity-Id><schema><schema: EmpContact> schema: EmpContact
<IRPrefix><Entity-Id><schema><schema: EmpTel> schema: EmpTel
<IRPrefix><Entity-Id><schema><schema: EmpAddr> schema: EmpAddr
<IRPrefix><Entity-Id><schema><EmpName> EmpName
<IRPrefix><Entity-Id><schema><EmpContact> EmpContact
<IRPrefix><Entity-Id><schema><EmpTel> EmpTel
<IRPrefix><Entity-Id><schema><EmpAddr> EmpAddr

We will be using the indexedTable like before but now our keys which store values will be prefixed with a special VRPrefix (ValueRowPrefix) and we will store the value in the value column. The indexes will also be stored in the same table and the key will be prefixes with IRPrefix (IndexRowPrefix), the value column for such rows will be empty and the index column will have the index value which will be indexed for search.

Another possibility was to store the real key value in a separate table and the indexes in the indexedTable which will avoid the empty column values for a row but this will lead to 6 tables on total (3 for system and business each) hence we have decided against it.

Search Filtering: We will perform post filtering if the query is limited to an entity type. 

In addition to above goals we also plan to do the following:

  1. Metadata Search Results:

    • CDAP-4274 - Metadata search should returns the metadata of matching entities ( Open)
    • Also return some other relevant info. Please see details below.

    Search Result 

    Metadata search will return Entities with the following details depending upon the type of the Entity. The search results will be order descending on basis of entity creation time.

    Entity TypeSearch Details
    Application

    Type

     Name
     Matched Metadata (Snippet) with all system metadata
     App Description
     Entity creation time
    ProgramType
     Name
     Matched Metadata (Snippet) with all system metadata
     App it belongs to
     Entity creation time
    ArtifactType
     Name
     Matched Metadata (Snippet) with all system metadata
     Entity creation time
    DatasetType
     Name
     Matched Metadata (Snippet) with all system metadata
     Entity creation time
    StreamName
     Type
     Matched Metadata (Snippet) with all system metadata
     Entity creation time
    ViewName



    Type
    Matched Metadata (Snippet) with all system metadata
    Stream Name
    Entity creation time

    Design Decision: 

      • In the search result of entity we will return the matched metadata with all the system metadata for that entity too. 

    Open Question: 

      • Please suggest other things which we can add to different search result entities ? 
  2. Emit more metadata from system entities:

Here is a list of System Metadata which we are planning to emit from different entities. If you have any suggestions as what other info can be useful as system metadata please comment below.

Artifacts

    • Artifact name
    • Version

Applications

    • Application name
    • ArtifactId
    • Plugins
      • Plugin Type
      • Plugin Name
    • Schedule
    • Programs

Programs

    • Program name
    • Type: Flow, MapReduce etc
      • Workflow
      • Nodes under this workflow
    • Mode: Batch, Realtime

Datasets

    • Dataset name
    • Schema
    • RecordScannable/BatchWritable/RecordWritable/BatchReadable
    • Type: KVTable, FileSet etc
    • ttl

Streams

    • Stream name
    • Schema
    • ttl

Views

    • View name
    • Schema

Open Questions:

    • Please suggest other things which we can add to different system metadata entries
    • Nitin Motgi: Can we call "business metadata" "user metadata" and also the table which stores it userMetadata table rather than business to keep it consistent with other stuff  like metrics etc. 

 

Additional Requirement and Notes:

  1. Invalidate just * query
  2. Support Pagination of search result in backend
  3. User entity creation time for ordering of search result
  4. Support searched with stemming (workflow/workflows) : Porter Stemming
  5. Support and (&) operation: Example search query - app:appname & program

 

  • No labels

24 Comments

  1. Rohit Sinha From product perspective Business Metadata is what makes sense. Business Metadata is always user generated metadata. Preference from product perspective would be Business, but how the implementation names it doesn't matter so much right. 

    1. Yes. So for implementation we will go ahead and call it user metadata then. 

      Poorna Chandra: FYI. 

  2. Program, Dataset, Stream, View should also return the application they are used in along with description and creation date if they have.

    1. For programs we are going to give application they belong to

      but the usage is dataset and others are dynamic and in our meeting today we decided that we will not show usage. 

  3. Questions:

    i) How do we return the app the Program belongs to? Are we storing that as one of the keys of the Program? If so, will it also contain another key for namespace?

    ii) Also, is Namespace not an 'entity type' ? I don't see that in the table. 

    1. i) Yes, we store it as one of the program's properties. Currently, we do not plan to add a key for namespace.

      ii) Currently, the only entities that can have metadata are artifacts, apps, programs, datasets, streams and views. A namespace is not thought of as an entity to which you can add metadata yet. Haven't come across such a use case. 

      A namespace is thought of more like a container for these entities. The only reference to namespace in a metadata context is that search results are restricted to a namespace. 

      1. I think it makes a lot of sense to have meta data for namespaces. Since namespaces provide a "sandbox", I might want to find all namespaces that are tagged with a particular project name, for example, or just all namespaces that belong to a specific team. That can be done with metadata search

  4. What if an entity has multiple schema (ex: transform which has input and output schema)

    Since datasets and streams are the only entities that have a schema, is this a valid scenario? If a transform has an input and an output schema, the entity that will be searched for will either be the input dataset or the output dataset, right? The transform itself cannot be an entity that has a schema?

    1. Or can transforms also have a schema? I may have missed that discussion.

      1. Transforms might not be the right example but it's possible to have entities with multiple schema.

        For now we will not differentiate between the schemas and search cannot be scoped to one type of schema. 

        1. Wouldn't it be scoped by the fieldname? If have, say, two fields inputSchema and outputSchema, I should be able to search outputSchema:customerId

  5. part of value

    Does that mean the same as "individual word" below?

  6. only for java primitive types

    what does that mean? Integer is primitve. List<Integer> is not? And what would be an example of a non-java type? A custom struct?

  7. dataset: Codename: Alpha

    this is ambiguous for searches that do not specify the field name. 

    "dataset: employee" can mean either:

    • entities of type dataset that have the word employee in any metadata field
    • entities of any type that have a field named dataset that contains the word employee

     

  8. Separate every search query on white space and search for every single word (or)

    why or? doesn't that impact the precision of your query? If I search for "sensor data", then I only want to get results that have both sensor and data in them 

    1. Make sense. 

      I think we will like to do this if main search query does not return any result. If "sensor data" search result is empty then we will search with "sensor" and "data" separately and return that result. 

      Will this be a good approach ?

      1. yes that makes sense

  9. 6 tables on total (3 for system and business each)

    why do business and system metadata go into separate tables? Can. you use the same table and use a different prefix?

  10. Entity creation time

    Is that also searchable? If so, it only makes sense as a search over a data range.

  11. The search results will be order descending on basis of entity creation time.

    But above, in the example "California USA" you gave a different ordering

  12. Porter Stemming

    Is that a committed feature? Do you have Porter stemming rules for languages other than English?

    1. No it's not a committed feature. It's just note from meeting and requirement that we want to support stemming. I think with a proper search it will come out of the box. 

    1. Support and (&) operation: Example search query - app:appname & program

     

    Other search engines typically use + syntax. 

    +user +stories means both must occur (and)

    user stories means at least one must occur

    user +stories means at least one must occur but the results that contain both come first