Page tree
Skip to end of metadata
Go to start of metadata

Checklist

  • User Stories Documented
  • User Stories Reviewed
  • Design Reviewed
  • APIs reviewed
  • Release priorities assigned
  • Test cases reviewed
  • Blog post

Introduction

The CDAP 4.0 UI is designed to provide operational insights about both - CDAP services as well as other service providers such as YARN, HBase and HDFS. The CDAP platform will need to expose additional APIs to surface this information.

Goals

The operational APIs should surface information for the Management Screen

These designs translate into the following requirements:

  • CDAP Uptime
    • P1: Should indicate the time (number of hours, days?) for which the CDAP Master process has been running. 
    • P2: In an HA environment, it would be nice to indicate the time of the last master failover.
  • CDAP System Services
    • P1: Should indicate the current number of instances.
    • P1: Should have a way to scale services.
    • P1: Should show service logs
    • P2: Node name where container started
    • P2: Container name
    • P2: master.services YARN application name
  • Middle Drawer:
    • CDAP:
      • P1* (Stretch goal - only possible if there's a straightforward approach): # of masters, routers, kafka-servers, auth-servers 
      • P1: Router requests - # 200s, 404s, 500s
      • P1: # namespaces, artifacts, apps, programs, datasets, streams, views
      • P1: Transaction snapshot summary (invalid, in-progress, committing, committed)
      • P1: Logs/Metrics service lags
      • P2: Last GC pause time
    • HDFS:
      • P1: Space metrics: total, free, used
      • P1: Nodes: total, healthy, decommissioned, decommissionInProgress
      • P1: Blocks: missing, corrupt, under-replicated
    • YARN:
      • P1: Nodes: total, new, running, unhealthy, decommissioned, lost, rebooted
      • P1: Apps: total, submitted, accepted, running, failed, killed, new,  new_saving
      • P1: Memory: total, used, free
      • P1: Virtual Cores: total, used, free
      • P1: Queues: total, stopped, running, max_capacity, current_capacity
    • HBase
      • P1: Nodes: total_regionservers, live_regionservers, dead_regionservers, masters
      • P1: No. of namespaces, tables
      • P2: Last major compaction (time + info)
    • Zookeeper: Most of these are from the output of echo mntr | nc localhost 2181
      • P2: Num of alive connections
      • P2: Num of znodes
      • P2: Num of watches
      • P2: Num of ephemeral nodes
      • P2: Data size
      • P2: Open file descriptor count
      • P2: Max file descriptor count
    • Kafka
    • Sentry
      • P2: # of roles
      • P2: # of privileges
      • P2: memory: total, used, available
      • P2: requests per second
      • any more?
    • KMS
      • TBD: Having a hard time hitting the JMX endpoint for KMS
  • Component Overview
    • P1: YARN, HDFS, HBase
    • P1: For each component: version, url, logs_url
    • P2: Zoookeeper, Kafka, Hive
    • P2: Sentry, KMS
    • P2: Distribution info
    • P2: Plus button - to store custom components and version, url, logs_url for each.

User Stories

  1. As a CDAP admin, I would like a single place to perform health checks and monitoring for CDAP system services as well as service providers that CDAP depends upon. 
  2. As a CDAP admin, I would like to have insights into the health of all CDAP system services including master, log saver, explore container, metrics processor, metrics, streams, transaction server and dataset executor
  3. As a CDAP admin, I would like to know information about my CDAP setup including the version of CDAP
  4. As a CDAP admin, I would like to know the uptime of CDAP including optionally the time since the last failover in an HA scenario
  5. As a CDAP admin, I would like to know the versions and (optionally) links to the web UI and logs if available of the underlying infrastructure components.
  6. As a CDAP admin, I would like to have operational insights including stats such as request rate, node status, available compute as well as storage capacity for the underlying infrastructure components that CDAP relies upon. These insights should help me understand the health of these components as well as help in root cause analysis in case CDAP fails or performs poorly.

Design

Data Sources

Versions

  • CDAP - co.cask.cdap.common.utils.ProjectInfo
  • HBase - co.cask.cdap.data2.util.hbase.HBaseVersion
  • YARN - org.apache.hadoop.yarn.util.YarnVersionInfo
  • HDFS - org.apache.hadoop.util.VersionInfo
  • Zookeeper - No client API available. Will have to build a utility around echo stat | nc localhost 2181
  • Hive - org.apache.hive.common.util.HiveVersionInfo

URL

  • CDAP - $(dashboard.bind.address) + $(dashboard.bind.port)
  • YARN - $(yarn.resourcemanager.webapp.address)
  • HDFS -  $(dfs.namenode.http-address)
  • HBase - hbaseAdmin.getClusterStatus().getMaster().toString()

HDFS

DistributedFileSystem - For HDFS stats

YARN

YarnClient - for YARN stats and info

HBase

HBaseAdmin - for HBase stats and info

Kafka

JMX

Reference: https://github.com/linkedin/kafka-monitor

Zookeeper

Option 1: Four letter commands - mntr. Drawbacks: mntr was introduced in 3.5.0 - users may be running older versions of Zookeeper

Option 2: Zookeeper also exposes JMX - https://zookeeper.apache.org/doc/trunk/zookeeperJMX.html

HiveServer2

TBD

Sentry

JMX

The following is available by enabling the sentry web service (ref: http://www.cloudera.com/documentation/enterprise/latest/topics/sg_sentry_metrics.html) and querying for metrics (API: http://[sentry-service-host]:51000/metrics?pretty=true).

Sentry JMX output
{
  "version" : "3.0.0",
  "gauges" : {
    "buffers.direct.capacity" : {
      "value" : 57344
    },
    "buffers.direct.count" : {
      "value" : 5
    },
    "buffers.direct.used" : {
      "value" : 57344
    },
    "buffers.mapped.capacity" : {
      "value" : 0
    },
    "buffers.mapped.count" : {
      "value" : 0
    },
    "buffers.mapped.used" : {
      "value" : 0
    },
    "gc.PS-MarkSweep.count" : {
      "value" : 0
    },
    "gc.PS-MarkSweep.time" : {
      "value" : 0
    },
    "gc.PS-Scavenge.count" : {
      "value" : 2
    },
    "gc.PS-Scavenge.time" : {
      "value" : 26
    },
    "memory.heap.committed" : {
      "value" : 1029701632
    },
    "memory.heap.init" : {
      "value" : 1073741824
    },
    "memory.heap.max" : {
      "value" : 1029701632
    },
    "memory.heap.usage" : {
      "value" : 0.17999917863585554
    },
    "memory.heap.used" : {
      "value" : 185345448
    },
    "memory.non-heap.committed" : {
      "value" : 31391744
    },
    "memory.non-heap.init" : {
      "value" : 24576000
    },
    "memory.non-heap.max" : {
      "value" : 136314880
    },
    "memory.non-heap.usage" : {
      "value" : 0.2187954829289363
    },
    "memory.non-heap.used" : {
      "value" : 29825080
    },
    "memory.pools.Code-Cache.usage" : {
      "value" : 0.029324849446614582
    },
    "memory.pools.PS-Eden-Space.usage" : {
      "value" : 0.6523454156767787
    },
    "memory.pools.PS-Old-Gen.usage" : {
      "value" : 1.1440740671897877E-4
    },
    "memory.pools.PS-Perm-Gen.usage" : {
      "value" : 0.32970512204053926
    },
    "memory.pools.PS-Survivor-Space.usage" : {
      "value" : 0.22010480095358456
    },
    "memory.total.committed" : {
      "value" : 1061093376
    },
    "memory.total.init" : {
      "value" : 1098317824
    },
    "memory.total.max" : {
      "value" : 1166016512
    },
    "memory.total.used" : {
      "value" : 215170528
    },
    "org.apache.sentry.provider.db.service.persistent.SentryStore.group_count" : {
      "value" : 3
    },
    "org.apache.sentry.provider.db.service.persistent.SentryStore.privilege_count" : {
      "value" : 0
    },
    "org.apache.sentry.provider.db.service.persistent.SentryStore.role_count" : {
      "value" : 132
    },
    "threads.blocked.count" : {
      "value" : 1
    },
    "threads.count" : {
      "value" : 38
    },
    "threads.daemon.count" : {
      "value" : 27
    },
    "threads.deadlocks" : {
      "value" : [ ]
    },
    "threads.new.count" : {
      "value" : 0
    },
    "threads.runnable.count" : {
      "value" : 6
    },
    "threads.terminated.count" : {
      "value" : 0
    },
    "threads.timed_waiting.count" : {
      "value" : 8
    },
    "threads.waiting.count" : {
      "value" : 23
    }
  },
  "counters" : { },
  "histograms" : { },
  "meters" : { },
  "timers" : {
    "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.create-role" : {
      "count" : 0,
      "max" : 0.0,
      "mean" : 0.0,
      "min" : 0.0,
      "p50" : 0.0,
      "p75" : 0.0,
      "p95" : 0.0,
      "p98" : 0.0,
      "p99" : 0.0,
      "p999" : 0.0,
      "stddev" : 0.0,
      "m15_rate" : 0.0,
      "m1_rate" : 0.0,
      "m5_rate" : 0.0,
      "mean_rate" : 0.0,
      "duration_units" : "seconds",
      "rate_units" : "calls/second"
    },
    "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.drop-privilege" : {
      "count" : 0,
      "max" : 0.0,
      "mean" : 0.0,
      "min" : 0.0,
      "p50" : 0.0,
      "p75" : 0.0,
      "p95" : 0.0,
      "p98" : 0.0,
      "p99" : 0.0,
      "p999" : 0.0,
      "stddev" : 0.0,
      "m15_rate" : 0.0,
      "m1_rate" : 0.0,
      "m5_rate" : 0.0,
      "mean_rate" : 0.0,
      "duration_units" : "seconds",
      "rate_units" : "calls/second"
    },
    "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.drop-role" : {
      "count" : 0,
      "max" : 0.0,
      "mean" : 0.0,
      "min" : 0.0,
      "p50" : 0.0,
      "p75" : 0.0,
      "p95" : 0.0,
      "p98" : 0.0,
      "p99" : 0.0,
      "p999" : 0.0,
      "stddev" : 0.0,
      "m15_rate" : 0.0,
      "m1_rate" : 0.0,
      "m5_rate" : 0.0,
      "mean_rate" : 0.0,
      "duration_units" : "seconds",
      "rate_units" : "calls/second"
    },
    "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.grant-privilege" : {
      "count" : 0,
      "max" : 0.0,
      "mean" : 0.0,
      "min" : 0.0,
      "p50" : 0.0,
      "p75" : 0.0,
      "p95" : 0.0,
      "p98" : 0.0,
      "p99" : 0.0,
      "p999" : 0.0,
      "stddev" : 0.0,
      "m15_rate" : 0.0,
      "m1_rate" : 0.0,
      "m5_rate" : 0.0,
      "mean_rate" : 0.0,
      "duration_units" : "seconds",
      "rate_units" : "calls/second"
    },
    "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.grant-role" : {
      "count" : 0,
      "max" : 0.0,
      "mean" : 0.0,
      "min" : 0.0,
      "p50" : 0.0,
      "p75" : 0.0,
      "p95" : 0.0,
      "p98" : 0.0,
      "p99" : 0.0,
      "p999" : 0.0,
      "stddev" : 0.0,
      "m15_rate" : 0.0,
      "m1_rate" : 0.0,
      "m5_rate" : 0.0,
      "mean_rate" : 0.0,
      "duration_units" : "seconds",
      "rate_units" : "calls/second"
    },
    "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.list-privileges-by-authorizable" : {
      "count" : 0,
      "max" : 0.0,
      "mean" : 0.0,
      "min" : 0.0,
      "p50" : 0.0,
      "p75" : 0.0,
      "p95" : 0.0,
      "p98" : 0.0,
      "p99" : 0.0,
      "p999" : 0.0,
      "stddev" : 0.0,
      "m15_rate" : 0.0,
      "m1_rate" : 0.0,
      "m5_rate" : 0.0,
      "mean_rate" : 0.0,
      "duration_units" : "seconds",
      "rate_units" : "calls/second"
    },
    "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.list-privileges-by-role" : {
      "count" : 0,
      "max" : 0.0,
      "mean" : 0.0,
      "min" : 0.0,
      "p50" : 0.0,
      "p75" : 0.0,
      "p95" : 0.0,
      "p98" : 0.0,
      "p99" : 0.0,
      "p999" : 0.0,
      "stddev" : 0.0,
      "m15_rate" : 0.0,
      "m1_rate" : 0.0,
      "m5_rate" : 0.0,
      "mean_rate" : 0.0,
      "duration_units" : "seconds",
      "rate_units" : "calls/second"
    },
    "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.list-privileges-for-provider" : {
      "count" : 0,
      "max" : 0.0,
      "mean" : 0.0,
      "min" : 0.0,
      "p50" : 0.0,
      "p75" : 0.0,
      "p95" : 0.0,
      "p98" : 0.0,
      "p99" : 0.0,
      "p999" : 0.0,
      "stddev" : 0.0,
      "m15_rate" : 0.0,
      "m1_rate" : 0.0,
      "m5_rate" : 0.0,
      "mean_rate" : 0.0,
      "duration_units" : "seconds",
      "rate_units" : "calls/second"
    },
    "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.list-roles-by-group" : {
      "count" : 0,
      "max" : 0.0,
      "mean" : 0.0,
      "min" : 0.0,
      "p50" : 0.0,
      "p75" : 0.0,
      "p95" : 0.0,
      "p98" : 0.0,
      "p99" : 0.0,
      "p999" : 0.0,
      "stddev" : 0.0,
      "m15_rate" : 0.0,
      "m1_rate" : 0.0,
      "m5_rate" : 0.0,
      "mean_rate" : 0.0,
      "duration_units" : "seconds",
      "rate_units" : "calls/second"
    },
    "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.rename-privilege" : {
      "count" : 0,
      "max" : 0.0,
      "mean" : 0.0,
      "min" : 0.0,
      "p50" : 0.0,
      "p75" : 0.0,
      "p95" : 0.0,
      "p98" : 0.0,
      "p99" : 0.0,
      "p999" : 0.0,
      "stddev" : 0.0,
      "m15_rate" : 0.0,
      "m1_rate" : 0.0,
      "m5_rate" : 0.0,
      "mean_rate" : 0.0,
      "duration_units" : "seconds",
      "rate_units" : "calls/second"
    },
    "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.revoke-privilege" : {
      "count" : 0,
      "max" : 0.0,
      "mean" : 0.0,
      "min" : 0.0,
      "p50" : 0.0,
      "p75" : 0.0,
      "p95" : 0.0,
      "p98" : 0.0,
      "p99" : 0.0,
      "p999" : 0.0,
      "stddev" : 0.0,
      "m15_rate" : 0.0,
      "m1_rate" : 0.0,
      "m5_rate" : 0.0,
      "mean_rate" : 0.0,
      "duration_units" : "seconds",
      "rate_units" : "calls/second"
    },
    "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.revoke-role" : {
      "count" : 0,
      "max" : 0.0,
      "mean" : 0.0,
      "min" : 0.0,
      "p50" : 0.0,
      "p75" : 0.0,
      "p95" : 0.0,
      "p98" : 0.0,
      "p99" : 0.0,
      "p999" : 0.0,
      "stddev" : 0.0,
      "m15_rate" : 0.0,
      "m1_rate" : 0.0,
      "m5_rate" : 0.0,
      "mean_rate" : 0.0,
      "duration_units" : "seconds",
      "rate_units" : "calls/second"
    }
  }
}

KMS

KMS also exposes JMX via the endpoint http://host:16000/kms/jmx

Implementation

Operational Stats Extensions

The service provider stats fetchers will be implemented as extensions. Each such extension will be installed by CDAP in the master/ext/operations directory as jar files. Each jar file in this directory will be scanned for implementations of the OperationalStats interface defined below. 

/**
 * Interface for all operational stats emitted using the operational stats extension framework.
 *
 * To emit stats using this framework, create JMX {@link MXBean} interfaces, then have the implementations of those
 * interfaces extend this class. At runtime, all sub-classes of this class will be registered with the
 * {@link MBeanServer} with the <i>name</i> property determined by {@link #getServiceName()} and the <i>type</i>
 * property determined by {@link #getStatType()}.
 */
public interface OperationalStats {
  /**
   * Returns the service name for which this operational stat is emitted. Service names are case-insensitive, and will
   * be converted to lower case.
   */
  String getServiceName();

  /**
   * Returns the type of the stat. Stat types are case-insensitive, and will be converted to lower case.
   */
  String getStatType();

  /**
   * Collects the stats that are reported by this object.
   */
  void collect() throws IOException;
}
  • There will be one implementation of this interface for every service provider and stat type.
  • For example, HDFSStorage can be an implementation that provides stats of type "storage" for the service provider HDFS. 
  • A single jar file may contain multiple implementations of this interface. 
  • These classes will be loaded in a separate classloader, but there will not be classloader isolation, so extensions will have classes from the CDAP classloader available. 
  • CDAP will provide a core extension installed at master/ext/operations/core/cdap-operations-extensions-core.jar which will contain stats for some standard service providers. Additional services can be configured by implementing OperationalStats for the service, and placing the jar file under master/ext/operations/

Collecting and reporting stats

For collecting and reporting OperationalStats, the JMX API will be used. Hence, in addition to implementing the OperationalStats interface so it can be recognized as an operational extension, each implementation should also define and implement a Java MXBean interface.

After loading an operational stats extension, the MXBean it implements will be registered with the MBeanServer using the property name set to the value returned from getServiceName and type set to the value returned from getStatType. These properties can then be used to create an ObjectName to retrieve the stats from JMX.

TODO: CDAP Master Uptime?

Caching

The collect method of every operational stats extension will be called at a configurable time interval, and is expected to refresh its stats. A call to an accessor method in the MXBean will simply return the current value cached inside the class.

API changes

New REST APIs

The following REST APIs will be exposed from app fabric.

PathMethodDescriptionResponse CodeResponse
/v3/system/serviceprovidersGETLists all the available service providers and (optionally) minimal info about each (version, url and logs_url)

200 - On success

500 - Any internal errors

List Service Providers Response
{
  "hdfs": {
    "version": "2.7.0",
    "url": "http://localhost:50070",
    "url": "http://localhost:50070/logs/"
  },
  "yarn": {
    "version": "2.7.0",
    "url": "http://localhost:8088",
    "logs": "http://localhost:8088/logs/"
  },
  "hbase": {
    "version": "1.0.0",
    "url": "http://localhost:50070",
    "logs": "http://localhost:60010/logs/"
  },
  "hive": {
    "version": 1.2
  },
  "zookeeper": {
    "version": "3.4.2"
  },
  "kafka": {
    "version": "2.10"
  }
}
/v3/system/serviceproviders/{service-provider-name}/statsGETReturns stats for the specified service provider

200 OK - stats for the specified service provider were successfully fetched

503 Unavailable - Could not contact the service provider for status

404 Not found - Service provider not found (not in the list returned by the list service providers API)

500 - Any other internal errors

CDAP Stats Response
{
    "services": {
      "masters": 2,
      "kafka-servers": 2,
      "routers": 1,
      "auth-servers": 1
    },
    "entities": {
      "namespaces": 10,
      "apps": 46,
      "artifacts": 23,
      "datasets": 68,
      "streams": 34,
      "programs": 78
    }
}
HDFS Stats Response
{
    "storage": {
      "total": 3452759234,
      "used": 34525543,
      "available": 3443555345
    },
    "nodes": {
      "total": 40,
      "healthy": 36,
      "decommissioned": 3,
      "decommissionInProgress": 1
    },
    "blocks": {
      "missing": 33,
      "corrupt": 3,
      "underreplicated": 5
    }
}
YARN Stats Response
{
    "nodes": {
      "total": 35,
      "new": 0,
      "running": 30,
      "unhealthy": 1,
      "decommissioned": 2,
      "lost": 1,
      "rebooted": 1
    },
    "apps": {
      "total": 30,
      "submitted": 2,
      "accepted": 4,
      "running": 20,
      "failed": 1,
      "killed": 3,
      "new": 0,
      "new_saving": 0
    },
    "memory": {
      "total": 8192,
      "used": 7168,
      "available": 1024
    },
    "virtualCores": {
      "total": 36,
      "used": 12,
      "available": 24
    },
    "queues": {
      "total": 10,
      "stopped": 2,
      "running": 8,
      "maxCapacity": 32,
      "currentCapacity": 21
    }
}
HBase Stats Response
{
    "nodes": {
      "totalRegionServers": 37,
      "liveRegionServers": 34,
      "deadRegionServers": 3,
      "masters": 3
    },
    "entities": {
      "tables": 56,
      "namespaces": 43
    }
}

TODO: Add responses for Kafka, Zookeeper, Sentry, KMS

CLI Impact or Changes

New CLI commands will have to be added to front the two new APIs.

List Service Providers

list service providers

Get Service Provider Stats

get stats for service provider <service-provider>

UI Impact or Changes

The Management screen on the CDAP 4.0 UI will have to be implemented using the APIs exposed by this design in addition to existing APIs for getting System Service Status and Logs

Security Impact

Currently CDAP does not enforce authorization for the system services APIs -  CDAP-6917 - Getting issue details... STATUS . The APIs in this design should enforce the same authorization policies as will be implemented for the system services APIs. Ideally, only users that have ADMIN privileges on the CDAP instance should be able to execute these APIs successfully.

Impact on Infrastructure Outages

Test Scenarios

Test IDTest DescriptionExpected Results
 T1Positive test for list API Should return all the configured service providers
 T2Positive test for stats of each service provider Should return the appropriate details for each service provider 
T3 Stop a configured storage provider and hit the API to get its statsShould return 503 with a proper error message 
T4Hit the API to get stats of a non-existent API Should return 404 with a proper error message 

Releases

Release 4.0.0

  • Ground work for collecting stats from infrastructure components.
  • Focus on HDFS, YARN, HBase

Release 4.1.0

  • More components such as Hive, Kafka, Zookeeper, Sentry, KMS  (in that order).

Related Work

Future work

  • TBD

 

  • No labels

35 Comments

  1. I don't see any mentioning of security related resources (authorization extension, sentry health, kms health etc)

    1. Correct, I hadn't added it because the design didn't have it. But perhaps we should keep the backend design more generic. I'll update the requirements.

  2. How to indicate error / problem in the service(s) in the REST call response? E.g app fabric is not able to contact hbase at the time the rest endpoint is hit, but all other services responded normally.

    1. Would it be better to make the APIs per service? Like /v3/system/serviceproviders/{service-provider}/[info, stats]? In that case, the APIs can return 503 if CDAP is unable to contact to the selected service-provider?

      Edwin Elia Ajai Narayan how would that impact the UI?

      1. Per service is fine, but there is the question of getting this list of services. Is it going to be hard coded in the UI or do we want to add another endpoint to obtain the list of service providers?

        1. How about if the /info API in the current design changes to /v3/system/serviceproviders (no info), and returns all the services (along with versions, url and logs_url). The UI can then make a call to each element of the response of that API as /v3/system/serviceproviders/{service-provider}/stats.

          1. That sounds good

            1. Terence Yim is that better? I'll update the doc if so.

  3. The CDAP logs URL only has appfabric (master) logs there?

    1. The info API probably does not need to expose CDAP details at all, since the logs for all CDAP system services will be in the top row in the UX mock - with the status (green/yellow/red). We already have APIs for logs of all CDAP services, probably don't need to show them again in the info API. I'll update

  4. For the caching, it can't be just based on time. E.g when a program is launched in YARN, I will expect the YARN stats gets updated as well (even if I need to refresh the page)

    1. Have added some more details on invalidation for cache. Please review

  5. Auth Server, Kafka, and Master all have some form of registration in ZooKeeper. Does Router do the same? How will Router metrics like 200's, 404's, etc be collected? Will Router expose JMX, or will it use another transport?

    1. We have metrics reporter hooks for all REST API Handlers in CDAP. Currently, the request metrics will be based off of them - so the transport will be CDAP's metrics system (kafka etc)

  6. Just a note that the zookeeper mntr command was introduced in 3.4.0, not 3.5.0.  All major distros that CDAP supports have 3.4.0+

  7. Could you expand on how the system services will look? Right now it only shows the traffic light of status. How would it, for example, show "P1: Should indicate the current number of instances." or logs and so on?

    1. APIs already exist for showing the current instances, as well as scaling up CDAP system services. Same for logs of system services. There will be no updates to that in the backend. This functionality exists in the CDAP UI as well since ~3.0. New designs for it have not been finalized. They should be done soon.

      1. Ok, I was looking for the design and wondering how would that panel change to include numerical info and links to logs etc.

  8. How would you calculate "P1: Logs/Metrics service lags"?

    1. I need to find that out, but I think there is a way to know the current backlog in Kafka for each of these services. We need to somehow surface this backlog.

  9. Last GC pause time is for the master?

    1. Yes. Not sure it will make it in the first cut, but was suggested as one of the requirements.

  10. Typo in "Space metrics: yotal, free, used" and in the next line.

  11. For sentry in middle drawer, would we have any filter on the number of roles and privileges that the user can see or are they only for the logged in user?

    1. This is aggregated info, intended for use by cluster admins. There will be no filtering. There may be an auth filter for the APIs itself which would prevent users from visiting this page itself, but once you can access this page, there is no filter. 

  12. For the OperationalStatsFetcher interface, shall we also expose the configuration (like hbase conf, yarn conf, etc, can be just a Map<String, String>)?

  13. The OperationalStatsFetcher interface is not a fetcher (fetcher means something who fetch stuff), but rather a provider.

    1. In the current design, the implementations of OperationalStatsFetcher (e.g. HDFSStatsFetcher) fetch the individual stats, which is why I called it fetcher. e.g.:

      HDFSStatsFetcher snippet
      @Override
      public StorageStats getStorageStats() throws IOException {
        FsStatus status = dfs.getStatus();
        long capacity = status.getCapacity();
        long used = status.getUsed();
        long remaining = status.getRemaining();
        long missingBlocks = dfs.getMissingBlocksCount();
        long corruptBlocks = dfs.getCorruptBlocksCount();
        long underReplicatedBlocks = dfs.getUnderReplicatedBlocksCount();
        return new StorageStats(capacity, used, remaining, missingBlocks, corruptBlocks, underReplicatedBlocks);
      }

      Can change it though.

      1. The contract should be from the CDAP platform perspective, which individual class "provides" stats. CDAP doesn't care how it get it (whether it fetches, it subscribes, or whatever).

        1. I see, sounds good. Will make the change.

  14. If we need to add new type of stat, then we need to add new method to the interface? Adding new method would break compatibility, meaning all classes needs to be recompiled. I would suggest it's a abstract class with all methods throwing UnsupportedOperationException. Also, shall we consider a more flexible approach, e.g. a method like <T extends OperationalStat> T getOperationalStat(Class<T> statType); ?

    1. I got the first part - we have an abstract class where every method throws an UnsupportedOperationException. Subclasses override only those methods that they support. We can keep adding methods to the abstract class as new requirements come up?

      The second approach will be as below, right?

      class HDFSStatsProvider implements OperationalStatsProvider {
      	@Override
      	public <T extends OperationalStat> T getOperationalStat(Class<T> statType) {
      		if (NodeStats.class.equals(statType)) {
      			return getNodeStats();
      		} 
      		if (StorageStats.class.equals(statType)) {
      			return getStorageStats();
      		} 
      		if (...) {
      		} 
      		throw new UnsupportedOperationException();
          }
      	
      	private HDFSNodeStats getNodeStats() {
       
      	}
       
      	private HDFSStorageStats getStorageStats() {
       
      	}
       
      	...
      }

      If so, the problem I see with the second approach is that the users of this API (e.g. the REST Handler for /v3/serviceproviders/[service-provider]/stats will have to know all the possible values for statType. As opposed to this, in the first approach, it will have to simply invoke every single method on the abstract class OperationalStatsProvider. What do you think?

      1. Whether it's method / class, the caller (i.e. the REST handler you mentioned) needs to know the method / class in order to get the stat instance correctly. However, using Class would be slightly more flexible if we want to produce a map from stat type to stats.

        BTW, have you consider an alternate way similar to the JMX MBean convention?