Page tree
Skip to end of metadata
Go to start of metadata

 

Overview


Data related to cask is scattered around different sources. The goal of this application is to collect and aggregate that data to provide unified access and generate useful statistics.

sources

  • Salesforce
  • Web Beacons
  • Social media analytics (Youtube, LinkedIn, Twitter)
  • Meltwater
  • Pro Ranking (SEO)
  • Github web-hooks
  • AWS s3 access logs

Motivation

To be able to generate and display aggregates and trends in one central location and to render front end in order to help marketing team.

Requirements

  • The system automatically fetches the latest data from respective apis and keeps the historical data

  • The system should notify the stake holders in case of failures

  • The system should be extensible to add more sources

  • Retrieval is optimized and should not incur any additional cost—meaning the data retrieved should not be pulled multiple times.

  • Data should be processed without any data loss.

  • The statistics should be aggregated at different time intervals:

    • Hourly

    • Daily

    • Weekly

    • Monthly

    • Every 3 months

    • Every 1 year

  • System should be able to process and catch-up in case of major outages.

  • System should have the ability to visualize metrics in the form of Dashboard Widgets: Line, Bar, Pie, Scatter, etc.

  • System should have the ability to configure notifications based on constraints specified for metrics:

    • External Api call fail

    • High and Low mark reached

    • Weekly or daily digest

    • The system is highly-available and the reports are available 24x7x365

    • The system should render charts as well as provide raw data to feed into external applications like tablue.

Assumptions

  • All the sources have developer APIs which supports retrieval of data 

  • Information generated does not need different access for different roles

Infrastructure

  • 2 node backend cluster for availability and replication (trying to keep the replication factor low to save costs)
  • S3 bucket to regularly backup data
  • Lean singlenode cluster for frontend (could be deployed on one of the backend nodes aswell)

Design


  • Partitioning of TimePartitionedFileset

 

        Each data source will be in its own TPFS instance

  • Source: "Sourcename Tpfs"

    • Format: Parquet Record with fields - ts, attributes

    • Cube Name: “SourceNameCube”

  • Example
    • Github: “GithubTPFS”

      • Format: Parquet Record with fields - ts, repo, stars, forks, watchers, pulls

      • Cube Name: “GithubCube”

API

External Apis to be used

APIAPI ProviderMetrics gathered
ForceSalesforce.comRaw Leads, MQLs, Sales Opportunities
Youtube reporting APIYoutube.comViews, Subscribers
LinkedIn ApiLinkedIn.comFollowers
Twitter4jOpen SourceFollowers
AWS APIAmazon.comS3 product download logs
Github WebhooksGithubGithub Statistics
Pro Ranking ApiPro RankingWebsite ranking

 

Api Calls

  • Use a Workflow Custom Action to run periodic RESTful calls to  APIs
  • A spark job can read the data from filesystem and update the cube
  • In order to allow different scheduling of different calls, each call will have its own workflow

REST EndPoints

MethodEnd PointDescriptionResponse
GET/pipeline/{time period}Returns the data related to marketing and sales leads
time period E {week, month} 
{
start: 06-06-2016,
end: 06-07-2016,
rawleads: 180,
mlq: 60,
inquiries: 100,
opp: 20
}
GET/awareness/webtraffic/{time priod}

Returns the traffic related information for website and

blog

{
start: 06-06-2016,
end: 06-07-2016,
sessions: 200,
newVisitors: 68,
returningVisitors: 80
blogViewers: 100
}
GETawareness/socialmedia/subscribersReturns the subscribers on various social media sites
{
youtube: {
		views: 23,
		subscribers: 2900 
			},
linkedin: 68,
twitter: 80
}
GETawareness/seoReturns the share of voice numbers
{
cask: 25,
informatica: 25,
talend: 25,
snaplogic: 25
}

GETadoption/downloadsReturns the number of downloads for cdap
 {downloads: [ {version: 3.5 , dl: 2000 } ]}

 


UI

  • UI could be deployed on a thin coopr node
  • Probable stack for UI will be Jquery embedded in a bootsrap dashboard
  • ChartJS and c3JS would be used to render charts
  • UI should allow refining all metrics to different time granularities (Hourly, Daily, Weekly, Monthly, Every three months, Every year)

  • Visualize metrics in the form of Dashboard (Widgets - Line, Bar, Pie, Scatter, ...)

  • Dashboard and backend should support overlaying week-over-week, month-over-month or year-over-year for any metric

  • Backend should  allow for raw querying of data through SQL commands


Reports

  • Email and text notifications can be sent using SendGrid or Amazon SNS service

  • Users can unsubscribe or subscribe using front end backed by apis
  • Generate daily and weekly digest report and email them to stakeholders

  • Export data into PDF/Excel available for download in UI


Alerts

  • Allow user to specify some threshold values for metrics that will alert by email

    • High-mark Low-mark reached alerts to users via email and sms( tentative )

    • Api call fail alerts to Admin and dev

  • No labels

4 Comments

  1. Some Other Additional Datasources to be considered:

    • Maven
    • Docker hub
    • Marketo 
    • NPM Stats

  2. For the UI, let's define that a little more. Where is the UI going to live? What is the tech stack being used? 

    Will Caskalytics include endpoints for external services to push data to? the two cases i can think of is Github Webhooks and a Web Beacon (because i designed them (smile) ). In the past, we spoke about running a basic webserver somewhere that keeps logs, then having a shipper on each of those nodes to send those logs into the cluster. Is that what we want to do or is there a better way?

    Where should this final project live? Some Cooper cluster somewhere? How big should the cluster be?

    In general, we (and our users) use a lot more Parquet than Avro. might be a better choice for storage.

    What specific APIs do you image we will need for this? Might help shape the design.

    Will the cluster data be backed up anywhere (s3)?

    What are you planning to use for sending emails? How will users subscribe or unsubscribe?

     

  3. We might not want to provide api endpoints for subscription as we might need access control to the information

  4. Abhinav Bansal Could you take a crack at breaking this down in our JIRA space?  CDAP-6874 - Getting issue details... STATUS

    The idea would be to create a hierarchy of issues ending up with tasks that should take no more than a day to complete. Feel free to work with Todd Greenstein if you need some guidance.