Integration Guide

Introduction

This document provides the guidance to start a successful integration with Kubit’s Self-Service Analytics. It covers the contexts and details steps required to share your data with Kubit and also how to get your metrics configured properly. It also includes the best practices in data model design and implementation as a reference.

With secure cloud data warehouse sharing, you can share your data with Kubit in real-time without the hassle of copying them, or introducing complicated batch jobs. While having full control over your data and maintaining a Single Source of Truth, all the computation cost involved with Self-Service Analytics will be covered by Kubit. This is the essence of No-Code Analytics offering from Kubit.

Requirements

You already have all your analytics related data stored in a Cloud Data Warehouse like Snowflake, BigQuery, RedShift, Azure or Vertica.

Your data engineers will configure secure data share in your cloud data warehouse to grant Kubit read-only access to your data.
If live analytics is in the PoC scope, please make sure that your data is up to date all the time.

You have detailed documentation about your data model and event definitions (eg a Data Dictionary).

If full documentation is not yet available, you will need to invest more time to pass along the knowledge and answering questions.
This process often requires various members from your team to be involved in. Don’t worry, Kubit has a built-in Data Dictionary feature to help you to collaboratively build this knowledge base dynamically on the fly.

There are some basic understanding and agreement of the metrics (KPIs) and dimensions that your team wants.

The PoC project will cover a limited number of metrics and dimensions. Though all the tools and features are enabled for you to build your own. That’s the essence of Self-Service Analytics.

Keep a constant communication on any changes on data model, data dictionary and metrics definition with Kubit team. Some changes are trivial to incorporate if planned ahead of time. Some may require invalidation and updates to existing analyses.
Provide feedback and share your thoughts with Kubit.

Process

Inventory and get your data ready in your data warehouse.
Discuss the scope of the PoC with Kubit: KPIs, metrics, dimensions and which data is needed.
Share access to your data with Kubit through secure data share.
Kubit defines the KPIs and dimensions.
Provide credentials to Kubit to provision the users to access the PoC.
Play with Self-Service Analytics and provide feedback.

About Your Data

Your data should already be stored in data warehouse tables for analytics purposes. They should also be updated frequently to stay live. Here are the key considerations for the PoC and some best practices.

Data Model

Data model is sometimes presented as an ER (Entity Relationship) diagram. It captures the relationship between all the entities in your database.

It is never a good idea to just dump your transactional data (i.e. database tables used for serving your users) into the data warehouse for analytics purposes. A star schema is the most recommended pattern for storing analytics data.

Fact tables: Events, Transactions (eg Purchases, Subscriptions)
Dimension tables: Users, Campaigns, Attributions

Below is a design sample of the proposed data model. If your current data model is different, don’t worry, there are many creative ways to address that by collaborating with your data engineers.

The fact tables store information about when something happens (events), some inline properties (eg timestamp, country, app_id, app_version), and some dimension keys which can be used to lookup for the dimension values in dimension tables (eg user_id to lookup user’s email in users table, campaign_id to lookup for a campaign’s start and end dates).

In most scenarios, it is recommended to use inline properties on the events as much as possible, i.e. capture most dimensions’ values as properties and store them on every event. It simplifies the query and also improves analytics performance dramatically. Modern analytical data warehouses like Snowflake have built-in optimizations to store duplicated column values very efficiently. If you are interested in this topic, read more about Columnar Database.

How to implement ETL (Extract, Transform and Load) to populate the star schema from your transactional data is out of the scope for this guide. Though feel free to contact Kubit support to get additional information or advice on this topic.

User Identification

Every user, regardless of registered or anonymous (guest), requires a unique identifier for the analytics to work. Counting unique users and following a user’s behavior (action events) are the most common analytical measures.

Ideally there should just be one unified id field (User ID) to identify any user. But in reality, it could be much more complicated because the relationship between a person, different apps, their devices and credentials can be super complex. Especially with considerations of privacy, fraud, hacking, device and credential sharing. It is a business decision for your organization to choose the best strategy.

Conceptually, there are several IDs:

Device ID: IDFV or Android ID (deprecated). Used to identify a physical device though this concept is going away because of privacy concerns. In most places, people use Advertiser ID for Device ID.
Advertiser ID: IDFA or AAID (Android Advertising ID). Usually used to match with attribution or advertising data from third-party vendors. Keep in mind these IDs may be empty or have a special value when ad tracking is disabled.
- Since iOS 14, ad tracking is now disabled by default.
User ID: always generate a User ID as soon as the app is launched so every guest/anonymous user can be identified too.

Account ID: some developers introduce this to identify a registered user across different apps and/or devices. In most cases it brings more pain and confusions. It is recommended to only use it in special cases like analyzing how users use multiple apps on different platforms (iOS, Android and Web).

Credential (username, email): due to privacy concerns, it is a bad idea to store user credentials in any analytics data.

Generally speaking, the pair of User ID and Advertiser ID can satisfy most of the product analytics use cases.

Events

Events are generated when a certain action takes place. For example: Login, Sing, Listen, Page View, Button Click. It is the most critical data structure for analytics.

Data Dictionary

It is strongly recommended to have a Data Dictionary (as simple as a Google Sheet) in place to capture the definition of every event:

Event name: a key (unique short name) and a business name (easy to read and understand).
Trigger condition: when this event is triggered.
Properties: the content and meaning of each property stored in this event.
Owner: who to talk to if a change is needed.
Version history: what has changed to this event over time. This information is often critical to communicate.

Instrumentation

Events need to be instrumented in the code on both mobile (in the app) and server (backend) sides because some events are triggered by user actions, some by backend processes (eg subscription renewal, push notifications sent).

If you are starting from scratch, there is no need to reinvent the wheel. It is recommended to use a CDP (Customer Data Platform) like Segment, mParticle, RudderStack or Snowplow to give you a mature SDK and the maximum flexibility to control where the data goes.

Technology aside, the key of a successful instrument project is communication. Keep in mind that the engineers who are injecting the events into code flow don’t usually know how these events are used in analytics. Providing this context information is often critical to avoid bugs and surprises down the line.

Properties

There are two kinds of properties for the events: Common Properties and Event Properties. A property can be categorical (enumeration of values) or numerical (unbounded continuous values).

Common Properties

These are the properties that every event should have. Usually they are used in filters or breakdowns as dimension attributes. Here are some examples:

Timestamp: when an event happens. Sometimes it is necessary to have both Client Timestamp (client clock when action is taken) and Server Timestamp (server clock when the event is processed) since the mobile device clock may not be trusted.

Date Week, Month, Quarter, Year: It is strongly recommended to also store these parts of the timestamp separately since analytics is often based on these time buckets.
Fiscal Quarter, Fiscal Year: only if your organization has the fiscal concept and wants to use them in analytics.

Device ID: often this is Advertiser ID
User ID: the user identifier
App: the id or name of the app
App Version: the version of the app
Country: country where the app is installed (usually from the device)
Locale: device locale (language + country)
Gender: user’s gender information
Birth year: used to compute user age at query time
Device model: the model of the device, eg iPhone 11 Pro, Pixel 4
OS Version: version of the OS on the device, eg iOS 13.6.7
A/B Test: test id and group assignment of an A/B test
Install Date: when the app was installed, used to compute Install Days and group users by how long they have installed the app. It is also critical for Dn/Wn metrics.

Event Properties

Each event can have its own set of properties depending on the context. It is recommended to store them in a set of named generic properties instead of introducing ad-hoc properties on the fly. This way the database table schema can be kept clean and efficient. These generic properties can be explained and managed using Data Dictionary. Here are some examples:

Context: capture the context information of the event. For example, it can be the flow where the user is in (eg Registration, Purchase), or the feature section (eg Songbook, Checkout).
Target: the target of the event. For example: the song id for Sing event; the social channel (eg Facebook, Instagram) for Share event; the SKU for a subscription event.
Value: the value associated with the event. For example, the price for a Subscription or Purchase event.
K1, K2, …Kn: more named properties slotted for different events.

Timezone

Any timezone conversion during query time will kill the performance. For simplicity and efficiency, all the data related to time should be stored as the Timestamp or Date data type, which indicates a certain point in time and has no timezone concept.

The complex issue of timezone arises when any date or time is presented in text format, as in every analytics report with dates (MM/DD/YYYY). It is recommended for the organization to decide on using one certain timezone across the board for all analytics purposes. Usually the selected timezone is where the headquarter is based since it is much easier to talk about local dates in business settings.

Share Your Data

By allowing Kubit to securely access your cloud data warehouse directly, no actual data is copied or transferred between accounts. It doesn’t require any engineering work to create ETL or batch jobs, neither does it take up any extra storage. With this approach, Kubit can deliver instant product insight from the real-time live data based on a Single Source of Truth.

All the access goes through secure, encrypted communication channels. In the meantime, your engineering team maintains full control of the data share by limiting the permissions and visibility.

Kubit IP Ranges

54.189.107.64/32
54.244.229.193/32
52.33.71.162/32
44.228.101.176/32
52.12.153.59/32
54.218.229.214/32
44.232.171.116/32
35.174.57.108/32

Snowflake

Snowflake Secure Data Share allows your data engineers to share a database, schema, table or view to Kubit with read-only permission and real-time live data. A view can be used to join multiple tables across different databases and add constraints to expose which columns. You are the Provider while Kubit is the Consumer of the share. It just takes four simple steps in SQL:

Create the view: or select which databases, tables and/or schema
Create a share: give it a name
Grant permissions: grant usage permissions
Share it with Kubit's Snowflake account depending on your Snowflake region (please confirm with Kubit support first)
- US West Coast: nk15162
- US East Coast: kubit
- EU: LP16358

Here is an example script (sharing to nk15162 on US West Coast):

Snowflake also offers a web interface to review and manage data shares. For details on this topic, please review Snowflake’s documentation here.

It is strongly recommended to use the proper Clustering Keys in your database tables to improve query performance. Typically at least the event date column and event name should be part of the clustering key since they are part of the filters in most queries.

Benefits

With Secure Data Share, no actual data is copied or transferred between accounts. All sharing is accomplished through Snowflake’s unique services layer and metadata store. This is an important concept because it means that shared data does not take up any extra storage for either the Provider or Consumer. Since the same underlying data is being shared, the data is guaranteed to be live and without delay.

More importantly, the computing environment is completely separated. The Consumer (Kubit) must create and use their own Virtual Data Warehouse and pay for it themselves. There is absolutely no impact on the Provider’s cost, performance or SLA for their Snowflake services.

Additionally, using Snowflake’s built-in roles and permissions capabilities, access to data share can be controlled and governed using the access controls already in place for your Snowflake account and the data therein. Access can be restricted and monitored the same exact way as your own data.

BigQuery

Google’s BigQuery allows you to share a dataset to Kubit securely through Service Account. Because BigQuery separates computing from storage, Kubit will cover the cost of querying and won’t impact your SLA.

Share a Dataset

Kubit will provide you a Kubit Service Account Email, typically in the format of <YOUR_ORG>@kubit-oauth.iam.gserviceaccount.com
- You can also create a Service Account and share the JSON key file with Kubit. Please make sure to add ops@kubit.co to "Service account users role" in step 3 "Grant user access to this service account (Optional)".
In the BigQuery console, select a dataset from Resources, then click SHARE DATASET near the right side of the window.

In the Share dataset panel, in the Dataset permissions tab, enter the previously created Kubit Service Account Email to the Add members field. Then select the role BigQuery Data Viewer, click Add, then click DONE.

If you need to share multiple datasets from different GCP projects, you can just use the same Kubit Service Account Email when sharing every dataset.

Create Service Account

Enable BigQuery API

If you haven’t done so yet, please go to Google Cloud API & Services - Dashboard. Click “+ ENABLE APIS AND SERVICES”

Search for “BigQuery API”

Select it and click ENABLE

Create Service Account

Open the credentials page in Google Cloud Platform API Manager (if needed, select your target project from the dropdown list on the top).

Click “CREATE CREDENTIALS” and choose “Service account”

Enter “Kubit” as the name for the service account. Then click CREATE.
- Note: "Service account email" shown on the screen (eg kubit-###@<GCP-PROJECT_NAME>.iam.gserviceaccount.com) is "Kubit Service Account Email"

Enter ops@kubit.co as “Service account users role”. Click DONE.

Select the newly created Service account. Click Keys - ADD KEY - Create new key.

Select JSON in the popup window. Click CREATE

A JSON private key file will be downloaded to your computer. Please send this file to Kubit securely. Click CLOSE.

Redshift

Kubit can access your data in Amazon Redshift cluster directly with restricted access, or through S3 storage sharing.

Configure your Redshift to be publicly accessible (with access control)
The associated Security Group should whitelist port 5439 to the Kubit IP ranges.

Grant Access

Create a new user in sharing database. Please, consider password requirements while you’re generating the Kubit use password:

CREATE USER kubit WITH PASSWORD '<password>';

Grant the Kubit user permissions to access all of your data:

GRANT SELECT on ALL TABLES IN SCHEMA public TO kubit;

Or you can limit the access to specific tables or views:

GRANT USAGE ON SCHEMA public TO kubit;
GRANT SELECT ON TABLE public.table_1 TO kubit;
GRANT SELECT ON TABLE public.table_2 TO kubit;

Presto

Kubit can access your Presto cluster with restricted access, or through S3 storage sharing.

Configure your Presto cluster to be publicly accessible (with access control).
The firewall associated with the Presto cluster should whitelist port 8080(or other custom port) to Kubit IP ranges

Grant Access

Create a principal, named kubit in your Presto's authentication provider (e.g. LDAP, Kerberos, etc.).
Add the following rule to your System Access Control file to grant read-only access to certain catalogs.
- <data_catalog> is the name of the catalog you would like to share with Kubit. You can share multiple catalogs by listing them all, delimited by |. For example: (data_1|data_2|data_3).

S3 Storage Sharing

If your raw data is stored in S3 buckets, instead of open direct access to your data warehouse, you can choose to share the S3 storage to Kubit.

Kubit will provide the ID of a role used for accessing bucket data directly (without the need to assume your roles). All you need to do is to add the following statements to the S3 bucket policy (sample). For more details, please consult Kubit support team.

Directory Structure

For parquet files, Snappy compression is recommended
A Hive directory structure (directories organized by data indexes and partitions) is recommended to get best performance. There are numerous libraries in each programming language to output this structure.
- Directory pattern: s3://<YOUR_BUCKET>/.../<TABLE_NAME>/<PARTITION_COLUMN_NAME>=<VALUE>/<PARTITION_COLUMN_NAME>=<VALUE>/
- Usually we suggest the following partition columns (in order): event_date (or event_timestamp), event_name. In such case, the directory would look like this: s3://<YOUR_BUCKET>/.../<TABLE_NAME>/event_date=<DATE_VALUE>/event_name=<EVENT_NAME_VALUE>/1.parquet

Define KPIs

This section provides guidance on how to convey your organization’s definition on metrics (KPIs) and dimensions for the PoC.

Metrics

Please provide a list of top metrics for your product analytics needs. Most of these metrics can be presented in simple math formulas with measure functions, along with some filter conditions. For example:

Engagement: DAU, MAU - unique users in a day or month
Retention: D2, W2 Retention - % of users who retained (came back) in day 2 or week 2
Activity: Likes/DAU, Shares/DAU, Listens/DAU - count an action event divided by DAU.

It is always recommended to use DAU or MAU as denominator to measure rate instead of absolute volume

Monetization: ARPU, Attach Rate - avg(revenue)/DAU, % of users who subscribed
Funnel: Registration, Purchase - unique users taking action at each steps of the funnel during a certain time period
Performance: Download Failure Rate - count(failure) / total(start, fail, canel, end)

Dimensions

Please provide the definitions of all the dimensions needed in the analytics. Most of these dimensions can come from the event's properties. Some dimensions require joining with dimension tables during query time. Here are the information needed:

Name: display name of the dimension.
For inline dimensions:

Name of the event property

For join dimensions:

Dimension table name
Dimension value column: the column on the dimension table to be used for dimension values
Join key: event property and dimension table column for join
Join condition: inner (only matching rows returns) or outer (for rows without match, use NULL)
Special join conditions: first touch, last touch, time window

Cohorts

Cohort is a group of users matching certain criteria. You can define these cohorts to be used for breakdowns in any Formula, Prediction and Braze Integration using very similar definitions like a query with measures, filters and conditions. For example:

New Users: count(first Launch) = 1 in last 7 days
Frequent Singers: count(Sing) > 5 in last 1 day
Whales: sum(Revenue on Purchases) > 100 in last 30 days

Security

For a SaaS solution, security is always on top of the priority list at Kubit.

Data Security

Through Secure Data Sharing, Kubit doesn’t store or change your data. You have full control over what data is shared and the permissions.
You can avoid exposing the tables or columns which are related to user privacy (like email, username, phone number etc) to Kubit. Analytics mostly just work with aggregated data instead of every user’s private information.
All network communication goes through the HTTPS encrypted channel, including when Kubit talks to your data warehouse and the analytics web interface in the browser.
Kubit’s cloud infrastructure takes all the necessary steps to enforce data security and isolation, including VPC (Virtual Private Cloud) and encrypted network traffic between all components.
Strict access control and processes are enforced within Kubit organizations to limit and monitor employees who have access to customer data.

User Management

Kubit utilizes Auth0, the leading Enterprise Authentication and Authorization Platform to handle all user security and management.

If your organization is using enterprise user identity services like Google G-Suite, Active Directory or LDAP already, Single Sign On can be enabled with Kubit through Auth0. There are no extra credentials for users to remember and your IT department has full control over who has access to Kubit.

To start simple, you can also just provide several users’ email addresses to get them provisioned on Kubit directly. They can login to Kubit with email and password. Also MFA (Multi-Factor Authentication) can be enabled for these users too to add more security.

Support

Support is provided during business hours (Pacific Time) through in-app chat. Your team members can always email support@kubit.ai, going through dedicated Slack support channel or contact the Account Manager for direct support needs.

Documentation

Introduction

Requirements

Process

About Your Data

Data Model

User Identification

Events

Data Dictionary

Instrumentation

Properties

Common Properties

Event Properties

Timezone

Share Your Data

Kubit IP Ranges

Snowflake

Benefits

BigQuery

Share a Dataset

Create Service Account

Enable BigQuery API

Create Service Account

Redshift

Grant Access

Presto

Grant Access

S3 Storage Sharing

Directory Structure

Define KPIs

Metrics

Dimensions

Cohorts

Security

Data Security

User Management

Support