Feature Tables
Causal creates a table for each feature you define in your FDL file. The table shares the name with your feature. Each row corresponds to a set of impressions of the feature having to the same argument values. Causal does this to compress data and make it easy for data scientists to train machine learning models. See memoization
These tables contain:
- ds: (partition key) The date (UTC) of the session expiry in which this impression occurs.
- hh: (partition key) The hour (UTC) of the session expiry in which this impression occurs.
- session_id: Causal's internal session identifier. See Session Keys.
- impressions: A list of unique impression_ids and timestamps (UTC) for each impression. Impression_ids can be an identifier your team passed in to enable linkage to tables in your warehouse.
- first_time: Timestamp (UTC) when we first saw this impression.
- impression_count: the number of times this impression was rendered. See memoization.
- Feature Arguments: These are arguments for the feature that you've specified in your FDL that are passed in by the front end.
- Plugin values: These are values calculated by any plugins you have enabled for the feature.
- Output values: These are values returned to the front end by Causal.
- Events: These are arrays of the feature's events that happened downstream of that impression.
Feature Best Practices
Causal was designed to store a lot of data, so you shouldn't be stingy when defining your features. Properly defined features should make it unnecessary for data engineers to dive into web log data or do a zillion joins in order to recover information about what the user was shown or has done.
For example, let's say that you are working on making recommendations for your users. You show them a set of products, and want to record which products a user clicked on.
Most systems will deal with recording product clicks just fine. However, with Causal you can also record all the products that were shown. That way, your data contains the set of products that got a click and the set of products that didn't. The latter is almost as important as the former when trying to make a data driven decision.