Skip to main content

Data Format and Guarantees

While most data collection solutions store your data using append only data stores, Causal takes a different approach. The data stored in your data warehouse is guaranteed to follow certain rules. These were chosen to optimize the layout of the data on disk for read time performance and ease of use.

ORC Tables

Causal stores data in ORC files, organized into hourly partitions according to the Hive standard. ORC is a column oriented file format. That means you don't read columns you don't need during a query, which speeds up queries that don't use all the data. Column values are also compressed very efficiently, because they are all the same types of values and adjacent to each other on disk.

ORC is an extremely popular data format and almost any data warehouse solution can access ORC like native tables. The Causal compiler will automatically generate the DDL required to mount these tables in your data warehouse using the --ddl option.

Sessions and Partitions

Data from a Causal session (including feature and metric data) is guaranteed to be written into a specific partition. This is the partition corresponding to the time that a session expires. This makes joins between tables more efficient.

For example, if you'd like to find all the impressions of a specific type in a given session, you can use the partition keys ds and hh in your join. This avoids the problem where you have to include several partitions in a join query because you aren't sure if the data you are looking for happens on the other side of a partition boundary. This greatly reduces the amount of work needed to do to calculate the queries because, the join only needs to consider one partition at a time.

Event Storage

Events are stored in the same row as the impression in which they occurred. Session events are stored in the session's row. This means that you can see the downstream effects of an impression in the same row as the rest of the feature data. This is both easier and orders of magnitude faster than having separate event tables. A data scientist or analyst doesn't need to build an ETL to join this data up, which could be both error prone and incur a time consuming join every time they run a query.

Since events are stored in the same row as the impression, we also cut down on data size because there is no need to duplicate impression data in the event tables. It's already local on the disk.

Order Guarantees

Causal guarantees that impressions occurring in the same session will appear one after another in chronological order in the table. We do this to improve compression for columns that record point in time state. This common situation occurs when data scientists training machine learning models want to avoid time shifting.

For example, let's say that you have a Redis database that holds some information about a user that you'd like inside of a feature. In order to train an algorithm to accurately work in this situation, you need the value stored in Redis at the time the impression occurred. However, Redis does not have any history capability. It only stores the current value.

The solution is to store the Redis values inside the feature. That way, you can see what was stored at the time the feature was rendered and accurately train your algorithm. You get time shifting guarantees of a feature store without having to move your data into yet another system with a whole bunch of complexity and challenges.

It's very efficient to store state values like this in Causal because of our order guarantee. If the value doesn't change from impression to impression much, the column based compression algorithms used by the underlying storage formats will perform exceptionally well.