Updating the Warehouse
The impression servers upload data to the S3 bucket you set up in Backend Setup. They collect user data into sessions and upload completed sessions on an hourly basis (more frequently for high loads).
These raw log files (in avro format) are saved to the incoming-logs
folder in your bucket in hourly partition directories.
If you don't see anything there, it's possible that your session hasn't expired yet on the impression server. You can interrupt the server with a ctrl+c
, which will shutdown the server and force an upload of all active sessions.
A background Causal ETL process takes the avro data and transforms it into a set of ORC files, as discussed in the previous sections.
In order to make these tables available in your data warehouse, you can use causalc
to generate warehouse SQL statements to update the catalog. The sections below describe the various ways you can do that.
AWS Glue
Causal supports AWS Glue natively and can automatically update your Glue catalog when FDL changes or data gets appended. In order to set this up, follow the directions here.
Hive
The Causal compiler can generate the DDL statements to mount the Causal tables into a Hive metastore. In order to do so, use the --hive
command line option to the compiler.
## replace TOKEN and ENV_ID appropriately
$ npx causalc --token TOKEN --environment ENV_ID --force-requery --hive causal.sql
The --force-requery
option forces the compiler to go to the cloud and grab the most recent metadata from the tools and the warehouse. Normally, the compiler will call out only if the cached data on disk is more than a day old.
The output of this command, stored in causal.sql
, will be the DDL statements need to add the Causal tables to your data warehouse. In this case it is generating code for Hive.1 Simply run the statements to add the Causal tables.
Redshift Spectrum
Redshift Spectrum supports querying ORC and other tables stored in S3 from a Redshift cluster. Spectrum allows you to access an external schema defined in Glue. You may use the technique illustrated in the previous section to load your data into Glue, and then mount that data in Redshift as described here.
Alternatively, the Causal compiler can generate Redshift DDL statements to mount the data into your Redshift cluster.
Not all Causal views are supported by Redshift DDL statements. In particular, event tables cannot currently be supported. Redshift views into external nested tables do not support the required UNION queries. You can use Athena to load the views into Glue and mount in Redshift if this is important to you.
First create an external schema pointing to AWS glue using this procedure.
Then, can generate the Redshift DDL to mount the tables as follows, where EXTERNAL_SCHEMA is the schema you just created:
## replace TOKEN, ENV_ID and EXTERNAL_SCHEMA appropriately
$ npx causalc --token TOKEN --environment ENV_ID --force-requery --redshift causal.sql --redshift-schema EXTERNAL_SCHEMA
Simply execute the commands stored in causal.sql
using psql
or some other SQL client in order to mount the tables.
1 Email us at support@causallabs.io to inquire about SQL generation for your specific data warehouse.