RedShift
Redshift Best Practices¶
- Smaller node types load data faster
- Best Practices for data load:
- 1 file in S3 per slice (instances in RedShift)
- Compressed using gzip compression
- File size: 1MB to 1GB compressed
- COPY from S3 is the fastest
- COPY from EMR HDFS may be faster, but most people don't use HDFS - they store data in S3
-
First column of SORTKEY should not be compressed
-
Workflows: move from staging table to production table
- Make sure to wrap the entire workflow into ONE transaction
- COMMITs are very expensive in RedShift
- Disable statistics on staging tables
- Make sure that the distribution keys match between staging and prod tables
-
Compress your staging tables
-
Do ANALYZE after VACUUM