AWS/GCP Data Lake

Built a data lake on both GCP and AWS.

AWS

  1. Set connections to the datasources on AWS Glue.
  2. Built crawlers to populate the AWS Glue Data Catalog with tables.
  3. Used Apache Spark to create AWS Glue Jobs to extract the created tables and store as raw Avro data on a S3 bucket.
  4. The raw data is then cleansed and refined to be sent to a S3 bucket as a Apache Parquet file.
  5. This is done because Parquet is a compressed optimized format for analysis.
  6. Utilized AWS Athena to aggregate this new data to create custom viewes.
  7. In Athena, CTAS was used to write new data into a new partition on the S3 bucket.
  8. In this process, a few tables were aggregated to create new viewes.

GCP

  1. Created a bucket on Cloud Storage and uploaded the raw data as CSV.
  2. Created the dataset and tables on BigQuery from the uploaded data.
  3. Partitioned the tables creating new ones to get new views on the dataset.