Big Data | Analytics | AWS | Redshift

A Beginner’s Guide to Apache Iceberg with Amazon S3 Tables

Learn Apache Iceberg and Amazon S3 Tables with this beginner's guide. Create real Iceberg tables, load data, query, update, and evolve schemas using SageMaker Unified Studio and Athena Spark. Discover ACID guarantees, time travel capabilities, and schema evolution without file rewrites. Perfect for analytics workloads requiring transactional reliability, point-in-time accuracy, and multi-engine support on S3. From the AWS Builder Center blog: https://builder.aws.com/content/3

David McAmis

Mar 291 min read

Analyzing Insurance Churn with SageMaker Data Agent, Amazon S3 Tables and Faker

Learn how to use Amazon SageMaker Data Agent, Amazon S3 Tables, and the Python Faker library to generate realistic synthetic insurance data for churn analysis, without exposing live policyholder records. This post walks through generating synthetic tables from scratch and creating synthetic twins from existing tables, then shows how to use plain-language prompts to analyze churn patterns across products, demographics, payment behavior, and customer interactions. From the AWS

David McAmis

Mar 221 min read

Federated Healthcare Analytics Across Snowflake and Amazon S3 Tables Using Iceberg

Learn how to run federated healthcare analytics across Snowflake and Amazon S3 Tables using Apache Iceberg — no ETL pipelines or data movement required. This guide demonstrates how to join governed patient demographics in Snowflake with ICU admissions and clinical measurements stored in S3 Tables, enabling mortality risk scoring, readmission prediction, and ventilator utilization analysis while keeping sensitive data in place and governance controls intact. From the AWS Build

David McAmis

Mar 141 min read

Building a Student 360 View with SageMaker Data Agent, Databricks Unity Catalog and S3 Tables -- Without Moving a Single Row of Data

Universities can build a Student 360 view, combining student information system data in Databricks Unity Catalog with learning management data in Amazon S3 tables — without moving or duplicating data. Using catalog federation and Amazon SageMaker Unified Studio, institutions can query across federated datasets through natural language with Data Agent, enabling early intervention, improved retention, and personalized learning at scale. From the AWS Builder Center blog:...

David McAmis

Mar 111 min read

Building a Scalable Synthetic Data Pipelines with Amazon S3 Tables, Apache Iceberg and Faker

Are you keen to try out Amazon S3 Tables? Learn how to generate millions of highly realistic synthetic financial transactions using Python's Faker library and AWS Glue, then store them in S3 Tables with Apache Iceberg for testing and development without exposing sensitive production data. This hands-on tutorial demonstrates partition optimization and reproducible data generation at scale. From the AWS Builder Center blog: https://builder.aws.com/content/3A5pA0YR3Ee4qM1Wuv59dz

David McAmis

Feb 281 min read

A Beginner’s Guide to Apache Parquet

Apache Parquet is a widely used file format in modern data analytics and data engineering. It is especially common in in data lakes and data lakehouse architectures, where performance, scalability, and efficient storage are critical. This guide explains what Parquet is, where it came from, how it is structured internally, and how to query it effectively. From the AWS Builder Center Blog: https://builder.aws.com/content/38xMNi5KpMwMMVNvBaHPjEOlv8X/a-beginners-guide-to-apache-p

David McAmis

Jan 301 min read

Inside AWS Glue: Understanding the Spark Engine and Using the Spark UI for Troubleshooting

Learn how AWS Glue uses the Spark engine to create scalable, performant data pipelines, and how to monitor background processes using the Spark UI. From the AWS Builder Center blog: https://builder.aws.com/content/38Rp4bJuE5lSsX89iee3gjDeZkO/inside-aws-glue-understanding-the-spark-engine-and-using-the-spark-ui-for-troubleshooting

David McAmis

Jan 191 min read

A Beginner’s Guide to Orchestrating AWS Glue Jobs with Amazon Managed Workflows for Apache Airflow (MWAA)

If you’re building data pipelines on AWS, you’ve probably used AWS Glue to run ETL (Extract, Transform, Load) jobs. Glue automates data movement and transformation so you can focus on insights rather than infrastructure. But what if you need to schedule those jobs, run them in sequence, or trigger them based on the completion of other tasks? That’s where Amazon Managed Workflows for Apache Airflow (MWAA) comes in. This blog will provide everything you need to set up your firs

David McAmis

Jan 191 min read

Connecting to Salesforce Data Using AWS Glue

Integrating Salesforce data into your AWS analytics ecosystem is an essential step in building a comprehensive view of your customers, sales, and operations. With the growing number of options available in AWS for ingesting and transforming external data, it’s important to understand which approach best suits your needs—especially when comparing traditional Glue ETL jobs with newer Zero-ETL features. From the AWS Builder blog: https://builder.aws.com/content/2zqZrESSbXDWzl2ft

David McAmis

Nov 7, 20251 min read

Troubleshooting AWS Glue Jobs

AWS Glue is a powerful serverless ETL service designed to simplify data integration tasks at scale. However, like any data engineering tool, Glue jobs can—and will—fail due to a variety of issues: configuration errors, data mismatches, IAM permission problems, or underlying infrastructure limits. This post is a practical guide to troubleshooting AWS Glue jobs. From the AWS Builder Blog: https://builder.aws.com/content/2y4nDmkmTBfknTTWR5wwtfrbECQ/troubleshooting-aws-glue-jobs

David McAmis

Nov 7, 20251 min read

Ingest Excel Files into a Data Lake Using AWS Glue

As organizations modernize their data infrastructure, ingesting legacy Excel files into cloud-based data lakes is becoming increasingly important. Whether you’re dealing with departmental spreadsheets or externally sourced data, AWS Glue provides a serverless, low-code approach for transforming and loading Excel files from Amazon S3 into your data lake. From the AWS Builder blog: https://builder.aws.com/content/2y1jJ2tU85XtCpO5CkHsX1EfQNW/how-to-ingest-excel-files-into-a-dat

David McAmis

Jun 5, 20251 min read