Integrating AWS Glue Jobs with S3 Notifications and EventBridge

Andrew Smith Posted on

In my last post I created a standalone AWS Glue job for loading and transforming data from S3 into Redshift.

Here, I’m expanding that scenario by integrating the job with S3 notifications, Lambda functions, EventBridge rules and SNS messaging. Specifically this will entail:

  • adding an S3 notification to the bucket where the CSV files for loading are dropped, which calls a Lambda function that starts the Glue job
  • adding an EventBridge rule that triggers if a job fails, and which calls a Lambda function to format and publish a message to an SNS topic
  • adding an S3 notification to the bucket where error files from the job are written, which calls a Lambda function to format and publish a message to an SNS topic

This is illustrated as follows:

GlueJobExpanded

Triggering the Glue job from S3

The Glue job from my last post needs to run whenever a new CSV file is dropped into the Products folder of an S3 bucket. So I added a notification to that bucket for “All object create events”, for the prefix “Products/” and which calls a Lambda function with the following script:

LambdaGlue

This and the other script in this post are available in my GitHub repository.

It uses the Python 3.8 runtime and uses the AWS boto3 API to call the Glue API’s start_job_run() function. The Glue job from my last post had source and destination data hard-coded into the top of the script – I’ve changed this now so this data can be received as parameters from the start_job_run() call shown above. This allows the Glue ETL script to be used unchanged in Production and Dev/Test environments, where in each case the script will just receive different source (S3) and destination (Redshift) data via the parameters that it’s called with in the above script.

The function above needs an IAM service role associated with it. I created a new role that was based upon the AWS managed AWSLambdaBasicExecutionRole policy and then added a further policy to allow the function to start the Glue job:

LambdaGlueIAMPolicy

Note that as per the Glue API Permissions documentation it is not possible to specify a particular job name resource for the glue:StartJobRun action.

Adding the function as the target of an S3 notification also automatically adds a resource-based policy to the function that allows S3 to invoke the function (and which is restricted to the particular S3 bucket in question).

The function is called asynchronously by S3, so a Lambda destination could be added to save contextual error information in the case of any execution failures that aren’t resolved by retries.

Adding an EventBridge rule

I’ve set up an EventBridge rule that runs on failures of the Glue job. The event pattern used is the same as for a CloudWatch event and is:

EventBridgeGlue

The target of this rule is a Lambda function with this script:

LambdaGlueSNS

This uses the Python 3.8 runtime and is written to allow it to process events form 2 sources – EventBridge and S3 (the latter is used in the next section below). In each case key items from the events are extracted and formatted into a message that’s published to an SNS topic, so that subscribers can receive these failure notification messages by email.

The event format for Glue events is described in the AWS documentation here and the event format for S3 notification events is described here.

The function above needs an IAM service role associated with it. I created a new role that was based upon the AWS managed AWSLambdaBasicExecutionRole policy and then added a further policy to allow the function to publish to the desired SNS topic:

LambdaSNSIAMPolicy

Adding the function as the target of both EventBridge and S3 events also automatically adds a resource-based policy that allows EventBridge and S3 to invoke the function (and which is restricted to the particular EventBridge rule and S3 bucket in question).

The function is called asynchronously by EventBridge and S3, so a Lambda destination could be added to save contextual error information in the case of any execution failures that aren’t resolved by retries.

Adding an S3 notification for ETL error files

The ETL script in my last post used a DynamicFrame to store source rows that failed a particular date conversion. This frame was written to a file in the “Errors” folder of the S3 bucket being used for the ETL.

I’ve lastly added a notification to the bucket so that anytime a file gets written, a message gets published to an SNS topic that subscribers can receive by email. The notification is created for “All object create events”, for the prefix “Errors/” and that calls the Lambda function shown in the previous section. That function extracts the bucket name and the file name from the S3 notification and formats them into a message that’s published to the required SNS topic.