Tập tành Run an AWS Glue job inside a VPC
Learn how to access resources in your VPC from an AWS Glue job.
point
The points to run an AWS Glue job in a VPC are as follows.
- VPC settings in “Database-> Connect” in the console menu.
- Associate a job with a connection.
- Self-reference security group inbound rules for ENI.
- Create S3 VPC endpoint as gateway type.
- There is no network setting on the job creation screen
- You need to define a database connection without having to connect to a JDBC or database connection
Please note.
I tried it
This time we will run the Glue job within the private subnet of the VPC without NAT.
1. Create a VPC endpoint for S3
Glue jobs require S3 access to get scripts and manage execution results.
Create a gateway type VPC endpoint for S3 so that VPC can access S3.
The endpoint of S3 is
- Gateway type
- Interface type
There are two types.
Interface types cannot use private DNS names and must use VPC endpoint-specific DNS names. Perhaps because of this limitation, you can’t access S3 from your VPC Glue just by creating an interfaced endpoint. Create a gateway endpoint for simplicity of configuration .
2. Creating a security group for job execution ENI
Create a security group for your network interface (ENI) to use when running Glue jobs within your VPC.
This security group adds self-referenced inbound rules, specifically rules that allow any TCP on any port that originates from itself.
type | protocol | port range | source |
All TCP | TCP | 0-65535 | Your own security group ID |
This may be necessary because the return communication to ENI is stateless.
3. Add database connection settings
From “Database-> Connection”, for Glue job execution
- VPC
- Subnet
- (Optional) Database connection information for JDBC
To set.
Connection properties
If you do not use JDBC , specify ” Network ” for “Connection type”.
When connecting to a database with JDBC from a Spark script, you can define the connection information.
If you want to access the datastores in the same AWS account, you can select a database such as Redshift for Connection Type and specify the JDBC connection information in a wizard format.
You can also select JDBC as the Connection Type and enter it directly.
Spark **catalog_connection
**passes this connection information as an argument.
datasink = glueContext.write_dynamic_frame.from_jdbc_conf(
frame = dropnullfields,
catalog_connection = "DATABASE_CONNECTION_NAME",
connection_options = {"dbtable": "table_name", "database": "db_name"},
transformation_ctx = "...")
Connection access
In “Connection Access”, specify the VPC and subnet on which the Glue job will be executed.
For “Security Group”, specify the security group you created earlier.
4. Check connection
After creating “Connection”, click the “Test Connection” button to check the communication of the created connection.
If the message “Successfully connected to the instance” is displayed, it is successful.
If an error occurs, review the settings by referring to the following knowledge.
Resolve a failed test connection in AWS Glue
5. Create a job
Create a Glue job and link the “Connection” created earlier on the “Connection” screen.
6. Job execution
If the job is executed and the “execution status” becomes “Succeeded”, it is successful.
7. Check the ENI of the Glue job
You can see the ENI for your Glue job from EC2 → Network Interface in the AWS console.
The ENI for Glue jobs has a description field of “Attached to Glue using role: arn: aws: iam :: AWS account ID: role / GlueJobRole” .
Make sure this ENI is in the same VPC / subnet / security group as specified in Database → Connection.
It will be deleted after a certain period of time without using ENI.
If you need internet communication
If the Glue job running in the VPC requires internet communication, the configuration is as follows.
- ENI for Glue jobs is not assigned a public IP, so provide NAT
- Directing Internet communication to NAT
- You don’t need an S3 VPC endpoint because you can access S3 via the Internet (use various VPC endpoints if you want to access AWS resources without going through the Internet)