Tập tành Run an AWS Glue job inside a VPC

Tập tành Run an AWS Glue job inside a VPC

Share Everywhere

Table of contents

Learn how to access resources in your VPC from an AWS Glue job.

point

The points to run an AWS Glue job in a VPC are as follows.

  • VPC settings in “Database-> Connect” in the console menu.
  • Associate a job with a connection.
  • Self-reference security group inbound rules for ENI.
  • Create S3 VPC endpoint as gateway type.
Learn how to access resources in your VPC from an AWS Glue job.
 
  • There is no network setting on the job creation screen
  • You need to define a database connection without having to connect to a JDBC or database connection

Please note.

I tried it

This time we will run the Glue job within the private subnet of the VPC without NAT.

1. Create a VPC endpoint for S3

Glue jobs require S3 access to get scripts and manage execution results.

Create a gateway type VPC endpoint for S3 so that VPC can access S3.

The endpoint of S3 is

  • Gateway type
  • Interface type

There are two types.

Interface types cannot use private DNS names and must use VPC endpoint-specific DNS names. Perhaps because of this limitation, you can’t access S3 from your VPC Glue just by creating an interfaced endpoint. Create a gateway endpoint for simplicity of configuration .

2. Creating a security group for job execution ENI

Create a security group for your network interface (ENI) to use when running Glue jobs within your VPC.

This security group adds self-referenced inbound rules, specifically rules that allow any TCP on any port that originates from itself.

type protocol port range source
All TCP TCP 0-65535 Your own security group ID
Learn how to access resources in your VPC from an AWS Glue job.

This may be necessary because the return communication to ENI is stateless.

3. Add database connection settings

From “Database-> Connection”, for Glue job execution

  • VPC
  • Subnet
  • (Optional) Database connection information for JDBC

To set.

Connection properties

If you do not use JDBC , specify ” Network ” for “Connection type”.

Learn how to access resources in your VPC from an AWS Glue job.

When connecting to a database with JDBC from a Spark script, you can define the connection information.

If you want to access the datastores in the same AWS account, you can select a database such as Redshift for Connection Type and specify the JDBC connection information in a wizard format.

You can also select JDBC as the Connection Type and enter it directly.

Spark **catalog_connection**passes this connection information as an argument.

datasink = glueContext.write_dynamic_frame.from_jdbc_conf(
  frame = dropnullfields,
  catalog_connection = "DATABASE_CONNECTION_NAME",
  connection_options = {"dbtable": "table_name", "database": "db_name"},
  transformation_ctx = "...")

Connection access

In “Connection Access”, specify the VPC and subnet on which the Glue job will be executed.

For “Security Group”, specify the security group you created earlier.

Learn how to access resources in your VPC from an AWS Glue job.
Learn how to access resources in your VPC from an AWS Glue job.

4. Check connection

After creating “Connection”, click the “Test Connection” button to check the communication of the created connection.

Learn how to access resources in your VPC from an AWS Glue job.

If the message “Successfully connected to the instance” is displayed, it is successful.

If an error occurs, review the settings by referring to the following knowledge.

Resolve a failed test connection in AWS Glue

5. Create a job

Create a Glue job and link the “Connection” created earlier on the “Connection” screen.

Learn how to access resources in your VPC from an AWS Glue job.

6. Job execution

If the job is executed and the “execution status” becomes “Succeeded”, it is successful.

Learn how to access resources in your VPC from an AWS Glue job.

7. Check the ENI of the Glue job

You can see the ENI for your Glue job from EC2 → Network Interface in the AWS console.

The ENI for Glue jobs has a description field of “Attached to Glue using role: arn: aws: iam :: AWS account ID: role / GlueJobRole” .

Make sure this ENI is in the same VPC / subnet / security group as specified in Database → Connection.

It will be deleted after a certain period of time without using ENI.

If you need internet communication

If the Glue job running in the VPC requires internet communication, the configuration is as follows.

Learn how to access resources in your VPC from an AWS Glue job.
  • ENI for Glue jobs is not assigned a public IP, so provide NAT
  • Directing Internet communication to NAT
  • You don’t need an S3 VPC endpoint because you can access S3 via the Internet (use various VPC endpoints if you want to access AWS resources without going through the Internet)

reference

Bạn thấy bài viết này như thế nào?
1 reaction

Add new comment

Restricted HTML

  • Allowed HTML tags: <a href hreflang> <em> <strong> <cite> <blockquote cite> <code> <ul type> <ol start type> <li> <dl> <dt> <dd> <h2 id> <h3 id> <h4 id> <h5 id> <h6 id>
  • Lines and paragraphs break automatically.
  • Web page addresses and email addresses turn into links automatically.
Image CAPTCHA
Enter the characters shown in the image.

Bài viết liên quan

Microservices Roadmap

Microservices Roadmap

- Kafka, RabbitMQ, Amazon SQS: Efficient and reliable message brokers for seamless communication between microservices.
The Data Analyst Roadmap

The Data Analyst Roadmap

**Database Knowledge**: Gain proficiency in working with databases like MySQL, PostgreSQL, or MongoDB.
Architectural patterns in software design

Architectural patterns in software design

Choose the architecture that aligns with your application's unique needs and goals. Each pattern offers a tailored approach to elevate your software system!
Exploring the Technological Marvel Behind Netflix

Exploring the Technological Marvel Behind Netflix

Ever wondered about the tech wizardry that powers your binge-watching adventures on Netflix?
Data engineers need to be proficient in at least one programming language

Roadmap to become a Data Engineer in 2024

Data engineers need to be proficient in at least one programming language