Skip to main content

AWS Glue

You can trigger an AWS Glue job based on an event. You can use the following events to trigger a Glue job CloudWatch Events: You can create a CloudWatch Events rule that triggers a Glue job when a specific event occurs. For example, you could create a rule that triggers a Glue job when a new file is created in an S3 bucket. Step Functions: You can use a Step Functions state machine to trigger a Glue job. For example, you could create a state machine that triggers a Glue job when a specific event occurs in another AWS service. Manually: You can manually trigger a Glue job by clicking the Run button in the AWS Glue console. To trigger a Glue job based on an event, you will need to create a trigger in the AWS Glue console. When you create a trigger, you will need to specify the event that you want to trigger the job on, as well as the job that you want to run. Here are the steps on how to trigger a Glue job based on an event: Go to the AWS Glue console. Click Triggers . Click Cr...

AWS Glue

You can trigger an AWS Glue job based on an event. You can use the following events to trigger a Glue job

  • CloudWatch Events: You can create a CloudWatch Events rule that triggers a Glue job when a specific event occurs. For example, you could create a rule that triggers a Glue job when a new file is created in an S3 bucket.
  • Step Functions: You can use a Step Functions state machine to trigger a Glue job. For example, you could create a state machine that triggers a Glue job when a specific event occurs in another AWS service.
  • Manually: You can manually trigger a Glue job by clicking the Run button in the AWS Glue console.

To trigger a Glue job based on an event, you will need to create a trigger in the AWS Glue console. When you create a trigger, you will need to specify the event that you want to trigger the job on, as well as the job that you want to run.

Here are the steps on how to trigger a Glue job based on an event:

  1. Go to the AWS Glue console.
  2. Click Triggers.
  3. Click Create trigger.
  4. In the Event type section, select the event that you want to trigger the job on.
  5. In the Job section, select the job that you want to run.
  6. In the Schedule section, specify the schedule for the trigger.
  7. Click Create.

Once you have created the trigger, the Glue job will be triggered whenever the specified event occurs.

 

spark.driver.memory in aws glue 

The spark.driver.memory property in AWS Glue specifies the amount of memory that is allocated to the Spark driver process. The driver process is responsible for managing the execution of the Spark job, and it also stores the state of the job.

The default value for spark.driver.memory is 512 MB. However, you may need to increase this value if your job is processing large amounts of data or if it is performing a lot of complex computations.

To increase the value of spark.driver.memory, you can use the --conf parameter when you run your Glue job. For example, the following command would allocate 1 GB of memory to the Spark driver process:

aws glue run --conf spark.driver.memory=1g my-glue-job
You can also increase the value of spark.driver.memory in the AWS Glue console. 
To do this, go to the Jobs page and select your job. Then, click the
 Script libraries and job parameters tab and enter the following value for  
spark.driver.memory:  1g
 

Once you have increased the value of spark.driver.memory, your Glue job will be able to use more memory and it may perform better. However, it is important to note that increasing the amount of memory allocated to the Spark driver process will also increase the cost of your Glue job.

Here are some tips for setting the spark.driver.memory property:

  • Start with the default value of 512 MB and increase it as needed.
  • Consider the size of your data and the complexity of your computations when setting the value of spark.driver.memory.
  • Monitor the memory usage of your Glue job to ensure that it is not using too much memory.

spark.sql.shuffle.partitions


The spark.sql.shuffle.partitions property in AWS Glue specifies the number of partitions that are used for shuffling data during Spark SQL operations. The default value for spark.sql.shuffle.partitions is 200. However, you may need to increase or decrease this value depending on the size of your data and the complexity of your queries.

Increasing the number of shuffle partitions can improve the performance of your queries by distributing the data more evenly across the Spark workers. However, it can also increase the memory and disk usage of your Glue job.

Decreasing the number of shuffle partitions can improve the performance of your queries by reducing the amount of data that needs to be shuffled. However, it can also reduce the parallelism of your queries, which can lead to slower performance.

The best way to determine the optimal value for spark.sql.shuffle.partitions is to experiment with different values and see how it affects the performance of your queries.

Here are some tips for setting the spark.sql.shuffle.partitions property:

  • Start with the default value of 200 and increase it as needed.
  • Consider the size of your data and the complexity of your queries when setting the value of spark.sql.shuffle.partitions.
  • Monitor the performance of your Glue job to ensure that it is not using too much memory or disk space.

spark.sql.adaptive.enabled

 

The spark.sql.adaptive.enabled property in AWS Glue specifies whether or not adaptive query execution is enabled. Adaptive query execution is a feature of Spark SQL that dynamically optimizes the execution plan of a query based on runtime statistics.

The default value for spark.sql.adaptive.enabled is true. This means that adaptive query execution is enabled by default in AWS Glue.

If you disable adaptive query execution, Spark will use a static execution plan for your queries. This may result in slower performance for some queries.

Here are some of the benefits of enabling adaptive query execution:

  • Improved performance: Adaptive query execution can improve the performance of your queries by dynamically optimizing the execution plan based on runtime statistics.
  • Reduced resource usage: Adaptive query execution can reduce the resource usage of your Glue jobs by dynamically adjusting the number of partitions and other parameters.
  • Increased scalability: Adaptive query execution can increase the scalability of your Glue jobs by dynamically adjusting the execution plan to handle larger datasets.

If you are not sure whether or not to enable adaptive query execution, you can experiment with different settings and see how it affects the performance of your queries.

Here are some tips for setting the spark.sql.adaptive.enabled property:

  • Start with the default value of true and disable it if you see no performance improvement.
  • Consider the size of your data and the complexity of your queries when setting the value of spark.sql.adaptive.enabled.
  • Monitor the performance of your Glue jobs to ensure that they are not using too much memory or disk space.

 

spark.sql.join.preferSortMergeJoin

 Once you have increased the number of worker nodes, your Glue job will be able to use more resources and it may perform better. However, it is important to note that increasing the number of worker nodes will also increase the cost of your Glue job.

 

spark.executor.memory

 spark.executor.memory property in AWS Glue specifies the amount of memory that is allocated to each executor. The executors are the worker processes that run on the worker nodes and that execute the Spark tasks.

The default value for spark.executor.memory is 512 MB. However, you may need to increase this value if your job is processing large amounts of data or if it is performing a lot of complex computations.

To increase the value of spark.executor.memory, you can use the --conf parameter when you run your Glue job. For example, the following command would allocate 2 GB of memory to each executor:

aws glue run --conf spark.executor.memory=2g my-glue-job

You can also increase the value of spark.executor.memory in the AWS Glue console. To do this, go to the Jobs page and select your job. Then, click the Script libraries and job parameters tab and enter the following value for spark.executor.memory: 2g


Delete data from a specific table in athena using aws glue

 

import sys
from awsglue.transforms import DeleteFromTable

def main():

    # Get the name of the table that you want to delete data from.
    table_name = sys.argv[1]

    # Get the condition that you want to use to filter the data that you want to delete.
    condition = sys.argv[2]

    # Create a Glue job that uses the DELETE FROM statement.
    delete_from_table = DeleteFromTable(
        DatabaseName="my_database",
        TableName=table_name,
        Condition=condition
    )

    # Run the Glue job.
    delete_from_table.run()

if __name__ == "__main__":
    main()

spark.driver.maxResultSize

 

The spark.driver.maxResultSize property in AWS Glue specifies the maximum limit of the total size of the serialized result that a driver can store for each Spark collect action (data in bytes). Sometimes this property also helps in the performance tuning of Spark Application.

The default value for spark.driver.maxResultSize is 1024 MB. However, you may need to increase this value if your job is processing large amounts of data or if it is performing a lot of complex computations.

To increase the value of spark.driver.maxResultSize, you can use the --conf parameter when you run your Glue job. For example, the following command would allocate 2 GB of memory to the driver:

aws glue run --conf spark.driver.maxResultSize=2g my-glue-job
You can also increase the value of spark.driver.maxResultSize in the AWS Glue console. To do this, 
go to the Jobs page and select your job. Then, click the 
Script libraries and job parameters tab and enter the following value 
for spark.driver.maxResultSize: 2g
 

enable-s3-parquet-optimized-committer

enable-s3-parquet-optimized-committer job parameter in AWS Glue enables the EMRFS S3-optimized committer for writing Parquet data into Amazon S3. This committer uses Amazon S3 multipart uploads instead of renaming files, and it usually reduces the number of HEAD/LIST requests significantly.

To enable the enable-s3-parquet-optimized-committer job parameter, you can use the following syntax:

--enable-s3-parquet-optimized-committer=true

 

 Glue script to identify accuracy of our data using PySpark for a dummy table:

 

import sys
from awsglue.transforms import *
from awsglue.context import GlueContext
from awsglue.job import Job

def main():

    # Create a Glue context.
    glue_context = GlueContext(SparkContext.getOrCreate())

    # Get the name of the dummy table.
    table_name = sys.argv[1]

    # Read the data from the dummy table.
    df = glue_context.read_table(table_name)

    # Calculate the accuracy of the data.
    accuracy = df.filter(df["column1"] == df["column2"]).count() / df.count()

    # Print the accuracy of the data.
    print("The accuracy of the data is:", accuracy)

if __name__ == "__main__":
    main()
 
This script will read the data from the dummy table and calculate the 
accuracy of the data. The accuracy of the data is the percentage of rows
 in the table where the values in the column1 and column2 columns are equal. 
The script will then print the accuracy of the data. 

 

 


 

Comments

Popular posts from this blog

AWS lambda

  AWS lambda AWS Lambda is a serverless computing service that allows you to run code without provisioning or managing servers. Lambda automatically scales your code based on demand, so you only pay for the compute time you use. Here are  examples of how you can use AWS Lambda: Process event data. Lambda can be used to process event data from a variety of sources, such as Amazon CloudWatch logs, Amazon Kinesis streams, and Amazon Simple Notification Service (SNS) topics. Generate dynamic content. Lambda can be used to generate dynamic content, such as personalized web pages or email messages. Automate tasks. Lambda can be used to automate tasks, such as sending out marketing emails or updating customer records. Build microservices. Lambda can be used to build microservices, which are small, independent services that can be easily scaled and deployed. Create custom APIs. Lambda can be used to create custom APIs that can be used by other applications. Run batch jobs. ...