AWS Glue

You can trigger an AWS Glue job based on an event. You can use the following events to trigger a Glue job

CloudWatch Events: You can create a CloudWatch Events rule that triggers a Glue job when a specific event occurs. For example, you could create a rule that triggers a Glue job when a new file is created in an S3 bucket.
Step Functions: You can use a Step Functions state machine to trigger a Glue job. For example, you could create a state machine that triggers a Glue job when a specific event occurs in another AWS service.
Manually: You can manually trigger a Glue job by clicking the Run button in the AWS Glue console.

To trigger a Glue job based on an event, you will need to create a trigger in the AWS Glue console. When you create a trigger, you will need to specify the event that you want to trigger the job on, as well as the job that you want to run.

Here are the steps on how to trigger a Glue job based on an event:

Go to the AWS Glue console.
Click Triggers.
Click Create trigger.
In the Event type section, select the event that you want to trigger the job on.
In the Job section, select the job that you want to run.
In the Schedule section, specify the schedule for the trigger.
Click Create.

Once you have created the trigger, the Glue job will be triggered whenever the specified event occurs.

spark.driver.memory in aws glue

The spark.driver.memory property in AWS Glue specifies the amount of memory that is allocated to the Spark driver process. The driver process is responsible for managing the execution of the Spark job, and it also stores the state of the job.

The default value for spark.driver.memory is 512 MB. However, you may need to increase this value if your job is processing large amounts of data or if it is performing a lot of complex computations.

To increase the value of spark.driver.memory, you can use the --conf parameter when you run your Glue job. For example, the following command would allocate 1 GB of memory to the Spark driver process:

aws glue run --conf spark.driver.memory=1g my-glue-job

You can also increase the value of spark.driver.memory in the AWS Glue console.

To do this, go to the Jobs page and select your job. Then, click the

 Script libraries and job parameters tab and enter the following value for

spark.driver.memory:  1g

Once you have increased the value of spark.driver.memory, your Glue job will be able to use more memory and it may perform better. However, it is important to note that increasing the amount of memory allocated to the Spark driver process will also increase the cost of your Glue job.

Here are some tips for setting the spark.driver.memory property:

Start with the default value of 512 MB and increase it as needed.
Consider the size of your data and the complexity of your computations when setting the value of spark.driver.memory.
Monitor the memory usage of your Glue job to ensure that it is not using too much memory.

spark.sql.shuffle.partitions

The spark.sql.shuffle.partitions property in AWS Glue specifies the number of partitions that are used for shuffling data during Spark SQL operations. The default value for spark.sql.shuffle.partitions is 200. However, you may need to increase or decrease this value depending on the size of your data and the complexity of your queries.

Increasing the number of shuffle partitions can improve the performance of your queries by distributing the data more evenly across the Spark workers. However, it can also increase the memory and disk usage of your Glue job.

Decreasing the number of shuffle partitions can improve the performance of your queries by reducing the amount of data that needs to be shuffled. However, it can also reduce the parallelism of your queries, which can lead to slower performance.

The best way to determine the optimal value for spark.sql.shuffle.partitions is to experiment with different values and see how it affects the performance of your queries.

Here are some tips for setting the spark.sql.shuffle.partitions property:

Start with the default value of 200 and increase it as needed.
Consider the size of your data and the complexity of your queries when setting the value of spark.sql.shuffle.partitions.
Monitor the performance of your Glue job to ensure that it is not using too much memory or disk space.

spark.sql.adaptive.enabled

The spark.sql.adaptive.enabled property in AWS Glue specifies whether or not adaptive query execution is enabled. Adaptive query execution is a feature of Spark SQL that dynamically optimizes the execution plan of a query based on runtime statistics.

The default value for spark.sql.adaptive.enabled is true. This means that adaptive query execution is enabled by default in AWS Glue.

If you disable adaptive query execution, Spark will use a static execution plan for your queries. This may result in slower performance for some queries.

Here are some of the benefits of enabling adaptive query execution:

Improved performance: Adaptive query execution can improve the performance of your queries by dynamically optimizing the execution plan based on runtime statistics.
Reduced resource usage: Adaptive query execution can reduce the resource usage of your Glue jobs by dynamically adjusting the number of partitions and other parameters.
Increased scalability: Adaptive query execution can increase the scalability of your Glue jobs by dynamically adjusting the execution plan to handle larger datasets.

If you are not sure whether or not to enable adaptive query execution, you can experiment with different settings and see how it affects the performance of your queries.

Here are some tips for setting the spark.sql.adaptive.enabled property:

Start with the default value of true and disable it if you see no performance improvement.
Consider the size of your data and the complexity of your queries when setting the value of spark.sql.adaptive.enabled.
Monitor the performance of your Glue jobs to ensure that they are not using too much memory or disk space.

spark.sql.join.preferSortMergeJoin

The spark.sql.join.preferSortMergeJoin property in AWS Glue specifies whether or not Spark should prefer the SortMerge join algorithm over other join algorithms. The SortMerge join algorithm is a more efficient algorithm for joining large datasets, but it can be slower for small datasets.

The default value for spark.sql.join.preferSortMergeJoin is true. This means that Spark will prefer the SortMerge join algorithm by default.

If you disable spark.sql.join.preferSortMergeJoin, Spark will use other join algorithms, such as the BroadcastHashJoin algorithm, for joining large datasets. This may result in slower performance for some queries.

Here are some examples of how to use the spark.sql.join.preferSortMergeJoin property:

# Enable SortMerge join for all joins
spark.sql.join.preferSortMergeJoin = true

# Disable SortMerge join for joins on small datasets
spark.sql.join.preferSortMergeJoin = false

# Enable SortMerge join for joins on specific tables
spark.sql.join.preferSortMergeJoin = "my_table1, my_table2"

Steps on how to increase the number of worker nodes in AWS Glue:

Go to the AWS Glue console.
Click Jobs.
Select the job that you want to increase the number of worker nodes for.
Click the Script libraries and job parameters tab.
In the Worker type field, select the worker type that you want to use.
In the Number of workers field, enter the number of worker nodes that you want to use.
Click Save.

For example, to increase the number of worker nodes for a job that is using the standard worker type to 10, you would set the Worker type field to Standard and the Number of workers field to 10.

Here is an example of the JSON configuration that you would use to increase the number of worker nodes for a job:

{
  "Name": "my-job",
  "Description": "This is my job",
  "ScriptLocation": "s3://my-bucket/my-script.py",
  "WorkerType": "Standard",
  "NumberOfWorkers": 10
}

Once you have increased the number of worker nodes, your Glue job will be able to use more resources and it may perform better. However, it is important to note that increasing the number of worker nodes will also increase the cost of your Glue job.

spark.executor.memory

spark.executor.memory property in AWS Glue specifies the amount of memory that is allocated to each executor. The executors are the worker processes that run on the worker nodes and that execute the Spark tasks.

The default value for spark.executor.memory is 512 MB. However, you may need to increase this value if your job is processing large amounts of data or if it is performing a lot of complex computations.

To increase the value of spark.executor.memory, you can use the --conf parameter when you run your Glue job. For example, the following command would allocate 2 GB of memory to each executor:

aws glue run --conf spark.executor.memory=2g my-glue-job

You can also increase the value of spark.executor.memory in the AWS Glue console. To do this, go to the Jobs page and select your job. Then, click the Script libraries and job parameters tab and enter the following value for spark.executor.memory: 2g

Delete data from a specific table in athena using aws glue

import sys
from awsglue.transforms import DeleteFromTable

def main():

    # Get the name of the table that you want to delete data from.
    table_name = sys.argv[1]

    # Get the condition that you want to use to filter the data that you want to delete.
    condition = sys.argv[2]

    # Create a Glue job that uses the DELETE FROM statement.
    delete_from_table = DeleteFromTable(
        DatabaseName="my_database",
        TableName=table_name,
        Condition=condition
    )

    # Run the Glue job.
    delete_from_table.run()

if __name__ == "__main__":
    main()

spark.driver.maxResultSize

The spark.driver.maxResultSize property in AWS Glue specifies the maximum limit of the total size of the serialized result that a driver can store for each Spark collect action (data in bytes). Sometimes this property also helps in the performance tuning of Spark Application.

The default value for spark.driver.maxResultSize is 1024 MB. However, you may need to increase this value if your job is processing large amounts of data or if it is performing a lot of complex computations.

To increase the value of spark.driver.maxResultSize, you can use the --conf parameter when you run your Glue job. For example, the following command would allocate 2 GB of memory to the driver:

aws glue run --conf spark.driver.maxResultSize=2g my-glue-job

You can also increase the value of spark.driver.maxResultSize in the AWS Glue console. To do this,

go to the Jobs page and select your job. Then, click the

Script libraries and job parameters tab and enter the following value

for spark.driver.maxResultSize: 2g

enable-s3-parquet-optimized-committer

enable-s3-parquet-optimized-committer job parameter in AWS Glue enables the EMRFS S3-optimized committer for writing Parquet data into Amazon S3. This committer uses Amazon S3 multipart uploads instead of renaming files, and it usually reduces the number of HEAD/LIST requests significantly.

To enable the enable-s3-parquet-optimized-committer job parameter, you can use the following syntax:

--enable-s3-parquet-optimized-committer=true

Glue script to identify accuracy of our data using PySpark for a dummy table:

import sys
from awsglue.transforms import *
from awsglue.context import GlueContext
from awsglue.job import Job

def main():

    # Create a Glue context.
    glue_context = GlueContext(SparkContext.getOrCreate())

    # Get the name of the dummy table.
    table_name = sys.argv[1]

    # Read the data from the dummy table.
    df = glue_context.read_table(table_name)

    # Calculate the accuracy of the data.
    accuracy = df.filter(df["column1"] == df["column2"]).count() / df.count()

    # Print the accuracy of the data.
    print("The accuracy of the data is:", accuracy)

if __name__ == "__main__":
    main()

This script will read the data from the dummy table and calculate the 
accuracy of the data. The accuracy of the data is the percentage of rows
 in the table where the values in the column1 and column2 columns are equal.

The script will then print the accuracy of the data.

Search This Blog

AWS Glue

AWS Glue

spark.driver.memory in aws glue

spark.sql.shuffle.partitions

spark.sql.adaptive.enabled

spark.sql.join.preferSortMergeJoin

spark.executor.memory

Delete data from a specific table in athena using aws glue

spark.driver.maxResultSize

enable-s3-parquet-optimized-committer

enable-s3-parquet-optimized-committer job parameter in AWS Glue enables the EMRFS S3-optimized committer for writing Parquet data into Amazon S3. This committer uses Amazon S3 multipart uploads instead of renaming files, and it usually reduces the number of HEAD/LIST requests significantly.

Glue script to identify accuracy of our data using PySpark for a dummy table:

Labels

Comments

Post a Comment

Popular posts from this blog

AWS lambda