Amazon S3 Select

Learn via video courses
Topics Covered

Overview

Amazon S3 Select is an add-on feature available in the Amazon S3 service for querying single objects using standard SQL. Amazon S3, or Simple Storage Solution, is the storage solution provided by AWS where we can store data, commonly referred to as objects, in the S3 bucket. Customers can implement Amazon S3 Select using code with SQL expressions in their application. An Amazon S3 select is mainly used to quickly retrieve an object from an Amazon S3 bucket.

What is S3 Select?

Amazon S3 Select is a feature introduced by Amazon in the year 2018.

Amazon S3 Select is used to retrieve the subset of data from an object residing in S3 based on the SQL expression that we have given. Customers should have s3 permission before querying the object using Amazon S3 Select.

For example

Consider if we have a 1 GB size object and customers want to query the data for a particular row in an object.

Before Amazon S3 Select

  • Customers need to download the object from S3 and export it to some application and do the query. Depending upon the size and file, the result will return.
  • Optionally, they can use Amazon Athena, but it will take some time to query the data and return the result.

Before Amazon S3 Select

After Amazon S3 Select

  • Customers can query the objects with predefined values such as row and column as parameters in the S3 select command in the code.
  • It will scan the object or data, filter it quickly, and return the expected result fastly.

After Amazon S3 Select

Filtering and Retrieving Data Using Amazon S3 Select

  • The data returned by the Amazon S3 console is limited to 40 MB only. So if the return data size is supposed to be more than 40 MB, use the AWS SDK or AWS CLI.
  • Amazon S3 Select supports CSV, JSON, and Apache Parquet as input formats.
  • Amazon S3 Select supports only CSV or JSON as output formats. It does not support Parquet output.
  • The maximum length of the SQL expression supported by Amazon S3 select is 256 KB.
  • We should use HTTPS and an encryption key in the request if the objects are encrypted with a customer-provided encryption key (SSE-C).
  • If any issue occurs while running or performing the query, Amazon S3 Select returns an error code
  • Customers can scan a subset of the object by specifying bytes and a query range with the ScanRange parameter.
  • Amazon S3 select uses an input-serialization method for how the object should be queried and an output-serialization method for how the object should return the results.

Features of S3 Select

General Availability

  • S3 Select is generally available in all the AWS regions. All AWS customers can use the S3 Select service. Depending on the scanned and returned data, the cost will vary.

Serverless

  • S3 Select is a serverless service, meaning that customers don’t have to provide any resources to run the query.
  • Customers can run the query on the S3 console. AWS will handle all the backend work and return the results.

Supported Format

  • S3 Select supports three formats. They are CSV, JSON, and Parquet.

    • Input-Serialization Setting

      Format: CSV, JSON, Apache Parquet.

      Delimiter-Comma, Tab, and custom.

      Compression-None, Gzip, Bzip2.

    • Output-Serialization Setting

      Format—JSON or CSV.

      Delimiter - CSV or Tab or custom.

Integration of AWS Services

  • Customers can run the S3 Select command over a code in a lambda function and invoke the lambda. The lambda will return the s3 select output as a result.

  • S3 Select also works with Amazon EMR Service.

  • S3 Select helps customers reduce the query time in a data-intensive application.

Effective Performance

  • Customers can accelerate their application performance and reduce their costs by querying an object using Amazon S3 select.

Data Types

  • The Amazon S3 select treats all the input data as a string unless the CAST function is defined.

CAST: It converts one data type to another.

Supported Data Type

NoS3 Select Supported Data TypesParquet Supported Data Types
1boolDATE
2int, integerINT(8), INT(16), INT(32), INT(64)
3stringSTRING
4floatLIST,ENUM
5decimal, numericDECIMAL
6TimestampTIMESTAMP

Examples

We have two options to perform an S3 Select query on S3 objects. They are the AWS SDK and the Rest APIs.

Rest API’s

If any of their applications are required, customers can send REST requests directly to S3.

AWS SDKs

AWS provides AWS SDK support for four languages: Java, Javascript, Ruby, and Python.

Illustration

Perform S3 Select Query Using the AWS SDK With the Help of AWS Lambda

In this example, we are using the AWS SDK for Python in Lambda.

Requirement

CSV file uploaded to an S3 bucket AWS account with Lambda, IAM, and S3 access.

Step 1: Create a lambda function with appropriate permissions.

Go to the AWS Lambda console and select functions in the left navigation pane, then click Create a Function, which is showing on the right side.

Perform S3 Select Query

Step 2: Enter the below-mentioned details

Option 1: Author From Scratch Name: s3-select-demo Runtime: Python 3.9 Architecture: X86_64 Permission: Create a new role from AWS policy templates

Step -2

Role Name: S3-bucket-permission

Policy Name: Amazon S3 object read-only permissions

Step -2

Select and enter the above value and select the Create function button.

Step 3: Enter the following code and run it.

Click the deploy button and run it after the lambda function is deployed.

Step-3

Go to the executive result tab, and we can see that the required output is shown below.

Step-3

Troubleshooting

Check the following items if an error occurs.

  1. S3 Get object permission in the lambda role permission

  2. S3 Bucket policy granting get object permission.

  3. Correct the bucket name and key name of the object mentioned in the lambda code.

  4. Check whether the IAM user has had lambda invoke permission or not.

  5. Check whether the SQL expression in the lambda code is valid or not.

Using S3 Select to Perform a Query

We can perform the query using the console, AWS SDK, or REST APIs.

Illustration 

Amazon S3 Select using AWS Console 

Requirement: CSV file uploaded to an S3 bucket.

Go to the AWS S3 console and upload a CSV file.

Amazon S3 Select using AWS Console

Once the file is successfully uploaded, click the query with the S3 select option mentioned in the action button.

Amazon S3 Select using AWS Console

Select the below options as mentioned in the below image.

Input Setting

Format: CSV

Delimiter-Comma

Compression-None

Output Setting

Format: CSV

Delimiter-CVV

Amazon S3 Select using AWS Console

Enter the below query

Amazon S3 Select using AWS Console

The above query will return the result as shown in the below image.

Amazon S3 Select using AWS Console

Furthermore, AWS provides some predefined queries to support basic use cases as well.

Amazon S3 Select using AWS Console

Cost Of S3 Select

Amazon S3 Select the cost for scanning the data across the standard storage class costs $0.0025. The cost for the data returned by the Amazon S3 Select will vary.

The below price is based on the AWS Mumbai region.

Amazon S3 Select = $0.0004 per 1000 SELECT commands

S3 ClassData scanned (price per GB)Data returned (price per GB)
S3 Standard$0.0025$0.0009
S3 Intelligent - Tiering$0.0025$0.0009
S3 Standard - Infrequent Access$0.0025$0.01
S3 One Zone - Infrequent Access$0.0025$0.0

An example scenario of Amazon S3 Select's cost

Assume that the customer has 1 GB of storage in S3 Standard. The object size is also 1 GB. After querying the data using S3 select, the data scanned and returned by S3 select is also 1GB. Calculate the total cost incurred by S3 Select.

The calculation is as follows:

S3 Select command= $0.0004 per 1000 commands Data scanned by S3 Select is $0.0025 per GB. Data returned by S3 Select= $0.0009 per GB

Total = 0.0004+0.0004 + 0.0025 + 0.0009=>0.0009 => 0.0038

Benefits and Limitations of S3 Select

  • Amazon S3 Select helps customers filter the data and perform a scan of the data in the storage layer where the data lives.
  • Amazon S3 Select provides the solution to the querying job quicker and more cost-effectively.
  • Amazon S3 Select plays an important role in accelerating performance in Big Data.
  • Amazon S3 Select helps customers and developers access the data in a way that is much more optimized for analytics and data lakes environments.
  • Customers can store data in S3 as tables and scan across those tables using Amazon S3 select. It will return the relevant results.
  • The larger the size of the object, the higher the time for the query.
  • It is not suitable for complex analysis and joins expression.
  • Amazon S3 Select is serverless and highly available, so customers don’t have to worry about managing the infrastructure for querying in S3.
  • AWS suggests that customers go with S3 Select instead of the S3 Get Action API for better results and application performance.

Amazon S3 Select vs Athena

NoS3 SelectAthena
1Query one object at a timeQuery multiple objects or an entire bucket at a time
2Limited to SELECT ClauseSupports multiple clauses such as DDL and DML.
3It does not support Joins and GroupingsIt supports Joins and Grouping
4It provides Optimized performance for larger dataset Objects in a single query time.It is used in various domains such as data analytics and Data pipelines.
5It provides better results for querying a single large object than Athena.It is considered one of the big data tools. It typically supports storage capacities ranging from gigabytes to petabytes.
6Only the S3 data source is supported.It supports various data sources.
7It is a serverless service, which means there is no requirement to manage any infrastructure.It is also serverless. To perform big data tasks, it uses a presto engine under the hood.

Amazon S3 Select vs Presto

NoS3 SelectPresto
1It is a proprietary service provided by AWS It is an Open Source Distributed Query Engine governed by Linux Foundation and some member companies
2It supports data sources as Amazon S3 onlyIt supports nearly 26 open-source connectors
3It has limited SQL dialectIt has a comprehensive SQL Dialect
4It supports CSV, JSON, and Apache Parquet format only.It supports many formats such as SequenceFile, RCFile, ORC, Delimited, CSV JSON, Parquet, and Avro.
5It works for a single large object.It usually works with large-scale data applications and data domains such as Big Data and Data Analytics
6The S3 select command can be used to query AWS S3 and AWS Outpost S3.It supports running on-premises or in the cloud (AWS, Azure, GCP, or OCI)
7The pricing depends on the scanning and returning size of the object.There is no cost for downloading and using the engine.

Conclusion

  1. The Amazon S3 Select feature is simple to use, faster in performance, and cheaper when compared to doing the query manually.
  2. Amazon S3 Select SQL expressions are highly optimized. As a result, it will take much less time than any other tool, such as Amazon Athena.
  3. Customers must have S3 Object permission to query with the S3 Select.
  4. Customers can utilize S3 Select using the AWS SDK for Java, AWS SDK for Python, AWS Console, Rest APIs, and AWS CLI.
  5. Amazon S3 Select is widely implemented in the domains of data analytics, data warehouse, data streaming, data pipelines, and Big Data.