Accessing files in S3 using Pre-signed URLs
    • Dark
      Light

    Accessing files in S3 using Pre-signed URLs

    • Dark
      Light

    Article Summary

    Overview

    Users cannot access files in Amazon S3 unless:

    1. That user has ownership of that file.
    2. The file has been made public.
    3. The file has been shared with other IAM users.

    However, using Amazon S3 SDK, a user with access to the file can generate a pre-signed URL, which allows anyone to access/download the file. This strategy is ideal for use with software applications/processes that need brief/momentary access to the file to consume its contents.

    Warning

    Please be cautious when sharing a pre-signed URL.

    By default, a pre-signed URL is valid for 3600 seconds (one hour). It is recommended that you use a much shorter duration depending on the amount of time you need for your process/user to consume the file.

    Read Sharing an object with a presigned URL for more information.



    Generate a pre-signed URL

    For this tutorial, a Python Script component is used in conjunction with the supported Boto3 package to generate a pre-signed URL. The Python Script component already has access to the AWS credentials that have been assigned to this Matillion ETL instance. Boto3 will use these credentials to generate the pre-signed URL for a resource/file in S3.

    The Python script below generates a URL (_uri) and assigns that URL to the project-variable S3_uri that can then be used in the Orchestration Job to access the file. Intialize the variables bucket, file_key, and uri_duration as appropriate.

    import boto3
    
    bucket = 'mtln-public-data' # name of the s3 bucket
    
    file_key = 'Samples/books.xml' # key including any folder paths
    
    uri_duration = 10 #expiry duration in seconds. default 3600
    
    s3Client = boto3.client('s3')
    
    _uri = s3Client.generate_presigned_url('get_object', Params = {'Bucket': bucket, 'Key': file_key}, ExpiresIn = uri_duration)
    
    context.updateVariable('s3_uri', _uri)
    
    


    Please Note

    The generated URL will contain enough information to permit anyone access to the file. By default, the URL is valid for 3600 seconds; however, the Python script limits the URL's validity to just 10 seconds via the uri_duration variable. Users can edit this value if necessary.



    Example

    Why not try this yourself?

    1. On this page, under Attachments, there is a sample JSON file and RSD file that can be used with the API Query component. Click on these attachments to download them.

    2. Upload the JSON file into an S3 bucket that can be accessed from Matillion ETL.

    3. In Matillion ETL, click ProjectManage API ProfilesManage Query Profiles. Create a new profile and paste the RSD definition.

    4. Create an Orchestration Job.

    5. Create a job variable. Set the name as s3_uri. Set the data type as Text.

    6. Add the Python Script component to the Matillion ETL canvas, as in the image above. Modify the Python script from above to point at the S3 bucket and file location relevant to your usage.

    7. Add an API Query component to the canvas. Modify the properties to point at the RSD profile created in step 2.

    8. Set the Connection Options property in API Query as shown in the image above to pass the variable s3_uri as a parameter.



    Sample output

    File in S3:

    s3://mtln-public-data/Samples/airports.json

    Pre-signed url:

    https://mtln-public-data.s3.amazonaws.com/Samples/airports.json?AWSAccessKeyId=AKIAJSVY7VZTAUN42OMQ&Expires=1499951483&Signature=uBh3ozU8Z4pI%2B8BM3CcE29xqH%2FY%3D
    
    Attachments