S3 Event Type (Beta)

S3 event type provides the ability to query data directly from AWS S3 buckets. Use your AWS API credentials and specify the S3 bucket to pull data through the event type.

📘

Contact LogicHub to enable the configuration.


To fetch the AWS S3, you need to create a connection and an event type. Let's look at how to create a connection and an event type for S3.

Create a Connection for S3

  1. Go to My Library > Connections.
  2. Click New.
  3. Select the Connection Type as s3 for AWS.
    • Enter the following credentials in the add connection form:
      • Access Key: Enter a valid AWS access key.
      • Secret: Enter a valid AWS secret key.
      • URL: Provide a valid URL. Uses3.amazonaws.com or s3-{region}.amazonaws.com. To set a default region for the S3 connection, use s3-{region}.amazonaws.com. We recommend that you use a region-specific URL for better download speed.
  4. Click Save.

Create an S3 Event Type

  1. Go to My Library > Event Types.
  2. Click New.
  3. Enter the following details in the event type form:
    • Name: Enter a name to identify the event type. The name can consist of alphanumeric characters and underscores ( _ ). The first character can't be a number.
    • Source: Click Query to base the event type on a query to an external connection.
    • Connection: Choose a connection from the drop-down.
  4. In the query field, enter in JSON format with the mandatory keys: bucket, fileFormat, and query and click Save. Following is a sample query.
{
"bucket": "lhub-cloudtrail-logs",
"fileFormat": "json", 
"query": "select * from {{table}} limit 10", 
"keyPattern": "AWSLogs/001234567890/CloudTrail/ap-northeast-1/{{yyyy}}/{{MM}}/{{dd}}/",
"timestampColumn": "Records_eventTime",
"timestampPattern": "yyyy-MM-dd'T'HH:mm:ss'Z'"
}

The following is the complete list of JSON keys and their descriptions that can be used in the query to specify the data to pull.

JSON Key

Description

bucket (Mandatory)
[String]

S3 bucket name to fetch the data from.

fileFormat (Mandatory)
[String]

Acceptable file formats are csv, csv.gz, json, json.gz, txt, log, txt.gz, log.gz, parquet

  • You can read any format as txt. Any other than format than the 5 mentioned will not be acceptable.

query (Mandatory)
[String]

Spark SQL query to run on the data you are fetching; {{table}} is an internal placeholder that must be used (see example query above).

keyPattern
[String]

The prefix of objects you want to download from the S3 bucket.

Let’s say if the key pattern is AWSLogs/001234567890/CloudTrail/ap-northeast-1/{{yyy}}/{{MM}}/{{dd}}/ and in playbook, the start time is '2020-08-20 00:00:00' and the end time is '2020-08-21 05:00:00'. Data from the following partitions will be pulled.

AWSLogs/001234567890/CloudTrail/ap-northeast-1/2020/08/20
AWSLogs/001234567890/CloudTrail/ap-northeast-1/2020/08/21

and 2 days of data will be downloaded. If timestampColumn is provided, then data between '2020-08-20 00:00:00' to '2020-08-21 05:00:00' will be filtered.

Here, /{{yyy}}/{{MM}}/{{dd}}/ is the format of date-time based partition information. The complete set of information you can include is:

  • yyyy: four-digit year
  • MM: two-digit month (01=January)
  • dd: two-digit day of the month (01 through 31)
  • HH: two digits of an hour (00 through 23, am/pm is not allowed). The maximum granularity supported is an hour.

Note:

  • You need not mention the actual date in the key pattern. Only the format of your prefix needs to be mentioned. The actual time information will be pulled from the playbook time period picker or the stream batch length.
  • Not providing a key pattern will download everything inside that bucket, which is not recommended.

timestampColumn
[String]

Timestamp column to infer lhub_ts from.

  • If timestampColumn is not provided, then the lhub_ts column will remain blank.
  • By default "yyyy-MM-dd HH:mm:ss" pattern is used to parse timestampColumn.
  • Rows not matching the default pattern will be omitted.

timestampPattern
[String]

The pattern to parse timestamp from the timestamp column.

  • If the pattern is not matched, the data will not be shown.
  • Rows not matching the specified pattern will be omitted.

region
[String]

AWS region

  • Used to specify bucket region.
  • If you specify a region in connection via URL s3-{region}.amazonaws.com, then specifying a region in query JSON will override this.
  • If there is a region specified in the connection URL and your bucket is in another region then, it will throw an error.

multiLineJson
[Boolean]

If the specified file is a JSON, by default, every new line is a considered as a complete JSON. In some cases, every new line might not be a complete JSON. In such cases, you can specify this to parse multiline JSON files.

header
[Boolean]

By default, the CSV files header is false. Therefore, first line of CSV as headers will not be assumed to be headers. It can be changed using the header key in the JSON query.

columnNames
[List[String]]

By default, first line of CSV as headers will not be assumed to be headers. Instead, you can provide columnNames here to be treated as headers for the same.

  • Here, the columnNames key is given priority if specified, it will supersede the header key.

📘

The maximum size that can be pulled from S3 at once is 50 GB.

Example of an S3 Event Type to Pull Cloudtrail Logs

Let's assume a bucket directory structure

  • Bucket_name: test_bucket
  • Assuming a directory structure: Logs/CloudTrail/2020/08/20/
  • Assuming two files exist in the directory: file1.json.gz, file2.json.gz
  • Query to fetch just file1.log.gz and file2.log.gz will be
{"bucket" : "test_bucket",
 "fileFormat" : "json.gz", 
"query": "select * from {{table}}", 
"keyPattern" : "Logs/CloudTrail/2020/08/20/"}
  • Now, let's say you want to fetch data for specific days only.
{"bucket" : "test_bucket",
 "fileFormat" : "json.gz", 
"query": "select * from {{table}}", 
"keyPattern" : "Logs/CloudTrail/{{yyyy}}/{{MM}}/{{dd}}/"}
  • The above query will only pull data for a start and end time specified through the playbook or stream batch length.
    • Let say start time it 2020-08-12 12:00:00 and end time 2020-08-14 01:00:00
    • This will download all data from 3 subfolders, that is: Logs/CloudTrail/2020/08/12/, Logs/CloudTrail/2020/08/13/, Logs/CloudTrail/2020/08/14/.
    • If you need more granular filtering over your downloaded data, just specify the correct timestamp column and timestamp pattern, then even though data is downloaded for three key prefixes but data for specified time range only will be returned in the result.

Handling and options of different file formats

  • JSON type

    • json, json.gz is handled as a JSON type

    • By default, it is assumed that every new line in a JSON file is considered as a complete JSON. It is a standard logging practice. If the file looks like the following:

      {"key1":"value1"}
      {"key2":"value2"}

      value1 and value2 will be in two columns, key1 and key2 respectively.

    • If the JSON looks like {"Results":[{"key1":"value1"},{"key2":{"key3":"value3","key4":"value4"}}]} by default, it will be flattened into columns:
      Results_key1, Results_key2_key3, Results_key2_key4

    • Any line carrying a malformed JSON will be skipped.

    • Handling multiline JSON: Multiline JSON can be only read if it is a complete JSON as a whole, that is, a big nested multiline JSON file or a list of multiline JSON. To read such files, pass mutliLineJson as true in the query, example:

{"bucket" : "test_bucket",
 "fileFormat" : "json.gz", 
"query": "select * from {{table}}", 
"keyPattern" : "Logs/CloudTrail/{{yyyy}}/{{MM}}/{{dd}}/",
"multiLineJson": true}
  • CSV type
    • csv and csv.gz is handled as CSV type
    • By default, the first line is not considered as headers for the CSV file. Let’s take the CSV example:
value1,value2
a,b
c,d

By default, it will be read as:

_c0

_c1

value1

value2

a

b

c

d

If your first line is a header, then specify "header":true in query JSON.

value1

value2

a

b

c

d

If you want to replace column names, then specify "columnNames": ["col1","col2"]

col1

col2

value1

value2

a

b

c

d

📘

Specifying any fewer or more column names will throw an error.

  • txt type

    • log, log.gz, txt, txt.gz is handled as txt type
    • txt format type split data on every new line.
    • Cannot extract timestamp column in this, hence no top-level filtering more than folder partitioning.
  • parquet type

    • Standard spark dataframe format.

Did this page help you?