S3 Event Type (Beta)

S3 event type provides the ability to query data directly from AWS S3 buckets. Use your AWS API credentials and specify the S3 bucket to pull data through the event type.

📘

Contact LogicHub to enable the configuration.


To fetch the AWS S3, you need to create a connection and an event type. Let's look at how to create a connection and an event type for S3.

Create a Connection for S3

  1. Go to My Library > Connections.
  2. Click New.
  3. Select the Connection Type as s3 for AWS.
    • Enter the following credentials in the add connection form:
      • Access Key: Enter a valid AWS access key.
      • Secret: Enter a valid AWS secret key.
      • URL: Provide a valid URL. Uses3.amazonaws.com or s3-{region}.amazonaws.com. To set a default region for the S3 connection, use s3-{region}.amazonaws.com. We recommend that you use a region-specific URL for better download speed.
  4. Click Save.

Create an S3 Event Type

  1. Go to My Library > Event Types.
  2. Click New.
  3. Enter the following details in the event type form:
    • Name: Enter a name to identify the event type. The name can consist of alphanumeric characters and underscores ( _ ). The first character can't be a number.
    • Source: Click Query to base the event type on a query to an external connection.
    • Connection: Choose an S3 connection from the drop-down. The following fields are part of the S3 connection.
      The following is the complete list of JSON keys and their descriptions. These JSON keys can be used to specify the data to retrieve in the fields.
    • Query: Enter the query as in the example below and click Save.

JSON Key

Description

Required

AWS Region

Choose the region from the drop-down.

  • Used to specify bucket region.
  • If you specify a region in connection via URL s3-{region}.amazonaws.com, then specifying a region in query JSON will override this.
  • If there is a region specified in the connection URL and your bucket is in another region then, it will throw an error.

Optional

Bucket
[String]

Enter the bucket where the data you want to query is located. S3 bucket name to fetch the data from.

Required

Key Pattern

Enter the pattern of the address of the files you want to query data from. Refer to the below table for the sample key patterns.
The prefix of objects you want to download from the S3 bucket.

Let’s say if the key pattern is AWSLogs/001234567890/CloudTrail/ap-northeast-1/{{yyy}}/{{MM}}/{{dd}}/ and in playbook, the start time is '2020-08-20 00:00:00' and the end time is '2020-08-21 05:00:00'. Data from the following partitions will be pulled.

AWSLogs/001234567890/CloudTrail/ap-northeast-1/2020/08/20
AWSLogs/001234567890/CloudTrail/ap-northeast-1/2020/08/21

and 2 days of data will be downloaded. If timestampColumn is provided, then data between '2020-08-20 00:00:00' to '2020-08-21 05:00:00' will be filtered.

Here, /{{yyy}}/{{MM}}/{{dd}}/ is the format of date-time based partition information. The complete set of information you can include is:

  • yyyy: four-digit year
  • MM: two-digit month (01=January)
  • dd: two-digit day of the month (01 through 31)
  • HH: two digits of an hour (00 through 23, am/pm is not allowed). The maximum granularity supported is an hour.

Note:

  • You need not mention the actual date in the key pattern. Only the format of your prefix needs to be mentioned. The actual time information will be pulled from the playbook time period picker or the stream batch length.
  • Not providing a key pattern will download everything inside that bucket, which is not recommended.

Optional

File Format
[String]

Choose format of the file you will be querying data from. The acceptable file formats are csv, csv.gz, json, json.gz, txt, log, txt.gz, log.gz, parquet

  • You can read any format as txt. Any other than format than the mentioned ones will not be acceptable.

Note:
When you select the file format as csv or csv.gz, the column header checkbox and the time stamp fields appear.

When you select the file format as json or json.gz, the multiline checkbox appears.

Timestamp columns are supported for parquet format.

Timestamp columns are not supported for file formats: txt, log, text.gz, log.gz.

Required

Query
[String]

Spark SQL query to run on the data you are fetching; {{table}} is an internal placeholder that must be used.

Example: "select * from {{table}} limit 10"

Required

Time Stamp Column
[String]

Timestamp column to infer lhub_ts from.

  • If timestampColumn is not provided, then the lhub_ts column will remain blank.
  • By default "yyyy-MM-dd HH:mm:ss" pattern is used to parse timestampColumn.
  • Rows not matching the default pattern will be omitted.

Optional

Time Stamp Pattern
[String]

The pattern to parse timestamp from the timestamp column.

  • If the pattern is not matched, the data will not be shown.
  • Rows not matching the specified pattern will be omitted.

Optional

MultiLine JSON
[Boolean]

If the specified file is a JSON, by default, every new line is a considered as a complete JSON. In some cases, every new line might not be a complete JSON. In such cases, you can specify this to parse multiline JSON files.

Optional

Header
[Boolean]

By default, the CSV files header is false. Therefore, first line of CSV as headers will not be assumed to be headers. It can be changed using the header key in the JSON query.

Optional

Column Names
[List[String]]

By default, first line of CSV as headers will not be assumed to be headers. Instead, you can provide columnNames here to be treated as headers for the same.

  • Here, the columnNames key is given priority if specified, it will supersede the header key.

Optional

📘

The maximum size that can be pulled from S3 at once is 50 GB.

Example of an S3 Event Type to Pull Cloudtrail Logs

Let's assume a bucket directory structure

  • Bucket_name: test_bucket
  • Assuming a directory structure: Logs/CloudTrail/2020/08/20/
  • Assuming two files exist in the directory: file1.json.gz, file2.json.gz
  • Query to fetch just file1.log.gz and file2.log.gz will be
{"bucket" : "test_bucket",
 "fileFormat" : "json.gz", 
"query": "select * from {{table}}", 
"keyPattern" : "Logs/CloudTrail/2020/08/20/"}
  • Now, let's say you want to fetch data for specific days only.
{"bucket" : "test_bucket",
 "fileFormat" : "json.gz", 
"query": "select * from {{table}}", 
"keyPattern" : "Logs/CloudTrail/{{yyyy}}/{{MM}}/{{dd}}/"}
  • The above query will only pull data for a start and end time specified through the playbook or stream batch length.
    • Let say start time it 2020-08-12 12:00:00 and end time 2020-08-14 01:00:00
    • This will download all data from 3 subfolders, that is: Logs/CloudTrail/2020/08/12/, Logs/CloudTrail/2020/08/13/, Logs/CloudTrail/2020/08/14/.
    • If you need more granular filtering over your downloaded data, just specify the correct timestamp column and timestamp pattern, then even though data is downloaded for three key prefixes but data for specified time range only will be returned in the result.

Handling and options of different file formats

  • JSON type

    • json, json.gz is handled as a JSON type

    • By default, it is assumed that every new line in a JSON file is considered as a complete JSON. It is a standard logging practice. If the file looks like the following:

      {"key1":"value1"}
      {"key2":"value2"}

      value1 and value2 will be in two columns, key1 and key2 respectively.

    • If the JSON looks like {"Results":[{"key1":"value1"},{"key2":{"key3":"value3","key4":"value4"}}]} by default, it will be flattened into columns:
      Results_key1, Results_key2_key3, Results_key2_key4

    • Any line carrying a malformed JSON will be skipped.

    • Handling multiline JSON: Multiline JSON can be only read if it is a complete JSON as a whole, that is, a big nested multiline JSON file or a list of multiline JSON. To read such files, pass mutliLineJson as true in the query, example:

{"bucket" : "test_bucket",
 "fileFormat" : "json.gz", 
"query": "select * from {{table}}", 
"keyPattern" : "Logs/CloudTrail/{{yyyy}}/{{MM}}/{{dd}}/",
"multiLineJson": true}
  • CSV type
    • csv and csv.gz is handled as CSV type
    • By default, the first line is not considered as headers for the CSV file. Let’s take the CSV example:
value1,value2
a,b
c,d

By default, it will be read as:

_c0

_c1

value1

value2

a

b

c

d

If your first line is a header, then specify "header":true in query JSON.

value1

value2

a

b

c

d

If you want to replace column names, then specify "columnNames": ["col1","col2"]

col1

col2

value1

value2

a

b

c

d

📘

Specifying any fewer or more column names will throw an error.

  • txt type

    • log, log.gz, txt, txt.gz is handled as txt type
    • txt format type split data on every new line.
    • Cannot extract timestamp column in this, hence no top-level filtering more than folder partitioning.
  • parquet type

    • Standard spark dataframe format.

Did this page help you?