Directory Data Source
The directory data source type creates a data source from the contents of one or more files. The directory data source type respects the time range of the playbook/batch and returns only the rows that have a timestamp within the playbook/batch time range.
The following are required to use the directory data source type:
- A connection pointing to an arbitrary directory on the LogicHub system. For example, file:///opt/docker/data/service/sources/dir.
- An event type that uses that connection with a JSON specification that describes the files that constitute the data source.
- A step in the playbook that uses the event type.
In the event type, the JSON-specified query field works together with the directory connection. The JSON file must include the name of the timestamp column and the timestamp format string pattern.
Query Specification
The JSON specification of the directory data source type has the following schema.
{
timestampColumn: Option[String] (optional name of the timestamp column)
timestampPattern: Option[String] (optional Java date/time format string)
additionalFiles: Array[String] (array of files constituting the data source)
extractAsRaw: Boolean (applicable ONLY to json file types - extract as json events with a message time field)
}
More details on the Java date/time format can be found here. If the timestamp column or pattern is omitted, the same timestamp inference algorithm used for file data source types is used. It will recognize a numerical value specified in epoch.
Each element of additionalFiles
can be any of the following:
- a local file, such as file:///opt/docker/data/service/somefile.csv.gz
- a local directory. All supported files in the directory are read (the operation is not recursive)
- a remote file, such as https://s3-us-west-2.amazonaws.com/lh-public/lsof_100.csv.gz
- Example of
additionalFiles
:
{
"timestampColumn": "_time",
"timestampPattern": "MM/dd/yy HH:mm:ss",
"additionalFiles": [
"https://s3-us-west-2.amazonaws.com/lh-public/lsof_100.csv.gz",
"https://www.dropbox.com/s/sfrshzzahb3xcx9/Attack2Data_125258.csv.gz"
]
}
Valid File Formats for Directory Data Source
CSV Files in the Directory
CSVs are read as tables with the first row as the header that defines the columns.
JSON Files in the Directory
Depending on the value of extractAsRaw
in the Directory Source query specification, JSON can be either of the following formats.
If extractAsRaw is true:
Query Specification
{
"extractAsRaw": true
}
Expected File Format
Sequence of one or more JSON objects with a timestamp field (optional). JSON objects can have any fields and format.
Sample File: test.json
{
"_time": "1569346383",
"_raw": "lorem ipsum 1",
"_sourceCategory": "syslog"
}{
"_time": "1569347383",
"_raw": "lorem ipsum 2",
"_sourceCategory": "syslog"
}
Columns:
_raw, timestampColumn - where timestampColumn = as set in query specification OR if not set, then one of the known timestamp column keys (_messagetime, _messagetimems, _time, _timestamp) that exists in the JSON object.
Rows:
One row for JSON each json object.
_raw = json blob
timestampColumn = value extracted from the column if specified, else look for one of the known timestamp keys, or set the current UTC time as _messagetime
For sample above, parsed rows would be
_time _raw
1569346383 {"_messagetime": "1569346383","_raw": "lorem ipsum 1","_sourceCategory": "syslog"}
1569347383 {"_messagetime": "1569347383","_raw": "lorem ipsum 2","_sourceCategory": "syslog"}
If extractAsRaw is false (default)
Query Specification
{
"extractAsRaw": false
}
Expected File Format
A single JSON object with fields and messages fields specifying the columns and the values for the columns respectively. Refer to the following sample for details.
Sample File: test.json
{
"fields": [
{
"name": "_time",
"fieldType": "string",
"keyField": false
},
{
"name": "_raw",
"fieldType": "string",
"keyField": false
},
{
"name": "_sourceCategory",
"fieldType": "string",
"keyField": false
}
],
"messages": [
{
"map": {
"_time": "1569346383",
"_raw": "lorem ipsum 1",
"_sourceCategory": "syslog"
}
},
{
"map": {
"_time": "1569347383",
"_raw": "lorem ipsum 2",
"_sourceCategory": "syslog"
}
}
]
}
Columns:
As specified in the fields. For sample above, the parsed columns would be _time, _raw, _sourceCategory.
Rows:
As specified in the messages JSON array. Each array element is parsed as one row each. For sample above, parsed rows would be
_time _raw _sourceCategory
1569346383 lorem ipsum 1 syslog
1569347383 lorem ipsum 2 syslog
Updated about 1 year ago