Components

Processors

AttributeRollingWindow

Track a Rolling Window based on evaluating an Expression Language expression on each FlowFile and add that value to the processor’s state. Each FlowFile will be emitted with the count of FlowFiles and total aggregate value of values processed in the current time window.

Tags: Attribute Expression Language, state, data science, rolling, window

Properties

Value to track

The expression on which to evaluate each FlowFile. The result of the expression will be added to the rolling window value.

Time window

The time window on which to calculate the rolling window.

Sub-window length

When set, values will be batched into sub-windows of the set length. This allows for much larger length total windows to be set but sacrifices some precision. If this is not set (or is 0) then each value is stored in state with the timestamp of when it was received. After the length of time stated in Time window elaspes the value will be removed. If this is set, values will be batched together every X amount of time (where X is the time period set for this property) and removed all at once.

Relationships

  • success: All FlowFiles are successfully processed are routed here

  • failure: When a FlowFile fails for a reason other than failing to set state it is routed here.

  • set state fail: When state fails to save when processing a FlowFile, the FlowFile is routed here.

Writes Attributes

  • rolling_window_value: The rolling window value (sum of all the values stored).

  • rolling_window_count: The count of the number of FlowFiles seen in the rolling window.

  • rolling_window_mean: The mean of the FlowFiles seen in the rolling window.

  • rolling_window_variance: The variance of the FlowFiles seen in the rolling window.

  • rolling_window_stddev: The standard deviation (positive square root of the variance) of the FlowFiles seen in the rolling window.

Stateful

Scope: Local

Store the values backing the rolling window. This includes storing the individual values and their time-stamps or the batches of values and their counts.

Input Requirement

This component requires an incoming relationship.

AttributesToCSV

Generates a CSV representation of the input FlowFile Attributes. The resulting CSV can be written to either a newly generated attribute named 'CSVAttributes' or written to the FlowFile as content. If the attribute value contains a comma, newline or double quote, then the attribute value will be escaped with double quotes. Any double quote characters in the attribute value are escaped with another double quote.

Tags: csv, attributes, flowfile

Properties

Attribute List

Comma separated list of attributes to be included in the resulting CSV. If this value is left empty then all existing Attributes will be included. This list of attributes is case sensitive and supports attribute names that contain commas. If an attribute specified in the list is not found it will be emitted to the resulting CSV with an empty string or null depending on the 'Null Value' property. If a core attribute is specified in this list and the 'Include Core Attributes' property is false, the core attribute will be included. The attribute list ALWAYS wins.

Attributes Regular Expression

Regular expression that will be evaluated against the flow file attributes to select the matching attributes. This property can be used in combination with the attributes list property. The final output will contain a combination of matches found in the ATTRIBUTE_LIST and ATTRIBUTE_REGEX.

Destination

Control if CSV value is written as a new flowfile attribute 'CSVData' or written in the flowfile content.

Include Core Attributes

Determines if the FlowFile org.apache.nifi.flowfile.attributes.CoreAttributes, which are contained in every FlowFile, should be included in the final CSV value generated. Core attributes will be added to the end of the CSVData and CSVSchema strings. The Attribute List property overrides this setting.

Null Value

If true a non existing or empty attribute will be 'null' in the resulting CSV. If false an empty string will be placed in the CSV

Include Schema

If true the schema (attribute names) will also be converted to a CSV string which will either be applied to a new attribute named 'CSVSchema' or applied at the first row in the content depending on the DESTINATION property setting.

Relationships

  • success: Successfully converted attributes to CSV

  • failure: Failed to convert attributes to CSV

Writes Attributes

  • CSVSchema: CSV representation of the Schema

  • CSVData: CSV representation of Attributes

Input Requirement

This component requires an incoming relationship.

AttributesToJSON

Generates a JSON representation of the input FlowFile Attributes. The resulting JSON can be written to either a new Attribute 'JSONAttributes' or written to the FlowFile as content. Attributes which contain nested JSON objects can either be handled as JSON or as escaped JSON depending on the strategy chosen.

Tags: json, attributes, flowfile

Properties

Attributes List

Comma separated list of attributes to be included in the resulting JSON. If this value is left empty then all existing Attributes will be included. This list of attributes is case sensitive. If an attribute specified in the list is not found it will be be emitted to the resulting JSON with an empty string or NULL value.

Attributes Regular Expression

Regular expression that will be evaluated against the flow file attributes to select the matching attributes. This property can be used in combination with the attributes list property.

Destination

Control if JSON value is written as a new flowfile attribute 'JSONAttributes' or written in the flowfile content. Writing to flowfile content will overwrite any existing flowfile content.

Include Core Attributes

Determines if the FlowFile org.apache.nifi.flowfile.attributes.CoreAttributes which are contained in every FlowFile should be included in the final JSON value generated.

Null Value

If true a non existing selected attribute will be NULL in the resulting JSON. If false an empty string will be placed in the JSON

JSON Handling Strategy

Strategy to use for handling attributes which contain nested JSON.

Pretty Print

Apply pretty print formatting to the output.

Relationships

  • success: Successfully converted attributes to JSON

  • failure: Failed to convert attributes to JSON

Writes Attributes

  • JSONAttributes: JSON representation of Attributes

Input Requirement

This component requires an incoming relationship.

CalculateParquetOffsets

The processor generates N flow files from the input, and adds attributes with the offsets required to read the group of rows in the FlowFile’s content. Can be used to increase the overall efficiency of processing extremely large Parquet files.

Tags: parquet, split, partition, break apart, efficient processing, load balance, cluster

Properties

Records Per Split

Specifies how many records should be covered in each FlowFile

Zero Content Output

Whether to do, or do not copy the content of input FlowFile.

Relationships

  • success: FlowFiles, with special attributes that represent a chunk of the input file.

Reads Attributes

  • record.offset: Gets the index of first record in the input.

  • record.count: Gets the number of records in the input.

  • parquet.file.range.startOffset: Gets the start offset of the selected row group in the parquet file.

  • parquet.file.range.endOffset: Gets the end offset of the selected row group in the parquet file.

Writes Attributes

  • record.offset: Sets the index of first record of the parquet file.

  • record.count: Sets the number of records in the parquet file.

Input Requirement

This component requires an incoming relationship.

CalculateParquetRowGroupOffsets

The processor generates one FlowFile from each Row Group of the input, and adds attributes with the offsets required to read the group of rows in the FlowFile’s content. Can be used to increase the overall efficiency of processing extremely large Parquet files.

Tags: parquet, split, partition, break apart, efficient processing, load balance, cluster

Properties

Zero Content Output

Whether to do, or do not copy the content of input FlowFile.

Relationships

  • success: FlowFiles, with special attributes that represent a chunk of the input file.

Writes Attributes

  • parquet.file.range.startOffset: Sets the start offset of the selected row group in the parquet file.

  • parquet.file.range.endOffset: Sets the end offset of the selected row group in the parquet file.

  • record.count: Sets the count of records in the selected row group.

Input Requirement

This component requires an incoming relationship.

CalculateRecordStats

Counts the number of Records in a record set, optionally counting the number of elements per category, where the categories are defined by user-defined properties.

Tags: record, stats, metrics

Properties

Record Reader

A record reader to use for reading the records.

record-stats-limit

Limit the number of individual stats that are returned for each record path to the top N results.

Dynamic Properties

The name of the category. For example, sport

Specifies a category that should be counted. For example, if the property name is 'sport' and the value is '/sport', the processor will count how many records have a value of 'soccer' for the /sport field, how many have a value of 'baseball' for the /sport, and so on. These counts be added as attributes named recordStats.sport.soccer, recordStats.sport.baseball.

Relationships

  • success: All FlowFiles that are successfully processed, are routed to this Relationship.

  • failure: If a FlowFile cannot be processed for any reason, it is routed to this Relationship.

Writes Attributes

  • record.count: A count of the records in the record set in the FlowFile.

  • recordStats.<User Defined Property Name>.count: A count of the records that contain a value for the user defined property.

  • recordStats.<User Defined Property Name>.<value>.count: Each value discovered for the user defined property will have its own count attribute. Total number of top N value counts to be added is defined by the limit configuration.

Input Requirement

This component requires an incoming relationship.

Additional Details

This processor takes in a record set and counts both the overall count and counts that are defined as dynamic properties that map a property name to a record path. Record path counts are provided at two levels:

  • The overall count of all records that successfully evaluated a record path.

  • A breakdown of counts of unique values that matched the record path operation.

Consider the following record structure:

{
  "sport": "Soccer",
  "name": "John Smith"
}
json

A valid mapping here would be sport ⇒ /sport.

For a record set with JSON like that, five entries and 3 instances of soccer and two instances of football, it would set the following attributes:

  • record_count: 5

  • sport: 5

  • sport.Soccer: 3

  • sport.Football: 2

CaptureChangeMySQL

Retrieves Change Data Capture (CDC) events from a MySQL database. CDC Events include INSERT, UPDATE, DELETE operations. Events are output as either a group of a specified number of events (the default is 1 so each event becomes its own flow file) or grouped as a full transaction (BEGIN to COMMIT). All events are ordered by the time at which the operation occurred. NOTE: If the processor is stopped before the specified number of events have been written to a flow file, the partial flow file will be output in order to maintain the consistency of the event stream.

Tags: sql, jdbc, cdc, mysql, transaction, event

Properties

MySQL Nodes

A list of hostname (and optional port) entries corresponding to nodes in a MySQL cluster. The entries should be comma separated using a colon (if the port is to be specified) such as host1:port,host2:port,…​. For example mysql.myhost.com:3306. The port need not be specified, when omitted the default MySQL port value of 3306 will be used. This processor will attempt to connect to the hosts in the list in order. If one node goes down and failover is enabled for the cluster, then the processor will connect to the active node (assuming its node entry is specified in this property).

MySQL Driver Class Name

The class name of the MySQL database driver class

MySQL Driver Location(s)

Comma-separated list of files/folders and/or URLs containing the MySQL driver JAR and its dependencies (if any). For example '/var/tmp/mysql-connector-java-5.1.38-bin.jar'

Username

Username to access the MySQL cluster

Password

Password to access the MySQL cluster

Event Processing Strategy

Specifies the strategy to use when writing events to FlowFile(s), such as 'Max Events Per FlowFile'

Events Per FlowFile

Specifies how many events should be written to a single FlowFile. If the processor is stopped before the specified number of events has been written,the events will still be written as a FlowFile before stopping.

Server ID

The client connecting to the MySQL replication group is actually a simplified replica (server), and the Server ID value must be unique across the whole replication group (i.e. different from any other Server ID being used by any primary or replica). Thus, each instance of CaptureChangeMySQL must have a Server ID unique across the replication group. If the Server ID is not specified, it defaults to 65535.

Database/Schema Name Pattern

A regular expression (regex) for matching databases (or schemas, depending on your RDBMS' terminology) against the list of CDC events. The regex must match the database name as it is stored in the RDBMS. If the property is not set, the database name will not be used to filter the CDC events. NOTE: DDL events, even if they affect different databases, are associated with the database used by the session to execute the DDL. This means if a connection is made to one database, but the DDL is issued against another, then the connected database will be the one matched against the specified pattern.

Table Name Pattern

A regular expression (regex) for matching CDC events affecting matching tables. The regex must match the table name as it is stored in the database. If the property is not set, no events will be filtered based on table name.

Max Wait Time

The maximum amount of time allowed for a connection to be established, zero means there is effectively no limit.

Distributed Map Cache Client - unused

This is a legacy property that is no longer used to store table information, the processor will handle the table information (column names, types, etc.)

Retrieve All Records

Specifies whether to get all available CDC events, regardless of the current binlog filename and/or position. If binlog filename and position values are present in the processor’s State, this property’s value is ignored. This allows for 4 different configurations: 1) If binlog data is available in processor State, that is used to determine the start location and the value of Retrieve All Records is ignored. 2) If no binlog data is in processor State, then Retrieve All Records set to true means start at the beginning of the binlog history. 3) If no binlog data is in processor State and Initial Binlog Filename/Position are not set, then Retrieve All Records set to false means start at the end of the binlog history. 4) If no binlog data is in processor State and Initial Binlog Filename/Position are set, then Retrieve All Records set to false means start at the specified initial binlog file/position. To reset the behavior, clear the processor state (refer to the State Management section of the processor’s documentation).

Include Begin/Commit Events

Specifies whether to emit events corresponding to a BEGIN or COMMIT event in the binary log. Set to true if the BEGIN/COMMIT events are necessary in the downstream flow, otherwise set to false, which suppresses generation of these events and can increase flow performance.

Include DDL Events

Specifies whether to emit events corresponding to Data Definition Language (DDL) events such as ALTER TABLE, TRUNCATE TABLE, e.g. in the binary log. Set to true if the DDL events are desired/necessary in the downstream flow, otherwise set to false, which suppresses generation of these events and can increase flow performance.

Initial Sequence ID

Specifies an initial sequence identifier to use if this processor’s State does not have a current sequence identifier. If a sequence identifier is present in the processor’s State, this property is ignored. Sequence identifiers are monotonically increasing integers that record the order of flow files generated by the processor. They can be used with the EnforceOrder processor to guarantee ordered delivery of CDC events.

Initial Binlog Filename

Specifies an initial binlog filename to use if this processor’s State does not have a current binlog filename. If a filename is present in the processor’s State or "Use GTID" property is set to false, this property is ignored. This can be used along with Initial Binlog Position to "skip ahead" if previous events are not desired. Note that NiFi Expression Language is supported, but this property is evaluated when the processor is configured, so FlowFile attributes may not be used. Expression Language is supported to enable the use of the environment properties.

Initial Binlog Position

Specifies an initial offset into a binlog (specified by Initial Binlog Filename) to use if this processor’s State does not have a current binlog filename. If a filename is present in the processor’s State or "Use GTID" property is false, this property is ignored. This can be used along with Initial Binlog Filename to "skip ahead" if previous events are not desired. Note that NiFi Expression Language is supported, but this property is evaluated when the processor is configured, so FlowFile attributes may not be used. Expression Language is supported to enable the use of the environment properties.

Use Binlog GTID

Specifies whether to use Global Transaction ID (GTID) for binlog tracking. If set to true, processor’s state of binlog file name and position is ignored. The main benefit of using GTID is to have much reliable failover than using binlog filename/position.

Initial Binlog GTID

Specifies an initial GTID to use if this processor’s State does not have a current GTID. If a GTID is present in the processor’s State or "Use GTID" property is set to false, this property is ignored. This can be used to "skip ahead" if previous events are not desired. Note that NiFi Expression Language is supported, but this property is evaluated when the processor is configured, so FlowFile attributes may not be used. Expression Language is supported to enable the use of the environment properties.

SSL Mode

SSL Mode used when SSL Context Service configured supporting certificate verification options

SSL Context Service

SSL Context Service supporting encrypted socket communication

Relationships

  • success: Successfully created FlowFile from SQL query result set.

Writes Attributes

  • cdc.sequence.id: A sequence identifier (i.e. strictly increasing integer value) specifying the order of the CDC event flow file relative to the other event flow file(s).

  • cdc.event.type: A string indicating the type of CDC event that occurred, including (but not limited to) 'begin', 'insert', 'update', 'delete', 'ddl' and 'commit'.

  • mime.type: The processor outputs flow file content in JSON format, and sets the mime.type attribute to application/json

Stateful

Scope: Cluster

Information such as a 'pointer' to the current CDC event in the database is stored by this processor, such that it can continue from the same location if restarted.

Restricted

Making changes to this component requires admin rights. Non-admins can still schedule this component.

Input Requirement

This component does not allow an incoming relationship.

ChunkDocument

Chunks incoming documents that are formatted as JSON Lines into chunks that are appropriately sized for creating Text Embeddings. The input is expected to be in "json-lines" format, with each line having a 'text' and a 'metadata' element. Each line will then be split into one or more lines in the output.

Use Cases

Create chunks of text from a single larger chunk.

Notes: The input for this use case is expected to be a FlowFile whose content is a JSON Lines document, with each line having a 'text' and a 'metadata' element.

Keywords: embedding, vector, text, rag, retrieval augmented generation

  1. Set "Input Format" to "Plain Text"

  2. Set "Element Strategy" to "Single Document"

Multi-Processor Use Cases

Chunk Plaintext data in order to prepare it for storage in a vector store. The output is in "json-lines" format,containing the chunked data as text, as well as metadata pertaining to the chunk.

Notes: The input for this use case is expected to be a FlowFile whose content is a plaintext document.

Keywords: embedding, vector, text, rag, retrieval augmented generation

ParseDocument:

  1. ParseDocument

ChunkDocument:

  1. ChunkDocument

Parse and chunk the textual contents of a PDF document in order to prepare it for storage in a vector store. The output is in "json-lines" format,containing the chunked data as text, as well as metadata pertaining to the chunk.

Notes: The input for this use case is expected to be a FlowFile whose content is a PDF document.

Keywords: pdf, embedding, vector, text, rag, retrieval augmented generation

ParseDocument:

  1. ParseDocument

ChunkDocument:

  1. ChunkDocument

Tags: text, split, chunk, langchain, embeddings, vector, machine learning, ML, artificial intelligence, ai, document

Properties

Chunking Strategy

Specifies which splitter should be used to split the text

Separator

Specifies the character sequence to use for splitting apart the text. If using a Chunking Strategy of Recursively Split by Character, it is a comma-separated list of character sequences. Meta-characters \n, \r and \t are automatically un-escaped.

Separator Format

Specifies how to interpret the value of the <Separator> property

Chunk Size

The maximum size of a chunk that should be returned

Chunk Overlap

The number of characters that should be overlapped between each chunk of text

Keep Separator

Whether or not to keep the text separator in each chunk of data

Strip Whitespace

Whether or not to strip the whitespace at the beginning and end of each chunk

Language

The language to use for the Code’s syntax

CompressContent

Compresses or decompresses the contents of FlowFiles using a user-specified compression algorithm and updates the mime.type attribute as appropriate. A common idiom is to precede CompressContent with IdentifyMimeType and configure Mode='decompress' AND Compression Format='use mime.type attribute'. When used in this manner, the MIME type is automatically detected and the data is decompressed, if necessary. If decompression is unnecessary, the data is passed through to the 'success' relationship. This processor operates in a very memory efficient way so very large objects well beyond the heap size are generally fine to process.

Use Cases

Compress the contents of a FlowFile

Input Requirement: This component allows an incoming relationship.

  1. "Mode" = "compress"

  2. "Compression Format" should be set to whichever compression algorithm should be used.

Decompress the contents of a FlowFile

Input Requirement: This component allows an incoming relationship.

  1. "Mode" = "decompress"

  2. "Compression Format" should be set to whichever compression algorithm was used to compress the data previously.

Multi-Processor Use Cases

Check whether or not a FlowFile is compressed and if so, decompress it.

Notes: If IdentifyMimeType determines that the content is not compressed, CompressContent will pass the FlowFile along to the 'success' relationship without attempting to decompress it.

Keywords: auto, detect, mime type, compress, decompress, gzip, bzip2

CompressContent:

  1. "Mode" = "decompress"

  2. "Compression Format" = "use mime.type attribute" .

IdentifyMimeType:

  1. Default property values are sufficient.

  2. Connect the 'success' relationship to CompressContent. .

Tags: content, compress, decompress, gzip, bzip2, lzma, xz-lzma2, snappy, snappy-hadoop, snappy framed, lz4-framed, deflate, zstd, brotli

Properties

Mode

Indicates whether the processor should compress content or decompress content. Must be either 'compress' or 'decompress'

Compression Format

The compression format to use. Valid values are: GZIP, Deflate, ZSTD, BZIP2, XZ-LZMA2, LZMA, Brotli, Snappy, Snappy Hadoop, Snappy Framed, and LZ4-Framed

Compression Level

The compression level to use; this is valid only when using gzip, deflate or xz-lzma2 compression. A lower value results in faster processing but less compression; a value of 0 indicates no (that is, simple archiving) for gzip or minimal for xz-lzma2 compression. Higher levels can mean much larger memory usage such as the case with levels 7-9 for xz-lzma/2 so be careful relative to heap size.

Update Filename

If true, will remove the filename extension when decompressing data (only if the extension indicates the appropriate compression format) and add the appropriate extension when compressing data

Relationships

  • success: FlowFiles will be transferred to the success relationship after successfully being compressed or decompressed

  • failure: FlowFiles will be transferred to the failure relationship if they fail to compress/decompress

Reads Attributes

  • mime.type: If the Compression Format is set to use mime.type attribute, this attribute is used to determine the compression type. Otherwise, this attribute is ignored.

Writes Attributes

  • mime.type: If the Mode property is set to compress, the appropriate MIME Type is set. If the Mode property is set to decompress and the file is successfully decompressed, this attribute is removed, as the MIME Type is no longer known.

Input Requirement

This component requires an incoming relationship.

System Resource Considerations

  • CPU: An instance of this component can cause high usage of this system resource. Multiple instances or high concurrency settings may result a degradation of performance.

  • MEMORY: An instance of this component can cause high usage of this system resource. Multiple instances or high concurrency settings may result a degradation of performance.

ConnectWebSocket

Acts as a WebSocket client endpoint to interact with a remote WebSocket server. FlowFiles are transferred to downstream relationships according to received message types as WebSocket client configured with this processor receives messages from remote WebSocket server. If a new flowfile is passed to the processor, the previous sessions will be closed and any data being sent will be aborted.

Tags: subscribe, WebSocket, consume, listen

Properties

WebSocket Client ControllerService

A WebSocket CLIENT Controller Service which can connect to a WebSocket server.

WebSocket Client Id

The client ID to identify WebSocket session. It should be unique within the WebSocket Client Controller Service. Otherwise, it throws WebSocketConfigurationException when it gets started.

Relationships

  • success: FlowFile holding connection configuration attributes (like URL or HTTP headers) in case of successful connection

  • failure: FlowFile holding connection configuration attributes (like URL or HTTP headers) in case of connection failure

  • binary message: The WebSocket binary message output

  • connected: The WebSocket session is established

  • disconnected: The WebSocket session is disconnected

  • text message: The WebSocket text message output

Writes Attributes

  • websocket.controller.service.id: WebSocket Controller Service id.

  • websocket.session.id: Established WebSocket session id.

  • websocket.endpoint.id: WebSocket endpoint id.

  • websocket.local.address: WebSocket client address.

  • websocket.remote.address: WebSocket server address.

  • websocket.message.type: TEXT or BINARY.

Input Requirement

This component allows an incoming relationship.

Additional Details

Summary

This processor acts as a WebSocket client endpoint to interact with a remote WebSocket server. It is capable of receiving messages from a websocket server and it transfers them to downstream relationships according to the received message types.

The processor may have an incoming relationship, in which case flowfile attributes are passed down to its WebSocket Client Service. This can be used to fine-tune the connection configuration (url and headers for example). For example ” dynamic.url = currentValue” flowfile attribute can be referenced in the WebSocket Client Service with the $\{dynamic.url} expression.

You can define custom websocket headers in the incoming flowfile as additional attributes. The attribute key shall start with “header.” and continue with they header key. For example: “header.Authorization”. The attribute value will be the corresponding header value. If a new flowfile is passed to the processor, the previous sessions will be closed, and any data being sent will be aborted.

  1. header.Autorization | Basic base64UserNamePassWord

  2. header.Content-Type | application, audio, example

For multiple header values provide a comma separated list.

ConsumeAMQP

Consumes AMQP Messages from an AMQP Broker using the AMQP 0.9.1 protocol. Each message that is received from the AMQP Broker will be emitted as its own FlowFile to the 'success' relationship.

Tags: amqp, rabbit, get, message, receive, consume

Properties

Queue

The name of the existing AMQP Queue from which messages will be consumed. Usually pre-defined by AMQP administrator.

Auto-Acknowledge Messages

If false (Non-Auto-Acknowledge), the messages will be acknowledged by the processor after transferring the FlowFiles to success and committing the NiFi session. Non-Auto-Acknowledge mode provides 'at-least-once' delivery semantics. If true (Auto-Acknowledge), messages that are delivered to the AMQP Client will be auto-acknowledged by the AMQP Broker just after sending them out. This generally will provide better throughput but will also result in messages being lost upon restart/crash of the AMQP Broker, NiFi or the processor. Auto-Acknowledge mode provides 'at-most-once' delivery semantics and it is recommended only if loosing messages is acceptable.

Batch Size

The maximum number of messages that should be processed in a single session. Once this many messages have been received (or once no more messages are readily available), the messages received will be transferred to the 'success' relationship and the messages will be acknowledged to the AMQP Broker. Setting this value to a larger number could result in better performance, particularly for very small messages, but can also result in more messages being duplicated upon sudden restart of NiFi.

Prefetch Count

The maximum number of unacknowledged messages for the consumer. If consumer has this number of unacknowledged messages, AMQP broker will no longer send new messages until consumer acknowledges some of the messages already delivered to it.Allowed values: from 0 to 65535. 0 means no limit

Header Output Format

Defines how to output headers from the received message

Header Key Prefix

Text to be prefixed to header keys as the are added to the FlowFile attributes. Processor will append '.' to the value of this property

Header Separator

The character that is used to separate key-value for header in String. The value must be only one character.

Remove Curly Braces

If true Remove Curly Braces, Curly Braces in the header will be automatically remove.

Brokers

A comma-separated list of known AMQP Brokers in the format <host>:<port> (e.g., localhost:5672). If this is set, Host Name and Port are ignored. Only include hosts from the same AMQP cluster.

Host Name

Network address of AMQP broker (e.g., localhost). If Brokers is set, then this property is ignored.

Port

Numeric value identifying Port of AMQP broker (e.g., 5671). If Brokers is set, then this property is ignored.

Virtual Host

Virtual Host name which segregates AMQP system for enhanced security.

User Name

User Name used for authentication and authorization.

Password

Password used for authentication and authorization.

AMQP Version

AMQP Version. Currently only supports AMQP v0.9.1.

SSL Context Service

The SSL Context Service used to provide client certificate information for TLS/SSL connections.

Use Client Certificate Authentication

Authenticate using the SSL certificate rather than user name/password.

Relationships

  • success: All FlowFiles that are received from the AMQP queue are routed to this relationship

Writes Attributes

  • amqp$appId: The App ID field from the AMQP Message

  • amqp$contentEncoding: The Content Encoding reported by the AMQP Message

  • amqp$contentType: The Content Type reported by the AMQP Message

  • amqp$headers: The headers present on the AMQP Message. Added only if processor is configured to output this attribute.

  • <Header Key Prefix>.<attribute>: Each message header will be inserted with this attribute name, if processor is configured to output headers as attribute

  • amqp$deliveryMode: The numeric indicator for the Message’s Delivery Mode

  • amqp$priority: The Message priority

  • amqp$correlationId: The Message’s Correlation ID

  • amqp$replyTo: The value of the Message’s Reply-To field

  • amqp$expiration: The Message Expiration

  • amqp$messageId: The unique ID of the Message

  • amqp$timestamp: The timestamp of the Message, as the number of milliseconds since epoch

  • amqp$type: The type of message

  • amqp$userId: The ID of the user

  • amqp$clusterId: The ID of the AMQP Cluster

  • amqp$routingKey: The routingKey of the AMQP Message

  • amqp$exchange: The exchange from which AMQP Message was received

Input Requirement

This component does not allow an incoming relationship.

Additional Details

Summary

This processor consumes messages from AMQP messaging queue and converts them to a FlowFile to be routed to the next component in the flow. At the time of writing this document the supported AMQP protocol version is v0.9.1.

The component is based on RabbitMQ Client API The following guide and tutorial may also help you to brush up on some of the AMQP basics.

This processor does two things. It constructs FlwFile by extracting information from the consumed AMQP message (both body and attributes). Once message is consumed a FlowFile is constructed. The message body is written to a FlowFile and its com.rabbitmq.client.AMQP.BasicProperties are transfered into the FlowFile as attributes. AMQP attribute names are prefixed with amqp$ prefix.

AMQP Properties

The following is the list of available standard AMQP properties which may come with the message: (“amqp$contentType”, “amqp$contentEncoding”, “amqp$headers”, “amqp$deliveryMode”, “amqp$priority”, “amqp$correlationId”, “amqp$replyTo”, “amqp$expiration”, “amqp$messageId”, “amqp$timestamp”, “amqp$type”, “amqp$userId”, “amqp$appId”, “amqp$clusterId”, “amqp$routingKey”)

Configuration Details

At the time of writing this document it only defines the essential configuration properties which are suitable for most cases. Other properties will be defined later as this component progresses. Configuring PublishAMQP:

  1. Queue - [REQUIRED] the name of AMQP queue the messages will be retrieved from. Usually provided by administrator (e.g., ‘amq.direct’)

  2. Host Name - [REQUIRED] the name of the host where AMQP broker is running. Usually provided by administrator ( e.g., ‘myhost.com’). Defaults to ‘localhost’.

  3. Port - [REQUIRED] the port number where AMQP broker is running. Usually provided by the administrator (e.g., ’ 2453’). Defaults to ‘5672’.

  4. User Name - [REQUIRED] user name to connect to AMQP broker. Usually provided by the administrator (e.g., ‘me’). Defaults to ‘guest’.

  5. Password - [REQUIRED] password to use with user name to connect to AMQP broker. Usually provided by the administrator. Defaults to ‘guest’.

  6. Use Certificate Authentication - [OPTIONAL] Use the SSL certificate common name for authentication rather than user name/password. This can only be used in conjunction with SSL. Defaults to ‘false’.

  7. Virtual Host - [OPTIONAL] Virtual Host name which segregates AMQP system for enhanced security. Please refer to this blog for more details on Virtual Host.

ConsumeAzureEventHub

Receives messages from Microsoft Azure Event Hubs with checkpointing to ensure consistent event processing. Checkpoint tracking avoids consuming a message multiple times and enables reliable resumption of processing in the event of intermittent network failures. Checkpoint tracking requires external storage and provides the preferred approach to consuming messages from Azure Event Hubs. In clustered environment, ConsumeAzureEventHub processor instances form a consumer group and the messages are distributed among the cluster nodes (each message is processed on one cluster node only).

Tags: azure, microsoft, cloud, eventhub, events, streaming, streams

Properties

Event Hub Namespace

The namespace that the Azure Event Hubs is assigned to. This is generally equal to <Event Hub Names>-ns.

Event Hub Name

The name of the event hub to pull messages from.

Service Bus Endpoint

To support namespaces not in the default windows.net domain.

Transport Type

Advanced Message Queuing Protocol Transport Type for communication with Azure Event Hubs

Shared Access Policy Name

The name of the shared access policy. This policy must have Listen claims.

Shared Access Policy Key

The key of the shared access policy. Either the primary or the secondary key can be used.

Use Azure Managed Identity

Choose whether or not to use the managed identity of Azure VM/VMSS

Consumer Group

The name of the consumer group to use.

Record Reader

The Record Reader to use for reading received messages. The event hub name can be referred by Expression Language '${eventhub.name}' to access a schema.

Record Writer

The Record Writer to use for serializing Records to an output FlowFile. The event hub name can be referred by Expression Language '${eventhub.name}' to access a schema. If not specified, each message will create a FlowFile.

Initial Offset

Specify where to start receiving messages if offset is not yet stored in the checkpoint store.

Prefetch Count

Batch Size

The number of messages to process within a NiFi session. This parameter affects throughput and consistency. NiFi commits its session and Event Hubs checkpoints after processing this number of messages. If NiFi session is committed, but fails to create an Event Hubs checkpoint, then it is possible that the same messages will be received again. The higher number, the higher throughput, but possibly less consistent.

Message Receive Timeout

The amount of time this consumer should wait to receive the Batch Size before returning.

Checkpoint Strategy

Specifies which strategy to use for storing and retrieving partition ownership and checkpoint information for each partition.

Storage Account Name

Name of the Azure Storage account to store event hub consumer group state.

Storage Account Key

The Azure Storage account key to store event hub consumer group state.

Storage SAS Token

The Azure Storage SAS token to store Event Hub consumer group state. Always starts with a ? character.

Storage Container Name

Name of the Azure Storage container to store the event hub consumer group state. If not specified, event hub name is used.

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP + AuthN

Relationships

  • success: FlowFiles received from Event Hub.

Writes Attributes

  • eventhub.enqueued.timestamp: The time (in milliseconds since epoch, UTC) at which the message was enqueued in the event hub

  • eventhub.offset: The offset into the partition at which the message was stored

  • eventhub.sequence: The sequence number associated with the message

  • eventhub.name: The name of the event hub from which the message was pulled

  • eventhub.partition: The name of the partition from which the message was pulled

  • eventhub.property.*: The application properties of this message. IE: 'application' would be 'eventhub.property.application'

Stateful

Scope: Local, Cluster

Local state is used to store the client id. Cluster state is used to store partition ownership and checkpoint information when component state is configured as the checkpointing strategy.

Input Requirement

This component does not allow an incoming relationship.

ConsumeElasticsearch

A processor that repeatedly runs a paginated query against a field using a Range query to consume new Documents from an Elasticsearch index/query. The processor will retrieve multiple pages of results until either no more results are available or the Pagination Keep Alive expiration is reached, after which the Range query will automatically update the field constraint based on the last retrieved Document value.

Tags: elasticsearch, elasticsearch5, elasticsearch6, elasticsearch7, elasticsearch8, query, scroll, page, search, json

Properties

Range Query Field

Field to be tracked as part of an Elasticsearch Range query using a "gt" bound match. This field must exist within the Elasticsearch document for it to be retrieved.

Sort Order

The order in which to sort the "Range Query Field". A "sort" clause for the "Range Query Field" field will be prepended to any provided "Sort" clauses. If a "sort" clause already exists for the "Range Query Field" field, it will not be updated.

Initial Value

The initial value to use for the query if the processor has not run previously. If the processor has run previously and stored a value in its state, this property will be ignored. If no value is provided, and the processor has not previously run, no Range query bounds will be used, i.e. all documents will be retrieved in the specified "Sort Order".

Initial Value Date Format

If the "Range Query Field" is a Date field, convert the "Initial Value" to a date with this format. If not specified, Elasticsearch will use the date format provided by the "Range Query Field"'s mapping. For valid syntax, see https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-date-format.html

Initial Value Date Time Zone

If the "Range Query Field" is a Date field, convert the "Initial Value" to UTC with this time zone. Valid values are ISO 8601 UTC offsets, such as "+01:00" or "-08:00", and IANA time zone IDs, such as "Europe/London".

Additional Filters

One or more query filters in JSON syntax, not Lucene syntax. Ex: [{"match":{"somefield":"somevalue"}}, {"match":{"anotherfield":"anothervalue"}}]. These filters wil be used as part of a Bool query’s filter.

Size

The maximum number of documents to retrieve in the query. If the query is paginated, this "size" applies to each page of the query, not the "size" of the entire result set.

Sort

Sort results by one or more fields, in JSON syntax. Ex: [{"price" : {"order" : "asc", "mode" : "avg"}}, {"post_date" : {"format": "strict_date_optional_time_nanos"}}]

Aggregations

One or more query aggregations (or "aggs"), in JSON syntax. Ex: {"items": {"terms": {"field": "product", "size": 10}}}

Fields

Fields of indexed documents to be retrieved, in JSON syntax. Ex: ["user.id", "http.response.*", {"field": "@timestamp", "format": "epoch_millis"}]

Script Fields

Fields to created using script evaluation at query runtime, in JSON syntax. Ex: {"test1": {"script": {"lang": "painless", "source": "doc['price'].value * 2"}}, "test2": {"script": {"lang": "painless", "source": "doc['price'].value * params.factor", "params": {"factor": 2.0}}}}

Query Attribute

If set, the executed query will be set on each result flowfile in the specified attribute.

Index

The name of the index to use.

Type

The type of this document (used by Elasticsearch for indexing and searching).

Max JSON Field String Length

The maximum allowed length of a string value when parsing a JSON document or attribute.

Client Service

An Elasticsearch client service to use for running queries.

Search Results Split

Output a flowfile containing all hits or one flowfile for each individual hit or one flowfile containing all hits from all paged responses.

Search Results Format

Format of Hits output.

Aggregation Results Split

Output a flowfile containing all aggregations or one flowfile for each individual aggregation.

Aggregation Results Format

Format of Aggregation output.

Output No Hits

Output a "hits" flowfile even if no hits found for query. If true, an empty "hits" flowfile will be output even if "aggregations" are output.

Pagination Type

Pagination method to use. Not all types are available for all Elasticsearch versions, check the Elasticsearch docs to confirm which are applicable and recommended for your service.

Pagination Keep Alive

Pagination "keep_alive" period. Period Elasticsearch will keep the scroll/pit cursor alive in between requests (this is not the time expected for all pages to be returned, but the maximum allowed time for requests between page retrievals).

Dynamic Properties

The name of a URL query parameter to add

Adds the specified property name/value as a query parameter in the Elasticsearch URL used for processing. These parameters will override any matching parameters in the query request body. For SCROLL type queries, these parameters are only used in the initial (first page) query as the Elasticsearch Scroll API does not support the same query parameters for subsequent pages of data.

Relationships

  • aggregations: Aggregations are routed to this relationship.

  • hits: Search hits are routed to this relationship.

Writes Attributes

  • mime.type: application/json

  • page.number: The number of the page (request), starting from 1, in which the results were returned that are in the output flowfile

  • hit.count: The number of hits that are in the output flowfile

  • elasticsearch.query.error: The error message provided by Elasticsearch if there is an error querying the index.

Stateful

Scope: Cluster

The pagination state (scrollId, searchAfter, pitId, hitCount, pageCount, pageExpirationTimestamp, trackingRangeValue) is retained in between invocations of this processor until the Scroll/PiT has expired (when the current time is later than the last query execution plus the Pagination Keep Alive interval).

Input Requirement

This component does not allow an incoming relationship.

System Resource Considerations

  • MEMORY: Care should be taken on the size of each page because each response from Elasticsearch will be loaded into memory all at once and converted into the resulting flowfiles.

Additional Details

This processor is intended for use with the Elasticsearch JSON DSL and Elasticsearch 5.X and newer. It is designed to be able to create a JSON query using input properties and execute it against an Elasticsearch cluster in a paginated manner. Like all processors in the “restapi” bundle, it uses the official Elastic client APIs, so it supports leader detection.

The query is paginated in Elasticsearch using one of the available methods - “Scroll” or “Search After” (optionally with a “Point in Time” for Elasticsearch 7.10+ with XPack enabled). The number of results per page can be controlled using the size property.

Results will be sorted on the field that is to be tracked, with the sort order set as a property.

Search results and aggregation results can be split up into multiple flowfiles. Aggregation results will only be split at the top level because nested aggregations lose their context (and thus lose their value) if separated from their parent aggregation. Additionally, the results from all pages can be combined into a single flowfile (but the processor will only load each page of data into memory at any one time).

The following is an example query that would be created for tracking a “@timestamp” field:

{
  "query": {
    "size": 10000,
    "sort": {
      "@timestamp": "desc"
    },
    "bool": {
      "filter": [
        {
          "range": {
            "@timestamp": {
              "gt": "2023-09-01"
            }
          }
        }
      ]
    }
  }
}
json

Additional “filter” entries can be added as a JSON string in the query filter property, for example:

[
  {
    "term": {
      "department": "accounts"
    }
  },
  {
    "term": {
      "title.keyword": "Annual Report"
    }
  }
]
json
Query Pagination Across Processor Executions

This processor runs on a schedule in order to execute the same query repeatedly. Once a paginated query has been initiated within Elasticsearch, this processor will continue to retrieve results for that same query until no further results are available. After that point, a new paginated query will be initiated using the same Query JSON, but with the “range” filter query’s “gt” field set to the last value obtained from previous results.

If the results are “Combined” from this processor, then the paginated query will run continually within a single invocation until no more results are available (then the processor will start a new paginated query upon its next invocation). If the results are “Split” or “Per Page”, then each invocation of this processor will retrieve the next page of results until either there are no more results or the paginated query expires within Elasticsearch.

Resetting Queries / Clearing Processor State

Cluster State is used to track the progress of a paginated query within this processor. If there is need to restart the query completely or change the processor configuration after a paginated query has already been started, be sure to “Clear State” of the processor once it has been stopped and before restarting.

Note that clearing the processor’s state will lose details of where the processor was up to with tracking documents retrieved. Update the “Initial Value” with the appropriate value before restarting the processor to continue with where it was up to.

Duplicate Results

This processor does not attempt to de-duplicate results between queries, for example if the same query runs twice and (some or all of) the results are identical, the output will contain these same results for both invocations. This might happen if the NiFi Primary Node changes while a page of data is being retrieved, or if the processor state is cleared, then the processor is restarted.

ConsumeGCPubSub

Consumes message from the configured Google Cloud PubSub subscription. If the 'Batch Size' is set, the configured number of messages will be pulled in a single request, else only one message will be pulled.

Tags: google, google-cloud, gcp, message, pubsub, consume

Properties

GCP Credentials Provider Service

The Controller Service used to obtain Google Cloud Platform credentials.

Project ID

Google Cloud Project ID

Subscription

Name of the Google Cloud Pub/Sub Subscription

Batch Size Threshold

Indicates the number of messages the cloud service should bundle together in a batch. If not set and left empty, only one message will be used in a batch

API Endpoint

Override the gRPC endpoint in the form of [host:port]

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP + AuthN

Relationships

  • success: FlowFiles are routed to this relationship after a successful Google Cloud Pub/Sub operation.

Writes Attributes

  • gcp.pubsub.ackId: Acknowledgement Id of the consumed Google Cloud PubSub message

  • gcp.pubsub.messageSize: Serialized size of the consumed Google Cloud PubSub message

  • gcp.pubsub.attributesCount: Number of attributes the consumed PubSub message has, if any

  • gcp.pubsub.publishTime: Timestamp value when the message was published

  • Dynamic Attributes: Other than the listed attributes, this processor may write zero or more attributes, if the original Google Cloud Publisher client added any attributes to the message while sending

Input Requirement

This component does not allow an incoming relationship.

See Also

ConsumeIMAP

Consumes messages from Email Server using IMAP protocol. The raw-bytes of each received email message are written as contents of the FlowFile

Tags: Email, Imap, Get, Ingest, Ingress, Message, Consume

Properties

Host Name

Network address of Email server (e.g., pop.gmail.com, imap.gmail.com . . .)

Port

Numeric value identifying Port of Email server (e.g., 993)

Authorization Mode

How to authorize sending email on the user’s behalf.

OAuth2 Access Token Provider

OAuth2 service that can provide access tokens.

User Name

User Name used for authentication and authorization with Email server.

Password

Password used for authentication and authorization with Email server.

Folder

Email folder to retrieve messages from (e.g., INBOX)

Fetch Size

Specify the maximum number of Messages to fetch per call to Email Server.

Delete Messages

Specify whether mail messages should be deleted after retrieval.

Connection timeout

The amount of time to wait to connect to Email server

Mark Messages as Read

Specify if messages should be marked as read after retrieval.

Use SSL

Specifies if IMAP connection must be obtained via SSL encrypted connection (i.e., IMAPS)

Relationships

  • success: All messages that are the are successfully received from Email server and converted to FlowFiles are routed to this relationship

Input Requirement

This component does not allow an incoming relationship.

Additional Details

Description:

This Processor consumes email messages via IMAP protocol and sends the content of an email message as content of the Flow File. Content of the incoming email message is written as raw bytes to the content of the outgoing Flow File.

Different email providers may require additional Java Mail properties which could be provided as dynamic properties. For example, below is a sample configuration for GMail:

Processor’s static properties:

  • Host Name - imap.gmail.com

  • Port - 993

  • User Name - [your user name]

  • Password - [your password]

  • Folder - INBOX

Processor’s dynamic properties:

  • mail.imap.socketFactory.class - javax.net.ssl.SSLSocketFactory

  • mail.imap.socketFactory.fallback - false

  • mail.store.protocol - imaps

Another useful property is mail.debug which allows Java Mail API to print protocol messages to the console helping you to both understand what’s going on and debug issues.

For the full list of available Java Mail properties please refer to here

ConsumeJMS

Consumes JMS Message of type BytesMessage, TextMessage, ObjectMessage, MapMessage or StreamMessage transforming its content to a FlowFile and transitioning it to 'success' relationship. JMS attributes such as headers and properties will be copied as FlowFile attributes. MapMessages will be transformed into JSONs and then into byte arrays. The other types will have their raw contents as byte array transferred into the flowfile.

Tags: jms, get, message, receive, consume

Properties

Connection Factory Service

The Controller Service that is used to obtain Connection Factory. Alternatively, the 'JNDI *' or the 'JMS *' properties can also be be used to configure the Connection Factory.

Destination Name

The name of the JMS Destination. Usually provided by the administrator (e.g., 'topic://myTopic' or 'myTopic').

Destination Type

The type of the JMS Destination. Could be one of 'QUEUE' or 'TOPIC'. Usually provided by the administrator. Defaults to 'QUEUE'

Message Selector

The JMS Message Selector to filter the messages that the processor will receive

User Name

User Name used for authentication and authorization.

Password

Password used for authentication and authorization.

Connection Client ID

The client id to be set on the connection, if set. For durable non shared consumer this is mandatory, for all others it is optional, typically with shared consumers it is undesirable to be set. Please see JMS spec for further details

Character Set

The name of the character set to use to construct or interpret TextMessages

Acknowledgement Mode

The JMS Acknowledgement Mode. Using Auto Acknowledge can cause messages to be lost on restart of NiFi but may provide better performance than Client Acknowledge.

Durable Subscription

If destination is Topic if present then make it the consumer durable. @see https://jakarta.ee/specifications/platform/9/apidocs/jakarta/jms/session#createDurableConsumer-jakarta.jms.Topic-java.lang.String-

Shared Subscription

If destination is Topic if present then make it the consumer shared. @see https://jakarta.ee/specifications/platform/9/apidocs/jakarta/jms/session#createSharedConsumer-jakarta.jms.Topic-java.lang.String-

Subscription Name

The name of the subscription to use if destination is Topic and is shared or durable.

Timeout

How long to wait to consume a message from the remote broker before giving up.

Maximum Batch Size

The maximum number of messages to publish or consume in each invocation of the processor.

Error Queue Name

The name of a JMS Queue where - if set - unprocessed messages will be routed. Usually provided by the administrator (e.g., 'queue://myErrorQueue' or 'myErrorQueue').Only applicable if 'Destination Type' is set to 'QUEUE'

Record Reader

The Record Reader to use for parsing received JMS Messages into Records.

Record Writer

The Record Writer to use for serializing Records before writing them to a FlowFile.

Output Strategy

The format used to output the JMS message into a FlowFile record.

JNDI Initial Context Factory Class

The fully qualified class name of the JNDI Initial Context Factory Class (java.naming.factory.initial).

JNDI Provider URL

The URL of the JNDI Provider to use as the value for java.naming.provider.url. See additional details documentation for allowed URL schemes.

JNDI Name of the Connection Factory

The name of the JNDI Object to lookup for the Connection Factory.

JNDI / JMS Client Libraries

Specifies jar files and/or directories to add to the ClassPath in order to load the JNDI / JMS client libraries. This should be a comma-separated list of files, directories, and/or URLs. If a directory is given, any files in that directory will be included, but subdirectories will not be included (i.e., it is not recursive).

JNDI Principal

The Principal to use when authenticating with JNDI (java.naming.security.principal).

JNDI Credentials

The Credentials to use when authenticating with JNDI (java.naming.security.credentials).

JMS Connection Factory Implementation Class

The fully qualified name of the JMS ConnectionFactory implementation class (eg. org.apache.activemq.ActiveMQConnectionFactory).

JMS Client Libraries

Path to the directory with additional resources (eg. JARs, configuration files etc.) to be added to the classpath (defined as a comma separated list of values). Such resources typically represent target JMS client libraries for the ConnectionFactory implementation.

JMS Broker URI

URI pointing to the network location of the JMS Message broker. Example for ActiveMQ: 'tcp://myhost:61616'. Examples for IBM MQ: 'myhost(1414)' and 'myhost01(1414),myhost02(1414)'.

JMS SSL Context Service

The SSL Context Service used to provide client certificate information for TLS/SSL connections.

Dynamic Properties

The name of a Connection Factory configuration property.

Additional configuration property for the Connection Factory. It can be used when the Connection Factory is being configured via the 'JNDI *' or the 'JMS *'properties of the processor. For more information, see the Additional Details page.

Relationships

  • success: All FlowFiles that are received from the JMS Destination are routed to this relationship

  • parse.failure: If a message cannot be parsed using the configured Record Reader, the contents of the message will be routed to this Relationship as its own individual FlowFile.

Writes Attributes

  • jms_deliveryMode: The JMSDeliveryMode from the message header.

  • jms_expiration: The JMSExpiration from the message header.

  • jms_priority: The JMSPriority from the message header.

  • jms_redelivered: The JMSRedelivered from the message header.

  • jms_timestamp: The JMSTimestamp from the message header.

  • jms_correlationId: The JMSCorrelationID from the message header.

  • jms_messageId: The JMSMessageID from the message header.

  • jms_type: The JMSType from the message header.

  • jms_replyTo: The JMSReplyTo from the message header.

  • jms_destination: The JMSDestination from the message header.

  • jms.messagetype: The JMS message type, can be TextMessage, BytesMessage, ObjectMessage, MapMessage or StreamMessage).

  • other attributes: Each message property is written to an attribute.

Restricted

Making changes to this component requires admin rights. Non-admins can still schedule this component.

Input Requirement

This component does not allow an incoming relationship.

Additional Details

Summary

This processor consumes messages from JMS compliant messaging system and converts them to a FlowFile to be routed to the next component in the flow.

This processor does two things. It constructs FlowFile by extracting information from the consumed JMS message including body, standard JMS Headers and Properties. The message body is written to a FlowFile while standard JMS Headers and Properties are set as FlowFile attributes.

Configuration Details

At the time of writing this document it only defines the essential configuration properties which are suitable for most cases. Other properties will be defined later as this component progresses. Configuring ConsumeJMS:

  1. User Name - [OPTIONAL] User Name used for authentication and authorization when this processor obtains javax.jms.Connection from the pre-configured javax.jms.ConnectionFactory (see below).

  2. Password - [OPTIONAL] Password used in conjunction with User Name.

  3. Destination Name - [REQUIRED] the name of the javax.jms.Destination. Usually provided by administrator ( e.g., ‘topic://myTopic’).

  4. Destination Type - [REQUIRED] the type of the javax.jms.Destination. Could be one of ‘QUEUE’ or ‘TOPIC’ Usually provided by the administrator. Defaults to ‘QUEUE’.

Connection Factory Configuration

There are multiple ways to configure the Connection Factory for the processor:

  • Connection Factory Service property - link to a pre-configured controller service ( JndiJmsConnectionFactoryProvider or JMSConnectionFactoryProvider)

  • JNDI * properties - processor level configuration, the properties are the same as the properties of JndiJmsConnectionFactoryProvider controller service, the dynamic properties can also be used in this case

  • JMS * properties - processor level configuration, the properties are the same as the properties of JMSConnectionFactoryProvider controller service, the dynamic properties can also be used in this case

The preferred way is to use the Connection Factory Service property and a pre-configured controller service. It is also the most convenient method, because it is enough to configure the controller service once, and then it can be used in multiple processors.

However, some JMS client libraries may not work with the controller services due to incompatible Java ClassLoader handling between the 3rd party JMS client library and NiFi. Should you encounter java.lang.ClassCastException errors when using the controller services, please try to configure the Connection Factory via the ‘JNDI *’ or the ‘JMS *’ and the dynamic properties of the processor. For more details on these properties, see the documentation of the corresponding controller service (JndiJmsConnectionFactoryProvider for ‘JNDI *’ and JMSConnectionFactoryProvider for ‘JMS *’).

ConsumeKafka

Consumes messages from Apache Kafka Consumer API. The complementary NiFi processor for sending messages is PublishKafka. The Processor supports consumption of Kafka messages, optionally interpreted as NiFi records. Please note that, at this time (in read record mode), the Processor assumes that all records that are retrieved from a given partition have the same schema. For this mode, if any of the Kafka messages are pulled but cannot be parsed or written with the configured Record Reader or Record Writer, the contents of the message will be written to a separate FlowFile, and that FlowFile will be transferred to the 'parse.failure' relationship. Otherwise, each FlowFile is sent to the 'success' relationship and may contain many individual messages within the single FlowFile. A 'record.count' attribute is added to indicate how many messages are contained in the FlowFile. No two Kafka messages will be placed into the same FlowFile if they have different schemas, or if they have different values for a message header that is included by the <Headers to Add as Attributes> property.

Tags: Kafka, Get, Record, csv, avro, json, Ingest, Ingress, Topic, PubSub, Consume

Properties

Kafka Connection Service

Provides connections to Kafka Broker for publishing Kafka Records

Group ID

Kafka Consumer Group Identifier corresponding to Kafka group.id property

Topic Format

Specifies whether the Topics provided are a comma separated list of names or a single regular expression

Topics

The name or pattern of the Kafka Topics from which the Processor consumes Kafka Records. More than one can be supplied if comma separated.

Auto Offset Reset

Automatic offset configuration applied when no previous consumer offset found corresponding to Kafka auto.offset.reset property

Commit Offsets

Specifies whether this Processor should commit the offsets to Kafka after receiving messages. Typically, this value should be set to true so that messages that are received are not duplicated. However, in certain scenarios, we may want to avoid committing the offsets, that the data can be processed and later acknowledged by PublishKafka in order to provide Exactly Once semantics.

Max Uncommitted Time

Specifies the maximum amount of time allowed to pass before offsets must be committed. This value impacts how often offsets will be committed. Committing offsets less often increases throughput but also increases the window of potential data duplication in the event of a rebalance or JVM restart between commits. This value is also related to maximum poll records and the use of a message demarcator. When using a message demarcator we can have far more uncommitted messages than when we’re not as there is much less for us to keep track of in memory.

Header Name Pattern

Regular Expression Pattern applied to Kafka Record Header Names for selecting Header Values to be written as FlowFile attributes

Header Encoding

Character encoding applied when reading Kafka Record Header values and writing FlowFile attributes

Processing Strategy

Strategy for processing Kafka Records and writing serialized output to FlowFiles

Record Reader

The Record Reader to use for incoming Kafka messages

Record Writer

The Record Writer to use in order to serialize the outgoing FlowFiles

Output Strategy

The format used to output the Kafka Record into a FlowFile Record.

Key Attribute Encoding

Encoding for value of configured FlowFile attribute containing Kafka Record Key.

Key Format

Specifies how to represent the Kafka Record Key in the output FlowFile

Key Record Reader

The Record Reader to use for parsing the Kafka Record Key into a Record

Message Demarcator

Since KafkaConsumer receives messages in batches, this Processor has an option to output FlowFiles which contains all Kafka messages in a single batch for a given topic and partition and this property allows you to provide a string (interpreted as UTF-8) to use for demarcating apart multiple Kafka messages. This is an optional property and if not provided each Kafka message received will result in a single FlowFile which time it is triggered. To enter special character such as 'new line' use CTRL+Enter or Shift+Enter depending on the OS

Separate By Key

When this property is enabled, two messages will only be added to the same FlowFile if both of the Kafka Messages have identical keys.

Relationships

  • success: FlowFiles containing one or more serialized Kafka Records

Writes Attributes

  • record.count: The number of records received

  • mime.type: The MIME Type that is provided by the configured Record Writer

  • kafka.count: The number of messages written if more than one

  • kafka.key: The key of message if present and if single message. How the key is encoded depends on the value of the 'Key Attribute Encoding' property.

  • kafka.offset: The offset of the message in the partition of the topic.

  • kafka.timestamp: The timestamp of the message in the partition of the topic.

  • kafka.partition: The partition of the topic the message or message bundle is from

  • kafka.topic: The topic the message or message bundle is from

  • kafka.tombstone: Set to true if the consumed message is a tombstone message

Input Requirement

This component does not allow an incoming relationship.

See Also

ConsumeKinesisStream

Reads data from the specified AWS Kinesis stream and outputs a FlowFile for every processed Record (raw) or a FlowFile for a batch of processed records if a Record Reader and Record Writer are configured. At-least-once delivery of all Kinesis Records within the Stream while the processor is running. AWS Kinesis Client Library can take several seconds to initialise before starting to fetch data. Uses DynamoDB for check pointing and CloudWatch (optional) for metrics. Ensure that the credentials provided have access to DynamoDB and CloudWatch (optional) along with Kinesis.

Tags: amazon, aws, kinesis, consume, stream

Properties

Amazon Kinesis Stream Name

The name of Kinesis Stream

Application Name

The Kinesis stream reader application name.

Record Reader

The Record Reader to use for reading received messages. The Kinesis Stream name can be referred to by Expression Language '${kinesis.name}' to access a schema. If Record Reader/Writer are not specified, each Kinesis Record will create a FlowFile.

Record Writer

The Record Writer to use for serializing Records to an output FlowFile. The Kinesis Stream name can be referred to by Expression Language '${kinesis.name}' to access a schema. If Record Reader/Writer are not specified, each Kinesis Record will create a FlowFile.

Region

Endpoint Override URL

Endpoint URL to use instead of the AWS default including scheme, host, port, and path. The AWS libraries select an endpoint URL based on the AWS region, but this property overrides the selected endpoint URL, allowing use with other S3-compatible endpoints.

DynamoDB Override

DynamoDB override to use non-AWS deployments

Initial Stream Position

Initial position to read Kinesis streams.

Stream Position Timestamp

Timestamp position in stream from which to start reading Kinesis Records. Required if Initial position to read Kinesis streams. is AT_TIMESTAMP. Uses the Timestamp Format to parse value into a Date.

Timestamp Format

Format to use for parsing the Stream Position Timestamp into a Date and converting the Kinesis Record’s Approximate Arrival Timestamp into a FlowFile attribute.

Failover Timeout

Kinesis Client Library failover timeout

Graceful Shutdown Timeout

Kinesis Client Library graceful shutdown timeout

Checkpoint Interval

Interval between Kinesis checkpoints

Retry Count

Number of times to retry a Kinesis operation (process record, checkpoint, shutdown)

Retry Wait

Interval between Kinesis operation retries (process record, checkpoint, shutdown)

Report Metrics to CloudWatch

Whether to report Kinesis usage metrics to CloudWatch.

Communications Timeout

AWS Credentials Provider Service

The Controller Service that is used to obtain AWS credentials provider

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP + AuthN

Dynamic Properties

Kinesis Client Library (KCL) Configuration property name

Override default KCL Configuration ConfigsBuilder properties with required values. Supports setting of values directly on the ConfigsBuilder, such as 'namespace', as well as properties on nested builders. For example, to set configsBuilder.retrievalConfig().maxListShardsRetryAttempts(value), name the property as 'retrievalConfig.maxListShardsRetryAttempts'. Only supports setting of simple property values, e.g. String, int, long and boolean. Does not allow override of KCL Configuration settings handled by non-dynamic processor properties.

Relationships

  • success: FlowFiles are routed to success relationship

Writes Attributes

  • aws.kinesis.partition.key: Partition key of the (last) Kinesis Record read from the Shard

  • aws.kinesis.shard.id: Shard ID from which the Kinesis Record was read

  • aws.kinesis.sequence.number: The unique identifier of the (last) Kinesis Record within its Shard

  • aws.kinesis.approximate.arrival.timestamp: Approximate arrival timestamp of the (last) Kinesis Record read from the stream

  • mime.type: Sets the mime.type attribute to the MIME Type specified by the Record Writer (if configured)

  • record.count: Number of records written to the FlowFiles by the Record Writer (if configured)

  • record.error.message: This attribute provides on failure the error message encountered by the Record Reader or Record Writer (if configured)

Input Requirement

This component does not allow an incoming relationship.

System Resource Considerations

  • CPU: Kinesis Client Library is used to create a Worker thread for consumption of Kinesis Records. The Worker is initialised and started when this Processor has been triggered. It runs continually, spawning Kinesis Record Processors as required to fetch Kinesis Records. The Worker Thread (and any child Record Processor threads) are not controlled by the normal NiFi scheduler as part of the Concurrent Thread pool and are not released until this processor is stopped.

  • NETWORK: Kinesis Client Library will continually poll for new Records, requesting up to a maximum number of Records/bytes per call. This can result in sustained network usage.

See Also

Additional Details

Streaming Versus Batch Processing

ConsumeKinesisStream retrieves all Kinesis Records that it encounters in the configured Kinesis Stream. There are two common, broadly defined use cases.

Per-Message Use Case

By default, the Processor will create a separate FlowFile for each Kinesis Record (message) in the Stream and add attributes for shard id, sequence number, etc.

Per-Batch Use Case

Another common use case is the desire to process all Kinesis Records retrieved from the Stream in a batch as a single FlowFile.

The ConsumeKinesisStream Processor can optionally be configured with a Record Reader and Record Writer. When a Record Reader and Record Writer are configured, a single FlowFile will be created that will contain a Record for each Record within the batch of Kinesis Records (messages), instead of a separate FlowFile per Kinesis Record.

The FlowFiles emitted in this mode will include the standard record.* attributes along with the same Kinesis Shard ID, Sequence Number and Approximate Arrival Timestamp; but the values will relate to the last Kinesis Record that was processed in the batch of messages constituting the content of the FlowFile.

ConsumeMQTT

Subscribes to a topic and receives messages from an MQTT broker

Tags: subscribe, MQTT, IOT, consume, listen

Properties

Broker URI

The URI(s) to use to connect to the MQTT broker (e.g., tcp://localhost:1883). The 'tcp', 'ssl', 'ws' and 'wss' schemes are supported. In order to use 'ssl', the SSL Context Service property must be set. When a comma-separated URI list is set (e.g., tcp://localhost:1883,tcp://localhost:1884), the processor will use a round-robin algorithm to connect to the brokers on connection failure.

MQTT Specification Version

The MQTT specification version when connecting with the broker. See the allowable value descriptions for more details.

Username

Username to use when connecting to the broker

Password

Password to use when connecting to the broker

SSL Context Service

The SSL Context Service used to provide client certificate information for TLS/SSL connections.

Session state

Whether to start a fresh or resume previous flows. See the allowable value descriptions for more details.

Session Expiry Interval

After this interval the broker will expire the client and clear the session state.

Client ID

MQTT client ID to use. If not set, a UUID will be generated.

Group ID

MQTT consumer group ID to use. If group ID not set, client will connect as individual consumer.

Topic Filter

The MQTT topic filter to designate the topics to subscribe to.

Quality of Service (QoS)

The Quality of Service (QoS) to receive the message with. Accepts values '0', '1' or '2'; '0' for 'at most once', '1' for 'at least once', '2' for 'exactly once'.

Record Reader

The Record Reader to use for parsing received MQTT Messages into Records.

Record Writer

The Record Writer to use for serializing Records before writing them to a FlowFile.

Add attributes as fields

If setting this property to true, default fields are going to be added in each record: _topic, _qos, _isDuplicate, _isRetained.

Message Demarcator

With this property, you have an option to output FlowFiles which contains multiple messages. This property allows you to provide a string (interpreted as UTF-8) to use for demarcating apart multiple messages. This is an optional property ; if not provided, and if not defining a Record Reader/Writer, each message received will result in a single FlowFile. To enter special character such as 'new line' use CTRL+Enter or Shift+Enter depending on the OS.

Connection Timeout (seconds)

Maximum time interval the client will wait for the network connection to the MQTT server to be established. The default timeout is 30 seconds. A value of 0 disables timeout processing meaning the client will wait until the network connection is made successfully or fails.

Keep Alive Interval (seconds)

Defines the maximum time interval between messages sent or received. It enables the client to detect if the server is no longer available, without having to wait for the TCP/IP timeout. The client will ensure that at least one message travels across the network within each keep alive period. In the absence of a data-related message during the time period, the client sends a very small "ping" message, which the server will acknowledge. A value of 0 disables keepalive processing in the client.

Last Will Message

The message to send as the client’s Last Will.

Last Will Topic

The topic to send the client’s Last Will to.

Last Will Retain

Whether to retain the client’s Last Will.

Last Will QoS Level

QoS level to be used when publishing the Last Will Message.

Max Queue Size

The MQTT messages are always being sent to subscribers on a topic regardless of how frequently the processor is scheduled to run. If the 'Run Schedule' is significantly behind the rate at which the messages are arriving to this processor, then a back up can occur in the internal queue of this processor. This property specifies the maximum number of messages this processor will hold in memory at one time in the internal queue. This data would be lost in case of a NiFi restart.

Relationships

  • Message: The MQTT message output

  • parse.failure: If a message cannot be parsed using the configured Record Reader, the contents of the message will be routed to this Relationship as its own individual FlowFile.

Writes Attributes

  • record.count: The number of records received

  • mqtt.broker: MQTT broker that was the message source

  • mqtt.topic: MQTT topic on which message was received

  • mqtt.qos: The quality of service for this message.

  • mqtt.isDuplicate: Whether or not this message might be a duplicate of one which has already been received.

  • mqtt.isRetained: Whether or not this message was from a current publisher, or was "retained" by the server as the last message published on the topic.

Input Requirement

This component does not allow an incoming relationship.

System Resource Considerations

  • MEMORY: The 'Max Queue Size' specifies the maximum number of messages that can be hold in memory by NiFi by a single instance of this processor. A high value for this property could represent a lot of data being stored in memory.

See Also

Additional Details

The MQTT messages are always being sent to subscribers on a topic regardless of how frequently the processor is scheduled to run. If the ‘Run Schedule’ is significantly behind the rate at which the messages are arriving to this processor, then a back-up can occur in the internal queue of this processor. Each time the processor is scheduled, the messages in the internal queue will be written to FlowFiles. In case the internal queue is full, the MQTT client will try for up to 1 second to add the message into the internal queue. If the internal queue is still full after this time, an exception saying that ‘The subscriber queue is full’ would be thrown, the message would be dropped and the client would be disconnected. In case the QoS property is set to 0, the message would be lost. In case the QoS property is set to 1 or 2, the message will be received after the client reconnects.

ConsumePOP3

Consumes messages from Email Server using POP3 protocol. The raw-bytes of each received email message are written as contents of the FlowFile

Tags: Email, POP3, Get, Ingest, Ingress, Message, Consume

Properties

Host Name

Network address of Email server (e.g., pop.gmail.com, imap.gmail.com . . .)

Port

Numeric value identifying Port of Email server (e.g., 993)

Authorization Mode

How to authorize sending email on the user’s behalf.

OAuth2 Access Token Provider

OAuth2 service that can provide access tokens.

User Name

User Name used for authentication and authorization with Email server.

Password

Password used for authentication and authorization with Email server.

Folder

Email folder to retrieve messages from (e.g., INBOX)

Fetch Size

Specify the maximum number of Messages to fetch per call to Email Server.

Delete Messages

Specify whether mail messages should be deleted after retrieval.

Connection timeout

The amount of time to wait to connect to Email server

Relationships

  • success: All messages that are the are successfully received from Email server and converted to FlowFiles are routed to this relationship

Input Requirement

This component does not allow an incoming relationship.

Additional Details

Description:

This Processor consumes email messages via POP3 protocol and sends the content of an email message as content of the Flow File. Content of the incoming email message is written as raw bytes to the content of the outgoing Flow File.

Since different serves may require different Java Mail properties such properties could be provided via dynamic properties. For example, below is a sample configuration for GMail:

Processor’s static properties:

  • Host Name - pop.gmail.com

  • Port - 995

  • User Name - [your user name]

  • Password - [your password]

  • Folder - INBOX

Processor’s dynamic properties:

  • mail.pop3.socketFactory.class - javax.net.ssl.SSLSocketFactory

  • mail.pop3.socketFactory.fallback - false

Another useful property is mail.debug which allows Java Mail API to print protocol messages to the console helping you to both understand what’s going on and debug issues.

For the full list of available Java Mail properties please refer to here

ConsumeSlack

Retrieves messages from one or more configured Slack channels. The messages are written out in JSON format. See Usage / Additional Details for more information about how to configure this Processor and enable it to retrieve messages from Slack.

Tags: slack, conversation, conversation.history, social media, team, text, unstructured

Properties

Channels

A comma-separated list of Slack Channels to Retrieve Messages From. Each element in the list may be either a Channel ID, such as C0L9VCD47, or (for public channels only) the name of a channel, prefixed with a # sign, such as #general. If any channel name is provided instead,instead of an ID, the Access Token provided must be granted the channels:read scope in order to resolve the Channel ID. See the Processor’s Additional Details for information on how to find a Channel ID.

Access Token

OAuth Access Token used for authenticating/authorizing the Slack request sent by NiFi. This may be either a User Token or a Bot Token. It must be granted the channels:history, groups:history, im:history, or mpim:history scope, depending on the type of conversation being used.

Reply Monitor Window

After consuming all messages in a given channel, this Processor will periodically poll all "threaded messages", aka Replies, whose timestamp is between now and this amount of time in the past in order to check for any new replies. Setting this value to a larger value may result in additional resource use and may result in Rate Limiting. However, if a user replies to an old thread that was started outside of this window, the reply may not be captured.

Reply Monitor Frequency

After consuming all messages in a given channel, this Processor will periodically poll all "threaded messages", aka Replies, whose timestamp falls between now and the amount of time specified by the <Reply Monitor Window> property. This property determines how frequently those messages are polled. Setting the value to a shorter duration may result in replies to messages being captured more quickly, providing a lower latency. However, it will also result in additional resource use and could trigger Rate Limiting to occur.

Batch Size

The maximum number of messages to retrieve in a single request to Slack. The entire response will be parsed into memory, so it is important that this be kept in mind when setting this value.

Resolve Usernames

Specifies whether or not User IDs should be resolved to usernames. By default, Slack Messages provide the ID of the user that sends a message, such as U0123456789, but not the username, such as NiFiUser. The username may be resolved, but it may require additional calls to the Slack API and requires that the Token used be granted the users:read scope. If set to true, usernames will be resolved with a best-effort policy: if a username cannot be obtained, it will be skipped over. Also, note that when a username is obtained, the Message’s <username> field is populated, and the <text> field is updated such that any mention will be output such as "Hi @user" instead of "Hi <@U1234567>".

Include Message Blocks

Specifies whether or not the output JSON should include the value of the 'blocks' field for each Slack Message. This field includes information such as individual parts of a message that are formatted using rich text. This may be useful, for instance, for parsing. However, it often accounts for a significant portion of the data and as such may be set to null when it is not useful to you.

Include Null Fields

Specifies whether or not fields that have null values should be included in the output JSON. If true, any field in a Slack Message that has a null value will be included in the JSON with a value of null. If false, the key omitted from the output JSON entirely. Omitting null values results in smaller messages that are generally more efficient to process, but including the values may provide a better understanding of the format, especially for schema inference.

Relationships

  • success: Slack messages that are successfully received will be routed to this relationship

Writes Attributes

  • slack.channel.id: The ID of the Slack Channel from which the messages were retrieved

  • slack.message.count: The number of slack messages that are included in the FlowFile

  • mime.type: Set to application/json, as the output will always be in JSON format

Stateful

Scope: Cluster

Maintains a mapping of Slack Channel IDs to the timestamp of the last message that was retrieved for that channel. This allows the processor to only retrieve messages that have been posted since the last time the processor was run. This state is stored in the cluster so that if the Primary Node changes, the new node will pick up where the previous node left off.

Input Requirement

This component does not allow an incoming relationship.

See Also

Additional Details

Description:

ConsumeSlack allows for receiving messages from Slack using Slack’s conversations.history API. This allows for consuming message events for a given conversation, such as a Channel. The Processor periodically polls Slack in order to obtain the latest messages. Unfortunately, the Slack API does not provide a mechanism for easily identifying new replies to messages (i.e., new threaded messages), without scanning through the original “parent” messages as well. As a result, the Processor will periodically poll messages within a channel in order to find any new replies. By default, this occurs every 5 minutes, but this can be configured by changing the value of the “Reply Monitor Frequency” property. Additionally, for long-lived channels, polling all messages would be very expensive. As a result, the Processor only polls messages newer than 7 days (by default) for new replies. This can be configured by setting the value of the “Reply Monitor Window” property.

Slack Setup

In order use this Processor, it requires that a Slack App be created and installed in your Slack workspace. An OAuth User or Bot Token must be created for the App, and the token must have the channels:history, groups:history, im:history, or mpim:history User Token Scope. Which scope is necessary depends on the type of conversation to consume from. Please see Slack’s documentation for the latest information on how to create an Application and install it into your workspace.

Depending on the Processor’s configuration, you may also require additional Scopes. For example, the Channels to consume from may be listed either as a Channel ID or (for public Channels) a Channel Name. However, if a name, such as #general is used, the token must be provided the channels:read scope in order to determine the Channel ID for you. Additionally, if the “Resolve Usernames” property is set to true, the token must have the users:read scope in order to resolve the User ID to a Username.

Rather than requiring the channels:read Scope, you may alternatively supply only Channel IDs for the “Channel” property. To determine the ID of a Channel, navigate to the desired Channel in Slack. Click the name of the Channel at the top of the screen. This provides a popup that provides information about the Channel. Scroll to the bottom of the popup, and you will be shown the Channel ID with the ability to click a button to Copy the ID to your clipboard.

At the time of this writing, the following steps may be used to create a Slack App with the necessary scope of channels:history scope. However, these instructions are subject to change at any time, so it is best to read through Slack’s Quickstart Guide.

  • Create a Slack App. Click here to get started. From here, click the “Create New App” button and choose “From scratch.” Give your App a name and choose the workspace that you want to use for developing the app.

  • Creating your app will take you to the configuration page for your application. For example, https://api.slack.com/apps/<APP_IDENTIFIER>. From here, click on “OAuth & Permissions” in the left-hand menu. Scroll down to the “Scopes” section and click the “Add an OAuth Scope” button under ‘Bot Token Scopes’. Choose the channels:history scope.

  • Scroll back to the top, and under the “OAuth Tokens for Your Workspace” section, click the “Install to Workspace” button. This will prompt you to allow the application to be added to your workspace, if you have the appropriate permissions. Otherwise, it will generate a notification for a Workspace Owner to approve the installation. Additionally, it will generate a “Bot User OAuth Token”.

  • Copy the value of the “Bot User OAuth Token.” This will be used as the value for the ConsumeSlack Processor’s Access Token property.

  • The Bot must then be enabled for each Channel that you would like to consume messages from. In order to do that, in the Slack application, go to the Channel that you would like to consume from and press /. Choose the Add apps to this channel option, and add the Application that you created as a Bot to the channel.

  • Alternatively, instead of creating an OAuth Scope of channels:history under “Bot Token Scopes”, you may choose to create an OAuth Scope of channels:history under the “User Token Scopes” section. This will allow the token to be used on your behalf in any channel that you have access to, such as all public channels, without the need to explicitly add a Bot to the channel.

ConsumeTwitter

Streams tweets from Twitter’s streaming API v2. The stream provides a sample stream or a search stream based on previously uploaded rules. This processor also provides a pass through for certain fields of the tweet to be returned as part of the response. See https://developer.twitter.com/en/docs/twitter-api/data-dictionary/introduction for more information regarding the Tweet object model.

Tags: twitter, tweets, social media, status, json

Properties

Stream Endpoint

The source from which the processor will consume Tweets.

Base Path

The base path that the processor will use for making HTTP requests. The default value should be sufficient for most use cases.

Bearer Token

The Bearer Token provided by Twitter.

Queue Size

Maximum size of internal queue for streamed messages

Batch Size

The maximum size of the number of Tweets to be written to a single FlowFile. Will write fewer Tweets based on the number available in the queue at the time of processor invocation.

Backoff Attempts

The number of reconnection tries the processor will attempt in the event of a disconnection of the stream for any reason, before throwing an exception. To start a stream after this exception occur and the connection is fixed, please stop and restart the processor. If the valueof this property is 0, then backoff will never occur and the processor will always need to be restartedif the stream fails.

Backoff Time

The duration to backoff before requesting a new stream ifthe current one fails for any reason. Will increase by factor of 2 every time a restart fails

Maximum Backoff Time

The maximum duration to backoff to start attempting a new stream.It is recommended that this number be much higher than the 'Backoff Time' property

Connect Timeout

The maximum time in which client should establish a connection with the Twitter API before a time out. Setting the value to 0 disables connection timeouts.

Read Timeout

The maximum time of inactivity between receiving tweets from Twitter through the API before a timeout. Setting the value to 0 disables read timeouts.

Backfill Minutes

The number of minutes (up to 5 minutes) of streaming data to be requested after a disconnect. Only available for project with academic research access. See https://developer.twitter.com/en/docs/twitter-api/tweets/filtered-stream/integrate/recovery-and-redundancy-features

Tweet Fields

A comma-separated list of tweet fields to be returned as part of the tweet. Refer to https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/tweet for proper usage. Possible field values include: attachments, author_id, context_annotations, conversation_id, created_at, entities, geo, id, in_reply_to_user_id, lang, non_public_metrics, organic_metrics, possibly_sensitive, promoted_metrics, public_metrics, referenced_tweets, reply_settings, source, text, withheld

User Fields

A comma-separated list of user fields to be returned as part of the tweet. Refer to https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/user for proper usage. Possible field values include: created_at, description, entities, id, location, name, pinned_tweet_id, profile_image_url, protected, public_metrics, url, username, verified, withheld

Media Fields

A comma-separated list of media fields to be returned as part of the tweet. Refer to https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/media for proper usage. Possible field values include: alt_text, duration_ms, height, media_key, non_public_metrics, organic_metrics, preview_image_url, promoted_metrics, public_metrics, type, url, width

Poll Fields

A comma-separated list of poll fields to be returned as part of the tweet. Refer to https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/poll for proper usage. Possible field values include: duration_minutes, end_datetime, id, options, voting_status

Place Fields

A comma-separated list of place fields to be returned as part of the tweet. Refer to https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/place for proper usage. Possible field values include: contained_within, country, country_code, full_name, geo, id, name, place_type

Expansions

A comma-separated list of expansions for objects in the returned tweet. See https://developer.twitter.com/en/docs/twitter-api/expansions for proper usage. Possible field values include: author_id, referenced_tweets.id, referenced_tweets.id.author_id, entities.mentions.username, attachments.poll_ids, attachments.media_keys ,in_reply_to_user_id, geo.place_id

Relationships

  • success: FlowFiles containing an array of one or more Tweets

Writes Attributes

  • mime.type: The MIME Type set to application/json

  • tweets: The number of Tweets in the FlowFile

Input Requirement

This component does not allow an incoming relationship.

ConsumeWindowsEventLog

Registers a Windows Event Log Subscribe Callback to receive FlowFiles from Events on Windows. These can be filtered via channel and XPath.

Tags: ingest, event, windows

Properties

Channel

The Windows Event Log Channel to listen to.

XPath Query

XPath Query to filter events. (See https://msdn.microsoft.com/en-us/library/windows/desktop/dd996910(v=vs.85).aspx for examples.)

Maximum Buffer Size

The individual Event Log XMLs are rendered to a buffer. This specifies the maximum size in bytes that the buffer will be allowed to grow to. (Limiting the maximum size of an individual Event XML.)

Maximum queue size

Events are received asynchronously and must be output as FlowFiles when the processor is triggered. This specifies the maximum number of events to queue for transformation into FlowFiles.

Inactive duration to reconnect

If no new event logs are processed for the specified time period, this processor will try reconnecting to recover from a state where any further messages cannot be consumed. Such situation can happen if Windows Event Log service is restarted, or ERROR_EVT_QUERY_RESULT_STALE (15011) is returned. Setting no duration, e.g. '0 ms' disables auto-reconnection.

Relationships

  • success: Relationship for successfully consumed events.

Writes Attributes

  • mime.type: Will set a MIME type value of application/xml.

Input Requirement

This component does not allow an incoming relationship.

Additional Details

Description:

This processor is used listen to Windows Event Log events. It has a success output that will contain an XML representation of the event.

Output XML Example:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
    <System>
        <Provider Name="Service Control Manager" Guid="{555908d1-a6d7-4695-8e1e-26931d2012f4}"
                  EventSourceName="Service Control Manager"/>
        <EventID Qualifiers="16384">7036</EventID>
        <Version>0</Version>
        <Level>4</Level>
        <Task>0</Task>
        <Opcode>0</Opcode>
        <Keywords>0x8080000000000000</Keywords>
        <TimeCreated SystemTime="2016-06-10T22:28:53.905233700Z"/>
        <EventRecordID>34153</EventRecordID>
        <Correlation/>
        <Execution ProcessID="684" ThreadID="3504"/>
        <Channel>System</Channel>
        <Computer>WIN-O05CNUCF16M.hdf.local</Computer>
        <Security/>
    </System>
    <EventData>
        <Data Name="param1">Smart Card Device Enumeration Service</Data>
        <Data Name="param2">running</Data>
        <Binary>5300630044006500760069006300650045006E0075006D002F0034000000</Binary>
    </EventData>
</Event>
xml
Permissions:

Your Windows User must have permissions to read the given Event Log. This can be achieved through the following steps ( Windows 2008 and newer):

  1. Open a command prompt as your user. Enter the command: wmic useraccount get name,sid

  2. Note the SID of the user or group you’d like to allow to read a given channel

  3. Open a command prompt as Administrator. enter the command: wevtutil gl CHANNEL_NAME

  4. Take the channelAccess Attribute starting with O:BAG, copy it into a text editor, and add ( A;;0x1;;;YOUR_SID_FROM_BEFORE) to the end

  5. Take that text and run the following command in your admin prompt (see below for example): wevtutil sl CHANNEL_NAME /ca:TEXT_FROM_PREVIOUS_STEP

The following command is the exact one I used to add read access to the Security log for my user. (You can see all the possible channels with: wevtutil el):

wevtutil sl Security /ca:O:BAG:SYD:(A;;0xf0005;;;SY)(A;;0x5;;;BA)(A;;0x1;;;S-1-5-32-573)(A;;0x1;;;S-1-5-21-3589080292-3448680409-2446571098-1001)

These steps were adapted from this guide.

ControlRate

Controls the rate at which data is transferred to follow-on processors. If you configure a very small Time Duration, then the accuracy of the throttle gets worse. You can improve this accuracy by decreasing the Yield Duration, at the expense of more Tasks given to the processor.

Use Cases

Limit the rate at which data is sent to a downstream system with little to no bursts

Keywords: throttle, limit, slow down, data rate

Input Requirement: This component allows an incoming relationship.

  1. Set the "Rate Control Criteria" to data rate.

  2. Set the "Time Duration" property to 1 sec.

  3. Configure the "Maximum Rate" property to specify how much data should be allowed through each second. .

  4. For example, to allow through 8 MB per second, set "Maximum Rate" to 8 MB. .

Limit the rate at which FlowFiles are sent to a downstream system with little to no bursts

Keywords: throttle, limit, slow down, flowfile rate

Input Requirement: This component allows an incoming relationship.

  1. Set the "Rate Control Criteria" to flowfile count.

  2. Set the "Time Duration" property to 1 sec.

  3. Configure the "Maximum Rate" property to specify how many FlowFiles should be allowed through each second. .

  4. For example, to allow through 100 FlowFiles per second, set "Maximum Rate" to 100. .

Reject requests that exceed a specific rate with little to no bursts

Keywords: throttle, limit, slow down, request rate

Input Requirement: This component allows an incoming relationship.

  1. Set the "Rate Control Criteria" to flowfile count.

  2. Set the "Time Duration" property to 1 sec.

  3. Set the "Rate Exceeded Strategy" property to Route to 'rate exceeded'.

  4. Configure the "Maximum Rate" property to specify how many requests should be allowed through each second. .

  5. For example, to allow through 100 requests per second, set "Maximum Rate" to 100.

  6. If more than 100 requests come in during any one second, the additional requests will be routed to rate exceeded instead of success. .

Reject requests that exceed a specific rate, allowing for bursts

Keywords: throttle, limit, slow down, request rate

Input Requirement: This component allows an incoming relationship.

  1. Set the "Rate Control Criteria" to flowfile count.

  2. Set the "Time Duration" property to 1 min.

  3. Set the "Rate Exceeded Strategy" property to Route to 'rate exceeded'.

  4. Configure the "Maximum Rate" property to specify how many requests should be allowed through each minute. .

  5. For example, to allow through 100 requests per second, set "Maximum Rate" to 6000.

  6. This will allow through 6,000 FlowFiles per minute, which averages to 100 FlowFiles per second. However, those 6,000 FlowFiles may come all within the first couple of

  7. seconds, or they may come in over a period of 60 seconds. As a result, this gives us an average rate of 100 FlowFiles per second but allows for bursts of data.

  8. If more than 6,000 requests come in during any one minute, the additional requests will be routed to rate exceeded instead of success. .

Tags: rate control, throttle, rate, throughput

Properties

Rate Control Criteria

Indicates the criteria that is used to control the throughput rate. Changing this value resets the rate counters.

Time Duration

The amount of time to which the Maximum Rate pertains. Changing this value resets the rate counters.

Maximum Rate

The maximum rate at which data should pass through this processor. The format of this property is expected to be a positive integer, or a Data Size (such as '1 MB') if Rate Control Criteria is set to 'data rate'.

Maximum Data Rate

The maximum rate at which data should pass through this processor. The format of this property is expected to be a Data Size (such as '1 MB') representing bytes per Time Duration.

Maximum FlowFile Rate

The maximum rate at which FlowFiles should pass through this processor. The format of this property is expected to be a positive integer representing FlowFiles count per Time Duration

Rate Exceeded Strategy

Specifies how to handle an incoming FlowFile when the maximum data rate has been exceeded.

Rate Controlled Attribute

The name of an attribute whose values build toward the rate limit if Rate Control Criteria is set to 'attribute value'. The value of the attribute referenced by this property must be a positive long, or the FlowFile will be routed to failure. This value is ignored if Rate Control Criteria is not set to 'attribute value'. Changing this value resets the rate counters.

Grouping Attribute

By default, a single "throttle" is used for all FlowFiles. If this value is specified, a separate throttle is used for each value specified by the attribute with this name. Changing this value resets the rate counters.

Relationships

  • success: FlowFiles are transferred to this relationship under normal conditions

  • failure: FlowFiles will be routed to this relationship if they are missing a necessary Rate Controlled Attribute or the attribute is not in the expected format

Input Requirement

This component requires an incoming relationship.

Additional Details

This processor throttles throughput of FlowFiles based on a configured rate. The rate can be specified as either a direct data rate (bytes per time period), or by counting FlowFiles or a specific attribute value. In all cases, the time period for measurement is specified in the Time Duration property.

The processor operates in one of four available modes. The mode is determined by the Rate Control Criteria property.

Mode Description

Data Rate

The FlowFile content size is accumulated for all FlowFiles passing through this processor. FlowFiles are throttled to ensure a maximum overall data rate (bytes per time period) is not exceeded. The Maximum Rate property specifies the maximum bytes allowed per Time Duration.

FlowFile Count

FlowFiles are counted regardless of content size. No more than the specified number of FlowFiles pass through this processor in the given Time Duration. The Maximum Rate property specifies the maximum number of FlowFiles allowed per Time Duration.

Attribute Value

The value of an attribute is accumulated to determine overall rate. The Rate Controlled Attribute property specifies the attribute whose value will be accumulated. The value of the specified attribute is expected to be an integer. This mode is independent of overall FlowFile size and count.

Data Rate or FlowFile Count

This mode provides a combination of Data Rate and FlowFile Count. Both rates are accumulated and FlowFiles are throttled if either rate is exceeded. Both Maximum Data Rate and Maximum FlowFile Rate properties must be specified to determine content size and FlowFile count per Time Duration.

If the Grouping Attribute property is specified, all rates are accumulated separately for unique values of the specified attribute. For example, assume Grouping Attribute property is specified and its value is “city”. All FlowFiles containing a “city” attribute with value “Albuquerque” will have an accumulated rate calculated. A separate rate will be calculated for all FlowFiles containing a “city” attribute with a value “Boston”. In other words, separate rate calculations will be accumulated for all unique values of the Grouping Attribute.

ConvertAvroToParquet

Converts Avro records into Parquet file format. The incoming FlowFile should be a valid avro file. If an incoming FlowFile does not contain any records, an empty parquet file is the output. NOTE: Many Avro datatypes (collections, primitives, and unions of primitives, e.g.) can be converted to parquet, but unions of collections and other complex datatypes may not be able to be converted to Parquet.

Tags: avro, parquet, convert

Properties

Compression Type

The type of compression for the file being written.

Row Group Size

The row group size used by the Parquet writer. The value is specified in the format of <Data Size> <Data Unit> where Data Unit is one of B, KB, MB, GB, TB.

Page Size

The page size used by the Parquet writer. The value is specified in the format of <Data Size> <Data Unit> where Data Unit is one of B, KB, MB, GB, TB.

Dictionary Page Size

The dictionary page size used by the Parquet writer. The value is specified in the format of <Data Size> <Data Unit> where Data Unit is one of B, KB, MB, GB, TB.

Max Padding Size

The maximum amount of padding that will be used to align row groups with blocks in the underlying filesystem. If the underlying filesystem is not a block filesystem like HDFS, this has no effect. The value is specified in the format of <Data Size> <Data Unit> where Data Unit is one of B, KB, MB, GB, TB.

Enable Dictionary Encoding

Specifies whether dictionary encoding should be enabled for the Parquet writer

Enable Validation

Specifies whether validation should be enabled for the Parquet writer

Writer Version

Specifies the version used by Parquet writer

Relationships

  • success: Parquet file that was converted successfully from Avro

  • failure: Avro content that could not be processed

Writes Attributes

  • filename: Sets the filename to the existing filename with the extension replaced by / added to by .parquet

  • record.count: Sets the number of records in the parquet file.

Input Requirement

This component requires an incoming relationship.

ConvertCharacterSet

Converts a FlowFile’s content from one character set to another

Tags: text, convert, characterset, character set

Properties

Input Character Set

The name of the CharacterSet to expect for Input

Output Character Set

The name of the CharacterSet to convert to

Relationships

  • success:

Input Requirement

This component requires an incoming relationship.

ConvertEdiToXml

Tags: virtimo, edi, xml, x12, edifact

Properties

XML Output Format

Use either the StAEDI format or Virtimo’s own XML format. With the Virtimo format, errors can also be skipped or only marked in the XML.

Verbose Output

Put more information in output - min occur, max occur, title and description (if available).

Throw Error

If an error occurs, do not write it to the XML, but throw an error directly.

Use Control Schema

EDI Schema that allows for user-specified validation rules.

Control Schema Body

Own EDI Schema in Property.

Control Schema File

Own EDI Schema from external file.

Validate Control Code Values

When set to false, enumerated code values of control structure elements will be ignore. Element size and type validation will still occur. Default value: true

Validate Control Structure

When set to false, control structures, segments, elements, and codes will not be validated unless a user-provided control schema has been set. When set to true AND no user-provided control schema has been set, the reader will attempt to find a known control schema for the detected EDI dialect and version to be used for control structure validation. Default value: true.

Ignore Extraneous Characters

When set to true, non-graphical, control characters will be ignored in the EDI input stream. This includes characters ranging from 0x00 through 0x1F and 0x7F. Default value: false.

Nest Hierarchical Loops

When set to true, hierarchical loops will be nested in the EDI input stream. The nesting structure is determined by the linkage specified by the EDI data itself using pointers given in the EDI schema for a loop. For example, the hierarchical information given by the X12 HL segment. Default value: true.

Trim Discriminator Values

When set to true, discriminator values from the EDI input will be trimmed (leading and trailing whitespace removed) prior to testing whether the value matches the enumerated values for a loop or segment defined in an implementation schema. Default value: false.

Pretty Print

Apply pretty-print formatting to the XML output.

Relationships

  • success: Successfully converted content.

  • failure: Failed to convert content.

Input Requirement

This component allows an incoming relationship.

ConvertRecord

Converts records from one data format to another using configured Record Reader and Record Write Controller Services. The Reader and Writer must be configured with "matching" schemas. By this, we mean the schemas must have the same field names. The types of the fields do not have to be the same if a field value can be coerced from one type to another. For instance, if the input schema has a field named "balance" of type double, the output schema can have a field named "balance" with a type of string, double, or float. If any field is present in the input that is not present in the output, the field will be left out of the output. If any field is specified in the output schema but is not present in the input data/schema, then the field will not be present in the output or will have a null value, depending on the writer.

Use Cases

Convert data from one record-oriented format to another

Input Requirement: This component allows an incoming relationship.

  1. The Record Reader should be configured according to the incoming data format.

  2. The Record Writer should be configured according to the desired output format.

Tags: convert, record, generic, schema, json, csv, avro, log, logs, freeform, text

Properties

Record Reader

Specifies the Controller Service to use for reading incoming data

Record Writer

Specifies the Controller Service to use for writing out the records

Include Zero Record FlowFiles

When converting an incoming FlowFile, if the conversion results in no data, this property specifies whether or not a FlowFile will be sent to the corresponding relationship

Relationships

  • success: FlowFiles that are successfully transformed will be routed to this relationship

  • failure: If a FlowFile cannot be transformed from the configured input format to the configured output format, the unchanged FlowFile will be routed to this relationship

Writes Attributes

  • mime.type: Sets the mime.type attribute to the MIME Type specified by the Record Writer

  • record.count: The number of records in the FlowFile

  • record.error.message: This attribute provides on failure the error message encountered by the Reader or Writer.

Input Requirement

This component requires an incoming relationship.

CopyAzureBlobStorage_v12

Copies a blob in Azure Blob Storage from one account/container to another. The processor uses Azure Blob Storage client library v12.

Tags: azure, microsoft, cloud, storage, blob

Properties

Source Storage Credentials

Credentials Service used to obtain Azure Blob Storage Credentials to read Source Blob information

Source Container Name

Name of the Azure storage container that will be copied

Source Blob Name

The full name of the source blob

Destination Storage Credentials

Controller Service used to obtain Azure Blob Storage Credentials.

Destination Container Name

Name of the Azure storage container destination defaults to the Source Container Name when not specified

Destination Blob Name

The full name of the destination blob defaults to the Source Blob Name when not specified

Conflict Resolution Strategy

Specifies whether an existing blob will have its contents replaced upon conflict.

Create Container

Specifies whether to check if the container exists and to automatically create it if it does not. Permission to list containers is required. If false, this check is not made, but the Put operation will fail if the container does not exist.

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP, SOCKS In case of SOCKS, it is not guaranteed that the selected SOCKS Version will be used by the processor.

Relationships

  • success: All successfully processed FlowFiles are routed to this relationship

  • failure: Unsuccessful operations will be transferred to the failure relationship.

Writes Attributes

  • azure.container: The name of the Azure Blob Storage container

  • azure.blobname: The name of the blob on Azure Blob Storage

  • azure.primaryUri: Primary location of the blob

  • azure.etag: ETag of the blob

  • azure.blobtype: Type of the blob (either BlockBlob, PageBlob or AppendBlob)

  • mime.type: MIME Type of the content

  • lang: Language code for the content

  • azure.timestamp: Timestamp of the blob

  • azure.length: Length of the blob

  • azure.error.code: Error code reported during blob operation

  • azure.ignored: When Conflict Resolution Strategy is 'ignore', this property will be true/false depending on whether the blob was ignored.

Input Requirement

This component requires an incoming relationship.

CopyS3Object

Copies a file from one bucket and key to another in AWS S3

Tags: Amazon, S3, AWS, Archive, Copy

Properties

Source Bucket

The bucket that contains the file to be copied.

Source Key

The source key in the source bucket

Destination Bucket

The bucket that will receive the copy.

Destination Key

The target key in the target bucket

AWS Credentials Provider Service

The Controller Service that is used to obtain AWS credentials provider

Region

The AWS Region to connect to.

Communications Timeout

The amount of time to wait in order to establish a connection to AWS or receive data from AWS before timing out.

FullControl User List

A comma-separated list of Amazon User ID’s or E-mail addresses that specifies who should have Full Control for an object

Read Permission User List

A comma-separated list of Amazon User ID’s or E-mail addresses that specifies who should have Read Access for an object

Write Permission User List

A comma-separated list of Amazon User ID’s or E-mail addresses that specifies who should have Write Access for an object

Read ACL User List

A comma-separated list of Amazon User ID’s or E-mail addresses that specifies who should have permissions to read the Access Control List for an object

Write ACL User List

A comma-separated list of Amazon User ID’s or E-mail addresses that specifies who should have permissions to change the Access Control List for an object

Canned ACL

Amazon Canned ACL for an object, one of: BucketOwnerFullControl, BucketOwnerRead, LogDeliveryWrite, AuthenticatedRead, PublicReadWrite, PublicRead, Private; will be ignored if any other ACL/permission/owner property is specified

Owner

The Amazon ID to use for the object’s owner

SSL Context Service

Specifies an optional SSL Context Service that, if provided, will be used to create connections

Endpoint Override URL

Endpoint URL to use instead of the AWS default including scheme, host, port, and path. The AWS libraries select an endpoint URL based on the AWS region, but this property overrides the selected endpoint URL, allowing use with other S3-compatible endpoints.

Signer Override

The AWS S3 library uses Signature Version 4 by default but this property allows you to specify the Version 2 signer to support older S3-compatible services or even to plug in your own custom signer implementation.

Custom Signer Class Name

Fully qualified class name of the custom signer class. The signer must implement com.amazonaws.auth.Signer interface.

Custom Signer Module Location

Comma-separated list of paths to files and/or directories which contain the custom signer’s JAR file and its dependencies (if any).

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP + AuthN

Relationships

  • success: FlowFiles are routed to this Relationship after they have been successfully processed.

  • failure: If the Processor is unable to process a given FlowFile, it will be routed to this Relationship.

Input Requirement

This component requires an incoming relationship.

CountText

Counts various metrics on incoming text. The requested results will be recorded as attributes. The resulting flowfile will not have its content modified.

Tags: count, text, line, word, character

Properties

Count Lines

If enabled, will count the number of lines present in the incoming text.

Count Non-Empty Lines

If enabled, will count the number of lines that contain a non-whitespace character present in the incoming text.

Count Words

If enabled, will count the number of words (alphanumeric character groups bounded by whitespace) present in the incoming text. Common logical delimiters [_-.] do not bound a word unless 'Split Words on Symbols' is true.

Count Characters

If enabled, will count the number of characters (including whitespace and symbols, but not including newlines and carriage returns) present in the incoming text.

Split Words on Symbols

If enabled, the word count will identify strings separated by common logical delimiters [ _ - . ] as independent words (ex. split-words-on-symbols = 4 words).

Character Encoding

Specifies a character encoding to use.

Call Immediate Adjustment

If true, the counter will be updated immediately, without regard to whether the ProcessSession is commit or rolled back;otherwise, the counter will be incremented only if and when the ProcessSession is committed.

Relationships

  • success: The flowfile contains the original content with one or more attributes added containing the respective counts

  • failure: If the flowfile text cannot be counted for some reason, the original file will be routed to this destination and nothing will be routed elsewhere

Writes Attributes

  • text.line.count: The number of lines of text present in the FlowFile content

  • text.line.nonempty.count: The number of lines of text (with at least one non-whitespace character) present in the original FlowFile

  • text.word.count: The number of words present in the original FlowFile

  • text.character.count: The number of characters (given the specified character encoding) present in the original FlowFile

Input Requirement

This component requires an incoming relationship.

See Also

CreateHadoopSequenceFile

Creates Hadoop Sequence Files from incoming flow files

Tags: hadoop, sequence file, create, sequencefile

Properties

Hadoop Configuration Resources

A file or comma separated list of files which contains the Hadoop file system configuration. Without this, Hadoop will search the classpath for a 'core-site.xml' and 'hdfs-site.xml' file or will revert to a default configuration. To use swebhdfs, see 'Additional Details' section of PutHDFS’s documentation.

Kerberos User Service

Specifies the Kerberos User Controller Service that should be used for authenticating with Kerberos

Additional Classpath Resources

A comma-separated list of paths to files and/or directories that will be added to the classpath and used for loading native libraries. When specifying a directory, all files with in the directory will be added to the classpath, but further sub-directories will not be included.

Compression type

Type of compression to use when creating Sequence File

Compression codec

Relationships

  • success: Generated Sequence Files are sent to this relationship

  • failure: Incoming files that failed to generate a Sequence File are sent to this relationship

Input Requirement

This component requires an incoming relationship.

See Also

Additional Details

Description

This processor is used to create a Hadoop Sequence File, which essentially is a file of key/value pairs. The key will be a file name and the value will be the flow file content. The processor will take either a merged (a.k.a. packaged) flow file or a singular flow file. Historically, this processor handled the merging by type and size or time prior to creating a SequenceFile output; it no longer does this. If creating a SequenceFile that contains multiple files of the same type is desired, precede this processor with a RouteOnAttribute processor to segregate files of the same type and follow that with a MergeContent processor to bundle up files. If the type of files is not important, just use the MergeContent processor. When using the MergeContent processor, the following Merge Formats are supported by this processor:

  • TAR

  • ZIP

  • FlowFileStream v3

The created SequenceFile is named the same as the incoming FlowFile with the suffix ‘.sf’. For incoming FlowFiles that are bundled, the keys in the SequenceFile are the individual file names, the values are the contents of each file.

The value portion of a key/value pair is loaded into memory. While there is a max size limit of 2GB, this could cause memory issues if there are too many concurrent tasks and the flow file sizes are large.
Using Compression

The value of the Compression codec property determines the compression library the processor uses to compress content. Third party libraries are used for compression. These third party libraries can be Java libraries or native libraries. In case of native libraries, the path of the parent folder needs to be in an environment variable called LD_LIBRARY_PATH so that NiFi can find the libraries.

Example: using Snappy compression with native library on CentOS
  1. Snappy compression needs to be installed on the server running NiFi:
    sudo yum install snappy

  2. Suppose that the server running NiFi has the native compression libraries in /opt/lib/hadoop/lib/native . (Native libraries have file extensions like .so, .dll, .lib, etc. depending on the platform.)
    We need to make sure that the files can be executed by the NiFi process’ user. For this purpose we can make a copy of these files to e.g. /opt/nativelibs and change their owner. If NiFi is executed by nifi user in the nifi group, then:
    chown nifi:nifi /opt/nativelibs
    chown nifi:nifi /opt/nativelibs/*

  3. The LD_LIBRARY_PATH needs to be set to contain the path to the folder /opt/nativelibs.

  4. NiFi needs to be restarted.

  5. Compression codec property can be set to SNAPPY and a Compression type can be selected.

  6. The processor can be started.

CryptographicHashContent

Calculates a cryptographic hash value for the flowfile content using the given algorithm and writes it to an output attribute. Please refer to https://csrc.nist.gov/Projects/Hash-Functions/NIST-Policy-on-Hash-Functions for help to decide which algorithm to use.

Tags: content, hash, sha, blake2, md5, cryptography

Properties

Fail if the content is empty

Route to failure if the content is empty. While hashing an empty value is valid, some flows may want to detect empty input.

Hash Algorithm

The hash algorithm to use. Note that not all of the algorithms available are recommended for use (some are provided for legacy compatibility). There are many things to consider when picking an algorithm; it is recommended to use the most secure algorithm possible.

Relationships

  • success: Used for flowfiles that have a hash value added

  • failure: Used for flowfiles that have no content if the 'fail on empty' setting is enabled

Writes Attributes

  • content_<algorithm>: This processor adds an attribute whose value is the result of hashing the flowfile content. The name of this attribute is specified by the value of the algorithm, e.g. 'content_SHA-256'.

Input Requirement

This component requires an incoming relationship.

DebugFlow

The DebugFlow processor aids testing and debugging the FlowFile framework by allowing various responses to be explicitly triggered in response to the receipt of a FlowFile or a timer event without a FlowFile if using timer or cron based scheduling. It can force responses needed to exercise or test various failure modes that can occur when a processor runs.

Tags: test, debug, processor, utility, flow, FlowFile

Properties

FlowFile Success Iterations

Number of FlowFiles to forward to success relationship.

FlowFile Failure Iterations

Number of FlowFiles to forward to failure relationship.

FlowFile Rollback Iterations

Number of FlowFiles to roll back (without penalty).

FlowFile Rollback Yield Iterations

Number of FlowFiles to roll back and yield.

FlowFile Rollback Penalty Iterations

Number of FlowFiles to roll back with penalty.

FlowFile Exception Iterations

Number of FlowFiles to throw exception.

FlowFile Exception Class

Exception class to be thrown (must extend java.lang.RuntimeException).

No FlowFile Skip Iterations

Number of times to skip onTrigger if no FlowFile.

No FlowFile Exception Iterations

Number of times to throw NPE exception if no FlowFile.

No FlowFile Yield Iterations

Number of times to yield if no FlowFile.

No FlowFile Exception Class

Exception class to be thrown if no FlowFile (must extend java.lang.RuntimeException).

Write Iterations

Number of times to write to the FlowFile

Content Size

The number of bytes to write each time that the FlowFile is written to

@OnScheduled Pause Time

Specifies how long the processor should sleep in the @OnScheduled method, so that the processor can be forced to take a long time to start up

Fail When @OnScheduled called

Specifies whether or not the Processor should throw an Exception when the methods annotated with @OnScheduled are called

@OnUnscheduled Pause Time

Specifies how long the processor should sleep in the @OnUnscheduled method, so that the processor can be forced to take a long time to respond when user clicks stop

Fail When @OnUnscheduled called

Specifies whether or not the Processor should throw an Exception when the methods annotated with @OnUnscheduled are called

@OnStopped Pause Time

Specifies how long the processor should sleep in the @OnStopped method, so that the processor can be forced to take a long time to shutdown

Fail When @OnStopped called

Specifies whether or not the Processor should throw an Exception when the methods annotated with @OnStopped are called

OnTrigger Pause Time

Specifies how long the processor should sleep in the onTrigger() method, so that the processor can be forced to take a long time to perform its task

CustomValidate Pause Time

Specifies how long the processor should sleep in the customValidate() method

Ignore Interrupts When Paused

If the Processor’s thread(s) are sleeping (due to one of the "Pause Time" properties above), and the thread is interrupted, this indicates whether the Processor should ignore the interrupt and continue sleeping or if it should allow itself to be interrupted.

Relationships

  • success: FlowFiles processed successfully.

  • failure: FlowFiles that failed to process.

Input Requirement

This component allows an incoming relationship.

Additional Details

When triggered, the processor loops through the appropriate response list. A response is produced the configured number of times for each pass through its response list, as long as the processor is running.

Triggered by a FlowFile, the processor can produce the following responses.

  1. transfer FlowFile to success relationship.

  2. transfer FlowFile to failure relationship.

  3. rollback the FlowFile without penalty.

  4. rollback the FlowFile and yield the context.

  5. rollback the FlowFile with penalty.

  6. throw an exception.

Triggered without a FlowFile, the processor can produce the following responses.

  1. do nothing and return.

  2. throw an exception.

  3. yield the context.

DecryptContentAge

Decrypt content using the age-encryption.org/v1 specification. Detects binary or ASCII armored content encoding using the initial file header bytes. The age standard uses ChaCha20-Poly1305 for authenticated encryption of the payload. The age-keygen command supports generating X25519 key pairs for encryption and decryption operations.

Tags: age, age-encryption.org, encryption, ChaCha20-Poly1305, X25519

Properties

Private Key Source

Source of information determines the loading strategy for X25519 Private Key Identities

Private Key Identities

One or more X25519 Private Key Identities, separated with newlines, encoded according to the age specification, starting with AGE-SECRET-KEY-1

Private Key Identity Resources

One or more files or URLs containing X25519 Private Key Identities, separated with newlines, encoded according to the age specification, starting with AGE-SECRET-KEY-1

Relationships

  • success: Decryption Completed

  • failure: Decryption Failed

Input Requirement

This component requires an incoming relationship.

DecryptContentPGP

Decrypt contents of OpenPGP messages. Using the Packaged Decryption Strategy preserves OpenPGP encoding to support subsequent signature verification.

Tags: PGP, GPG, OpenPGP, Encryption, RFC 4880

Properties

Decryption Strategy

Strategy for writing files to success after decryption

Passphrase

Passphrase used for decrypting data encrypted with Password-Based Encryption

Private Key Service

PGP Private Key Service for decrypting data encrypted with Public Key Encryption

Relationships

  • success: Decryption Succeeded

  • failure: Decryption Failed

Writes Attributes

  • pgp.literal.data.filename: Filename from decrypted Literal Data

  • pgp.literal.data.modified: Modified Date from decrypted Literal Data

  • pgp.symmetric.key.algorithm.block.cipher: Symmetric-Key Algorithm Block Cipher

  • pgp.symmetric.key.algorithm.id: Symmetric-Key Algorithm Identifier

Input Requirement

This component requires an incoming relationship.

DeduplicateRecord

This processor de-duplicates individual records within a record set. It can operate on a per-file basis using an in-memory hashset or bloom filter. When configured with a distributed map cache, it de-duplicates records across multiple files.

Tags: text, record, update, change, replace, modify, distinct, unique, filter, hash, dupe, duplicate, dedupe

Properties

Deduplication Strategy

The strategy to use for detecting and routing duplicate records. The option for detecting duplicates across a single FlowFile operates in-memory, whereas detection spanning multiple FlowFiles utilises a distributed map cache.

Distributed Map Cache client

This property is required when the deduplication strategy is set to 'multiple files.' The map cache will for each record, atomically check whether the cache key exists and if not, set it.

Cache Identifier

An optional expression language field that overrides the record’s computed cache key. This field has an additional attribute available: ${record.hash.value}, which contains the cache key derived from dynamic properties (if set) or record fields.

Cache the Entry Identifier

For each record, check whether the cache identifier exists in the distributed map cache. If it doesn’t exist and this property is true, put the identifier to the cache.

Record Reader

Specifies the Controller Service to use for reading incoming data

Record Writer

Specifies the Controller Service to use for writing out the records

Include Zero Record FlowFiles

If a FlowFile sent to either the duplicate or non-duplicate relationships contains no records, a value of false in this property causes the FlowFile to be dropped. Otherwise, the empty FlowFile is emitted.

Record Hashing Algorithm

The algorithm used to hash the cache key.

Filter Type

The filter used to determine whether a record has been seen before based on the matching RecordPath criteria. If hash set is selected, a Java HashSet object will be used to deduplicate all encountered records. If the bloom filter option is selected, a bloom filter will be used. The bloom filter option is less memory intensive, but has a chance of having false positives.

Filter Capacity Hint

An estimation of the total number of unique records to be processed. The more accurate this number is will lead to fewer false negatives on a BloomFilter.

Bloom Filter Certainty

The desired false positive probability when using the BloomFilter type. Using a value of .05 for example, guarantees a five-percent probability that the result is a false positive. The closer to 1 this value is set, the more precise the result at the expense of more storage space utilization.

Dynamic Properties

Name of the property.

A record’s cache key is generated by combining the name of each dynamic property with its evaluated record value (as specified by the corresponding RecordPath).

Relationships

  • failure: If unable to communicate with the cache, the FlowFile will be penalized and routed to this relationship

  • duplicate: Records detected as duplicates are routed to this relationship.

  • non-duplicate: Records not found in the cache are routed to this relationship.

  • original: The original input FlowFile is sent to this relationship unless a fatal error occurs.

Writes Attributes

  • record.count: Number of records written to the destination FlowFile.

Input Requirement

This component requires an incoming relationship.

System Resource Considerations

  • MEMORY: The HashSet filter type will grow memory space proportionate to the number of unique records processed. The BloomFilter type will use constant memory regardless of the number of records processed.

  • CPU: If a more advanced hash algorithm is chosen, the amount of time required to hash any particular record could increase substantially.

DeleteAzureBlobStorage_v12

Deletes the specified blob from Azure Blob Storage. The processor uses Azure Blob Storage client library v12.

Tags: azure, microsoft, cloud, storage, blob

Properties

Storage Credentials

Controller Service used to obtain Azure Blob Storage Credentials.

Container Name

Name of the Azure storage container. In case of PutAzureBlobStorage processor, container can be created if it does not exist.

Blob Name

The full name of the blob

Delete Snapshots Option

Specifies the snapshot deletion options to be used when deleting a blob.

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP, SOCKS In case of SOCKS, it is not guaranteed that the selected SOCKS Version will be used by the processor.

Relationships

  • success: All successfully processed FlowFiles are routed to this relationship

  • failure: Unsuccessful operations will be transferred to the failure relationship.

Input Requirement

This component requires an incoming relationship.

DeleteAzureDataLakeStorage

Deletes the provided file from Azure Data Lake Storage

Tags: azure, microsoft, cloud, storage, adlsgen2, datalake

Properties

ADLS Credentials

Controller Service used to obtain Azure Credentials.

Filesystem Name

Name of the Azure Storage File System (also called Container). It is assumed to be already existing.

Filesystem Object Type

They type of the file system object to be deleted. It can be either folder or file.

Directory Name

Name of the Azure Storage Directory. The Directory Name cannot contain a leading '/'. The root directory can be designated by the empty string value. In case of the PutAzureDataLakeStorage processor, the directory will be created if not already existing.

File Name

The filename

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP, SOCKS In case of SOCKS, it is not guaranteed that the selected SOCKS Version will be used by the processor.

Relationships

  • success: Files that have been successfully written to Azure storage are transferred to this relationship

  • failure: Files that could not be written to Azure storage for some reason are transferred to this relationship

Input Requirement

This component requires an incoming relationship.

DeleteBPCProcessLog

Deletes data from a Virtimo Business Process Center (BPC) log service.

Tags: virtimo, bpc

Properties

BPC Controller

Controller used to define the connection to the BPC. The API-Key used by the controller requires 'LOG_SERVICE_DELETE_DATA'-Rights.

BPC Logger

Select the logger from available ones.

BPC Logger ID

The ID of the logger (i.e. the Component ID of the Log Service in BPC).

Delete all entries

WARNING: All entries from the chosen logger will be deleted.

Parent ID

The ID of the log entry to delete (in most cases the 'PROCESSID'-value). Multiple IDs can be set separated by comma.

Child ID

The ID of the child log entry to delete (in most cases the 'CHILDID'-value). If not set, deletes all child entries along with the parent log entry. Multiple IDs can be set separated by comma. Must not be set if multiple Parent IDs or set.

Relationships

  • success: If the request was successfully processed by the BPC, the FlowFile is routed here.

  • failure: If there was an error, the FlowFile is routed here.

Writes Attributes

  • bpc.status.code: The response code from BPC.

  • bpc.error.response: The error description from BPC.

Input Requirement

This component requires an incoming relationship.

DeleteByQueryElasticsearch

Delete from an Elasticsearch index using a query. The query can be loaded from a flowfile body or from the Query parameter.

Tags: elastic, elasticsearch, elasticsearch5, elasticsearch6, elasticsearch7, elasticsearch8, delete, query

Properties

Query Definition Style

How the JSON Query will be defined for use by the processor.

Query

A query in JSON syntax, not Lucene syntax. Ex: {"query":{"match":{"somefield":"somevalue"}}}. If this parameter is not set, the query will be read from the flowfile content. If the query (property and flowfile content) is empty, a default empty JSON Object will be used, which will result in a "match_all" query in Elasticsearch.

Query Clause

A "query" clause in JSON syntax, not Lucene syntax. Ex: {"match":{"somefield":"somevalue"}}. If the query is empty, a default JSON Object will be used, which will result in a "match_all" query in Elasticsearch.

Query Attribute

If set, the executed query will be set on each result flowfile in the specified attribute.

Index

The name of the index to use.

Type

The type of this document (used by Elasticsearch for indexing and searching).

Max JSON Field String Length

The maximum allowed length of a string value when parsing a JSON document or attribute.

Client Service

An Elasticsearch client service to use for running queries.

Dynamic Properties

The name of a URL query parameter to add

Adds the specified property name/value as a query parameter in the Elasticsearch URL used for processing. These parameters will override any matching parameters in the query request body

Relationships

  • success: If the "by query" operation succeeds, and a flowfile was read, it will be sent to this relationship.

  • failure: If the "by query" operation fails, and a flowfile was read, it will be sent to this relationship.

  • retry: All flowfiles that fail due to server/cluster availability go to this relationship.

Writes Attributes

  • elasticsearch.delete.took: The amount of time that it took to complete the delete operation in ms.

  • elasticsearch.delete.error: The error message provided by Elasticsearch if there is an error running the delete.

Input Requirement

This component allows an incoming relationship.

Additional Details

This processor executes a delete operation against one or more indices using the _delete_by_query handler. The query should be a valid Elasticsearch JSON DSL query (Lucene syntax is not supported). An example query:

{
  "query": {
    "match": {
      "username.keyword": "john.smith"
    }
  }
}
json

To delete all the contents of an index, this could be used:

{
  "query": {
    "match_all": {}
  }
}
json

DeleteDynamoDB

Deletes a document from DynamoDB based on hash and range key. The key can be string or number. The request requires all the primary keys for the operation (hash or hash and range key)

Tags: Amazon, DynamoDB, AWS, Delete, Remove

Properties

Table Name

The DynamoDB table name

Region

AWS Credentials Provider Service

The Controller Service that is used to obtain AWS credentials provider

Hash Key Name

The hash key name of the item

Range Key Name

The range key name of the item

Hash Key Value

The hash key value of the item

Range Key Value

Hash Key Value Type

The hash key value type of the item

Range Key Value Type

The range key value type of the item

Batch items for each request (between 1 and 50)

The items to be retrieved in one batch

Communications Timeout

Endpoint Override URL

Endpoint URL to use instead of the AWS default including scheme, host, port, and path. The AWS libraries select an endpoint URL based on the AWS region, but this property overrides the selected endpoint URL, allowing use with other S3-compatible endpoints.

SSL Context Service

Specifies an optional SSL Context Service that, if provided, will be used to create connections

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP + AuthN

Relationships

  • success: FlowFiles are routed to success relationship

  • failure: FlowFiles are routed to failure relationship

  • unprocessed: FlowFiles are routed to unprocessed relationship when DynamoDB is not able to process all the items in the request. Typical reasons are insufficient table throughput capacity and exceeding the maximum bytes per request. Unprocessed FlowFiles can be retried with a new request.

Reads Attributes

  • * dynamodb.item.hash.key.value*: Items hash key value

  • * dynamodb.item.range.key.value*: Items range key value

Writes Attributes

  • dynamodb.key.error.unprocessed: DynamoDB unprocessed keys

  • dynmodb.range.key.value.error: DynamoDB range key error

  • dynamodb.key.error.not.found: DynamoDB key not found

  • dynamodb.error.exception.message: DynamoDB exception message

  • dynamodb.error.code: DynamoDB error code

  • dynamodb.error.message: DynamoDB error message

  • dynamodb.error.service: DynamoDB error service

  • dynamodb.error.retryable: DynamoDB error is retryable

  • dynamodb.error.request.id: DynamoDB error request id

  • dynamodb.error.status.code: DynamoDB status code

Input Requirement

This component requires an incoming relationship.

DeleteFile

Deletes a file from the filesystem.

Use Cases

Delete source file only after its processing completed

Input Requirement: This component allows an incoming relationship.

  1. Retrieve a file from the filesystem, e.g. using 'ListFile' and 'FetchFile'.

  2. Process the file using any combination of processors.

  3. Store the resulting file to a destination, e.g. using 'PutSFTP'.

  4. Using 'DeleteFile', delete the file from the filesystem only after the result has been stored. .

Tags: file, remove, delete, local, files, filesystem

Properties

Directory Path

The path to the directory the file to delete is located in.

Filename

The name of the file to delete.

Relationships

  • success: All FlowFiles, for which an existing file has been deleted, are routed to this relationship

  • failure: All FlowFiles, for which an existing file could not be deleted, are routed to this relationship

  • not found: All FlowFiles, for which the file to delete did not exist, are routed to this relationship

Restricted

Making changes to this component requires admin rights. Non-admins can still schedule this component.

Input Requirement

This component requires an incoming relationship.

DeleteGCSObject

Deletes objects from a Google Cloud Bucket. If attempting to delete a file that does not exist, FlowFile is routed to success.

Tags: google cloud, gcs, google, storage, delete

Properties

GCP Credentials Provider Service

The Controller Service used to obtain Google Cloud Platform credentials.

Project ID

Google Cloud Project ID

Bucket

Bucket of the object.

Key

Name of the object.

Generation

The generation of the object to be deleted. If null, will use latest version of the object.

Number of retries

How many retry attempts should be made before routing to the failure relationship.

Storage API URL

Overrides the default storage URL. Configuring an alternative Storage API URL also overrides the HTTP Host header on requests as described in the Google documentation for Private Service Connections.

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP + AuthN

Relationships

  • success: FlowFiles are routed to this relationship after a successful Google Cloud Storage operation.

  • failure: FlowFiles are routed to this relationship if the Google Cloud Storage operation fails.

Input Requirement

This component requires an incoming relationship.

DeleteGridFS

Deletes a file from GridFS using a file name or a query.

Tags: gridfs, delete, mongodb

Properties

Client Service

The MongoDB client service to use for database connections.

Mongo Database Name

The name of the database to use

Bucket Name

The GridFS bucket where the files will be stored. If left blank, it will use the default value 'fs' that the MongoDB client driver uses.

File Name

The name of the file in the bucket that is the target of this processor. GridFS file names do not include path information because GridFS does not sort files into folders within a bucket.

Query

A valid MongoDB query to use to find and delete one or more files from GridFS.

Query Output Attribute

If set, the query will be written to a specified attribute on the output flowfiles.

Relationships

  • success: When the operation succeeds, the flowfile is sent to this relationship.

  • failure: When there is a failure processing the flowfile, it goes to this relationship.

Input Requirement

This component requires an incoming relationship.

Additional Details

Description:

This processor retrieves one or more files from GridFS. The query to execute can be either provided in the query configuration parameter or generated from the value pulled from the filename configuration parameter. Upon successful execution, it will append the query that was executed as an attribute on the flowfile that was processed.

DeleteHDFS

Deletes one or more files or directories from HDFS. The path can be provided as an attribute from an incoming FlowFile, or a statically set path that is periodically removed. If this processor has an incoming connection, itwill ignore running on a periodic basis and instead rely on incoming FlowFiles to trigger a delete. Note that you may use a wildcard character to match multiple files or directories. If there are no incoming connections no flowfiles will be transfered to any output relationships. If there is an incoming flowfile then provided there are no detected failures it will be transferred to success otherwise it will be sent to false. If knowledge of globbed files deleted is necessary use ListHDFS first to produce a specific list of files to delete.

Tags: hadoop, HCFS, HDFS, delete, remove, filesystem

Properties

Hadoop Configuration Resources

A file or comma separated list of files which contains the Hadoop file system configuration. Without this, Hadoop will search the classpath for a 'core-site.xml' and 'hdfs-site.xml' file or will revert to a default configuration. To use swebhdfs, see 'Additional Details' section of PutHDFS’s documentation.

Kerberos User Service

Specifies the Kerberos User Controller Service that should be used for authenticating with Kerberos

Additional Classpath Resources

A comma-separated list of paths to files and/or directories that will be added to the classpath and used for loading native libraries. When specifying a directory, all files with in the directory will be added to the classpath, but further sub-directories will not be included.

Path

The HDFS file or directory to delete. A wildcard expression may be used to only delete certain files

Recursive

Remove contents of a non-empty directory recursively

Relationships

  • success: When an incoming flowfile is used then if there are no errors invoking delete the flowfile will route here.

  • failure: When an incoming flowfile is used and there is a failure while deleting then the flowfile will route here.

Writes Attributes

  • hdfs.filename: HDFS file to be deleted. If multiple files are deleted, then only the last filename is set.

  • hdfs.path: HDFS Path specified in the delete request. If multiple paths are deleted, then only the last path is set.

  • hadoop.file.url: The hadoop url for the file to be deleted.

  • hdfs.error.message: HDFS error message related to the hdfs.error.code

Restricted

Making changes to this component requires admin rights. Non-admins can still schedule this component.

Input Requirement

This component allows an incoming relationship.

See Also

DeleteMongo

Executes a delete query against a MongoDB collection. The query is provided in the body of the flowfile and the user can select whether it will delete one or many documents that match it.

Tags: delete, mongo, mongodb

Properties

Client Service

If configured, this property will use the assigned client service for connection pooling.

Mongo Database Name

The name of the database to use

Mongo Collection Name

The name of the collection to use

Delete Mode

Choose between deleting one document by query or many documents by query.

Fail When Nothing Is Deleted

Determines whether to send the flowfile to the success or failure relationship if nothing is successfully deleted.

Relationships

  • success: All FlowFiles that are written to MongoDB are routed to this relationship

  • failure: All FlowFiles that cannot be written to MongoDB are routed to this relationship

Reads Attributes

  • mongodb.delete.mode: Configurable parameter for controlling delete mode on a per-flowfile basis. The process must be configured to use this option. Acceptable values are 'one' and 'many.'

Input Requirement

This component requires an incoming relationship.

Additional Details

Description:

This processor deletes from Mongo using a user-provided query that is provided in the body of a flowfile. It must be a valid JSON document. The user has the option of deleting a single document or all documents that match the criteria. That behavior can be configured using the related configuration property. In addition, the processor can be configured to regard a failure to delete any documents as an error event, which would send the flowfile with the query to the failure relationship.

Example Query
{
  "username": "john.smith",
  "recipient": "jane.doe"
}
json

DeleteS3Object

Deletes a file from an Amazon S3 Bucket. If attempting to delete a file that does not exist, FlowFile is routed to success.

Tags: Amazon, S3, AWS, Archive, Delete

Properties

Bucket

The S3 Bucket to interact with

Object Key

The S3 Object Key to use. This is analogous to a filename for traditional file systems.

AWS Credentials Provider Service

The Controller Service that is used to obtain AWS credentials provider

Region

The AWS Region to connect to.

Communications Timeout

The amount of time to wait in order to establish a connection to AWS or receive data from AWS before timing out.

Version

The Version of the Object to delete

FullControl User List

A comma-separated list of Amazon User ID’s or E-mail addresses that specifies who should have Full Control for an object

Read Permission User List

A comma-separated list of Amazon User ID’s or E-mail addresses that specifies who should have Read Access for an object

Write Permission User List

A comma-separated list of Amazon User ID’s or E-mail addresses that specifies who should have Write Access for an object

Read ACL User List

A comma-separated list of Amazon User ID’s or E-mail addresses that specifies who should have permissions to read the Access Control List for an object

Write ACL User List

A comma-separated list of Amazon User ID’s or E-mail addresses that specifies who should have permissions to change the Access Control List for an object

Owner

The Amazon ID to use for the object’s owner

SSL Context Service

Specifies an optional SSL Context Service that, if provided, will be used to create connections

Endpoint Override URL

Endpoint URL to use instead of the AWS default including scheme, host, port, and path. The AWS libraries select an endpoint URL based on the AWS region, but this property overrides the selected endpoint URL, allowing use with other S3-compatible endpoints.

Signer Override

The AWS S3 library uses Signature Version 4 by default but this property allows you to specify the Version 2 signer to support older S3-compatible services or even to plug in your own custom signer implementation.

Custom Signer Class Name

Fully qualified class name of the custom signer class. The signer must implement com.amazonaws.auth.Signer interface.

Custom Signer Module Location

Comma-separated list of paths to files and/or directories which contain the custom signer’s JAR file and its dependencies (if any).

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP + AuthN

Relationships

  • success: FlowFiles are routed to this Relationship after they have been successfully processed.

  • failure: If the Processor is unable to process a given FlowFile, it will be routed to this Relationship.

Writes Attributes

  • s3.exception: The class name of the exception thrown during processor execution

  • s3.additionalDetails: The S3 supplied detail from the failed operation

  • s3.statusCode: The HTTP error code (if available) from the failed operation

  • s3.errorCode: The S3 moniker of the failed operation

  • s3.errorMessage: The S3 exception message from the failed operation

Input Requirement

This component requires an incoming relationship.

DeleteSFTP

Deletes a file residing on an SFTP server.

Use Cases

Delete source file only after its processing completed

Input Requirement: This component allows an incoming relationship.

  1. Retrieve a file residing on an SFTP server, e.g. using 'ListSFTP' and 'FetchSFTP'.

  2. Process the file using any combination of processors.

  3. Store the resulting file to a destination, e.g. using 'PutFile'.

  4. Using 'DeleteSFTP', delete the file residing on an SFTP server only after the result has been stored. .

Tags: remote, remove, delete, sftp

Properties

Directory Path

The path to the directory the file to delete is located in.

Filename

The name of the file to delete.

Hostname

The fully qualified hostname or IP address of the remote system

Port

The port that the remote system is listening on for file transfers

Username

Username

Password

Password for the user account

Private Key Path

The fully qualified path to the Private Key file

Private Key Passphrase

Password for the private key

Strict Host Key Checking

Indicates whether or not strict enforcement of hosts keys should be applied

Host Key File

If supplied, the given file will be used as the Host Key; otherwise, if 'Strict Host Key Checking' property is applied (set to true) then uses the 'known_hosts' and 'known_hosts2' files from ~/.ssh directory else no host key file will be used

Batch Size

The maximum number of FlowFiles to send in a single connection

Connection Timeout

Amount of time to wait before timing out while creating a connection

Data Timeout

When transferring a file between the local and remote system, this value specifies how long is allowed to elapse without any data being transferred between systems

Send Keep Alive On Timeout

Send a Keep Alive message every 5 seconds up to 5 times for an overall timeout of 25 seconds.

Use Compression

Indicates whether or not ZLIB compression should be used when transferring files

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP + AuthN, SOCKS + AuthN

Ciphers Allowed

A comma-separated list of Ciphers allowed for SFTP connections. Leave unset to allow all. Available options are: 3des-cbc, 3des-ctr, aes128-cbc, aes128-ctr, aes128-gcm@openssh.com, aes192-cbc, aes192-ctr, aes256-cbc, aes256-ctr, aes256-gcm@openssh.com, arcfour, arcfour128, arcfour256, blowfish-cbc, blowfish-ctr, cast128-cbc, cast128-ctr, chacha20-poly1305@openssh.com, idea-cbc, idea-ctr, serpent128-cbc, serpent128-ctr, serpent192-cbc, serpent192-ctr, serpent256-cbc, serpent256-ctr, twofish-cbc, twofish128-cbc, twofish128-ctr, twofish192-cbc, twofish192-ctr, twofish256-cbc, twofish256-ctr

Key Algorithms Allowed

A comma-separated list of Key Algorithms allowed for SFTP connections. Leave unset to allow all. Available options are: ecdsa-sha2-nistp256, ecdsa-sha2-nistp256-cert-v01@openssh.com, ecdsa-sha2-nistp384, ecdsa-sha2-nistp384-cert-v01@openssh.com, ecdsa-sha2-nistp521, ecdsa-sha2-nistp521-cert-v01@openssh.com, rsa-sha2-256, rsa-sha2-512, ssh-dss, ssh-dss-cert-v01@openssh.com, ssh-ed25519, ssh-ed25519-cert-v01@openssh.com, ssh-rsa, ssh-rsa-cert-v01@openssh.com

Key Exchange Algorithms Allowed

A comma-separated list of Key Exchange Algorithms allowed for SFTP connections. Leave unset to allow all. Available options are: curve25519-sha256, curve25519-sha256@libssh.org, diffie-hellman-group-exchange-sha1, diffie-hellman-group-exchange-sha256, diffie-hellman-group1-sha1, diffie-hellman-group14-sha1, diffie-hellman-group14-sha256, diffie-hellman-group14-sha256@ssh.com, diffie-hellman-group15-sha256, diffie-hellman-group15-sha256@ssh.com, diffie-hellman-group15-sha384@ssh.com, diffie-hellman-group15-sha512, diffie-hellman-group16-sha256, diffie-hellman-group16-sha384@ssh.com, diffie-hellman-group16-sha512, diffie-hellman-group16-sha512@ssh.com, diffie-hellman-group17-sha512, diffie-hellman-group18-sha512, diffie-hellman-group18-sha512@ssh.com, ecdh-sha2-nistp256, ecdh-sha2-nistp384, ecdh-sha2-nistp521, ext-info-c

Message Authentication Codes Allowed

A comma-separated list of Message Authentication Codes allowed for SFTP connections. Leave unset to allow all. Available options are: hmac-md5, hmac-md5-96, hmac-md5-96-etm@openssh.com, hmac-md5-etm@openssh.com, hmac-ripemd160, hmac-ripemd160-96, hmac-ripemd160-etm@openssh.com, hmac-ripemd160@openssh.com, hmac-sha1, hmac-sha1-96, hmac-sha1-96@openssh.com, hmac-sha1-etm@openssh.com, hmac-sha2-256, hmac-sha2-256-etm@openssh.com, hmac-sha2-512, hmac-sha2-512-etm@openssh.com

Relationships

  • success: All FlowFiles, for which an existing file has been deleted, are routed to this relationship

  • failure: All FlowFiles, for which an existing file could not be deleted, are routed to this relationship

  • not found: All FlowFiles, for which the file to delete did not exist, are routed to this relationship

Input Requirement

This component requires an incoming relationship.

DeleteSQS

Deletes a message from an Amazon Simple Queuing Service Queue

Tags: Amazon, AWS, SQS, Queue, Delete

Properties

Queue URL

The URL of the queue delete from

Region

AWS Credentials Provider Service

The Controller Service that is used to obtain AWS credentials provider

SSL Context Service

Specifies an optional SSL Context Service that, if provided, will be used to create connections

Receipt Handle

The identifier that specifies the receipt of the message

Communications Timeout

Endpoint Override URL

Endpoint URL to use instead of the AWS default including scheme, host, port, and path. The AWS libraries select an endpoint URL based on the AWS region, but this property overrides the selected endpoint URL, allowing use with other S3-compatible endpoints.

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP + AuthN

Relationships

  • success: FlowFiles are routed to success relationship

  • failure: FlowFiles are routed to failure relationship

Input Requirement

This component requires an incoming relationship.

See Also

DetectDuplicate

Caches a value, computed from FlowFile attributes, for each incoming FlowFile and determines if the cached value has already been seen. If so, routes the FlowFile to 'duplicate' with an attribute named 'original.identifier' that specifies the original FlowFile’s "description", which is specified in the <FlowFile Description> property. If the FlowFile is not determined to be a duplicate, the Processor routes the FlowFile to 'non-duplicate'

Tags: hash, dupe, duplicate, dedupe

Properties

Cache Entry Identifier

A FlowFile attribute, or the results of an Attribute Expression Language statement, which will be evaluated against a FlowFile in order to determine the value used to identify duplicates; it is this value that is cached

FlowFile Description

When a FlowFile is added to the cache, this value is stored along with it so that if a duplicate is found, this description of the original FlowFile will be added to the duplicate’s "original.flowfile.description" attribute

Age Off Duration

Time interval to age off cached FlowFiles

Distributed Cache Service

The Controller Service that is used to cache unique identifiers, used to determine duplicates

Cache The Entry Identifier

When true this cause the processor to check for duplicates and cache the Entry Identifier. When false, the processor would only check for duplicates and not cache the Entry Identifier, requiring another processor to add identifiers to the distributed cache.

Relationships

  • failure: If unable to communicate with the cache, the FlowFile will be penalized and routed to this relationship

  • duplicate: If a FlowFile has been detected to be a duplicate, it will be routed to this relationship

  • non-duplicate: If a FlowFile’s Cache Entry Identifier was not found in the cache, it will be routed to this relationship

Writes Attributes

  • original.flowfile.description: All FlowFiles routed to the duplicate relationship will have an attribute added named original.flowfile.description. The value of this attribute is determined by the attributes of the original copy of the data and by the FlowFile Description property.

Input Requirement

This component requires an incoming relationship.

DistributeLoad

Distributes FlowFiles to downstream processors based on a Distribution Strategy. If using the Round Robin strategy, the default is to assign each destination a weighting of 1 (evenly distributed). However, optional properties can be added to the change this; adding a property with the name '5' and value '10' means that the relationship with name '5' will be receive 10 FlowFiles in each iteration instead of 1.

Tags: distribute, load balance, route, round robin, weighted

Properties

Number of Relationships

Determines the number of Relationships to which the load should be distributed

Distribution Strategy

Determines how the load will be distributed. Relationship weight is in numeric order where '1' has the greatest weight.

Dynamic Properties

The relationship name (positive number)

Adding a property with the name '5' and value '10' means that the relationship with name '5' will receive 10 FlowFiles in each iteration instead of 1.

Relationships

  • 1: Where to route flowfiles for this relationship index

Dynamic Relationships

  • A number 1..<Number Of Relationships>: FlowFiles are sent to this relationship per the <Distribution Strategy>

Writes Attributes

  • distribute.load.relationship: The name of the specific relationship the FlowFile has been routed through

Input Requirement

This component requires an incoming relationship.

DuplicateFlowFile

Intended for load testing, this processor will create the configured number of copies of each incoming FlowFile. The original FlowFile as well as all generated copies are sent to the 'success' relationship. In addition, each FlowFile gets an attribute 'copy.index' set to the copy number, where the original FlowFile gets a value of zero, and all copies receive incremented integer values.

Tags: test, load, duplicate

Properties

Number of Copies

Specifies how many copies of each incoming FlowFile will be made

Relationships

  • success: The original FlowFile and all copies will be sent to this relationship

Writes Attributes

  • copy.index: A zero-based incrementing integer value based on which copy the FlowFile is.

Input Requirement

This component requires an incoming relationship.

EncodeContent

Encode or decode the contents of a FlowFile using Base64, Base32, or hex encoding schemes

Tags: encode, decode, base64, base32, hex

Properties

Mode

Specifies whether the content should be encoded or decoded.

Encoding

Specifies the type of encoding used.

Line Output Mode

Controls the line formatting for encoded content based on selected property values.

Encoded Line Length

Each line of encoded data will contain up to the configured number of characters, rounded down to the nearest multiple of 4.

Relationships

  • success: Any FlowFile that is successfully encoded or decoded will be routed to success

  • failure: Any FlowFile that cannot be encoded or decoded will be routed to failure

Input Requirement

This component requires an incoming relationship.

EncryptContentAge

Encrypt content using the age-encryption.org/v1 specification. Supports binary or ASCII armored content encoding using configurable properties. The age standard uses ChaCha20-Poly1305 for authenticated encryption of the payload. The age-keygen command supports generating X25519 key pairs for encryption and decryption operations.

Tags: age, age-encryption.org, encryption, ChaCha20-Poly1305, X25519

Properties

File Encoding

Output encoding for encrypted files. Binary encoding provides optimal processing performance.

Public Key Source

Source of information determines the loading strategy for X25519 Public Key Recipients

Public Key Recipients

One or more X25519 Public Key Recipients, separated with newlines, encoded according to the age specification, starting with age1

Public Key Recipient Resources

One or more files or URLs containing X25519 Public Key Recipients, separated with newlines, encoded according to the age specification, starting with age1

Relationships

  • success: Encryption Completed

  • failure: Encryption Failed

Input Requirement

This component requires an incoming relationship.

EncryptContentPGP

Encrypt contents using OpenPGP. The processor reads input and detects OpenPGP messages to avoid unnecessary additional wrapping in Literal Data packets.

Tags: PGP, GPG, OpenPGP, Encryption, RFC 4880

Properties

Symmetric-Key Algorithm

Symmetric-Key Algorithm for encryption

File Encoding

File Encoding for encryption

Passphrase

Passphrase used for encrypting data with Password-Based Encryption

Public Key Service

PGP Public Key Service for encrypting data with Public Key Encryption

Public Key Search

PGP Public Key Search will be used to match against the User ID or Key ID when formatted as uppercase hexadecimal string of 16 characters

Relationships

  • success: Encryption Succeeded

  • failure: Encryption Failed

Writes Attributes

  • pgp.symmetric.key.algorithm: Symmetric-Key Algorithm

  • pgp.symmetric.key.algorithm.block.cipher: Symmetric-Key Algorithm Block Cipher

  • pgp.symmetric.key.algorithm.key.size: Symmetric-Key Algorithm Key Size

  • pgp.symmetric.key.algorithm.id: Symmetric-Key Algorithm Identifier

  • pgp.file.encoding: File Encoding

  • pgp.compression.algorithm: Compression Algorithm

  • pgp.compression.algorithm.id: Compression Algorithm Identifier

Input Requirement

This component requires an incoming relationship.

EnforceOrder

Enforces expected ordering of FlowFiles that belong to the same data group within a single node. Although PriorityAttributePrioritizer can be used on a connection to ensure that flow files going through that connection are in priority order, depending on error-handling, branching, and other flow designs, it is possible for FlowFiles to get out-of-order. EnforceOrder can be used to enforce original ordering for those FlowFiles. [IMPORTANT] In order to take effect of EnforceOrder, FirstInFirstOutPrioritizer should be used at EVERY downstream relationship UNTIL the order of FlowFiles physically get FIXED by operation such as MergeContent or being stored to the final destination.

Tags: sort, order

Properties

Group Identifier

EnforceOrder is capable of multiple ordering groups. 'Group Identifier' is used to determine which group a FlowFile belongs to. This property will be evaluated with each incoming FlowFile. If evaluated result is empty, the FlowFile will be routed to failure.

Order Attribute

A name of FlowFile attribute whose value will be used to enforce order of FlowFiles within a group. If a FlowFile does not have this attribute, or its value is not an integer, the FlowFile will be routed to failure.

Initial Order

When the first FlowFile of a group arrives, initial target order will be computed and stored in the managed state. After that, target order will start being tracked by EnforceOrder and stored in the state management store. If Expression Language is used but evaluated result was not an integer, then the FlowFile will be routed to failure, and initial order will be left unknown until consecutive FlowFiles provide a valid initial order.

Maximum Order

If specified, any FlowFiles that have larger order will be routed to failure. This property is computed only once for a given group. After a maximum order is computed, it will be persisted in the state management store and used for other FlowFiles belonging to the same group. If Expression Language is used but evaluated result was not an integer, then the FlowFile will be routed to failure, and maximum order will be left unknown until consecutive FlowFiles provide a valid maximum order.

Batch Count

The maximum number of FlowFiles that EnforceOrder can process at an execution.

Wait Timeout

Indicates the duration after which waiting FlowFiles will be routed to the 'overtook' relationship.

Inactive Timeout

Indicates the duration after which state for an inactive group will be cleared from managed state. Group is determined as inactive if any new incoming FlowFile has not seen for a group for specified duration. Inactive Timeout must be longer than Wait Timeout. If a FlowFile arrives late after its group is already cleared, it will be treated as a brand new group, but will never match the order since expected preceding FlowFiles are already gone. The FlowFile will eventually timeout for waiting and routed to 'overtook'. To avoid this, group states should be kept long enough, however, shorter duration would be helpful for reusing the same group identifier again.

Relationships

  • success: A FlowFile with a matching order number will be routed to this relationship.

  • failure: A FlowFiles which does not have required attributes, or fails to compute those will be routed to this relationship

  • overtook: A FlowFile that waited for preceding FlowFiles longer than Wait Timeout and overtook those FlowFiles, will be routed to this relationship.

  • skipped: A FlowFile that has an order younger than current, which means arrived too late and skipped, will be routed to this relationship.

  • wait: A FlowFile with non matching order will be routed to this relationship

Writes Attributes

  • EnforceOrder.startedAt: All FlowFiles going through this processor will have this attribute. This value is used to determine wait timeout.

  • EnforceOrder.result: All FlowFiles going through this processor will have this attribute denoting which relationship it was routed to.

  • EnforceOrder.detail: FlowFiles routed to 'failure' or 'skipped' relationship will have this attribute describing details.

  • EnforceOrder.expectedOrder: FlowFiles routed to 'wait' or 'skipped' relationship will have this attribute denoting expected order when the FlowFile was processed.

Stateful

Scope: Local

EnforceOrder uses following states per ordering group: '<groupId>.target' is a order number which is being waited to arrive next. When a FlowFile with a matching order arrives, or a FlowFile overtakes the FlowFile being waited for because of wait timeout, target order will be updated to (FlowFile.order + 1). '<groupId>.max is the maximum order number for a group. '<groupId>.updatedAt' is a timestamp when the order of a group was updated last time. These managed states will be removed automatically once a group is determined as inactive, see 'Inactive Timeout' for detail.

Input Requirement

This component requires an incoming relationship.

EvaluateJsonPath

Evaluates one or more JsonPath expressions against the content of a FlowFile. The results of those expressions are assigned to FlowFile Attributes or are written to the content of the FlowFile itself, depending on configuration of the Processor. JsonPaths are entered by adding user-defined properties; the name of the property maps to the Attribute Name into which the result will be placed (if the Destination is flowfile-attribute; otherwise, the property name is ignored). The value of the property must be a valid JsonPath expression. A Return Type of 'auto-detect' will make a determination based off the configured destination. When 'Destination' is set to 'flowfile-attribute,' a return type of 'scalar' will be used. When 'Destination' is set to 'flowfile-content,' a return type of 'JSON' will be used.If the JsonPath evaluates to a JSON array or JSON object and the Return Type is set to 'scalar' the FlowFile will be unmodified and will be routed to failure. A Return Type of JSON can return scalar values if the provided JsonPath evaluates to the specified value and will be routed as a match.If Destination is 'flowfile-content' and the JsonPath does not evaluate to a defined path, the FlowFile will be routed to 'unmatched' without having its contents modified. If Destination is 'flowfile-attribute' and the expression matches nothing, attributes will be created with empty strings as the value unless 'Path Not Found Behaviour' is set to 'skip', and the FlowFile will always be routed to 'matched.'

Tags: JSON, evaluate, JsonPath

Properties

Destination

Indicates whether the results of the JsonPath evaluation are written to the FlowFile content or a FlowFile attribute; if using attribute, must specify the Attribute Name property. If set to flowfile-content, only one JsonPath may be specified, and the property name is ignored.

Return Type

Indicates the desired return type of the JSON Path expressions. Selecting 'auto-detect' will set the return type to 'json' for a Destination of 'flowfile-content', and 'scalar' for a Destination of 'flowfile-attribute'.

Path Not Found Behavior

Indicates how to handle missing JSON path expressions when destination is set to 'flowfile-attribute'. Selecting 'warn' will generate a warning when a JSON path expression is not found. Selecting 'skip' will omit attributes for any unmatched JSON path expressions.

Null Value Representation

Indicates the desired representation of JSON Path expressions resulting in a null value.

Max String Length

The maximum allowed length of a string value when parsing the JSON document

Dynamic Properties

A FlowFile attribute(if <Destination> is set to 'flowfile-attribute')

If <Destination>='flowfile-attribute' then that FlowFile attribute will be set to any JSON objects that match the JsonPath. If <Destination>='flowfile-content' then the FlowFile content will be updated to any JSON objects that match the JsonPath.

Relationships

  • failure: FlowFiles are routed to this relationship when the JsonPath cannot be evaluated against the content of the FlowFile; for instance, if the FlowFile is not valid JSON

  • matched: FlowFiles are routed to this relationship when the JsonPath is successfully evaluated and the FlowFile is modified as a result

  • unmatched: FlowFiles are routed to this relationship when the JsonPath does not match the content of the FlowFile and the Destination is set to flowfile-content

Input Requirement

This component requires an incoming relationship.

Additional Details

Note: The underlying JsonPath library loads the entirety of the streamed content into and performs result evaluations in memory. Accordingly, it is important to consider the anticipated profile of content being evaluated by this processor and the hardware supporting it especially when working against large JSON documents.

Additional Notes

It’s a common pattern to make JSON from attributes in NiFi. Many of these attributes have periods in their names. For example record.count. To reference them safely, you must use this sort of operation which puts the entire key in brackets. This also applies to JSON keys that contain whitespace:

$.[“record.count”]

$.[“record count”]

EvaluateXPath

Evaluates one or more XPaths against the content of a FlowFile. The results of those XPaths are assigned to FlowFile Attributes or are written to the content of the FlowFile itself, depending on configuration of the Processor. XPaths are entered by adding user-defined properties; the name of the property maps to the Attribute Name into which the result will be placed (if the Destination is flowfile-attribute; otherwise, the property name is ignored). The value of the property must be a valid XPath expression. If the XPath evaluates to more than one node and the Return Type is set to 'nodeset' (either directly, or via 'auto-detect' with a Destination of 'flowfile-content'), the FlowFile will be unmodified and will be routed to failure. If the XPath does not evaluate to a Node, the FlowFile will be routed to 'unmatched' without having its contents modified. If Destination is flowfile-attribute and the expression matches nothing, attributes will be created with empty strings as the value, and the FlowFile will always be routed to 'matched'

Tags: XML, evaluate, XPath

Properties

Destination

Indicates whether the results of the XPath evaluation are written to the FlowFile content or a FlowFile attribute; if using attribute, must specify the Attribute Name property. If set to flowfile-content, only one XPath may be specified, and the property name is ignored.

Return Type

Indicates the desired return type of the Xpath expressions. Selecting 'auto-detect' will set the return type to 'nodeset' for a Destination of 'flowfile-content', and 'string' for a Destination of 'flowfile-attribute'.

Allow DTD

Allow embedded Document Type Declaration in XML. This feature should be disabled to avoid XML entity expansion vulnerabilities.

Dynamic Properties

A FlowFile attribute(if <Destination> is set to 'flowfile-attribute'

If <Destination>='flowfile-attribute' then the FlowFile attribute is set to the result of the XPath Expression. If <Destination>='flowfile-content' then the FlowFile content is set to the result of the XPath Expression.

Relationships

  • failure: FlowFiles are routed to this relationship when the XPath cannot be evaluated against the content of the FlowFile; for instance, if the FlowFile is not valid XML, or if the Return Type is 'nodeset' and the XPath evaluates to multiple nodes

  • matched: FlowFiles are routed to this relationship when the XPath is successfully evaluated and the FlowFile is modified as a result

  • unmatched: FlowFiles are routed to this relationship when the XPath does not match the content of the FlowFile and the Destination is set to flowfile-content

Writes Attributes

  • user-defined: This processor adds user-defined attributes if the <Destination> property is set to flowfile-attribute.

Input Requirement

This component requires an incoming relationship.

System Resource Considerations

  • MEMORY: Processing requires reading the entire FlowFile into memory

EvaluateXQuery

Evaluates one or more XQueries against the content of a FlowFile. The results of those XQueries are assigned to FlowFile Attributes or are written to the content of the FlowFile itself, depending on configuration of the Processor. XQueries are entered by adding user-defined properties; the name of the property maps to the Attribute Name into which the result will be placed (if the Destination is 'flowfile-attribute'; otherwise, the property name is ignored). The value of the property must be a valid XQuery. If the XQuery returns more than one result, new attributes or FlowFiles (for Destinations of 'flowfile-attribute' or 'flowfile-content' respectively) will be created for each result (attributes will have a '.n' one-up number appended to the specified attribute name). If any provided XQuery returns a result, the FlowFile(s) will be routed to 'matched'. If no provided XQuery returns a result, the FlowFile will be routed to 'unmatched'. If the Destination is 'flowfile-attribute' and the XQueries matche nothing, no attributes will be applied to the FlowFile.

Tags: XML, evaluate, XPath, XQuery

Properties

Destination

Indicates whether the results of the XQuery evaluation are written to the FlowFile content or a FlowFile attribute. If set to <flowfile-content>, only one XQuery may be specified and the property name is ignored. If set to <flowfile-attribute> and the XQuery returns more than one result, multiple attributes will be added to theFlowFile, each named with a '.n' one-up number appended to the specified attribute name

Output: Method

Identifies the overall method that should be used for outputting a result tree.

Output: Omit XML Declaration

Specifies whether the processor should output an XML declaration when transforming a result tree.

Output: Indent

Specifies whether the processor may add additional whitespace when outputting a result tree.

Allow DTD

Allow embedded Document Type Declaration in XML. This feature should be disabled to avoid XML entity expansion vulnerabilities.

Dynamic Properties

A FlowFile attribute(if <Destination> is set to 'flowfile-attribute'

If <Destination>='flowfile-attribute' then the FlowFile attribute is set to the result of the XQuery. If <Destination>='flowfile-content' then the FlowFile content is set to the result of the XQuery.

Relationships

  • failure: FlowFiles are routed to this relationship when the XQuery cannot be evaluated against the content of the FlowFile.

  • matched: FlowFiles are routed to this relationship when the XQuery is successfully evaluated and the FlowFile is modified as a result

  • unmatched: FlowFiles are routed to this relationship when the XQuery does not match the content of the FlowFile and the Destination is set to flowfile-content

Writes Attributes

  • user-defined: This processor adds user-defined attributes if the <Destination> property is set to flowfile-attribute .

Input Requirement

This component requires an incoming relationship.

System Resource Considerations

  • MEMORY: Processing requires reading the entire FlowFile into memory

Additional Details

Examples:

This processor produces one attribute or FlowFile per XQueryResult. If only one attribute or FlowFile is desired, the following examples demonstrate how this can be achieved using the XQuery language. The examples below reference the following sample XML:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="foo.xsl"?>
<ns:fruitbasket xmlns:ns="http://namespace/1">
    <fruit taste="crisp">           <!-- Apples are my favorite-->
        <name>apple</name>
        <color>red</color>
    </fruit>
    <fruit>
        <name>apple</name>
        <color>green</color>
    </fruit>
    <fruit>
        <name>banana</name>
        <color>yellow</color>
    </fruit>
    <fruit taste="sweet">
        <name>orange</name>
        <color>orange</color>
    </fruit>
    <fruit>
        <name>blueberry</name>
        <color>blue</color>
    </fruit>
    <fruit taste="tart">
        <name>raspberry</name>
        <color>red</color>
    </fruit>
    <fruit>
        <name>none</name>
        <color/>
    </fruit>
</ns:fruitbasket>
xml
  • XQuery to return all “fruit” nodes individually (7 Results):

    • //fruit]

  • XQuery to return only the first “fruit” node (1 Result):

    • //fruit[1]

  • XQuery to return only the last “fruit” node (1 Result):

    • //fruit[count(//fruit)]

  • XQuery to return all “fruit” nodes, wrapped in a “basket” tag (1 Result):

    • \{//fruit}

  • XQuery to return all “fruit” names individually (7 Results):

    • //fruit/text()

  • XQuery to return only the first “fruit” name (1 Result):

    • //fruit[1]/text()

  • XQuery to return only the last “fruit” name (1 Result):

    • //fruit[count(//fruit)]/text()

  • XQuery to return all “fruit” names as a comma separated list (1 Result):

    • string-joinfor $x in //fruit return $x/name/text(, ‘,’)

  • XQuery to return all “fruit” colors and names as a comma separated list (1 Result):

    • string-joinfor $y in (for $x in //fruit return string-join(($x/color/text() , $x/name/text(, ’ ‘)) return $y),’, ’)

  • XQuery to return all “fruit” colors and names as a comma separated list (1 Result):

    • string-joinfor $y in (for $x in //fruit return string-join(($x/color/text() , $x/name/text(, ’ ‘)) return $y),’, ’)

  • XQuery to return all “fruit” colors and names as a new line separated list (1 Result):

    • string-joinfor $y in (for $x in //fruit return string-join(($x/color/text() , $x/name/text(, ’ ‘)) return $y),’\n’)

ExecuteGraphQuery

This processor is designed to execute queries in either the Cypher query language or the Tinkerpop Gremlin DSL. It delegates most of the logic to a configured client service that handles the interaction with the remote data source. All of the output is written out as JSON data.

Tags: cypher, neo4j, graph, network, insert, update, delete, put, get, node, relationship, connection, executor, gremlin, tinkerpop

Properties

Client Service

The graph client service for connecting to the graph database.

Graph Query

Specifies the graph query. If it is left blank, the processor will attempt to get the query from body.

Relationships

  • success: Successful FlowFiles are routed to this relationship

  • failure: Failed FlowFiles are routed to this relationship

  • original: If there is an input flowfile, the original input flowfile will be written to this relationship if the operation succeeds.

Writes Attributes

  • graph.error.message: GraphDB error message

  • graph.labels.added: Number of labels added

  • graph.nodes.created: Number of nodes created

  • graph.nodes.deleted: Number of nodes deleted

  • graph.properties.set: Number of properties set

  • graph.relations.created: Number of relationships created

  • graph.relations.deleted: Number of relationships deleted

  • graph.rows.returned: Number of rows returned

  • query.took: The amount of time in milliseconds that the querytook to execute.

Input Requirement

This component allows an incoming relationship.

Additional Details

Description:

This processor is designed to work with Gremlin and Cypher queries. The query is specified in the configuration parameter labeled Query, and parameters can be configured using dynamic properties on the processor. All Gremlin and Cypher CRUD operations are supported by this processor. It will stream the entire result set into a single flowfile as a JSON array.

ExecuteGraphQueryRecord

This uses FlowFile records as input to perform graph mutations. Each record is associated with an individual query/mutation, and a FlowFile will be output for each successful operation. Failed records will be sent as a single FlowFile to the failure relationship.

Tags: graph, gremlin, cypher

Properties

Client Service

The graph client service for connecting to a graph database.

Record Reader

The record reader to use with this processor.

Failed Record Writer

The record writer to use for writing failed records.

Graph Record Script

Script to perform the business logic on graph, using flow file attributes and custom properties as variable-value pairs in its logic.

Dynamic Properties

A dynamic property to be used as a parameter in the graph script

Uses a record path to set a variable as a parameter in the graph script

Relationships

  • failure: Flow files that fail to interact with graph server.

  • original: Original flow files that successfully interacted with graph server.

  • response: The response object from the graph server.

Writes Attributes

  • graph.operations.took: The amount of time it took to execute all of the graph operations.

  • record.count: The number of records unsuccessfully processed (written on FlowFiles routed to the 'failure' relationship.

Input Requirement

This component requires an incoming relationship.

ExecuteGroovyScript

Experimental Extended Groovy script processor. The script is responsible for handling the incoming flow file (transfer to SUCCESS or remove, e.g.) as well as any flow files created by the script. If the handling is incomplete or incorrect, the session will be rolled back.

Tags: script, groovy, groovyx

Properties

Script File

Path to script file to execute. Only one of Script File or Script Body may be used

Script Body

Body of script to execute. Only one of Script File or Script Body may be used

Failure strategy

What to do with unhandled exceptions. If you want to manage exception by code then keep the default value rollback. If transfer to failure selected and unhandled exception occurred then all flowFiles received from incoming queues in this session will be transferred to failure relationship with additional attributes set: ERROR_MESSAGE and ERROR_STACKTRACE. If rollback selected and unhandled exception occurred then all flowFiles received from incoming queues will be penalized and returned. If the processor has no incoming connections then this parameter has no effect.

Additional classpath

Classpath list separated by semicolon or comma. You can use masks like , .jar in file name.

Dynamic Properties

A script engine property to update

Updates a script engine property specified by the Dynamic Property’s key with the value specified by the Dynamic Property’s value. Use CTL. to access any controller services, SQL. to access any DBCPServices, RecordReader. to access RecordReaderFactory instances, or RecordWriter. to access any RecordSetWriterFactory instances.

Relationships

  • success: FlowFiles that were successfully processed

  • failure: FlowFiles that failed to be processed

Restricted

Making changes to this component requires admin rights. Non-admins can still schedule this component.

Input Requirement

This component allows an incoming relationship.

See Also

Additional Details

Summary

This is the grooviest groovy script :)

Script Bindings:
variable type description

session

org.apache.nifi.processor.ProcessSession

the session that is used to get, change, and transfer input files

context

org.apache.nifi.processor.ProcessContext

the context (almost unuseful)

log

org.apache.nifi.logging.ComponentLog

the logger for this processor instance

REL_SUCCESS

org.apache.nifi.processor.Relationship

the success relationship

REL_FAILURE

org.apache.nifi.processor.Relationship

the failure relationship

CTL

java.util.HashMap<String,https://github.com/apache/nifi/blob/main/nifi-api/src/main/java/org/apache/nifi/controller/ControllerService.java[ControllerService]>

Map populated with controller services defined with CTL.* processor properties. The CTL. prefixed properties could be linked to controller service and provides access to this service from a script without additional code.

SQL

java.util.HashMap<String, groovy.sql.Sql>

Map populated with groovy.sql.Sql objects connected to corresponding database defined with SQL.* processor properties. The SQL. prefixed properties could be linked only to DBCPSercice.

RecordReader

java.util.HashMap<String,https://github.com/apache/nifi/blob/main/nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-service-api/src/main/java/org/apache/nifi/serialization/RecordReaderFactory.java[RecordReaderFactory]>

Map populated with controller services defined with RecordReader.* processor properties. The RecordReader. prefixed properties are to be linked to RecordReaderFactory controller service instances.

RecordWriter

java.util.HashMap<String,https://github.com/apache/nifi/blob/main/nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-service-api/src/main/java/org/apache/nifi/serialization/RecordSetWriterFactory.java[RecordSetWriterFactory]>

Map populated with controller services defined with RecordWriter.* processor properties. The RecordWriter. prefixed properties are to be linked to RecordSetWriterFactory controller service instances.

Dynamic processor properties

org.apache.nifi.components.PropertyDescriptor

All processor properties not started with CTL. or SQL. are bound to script variables

SQL map details

Example: if you defined property \'SQL.mydb\' and linked it to any DBCPService, then you can access it from code SQL.mydb.rows('select * from mytable')

The processor automatically takes connection from dbcp service before executing script and tries to handle transaction:
database transactions automatically rolled back on script exception and committed on success.
Or you can manage transaction manually.
NOTE: Script must not disconnect connection.

SessionFile - flow file extension

The (org.apache.nifi.processors.groovyx.flow.SessionFile) is an actual object returned by session in Extended Groovy processor.
This flow file is a container that references session and the real flow file.
This allows to use simplified syntax to work with file attributes and content:

set new attribute value

flowFile.ATTRIBUTE_NAME = ATTRIBUTE_VALUE
flowFile.'mime.type' = 'text/xml'
flowFile.putAttribute("ATTRIBUTE_NAME", ATTRIBUTE_VALUE)
// the same as
flowFile = session.putAttribute(flowFile, "ATTRIBUTE_NAME", ATTRIBUTE_VALUE)
groovy

remove attribute

flowFile.ATTRIBUTE_NAME = null
// equals to
flowFile = session.removeAttribute(flowFile, "ATTRIBUTE_NAME")
groovy

get attribute value

String a = flowFile.ATTRIBUTE_NAME
groovy

write content

flowFile.write("UTF-8", "THE CharSequence to write into flow file replacing current content")
flowFile.write("UTF-8") { writer ->
    // do something with java.io.Writer...
}
flowFile.write { outStream ->
    // do something with output stream...
}
flowFile.write { inStream, outStream ->
    // do something with input and output streams...
}
groovy

get content

InputStream i = flowFile.read()
def json = new groovy.json.JsonSlurper().parse(flowFile.read())
String text = flowFile.read().getText("UTF-8")
groovy

transfer flow file to success relation

REL_SUCCESS << flowFile
flowFile.transfer(REL_SUCCESS)
// the same as:
session.transfer(flowFile, REL_SUCCESS)
groovy

work with dbcp

import groovy.sql.Sql

// define property named `SQL.db` connected to a DBCPConnectionPool controller service
// for this case it's an H2 database example

// read value from the database with prepared statement
// and assign into flowfile attribute `db.yesterday`
def daysAdd = -1
def row = SQL.db.firstRow("select dateadd('DAY', ${daysAdd}, sysdate) as DB_DATE from dual")
flowFile.'db.yesterday' = row.DB_DATE

// to work with BLOBs and CLOBs in the database
// use parameter casting using groovy.sql.Sql.BLOB(Stream) and groovy.sql.Sql.CLOB(Reader)

// write content of the flow file into database blob
flowFile.read { rawIn ->
    def parms = [
            p_id  : flowFile.ID as Long, // get flow file attribute named \`ID\`
            p_data: Sql.BLOB(rawIn),   // use input stream as BLOB sql parameter
    ]
    SQL.db.executeUpdate(parms, "update mytable set data = :p_data where id = :p_id")
}
groovy
Handling processor start & stop

In the extended groovy processor you can catch start and stop and unscheduled events by providing corresponding static methods:

import org.apache.nifi.processor.ProcessContext
import java.util.concurrent.atomic.AtomicLong

class Const {
    static Date startTime = null;
    static AtomicLong triggerCount = null;
}

static onStart(ProcessContext context) {
    Const.startTime = new Date()
    Const.triggerCount = new AtomicLong(0)
    println "onStart $context ${Const.startTime}"
}

static onStop(ProcessContext context) {
    def alive = (System.currentTimeMillis() - Const.startTime.getTime()) / 1000
    println "onStop $context executed ${Const.triggerCount} times during ${alive} seconds"
}

static onUnscheduled(ProcessContext context) {
    def alive = (System.currentTimeMillis() - Const.startTime.getTime()) / 1000
    println "onUnscheduled $context executed ${Const.triggerCount} times during ${alive} seconds"
}

flowFile.'trigger.count' = Const.triggerCount.incrementAndGet()
REL_SUCCESS << flowFile
groovy

ExecuteProcess

Runs an operating system command specified by the user and writes the output of that command to a FlowFile. If the command is expected to be long-running, the Processor can output the partial data on a specified interval. When this option is used, the output is expected to be in textual format, as it typically does not make sense to split binary data on arbitrary time-based intervals.

Tags: command, process, source, external, invoke, script

Properties

Command

Specifies the command to be executed; if just the name of an executable is provided, it must be in the user’s environment PATH.

Command Arguments

The arguments to supply to the executable delimited by white space. White space can be escaped by enclosing it in double-quotes.

Batch Duration

If the process is expected to be long-running and produce textual output, a batch duration can be specified so that the output will be captured for this amount of time and a FlowFile will then be sent out with the results and a new FlowFile will be started, rather than waiting for the process to finish before sending out the results

Redirect Error Stream

If true will redirect any error stream output of the process to the output stream. This is particularly helpful for processes which write extensively to the error stream or for troubleshooting.

Working Directory

The directory to use as the current working directory when executing the command

Argument Delimiter

Delimiter to use to separate arguments for a command [default: space]. Must be a single character.

Output MIME Type

Specifies the value to set for the "mime.type" attribute. This property is ignored if 'Batch Duration' is set.

Dynamic Properties

An environment variable name

These environment variables are passed to the process spawned by this Processor

Relationships

  • success: All created FlowFiles are routed to this relationship

Writes Attributes

  • command: Executed command

  • command.arguments: Arguments of the command

  • mime.type: Sets the MIME type of the output if the 'Output MIME Type' property is set and 'Batch Duration' is not set

Restricted

Making changes to this component requires admin rights. Non-admins can still schedule this component.

Input Requirement

This component does not allow an incoming relationship.

ExecuteScript

Experimental - Executes a script given the flow file and a process session. The script is responsible for handling the incoming flow file (transfer to SUCCESS or remove, e.g.) as well as any flow files created by the script. If the handling is incomplete or incorrect, the session will be rolled back. Experimental: Impact of sustained usage not yet verified.

Tags: script, execute, groovy, clojure

Properties

Script Engine

No Script Engines found

Script File

Path to script file to execute. Only one of Script File or Script Body may be used

Script Body

Body of script to execute. Only one of Script File or Script Body may be used

Module Directory

Comma-separated list of paths to files and/or directories which contain modules required by the script.

Dynamic Properties

Script Engine Binding property

Updates a script engine property specified by the Dynamic Property’s key with the value specified by the Dynamic Property’s value

Relationships

  • success: FlowFiles that were successfully processed

  • failure: FlowFiles that failed to be processed

Stateful

Scope: Local, Cluster

Scripts can store and retrieve state using the State Management APIs. Consult the State Manager section of the Developer’s Guide for more details.

Restricted

Making changes to this component requires admin rights. Non-admins can still schedule this component.

Input Requirement

This component allows an incoming relationship.

Additional Details

Description

The ExecuteScript Processor provides the ability to use a scripting language in order to leverage the NiFi API to perform tasks such as the following:

  • Read content and/or attributes from an incoming FlowFile

  • Create a new FlowFile (with or without a parent)

  • Write content and/or attributes to an outgoing FlowFile

  • Interact with the ProcessSession to transfer FlowFiles to relationships

  • Read/write to the State Manager to keep track of variables across executions of the processor

Notes:

  • ExecuteScript uses the JSR-223 Script Engine API to evaluate scripts, so the use of idiomatic language structure is sometimes limited. For example, in the case of Groovy, there is a separate ExecuteGroovyScript processor that allows you to do many more idiomatic Groovy tasks. For example, it’s easier to interact with Controller Services via ExecuteGroovyScript vs. ExecuteScript (see the ExecuteGroovyScript documentation for more details)

Variable Bindings

The Processor expects a user defined script that is evaluated when the processor is triggered. The following variables are available to the scripts:

Variable Name Description Variable Class

session

This is a reference to the ProcessSession assigned to the processor. The session allows you to perform operations on FlowFiles such as create(), putAttribute(), and transfer(), as well as read() and write()

ProcessSession

context

This is a reference to the ProcessContext for the processor. It can be used to retrieve processor properties, relationships, Controller Services, and the State Manager.

ProcessContext

log

This is a reference to the ComponentLog for the processor. Use it to log messages to NiFi, such as log.info(‘Hello world!’)

ComponentLog

REL_SUCCESS

This is a reference to the “success” relationship defined for the processor. It could also be inherited by referencing the static member of the parent class (ExecuteScript), this is a convenience variable. It also saves having to use the fully-qualified name for the relationship.

Relationship

REL_FAILURE

This is a reference to the “failure” relationship defined for the processor. As with REL_SUCCESS, it could also be inherited by referencing the static member of the parent class (ExecuteScript), this is a convenience variable. It also saves having to use the fully-qualified name for the relationship.

Relationship

Dynamic Properties

Any dynamic (user-defined) properties defined in ExecuteScript are passed to the script engine as variables set to the PropertyValue object corresponding to the dynamic property. This allows you to get the String value of the property, but also to evaluate the property with respect to NiFi Expression Language, cast the value as an appropriate data type (e.g., Boolean), etc. Because the dynamic property name becomes the variable name for the script, you must be aware of the variable naming properties for the chosen script engine. For example, Groovy does not allow periods (.) in variable names, so an error will occur if “my.property” was a dynamic property name. Interaction with these variables is done via the NiFi Java API. The ‘Dynamic Properties’ section below will discuss the relevant API calls as they are introduced.

PropertyValue

Example Scripts

Get an incoming FlowFile from the session

Use Case: You have incoming connection(s) to ExecuteScript and want to retrieve one FlowFile from the queue(s) for processing.

Approach: Use the get() method from the session object. This method returns the FlowFile that is next highest priority FlowFile to process. If there is no FlowFile to process, the method will return null. NOTE: It is possible to have null returned even if there is a steady flow of FlowFiles into the processor. This can happen if there are multiple concurrent tasks for the processor, and the other task(s) have already retrieved the FlowFiles. If the script requires a FlowFile to continue processing, then it should immediately return if null is returned from session.get()

Groovy

flowFile = session.get()
if (!flowFile) return
groovy

Get multiple incoming FlowFiles from the session:

Use Case: You have incoming connection(s) to ExecuteScript and want to retrieve multiple FlowFiles from the queue(s) for processing.

Approach: Use the get(maxResults) method from the session object. This method returns up to maxResults FlowFiles from the work queue. If no FlowFiles are available, an empty list is returned (the method does not return null). NOTE: If multiple incoming queues are present, the behavior is unspecified in terms of whether all queues or only a single queue will be polled in a single call. Having said that, the observed behavior (for both NiFi 1.1.0+ and before) is described here.

Examples:

Groovy

flowFileList = session.get(100)
if (!flowFileList.isEmpty()) {
    flowFileList.each { flowFile ->
// Process each FlowFile here
    }
}
groovy

Create a new FlowFile

Use Case: You want to generate a new FlowFile to send to the next processor.

Approach: Use the create() method from the session object. This method returns a new FlowFile object, which you can perform further processing on

Examples:

Groovy

flowFile = session.create()
// Additional processing here
groovy

Create a new FlowFile from a parent FlowFile

Use Case: You want to generate new FlowFile(s) based on an incoming FlowFile.

Approach: Use the create(parentFlowFile) method from the session object. This method takes a parent FlowFile reference and returns a new child FlowFile object. The newly created FlowFile will inherit all the parent’s attributes except for the UUID. This method will automatically generate a Provenance FORK event or a Provenance JOIN event, depending on whether other FlowFiles are generated from the same parent before the ProcessSession is committed.

Examples:

Groovy

flowFile = session.get()
if (!flowFile) return
newFlowFile = session.create(flowFile)
// Additional processing here
groovy

Add an attribute to a FlowFile

Use Case: You have a FlowFile to which you’d like to add a custom attribute.

Approach: Use the putAttribute(flowFile, attributeKey, attributeValue) method from the session object. This method updates the given FlowFile’s attributes with the given key/value pair. NOTE: The “uuid” attribute is fixed for a FlowFile and cannot be modified; if the key is named “uuid”, it will be ignored.

Also this is a good point to mention that FlowFile objects are immutable; this means that if you update a FlowFile’s attributes (or otherwise alter it) via the API, you will get a new reference to the new version of the FlowFile. This is very important when it comes to transferring FlowFiles to relationships. You must keep a reference to the latest version of a FlowFile, and you must transfer or remove the latest version of all FlowFiles retrieved from or created by the session, otherwise you will get an error when executing. Most often, the variable used to store a FlowFile reference will be overwritten with the latest version returned from a method that alters the FlowFile (intermediate FlowFile references will be automatically discarded). In these examples you will see this technique of reusing a FlowFile reference when adding attributes. Note that the current reference to the FlowFile is passed into the putAttribute() method. The resulting FlowFile has an attribute named ‘myAttr’ with a value of ‘myValue’. Also note that the method takes a String for the value; if you have an Object you will have to serialize it to a String. Finally, please note that if you are adding multiple attributes, it is better to create a Map and use putAllAttributes() instead (see next recipe for details).

Examples:

Groovy

flowFile = session.get()
if (!flowFile) return
flowFile = session.putAttribute(flowFile, 'myAttr', 'myValue')
groovy

Add multiple attributes to a FlowFile

Use Case: You have a FlowFile to which you’d like to add custom attributes.

Approach: Use the putAllAttributes(flowFile, attributeMap) method from the session object. This method updates the given FlowFile’s attributes with the key/value pairs from the given Map. NOTE: The “uuid” attribute is fixed for a FlowFile and cannot be modified; if the key is named “uuid”, it will be ignored.

Examples:

Groovy

attrMap = ['myAttr1': '1', 'myAttr2': Integer.toString(2)]
flowFile = session.get()
if (!flowFile) return
flowFile = session.putAllAttributes(flowFile, attrMap)
groovy

Get an attribute from a FlowFile

Use Case: You have a FlowFile from which you’d like to inspect an attribute.

Approach: Use the getAttribute(attributeKey) method from the FlowFile object. This method returns the String value for the given attributeKey, or null if the attributeKey is not found. The examples show the retrieval of the value for the “filename” attribute.

Examples:

Groovy

flowFile = session.get()
if (!flowFile) return
myAttr = flowFile.getAttribute('filename')
groovy

Get all attributes from a FlowFile

Use Case: You have a FlowFile from which you’d like to retrieve its attributes.

Approach: Use the getAttributes() method from the FlowFile object. This method returns a Map with String keys and String values, representing the key/value pairs of attributes for the FlowFile. The examples show an iteration over the Map of all attributes for a FlowFile.

Examples:

Groovy

flowFile = session.get()
if (!flowFile) return
flowFile.getAttributes().each { key, value ->
// Do something with the key/value pair
}
groovy

Transfer a FlowFile to a relationship

Use Case: After processing a FlowFile (new or incoming), you want to transfer the FlowFile to a relationship (” success” or “failure”). In this simple case let us assume there is a variable called “errorOccurred” that indicates which relationship to which the FlowFile should be transferred. Additional error handling techniques will be discussed in part 2 of this series.

Approach: Use the transfer(flowFile, relationship) method from the session object. From the documentation: this method transfers the given FlowFile to the appropriate destination processor work queue(s) based on the given relationship. If the relationship leads to more than one destination the state of the FlowFile is replicated such that each destination receives an exact copy of the FlowFile though each will have its own unique identity.

ExecuteScript will perform a session.commit() at the end of each execution to ensure the operations have been committed. You do not need to (and should not) perform a session.commit() within the script.

Examples:

Groovy

flowFile = session.get()
if (!flowFile) return
// Processing occurs here
if (errorOccurred) {
    session.transfer(flowFile, REL \ _FAILURE)
} else {
    session.transfer(flowFile, REL \ _SUCCESS)
}
groovy

Send a message to the log at a specified logging level

Use Case: You want to report some event that has occurred during processing to the logging framework.

Approach: Use the log variable with the warn(), trace(), debug(), info(), or error() methods. These methods can take a single String, or a String followed by an array of Objects, or a String followed by an array of Objects followed by a Throwable. The first one is used for simple messages. The second is used when you have some dynamic objects/values that you want to log. To refer to these in the message string use “\{}” in the message. These are evaluated against the Object array in order of appearance, so if the message reads “Found these things: \{} \{} \{}” and the Object array is [‘Hello’ ,1,true], then the logged message will be “Found these things: Hello 1 true”. The third form of these logging methods also takes a Throwable parameter, and is useful when an exception is caught and you want to log it.

Examples:

Groovy

log.info('Found these things: {} {} {}', \ ['Hello', 1, true \] as Object \ [\])
groovy

Read the contents of an incoming FlowFile using a callback

Use Case: You have incoming connection(s) to ExecuteScript and want to retrieve the contents of a FlowFile from the queue(s) for processing.

Approach: Use the read(flowFile, inputStreamCallback) method from the session object. An InputStreamCallback object is needed to pass into the read() method. Note that because InputStreamCallback is an object, the contents are only visible to that object by default. If you need to use the data outside the read() method, use a more globally-scoped variable. The examples will store the full contents of the incoming FlowFile into a String (using Apache Commons’ IOUtils class). NOTE: For large FlowFiles, this is not the best technique; rather you should read in only as much data as you need, and process that as appropriate. For something like SplitText, you could read in a line at a time and process it within the InputStreamCallback, or use the session.read(flowFile) approach mentioned earlier to get an InputStream reference to use outside the callback.

Examples:

Groovy

import org.apache.commons.io.IOUtils
import java.nio.charset.StandardCharsets

flowFile = session.get()
if (!flowFile) return
def text = ''
// Cast a closure with an inputStream parameter to InputStreamCallback
session.read(flowFile, { inputStream ->
    text = IOUtils.toString(inputStream, StandardCharsets.UTF \ _8)
// Do something with text here
} as InputStreamCallback)
groovy

Write content to an outgoing FlowFile using a callback

Use Case: You want to generate content for an outgoing FlowFile.

Approach: Use the write(flowFile, outputStreamCallback) method from the session object. An OutputStreamCallback object is needed to pass into the write() method. Note that because OutputStreamCallback is an object, the contents are only visible to that object by default. If you need to use the data outside the write() method, use a more globally-scoped variable. The examples will write a sample String to a FlowFile.

Examples:

Groovy

import org.apache.commons.io.IOUtils
import java.nio.charset.StandardCharsets

flowFile = session.get()
if (!flowFile) return
def text = 'Hello world!'
// Cast a closure with an outputStream parameter to OutputStreamCallback
flowFile = session.write(flowFile, { outputStream ->
    outputStream.write(text.getBytes(StandardCharsets.UTF \ _8))
} as OutputStreamCallback)
groovy

Overwrite an incoming FlowFile with updated content using a callback

Use Case: You want to reuse the incoming FlowFile but want to modify its content for the outgoing FlowFile.

Approach: Use the write(flowFile, streamCallback) method from the session object. An StreamCallback object is needed to pass into the write() method. StreamCallback provides both an InputStream (from the incoming FlowFile) and an outputStream (for the next version of that FlowFile), so you can use the InputStream to get the current contents of the FlowFile, then modify them and write them back out to the FlowFile. This overwrites the contents of the FlowFile, so for append you’d have to handle that by appending to the read-in contents, or use a different approach (with session.append() rather than session.write() ). Note that because StreamCallback is an object, the contents are only visible to that object by default. If you need to use the data outside the write() method, use a more globally-scoped variable. The examples will reverse the contents of the incoming flowFile (assumed to be a String) and write out the reversed string to a new version of the FlowFile.

Examples:

Groovy

import org.apache.commons.io.IOUtils
import java.nio.charset.StandardCharsets

flowFile = session.get()
if (!flowFile) return
def text = 'Hello world!'
// Cast a closure with an inputStream and outputStream parameter to StreamCallback
flowFile = session.write(flowFile, { inputStream, outputStream ->
    text = IOUtils.toString(inputStream, StandardCharsets.UTF \ _8)
    outputStream.write(text.reverse().getBytes(StandardCharsets.UTF \ _8))
} as StreamCallback)
session.transfer(flowFile, REL \ _SUCCESS)
groovy

Handle errors during script processing

Use Case: An error occurs in the script (either by data validation or a thrown exception), and you want the script to handle it gracefully.

Approach: For exceptions, use the exception-handling mechanism for the scripting language (often they are try/catch block(s)). For data validation, you can use a similar approach, but define a boolean variable like “valid” and an if/else clause rather than a try/catch clause. ExecuteScript defines “success” and “failure” relationships; often your processing will transfer “good” FlowFiles to success and “bad” FlowFiles to failure (logging an error in the latter case).

Examples:

Groovy

flowFile = session.get()
if (!flowFile) return
try {
// Something that might throw an exception here

// Last operation is transfer to success (failures handled in the catch block)
    session.transfer(flowFile, REL \ _SUCCESS)
} catch (e) {
    log.error('Something went wrong', e)
    session.transfer(flowFile, REL \ _FAILURE)
}
groovy

ExecuteSQL

Executes provided SQL select query. Query result will be converted to Avro format. Streaming is used so arbitrarily large result sets are supported. This processor can be scheduled to run on a timer, or cron expression, using the standard scheduling methods, or it can be triggered by an incoming FlowFile. If it is triggered by an incoming FlowFile, then attributes of that FlowFile will be available when evaluating the select query, and the query may use the ? to escape parameters. In this case, the parameters to use must exist as FlowFile attributes with the naming convention sql.args.N.type and sql.args.N.value, where N is a positive integer. The sql.args.N.type is expected to be a number indicating the JDBC Type. The content of the FlowFile is expected to be in UTF-8 format. FlowFile attribute 'executesql.row.count' indicates how many rows were selected.

Tags: sql, select, jdbc, query, database

Properties

Database Connection Pooling Service

The Controller Service that is used to obtain connection to database

SQL Pre-Query

A semicolon-delimited list of queries executed before the main SQL query is executed. For example, set session properties before main query. It’s possible to include semicolons in the statements themselves by escaping them with a backslash ('\;'). Results/outputs from these queries will be suppressed if there are no errors.

SQL select query

The SQL select query to execute. The query can be empty, a constant value, or built from attributes using Expression Language. If this property is specified, it will be used regardless of the content of incoming flowfiles. If this property is empty, the content of the incoming flow file is expected to contain a valid SQL select query, to be issued by the processor to the database. Note that Expression Language is not evaluated for flow file contents.

SQL Post-Query

A semicolon-delimited list of queries executed after the main SQL query is executed. Example like setting session properties after main query. It’s possible to include semicolons in the statements themselves by escaping them with a backslash ('\;'). Results/outputs from these queries will be suppressed if there are no errors.

Max Wait Time

The maximum amount of time allowed for a running SQL select query , zero means there is no limit. Max time less than 1 second will be equal to zero.

Normalize Table/Column Names

Whether to change non-Avro-compatible characters in column names to Avro-compatible characters. For example, colons and periods will be changed to underscores in order to build a valid Avro record.

Use Avro Logical Types

Whether to use Avro Logical Types for DECIMAL/NUMBER, DATE, TIME and TIMESTAMP columns. If disabled, written as string. If enabled, Logical types are used and written as its underlying type, specifically, DECIMAL/NUMBER as logical 'decimal': written as bytes with additional precision and scale meta data, DATE as logical 'date-millis': written as int denoting days since Unix epoch (1970-01-01), TIME as logical 'time-millis': written as int denoting milliseconds since Unix epoch, and TIMESTAMP as logical 'timestamp-millis': written as long denoting milliseconds since Unix epoch. If a reader of written Avro records also knows these logical types, then these values can be deserialized with more context depending on reader implementation.

Compression Format

Compression type to use when writing Avro files. Default is None.

Default Decimal Precision

When a DECIMAL/NUMBER value is written as a 'decimal' Avro logical type, a specific 'precision' denoting number of available digits is required. Generally, precision is defined by column data type definition or database engines default. However undefined precision (0) can be returned from some database engines. 'Default Decimal Precision' is used when writing those undefined precision numbers.

Default Decimal Scale

When a DECIMAL/NUMBER value is written as a 'decimal' Avro logical type, a specific 'scale' denoting number of available decimal digits is required. Generally, scale is defined by column data type definition or database engines default. However when undefined precision (0) is returned, scale can also be uncertain with some database engines. 'Default Decimal Scale' is used when writing those undefined numbers. If a value has more decimals than specified scale, then the value will be rounded-up, e.g. 1.53 becomes 2 with scale 0, and 1.5 with scale 1.

Max Rows Per Flow File

The maximum number of result rows that will be included in a single FlowFile. This will allow you to break up very large result sets into multiple FlowFiles. If the value specified is zero, then all rows are returned in a single FlowFile.

Output Batch Size

The number of output FlowFiles to queue before committing the process session. When set to zero, the session will be committed when all result set rows have been processed and the output FlowFiles are ready for transfer to the downstream relationship. For large result sets, this can cause a large burst of FlowFiles to be transferred at the end of processor execution. If this property is set, then when the specified number of FlowFiles are ready for transfer, then the session will be committed, thus releasing the FlowFiles to the downstream relationship. NOTE: The fragment.count attribute will not be set on FlowFiles when this property is set.

Fetch Size

The number of result rows to be fetched from the result set at a time. This is a hint to the database driver and may not be honored and/or exact. If the value specified is zero, then the hint is ignored.

Set Auto Commit

Enables or disables the auto commit functionality of the DB connection. Default value is 'true'. The default value can be used with most of the JDBC drivers and this functionality doesn’t have any impact in most of the cases since this processor is used to read data. However, for some JDBC drivers such as PostgreSQL driver, it is required to disable the auto committing functionality to limit the number of result rows fetching at a time. When auto commit is enabled, postgreSQL driver loads whole result set to memory at once. This could lead for a large amount of memory usage when executing queries which fetch large data sets. More Details of this behaviour in PostgreSQL driver can be found in https://jdbc.postgresql.org//documentation/head/query.html.

Dynamic Properties

sql.args.N.type

Incoming FlowFiles are expected to be parametrized SQL statements. The type of each Parameter is specified as an integer that represents the JDBC Type of the parameter. The following types are accepted: [LONGNVARCHAR: -16], [BIT: -7], [BOOLEAN: 16], [TINYINT: -6], [BIGINT: -5], [LONGVARBINARY: -4], [VARBINARY: -3], [BINARY: -2], [LONGVARCHAR: -1], [CHAR: 1], [NUMERIC: 2], [DECIMAL: 3], [INTEGER: 4], [SMALLINT: 5] [FLOAT: 6], [REAL: 7], [DOUBLE: 8], [VARCHAR: 12], [DATE: 91], [TIME: 92], [TIMESTAMP: 93], [VARCHAR: 12], [CLOB: 2005], [NCLOB: 2011]

sql.args.N.value

Incoming FlowFiles are expected to be parametrized SQL statements. The value of the Parameters are specified as sql.args.1.value, sql.args.2.value, sql.args.3.value, and so on. The type of the sql.args.1.value Parameter is specified by the sql.args.1.type attribute.

sql.args.N.format

This attribute is always optional, but default options may not always work for your data. Incoming FlowFiles are expected to be parametrized SQL statements. In some cases a format option needs to be specified, currently this is only applicable for binary data types, dates, times and timestamps. Binary Data Types (defaults to 'ascii') - ascii: each string character in your attribute value represents a single byte. This is the format provided by Avro Processors. base64: the string is a Base64 encoded string that can be decoded to bytes. hex: the string is hex encoded with all letters in upper case and no '0x' at the beginning. Dates/Times/Timestamps - Date, Time and Timestamp formats all support both custom formats or named format ('yyyy-MM-dd','ISO_OFFSET_DATE_TIME') as specified according to java.time.format.DateTimeFormatter. If not specified, a long value input is expected to be an unix epoch (milli seconds from 1970/1/1), or a string value in 'yyyy-MM-dd' format for Date, 'HH:mm:ss.SSS' for Time (some database engines e.g. Derby or MySQL do not support milliseconds and will truncate milliseconds), 'yyyy-MM-dd HH:mm:ss.SSS' for Timestamp is used.

Relationships

  • success: Successfully created FlowFile from SQL query result set.

  • failure: SQL query execution failed. Incoming FlowFile will be penalized and routed to this relationship

Reads Attributes

  • sql.args.N.type: Incoming FlowFiles are expected to be parametrized SQL statements. The type of each Parameter is specified as an integer that represents the JDBC Type of the parameter. The following types are accepted: [LONGNVARCHAR: -16], [BIT: -7], [BOOLEAN: 16], [TINYINT: -6], [BIGINT: -5], [LONGVARBINARY: -4], [VARBINARY: -3], [BINARY: -2], [LONGVARCHAR: -1], [CHAR: 1], [NUMERIC: 2], [DECIMAL: 3], [INTEGER: 4], [SMALLINT: 5] [FLOAT: 6], [REAL: 7], [DOUBLE: 8], [VARCHAR: 12], [DATE: 91], [TIME: 92], [TIMESTAMP: 93], [VARCHAR: 12], [CLOB: 2005], [NCLOB: 2011]

  • sql.args.N.value: Incoming FlowFiles are expected to be parametrized SQL statements. The value of the Parameters are specified as sql.args.1.value, sql.args.2.value, sql.args.3.value, and so on. The type of the sql.args.1.value Parameter is specified by the sql.args.1.type attribute.

  • sql.args.N.format: This attribute is always optional, but default options may not always work for your data. Incoming FlowFiles are expected to be parametrized SQL statements. In some cases a format option needs to be specified, currently this is only applicable for binary data types, dates, times and timestamps. Binary Data Types (defaults to 'ascii') - ascii: each string character in your attribute value represents a single byte. This is the format provided by Avro Processors. base64: the string is a Base64 encoded string that can be decoded to bytes. hex: the string is hex encoded with all letters in upper case and no '0x' at the beginning. Dates/Times/Timestamps - Date, Time and Timestamp formats all support both custom formats or named format ('yyyy-MM-dd','ISO_OFFSET_DATE_TIME') as specified according to java.time.format.DateTimeFormatter. If not specified, a long value input is expected to be an unix epoch (milli seconds from 1970/1/1), or a string value in 'yyyy-MM-dd' format for Date, 'HH:mm:ss.SSS' for Time (some database engines e.g. Derby or MySQL do not support milliseconds and will truncate milliseconds), 'yyyy-MM-dd HH:mm:ss.SSS' for Timestamp is used.

Writes Attributes

  • executesql.row.count: Contains the number of rows returned by the query. If 'Max Rows Per Flow File' is set, then this number will reflect the number of rows in the Flow File instead of the entire result set.

  • executesql.query.duration: Combined duration of the query execution time and fetch time in milliseconds. If 'Max Rows Per Flow File' is set, then this number will reflect only the fetch time for the rows in the Flow File instead of the entire result set.

  • executesql.query.executiontime: Duration of the query execution time in milliseconds. This number will reflect the query execution time regardless of the 'Max Rows Per Flow File' setting.

  • executesql.query.fetchtime: Duration of the result set fetch time in milliseconds. If 'Max Rows Per Flow File' is set, then this number will reflect only the fetch time for the rows in the Flow File instead of the entire result set.

  • executesql.resultset.index: Assuming multiple result sets are returned, the zero based index of this result set.

  • executesql.error.message: If processing an incoming flow file causes an Exception, the Flow File is routed to failure and this attribute is set to the exception message.

  • fragment.identifier: If 'Max Rows Per Flow File' is set then all FlowFiles from the same query result set will have the same value for the fragment.identifier attribute. This can then be used to correlate the results.

  • fragment.count: If 'Max Rows Per Flow File' is set then this is the total number of FlowFiles produced by a single ResultSet. This can be used in conjunction with the fragment.identifier attribute in order to know how many FlowFiles belonged to the same incoming ResultSet. If Output Batch Size is set, then this attribute will not be populated.

  • fragment.index: If 'Max Rows Per Flow File' is set then the position of this FlowFile in the list of outgoing FlowFiles that were all derived from the same result set FlowFile. This can be used in conjunction with the fragment.identifier attribute to know which FlowFiles originated from the same query result set and in what order FlowFiles were produced

  • input.flowfile.uuid: If the processor has an incoming connection, outgoing FlowFiles will have this attribute set to the value of the input FlowFile’s UUID. If there is no incoming connection, the attribute will not be added.

Input Requirement

This component allows an incoming relationship.

ExecuteSQLRecord

Executes provided SQL select query. Query result will be converted to the format specified by a Record Writer. Streaming is used so arbitrarily large result sets are supported. This processor can be scheduled to run on a timer, or cron expression, using the standard scheduling methods, or it can be triggered by an incoming FlowFile. If it is triggered by an incoming FlowFile, then attributes of that FlowFile will be available when evaluating the select query, and the query may use the ? to escape parameters. In this case, the parameters to use must exist as FlowFile attributes with the naming convention sql.args.N.type and sql.args.N.value, where N is a positive integer. The sql.args.N.type is expected to be a number indicating the JDBC Type. The content of the FlowFile is expected to be in UTF-8 format. FlowFile attribute 'executesql.row.count' indicates how many rows were selected.

Tags: sql, select, jdbc, query, database, record

Properties

Database Connection Pooling Service

The Controller Service that is used to obtain connection to database

SQL Pre-Query

A semicolon-delimited list of queries executed before the main SQL query is executed. For example, set session properties before main query. It’s possible to include semicolons in the statements themselves by escaping them with a backslash ('\;'). Results/outputs from these queries will be suppressed if there are no errors.

SQL select query

The SQL select query to execute. The query can be empty, a constant value, or built from attributes using Expression Language. If this property is specified, it will be used regardless of the content of incoming flowfiles. If this property is empty, the content of the incoming flow file is expected to contain a valid SQL select query, to be issued by the processor to the database. Note that Expression Language is not evaluated for flow file contents.

SQL Post-Query

A semicolon-delimited list of queries executed after the main SQL query is executed. Example like setting session properties after main query. It’s possible to include semicolons in the statements themselves by escaping them with a backslash ('\;'). Results/outputs from these queries will be suppressed if there are no errors.

Max Wait Time

The maximum amount of time allowed for a running SQL select query , zero means there is no limit. Max time less than 1 second will be equal to zero.

Record Writer

Specifies the Controller Service to use for writing results to a FlowFile. The Record Writer may use Inherit Schema to emulate the inferred schema behavior, i.e. an explicit schema need not be defined in the writer, and will be supplied by the same logic used to infer the schema from the column types.

Normalize Table/Column Names

Whether to change characters in column names. For example, colons and periods will be changed to underscores.

Use Avro Logical Types

Whether to use Avro Logical Types for DECIMAL/NUMBER, DATE, TIME and TIMESTAMP columns. If disabled, written as string. If enabled, Logical types are used and written as its underlying type, specifically, DECIMAL/NUMBER as logical 'decimal': written as bytes with additional precision and scale meta data, DATE as logical 'date-millis': written as int denoting days since Unix epoch (1970-01-01), TIME as logical 'time-millis': written as int denoting milliseconds since Unix epoch, and TIMESTAMP as logical 'timestamp-millis': written as long denoting milliseconds since Unix epoch. If a reader of written Avro records also knows these logical types, then these values can be deserialized with more context depending on reader implementation.

Default Decimal Precision

When a DECIMAL/NUMBER value is written as a 'decimal' Avro logical type, a specific 'precision' denoting number of available digits is required. Generally, precision is defined by column data type definition or database engines default. However undefined precision (0) can be returned from some database engines. 'Default Decimal Precision' is used when writing those undefined precision numbers.

Default Decimal Scale

When a DECIMAL/NUMBER value is written as a 'decimal' Avro logical type, a specific 'scale' denoting number of available decimal digits is required. Generally, scale is defined by column data type definition or database engines default. However when undefined precision (0) is returned, scale can also be uncertain with some database engines. 'Default Decimal Scale' is used when writing those undefined numbers. If a value has more decimals than specified scale, then the value will be rounded-up, e.g. 1.53 becomes 2 with scale 0, and 1.5 with scale 1.

Max Rows Per Flow File

The maximum number of result rows that will be included in a single FlowFile. This will allow you to break up very large result sets into multiple FlowFiles. If the value specified is zero, then all rows are returned in a single FlowFile.

Output Batch Size

The number of output FlowFiles to queue before committing the process session. When set to zero, the session will be committed when all result set rows have been processed and the output FlowFiles are ready for transfer to the downstream relationship. For large result sets, this can cause a large burst of FlowFiles to be transferred at the end of processor execution. If this property is set, then when the specified number of FlowFiles are ready for transfer, then the session will be committed, thus releasing the FlowFiles to the downstream relationship. NOTE: The fragment.count attribute will not be set on FlowFiles when this property is set.

Fetch Size

The number of result rows to be fetched from the result set at a time. This is a hint to the database driver and may not be honored and/or exact. If the value specified is zero, then the hint is ignored.

Set Auto Commit

Enables or disables the auto commit functionality of the DB connection. Default value is 'true'. The default value can be used with most of the JDBC drivers and this functionality doesn’t have any impact in most of the cases since this processor is used to read data. However, for some JDBC drivers such as PostgreSQL driver, it is required to disable the auto committing functionality to limit the number of result rows fetching at a time. When auto commit is enabled, postgreSQL driver loads whole result set to memory at once. This could lead for a large amount of memory usage when executing queries which fetch large data sets. More Details of this behaviour in PostgreSQL driver can be found in https://jdbc.postgresql.org//documentation/head/query.html.

Dynamic Properties

sql.args.N.type

Incoming FlowFiles are expected to be parametrized SQL statements. The type of each Parameter is specified as an integer that represents the JDBC Type of the parameter. The following types are accepted: [LONGNVARCHAR: -16], [BIT: -7], [BOOLEAN: 16], [TINYINT: -6], [BIGINT: -5], [LONGVARBINARY: -4], [VARBINARY: -3], [BINARY: -2], [LONGVARCHAR: -1], [CHAR: 1], [NUMERIC: 2], [DECIMAL: 3], [INTEGER: 4], [SMALLINT: 5] [FLOAT: 6], [REAL: 7], [DOUBLE: 8], [VARCHAR: 12], [DATE: 91], [TIME: 92], [TIMESTAMP: 93], [VARCHAR: 12], [CLOB: 2005], [NCLOB: 2011]

sql.args.N.value

Incoming FlowFiles are expected to be parametrized SQL statements. The value of the Parameters are specified as sql.args.1.value, sql.args.2.value, sql.args.3.value, and so on. The type of the sql.args.1.value Parameter is specified by the sql.args.1.type attribute.

sql.args.N.format

This attribute is always optional, but default options may not always work for your data. Incoming FlowFiles are expected to be parametrized SQL statements. In some cases a format option needs to be specified, currently this is only applicable for binary data types, dates, times and timestamps. Binary Data Types (defaults to 'ascii') - ascii: each string character in your attribute value represents a single byte. This is the format provided by Avro Processors. base64: the string is a Base64 encoded string that can be decoded to bytes. hex: the string is hex encoded with all letters in upper case and no '0x' at the beginning. Dates/Times/Timestamps - Date, Time and Timestamp formats all support both custom formats or named format ('yyyy-MM-dd','ISO_OFFSET_DATE_TIME') as specified according to java.time.format.DateTimeFormatter. If not specified, a long value input is expected to be an unix epoch (milli seconds from 1970/1/1), or a string value in 'yyyy-MM-dd' format for Date, 'HH:mm:ss.SSS' for Time (some database engines e.g. Derby or MySQL do not support milliseconds and will truncate milliseconds), 'yyyy-MM-dd HH:mm:ss.SSS' for Timestamp is used.

Relationships

  • success: Successfully created FlowFile from SQL query result set.

  • failure: SQL query execution failed. Incoming FlowFile will be penalized and routed to this relationship

Reads Attributes

  • sql.args.N.type: Incoming FlowFiles are expected to be parametrized SQL statements. The type of each Parameter is specified as an integer that represents the JDBC Type of the parameter. The following types are accepted: [LONGNVARCHAR: -16], [BIT: -7], [BOOLEAN: 16], [TINYINT: -6], [BIGINT: -5], [LONGVARBINARY: -4], [VARBINARY: -3], [BINARY: -2], [LONGVARCHAR: -1], [CHAR: 1], [NUMERIC: 2], [DECIMAL: 3], [INTEGER: 4], [SMALLINT: 5] [FLOAT: 6], [REAL: 7], [DOUBLE: 8], [VARCHAR: 12], [DATE: 91], [TIME: 92], [TIMESTAMP: 93], [VARCHAR: 12], [CLOB: 2005], [NCLOB: 2011]

  • sql.args.N.value: Incoming FlowFiles are expected to be parametrized SQL statements. The value of the Parameters are specified as sql.args.1.value, sql.args.2.value, sql.args.3.value, and so on. The type of the sql.args.1.value Parameter is specified by the sql.args.1.type attribute.

  • sql.args.N.format: This attribute is always optional, but default options may not always work for your data. Incoming FlowFiles are expected to be parametrized SQL statements. In some cases a format option needs to be specified, currently this is only applicable for binary data types, dates, times and timestamps. Binary Data Types (defaults to 'ascii') - ascii: each string character in your attribute value represents a single byte. This is the format provided by Avro Processors. base64: the string is a Base64 encoded string that can be decoded to bytes. hex: the string is hex encoded with all letters in upper case and no '0x' at the beginning. Dates/Times/Timestamps - Date, Time and Timestamp formats all support both custom formats or named format ('yyyy-MM-dd','ISO_OFFSET_DATE_TIME') as specified according to java.time.format.DateTimeFormatter. If not specified, a long value input is expected to be an unix epoch (milli seconds from 1970/1/1), or a string value in 'yyyy-MM-dd' format for Date, 'HH:mm:ss.SSS' for Time (some database engines e.g. Derby or MySQL do not support milliseconds and will truncate milliseconds), 'yyyy-MM-dd HH:mm:ss.SSS' for Timestamp is used.

Writes Attributes

  • executesql.row.count: Contains the number of rows returned in the select query

  • executesql.query.duration: Combined duration of the query execution time and fetch time in milliseconds

  • executesql.query.executiontime: Duration of the query execution time in milliseconds

  • executesql.query.fetchtime: Duration of the result set fetch time in milliseconds

  • executesql.resultset.index: Assuming multiple result sets are returned, the zero based index of this result set.

  • executesql.error.message: If processing an incoming flow file causes an Exception, the Flow File is routed to failure and this attribute is set to the exception message.

  • fragment.identifier: If 'Max Rows Per Flow File' is set then all FlowFiles from the same query result set will have the same value for the fragment.identifier attribute. This can then be used to correlate the results.

  • fragment.count: If 'Max Rows Per Flow File' is set then this is the total number of FlowFiles produced by a single ResultSet. This can be used in conjunction with the fragment.identifier attribute in order to know how many FlowFiles belonged to the same incoming ResultSet. If Output Batch Size is set, then this attribute will not be populated.

  • fragment.index: If 'Max Rows Per Flow File' is set then the position of this FlowFile in the list of outgoing FlowFiles that were all derived from the same result set FlowFile. This can be used in conjunction with the fragment.identifier attribute to know which FlowFiles originated from the same query result set and in what order FlowFiles were produced

  • input.flowfile.uuid: If the processor has an incoming connection, outgoing FlowFiles will have this attribute set to the value of the input FlowFile’s UUID. If there is no incoming connection, the attribute will not be added.

  • mime.type: Sets the mime.type attribute to the MIME Type specified by the Record Writer.

  • record.count: The number of records output by the Record Writer.

Input Requirement

This component allows an incoming relationship.

ExecuteStreamCommand

The ExecuteStreamCommand processor provides a flexible way to integrate external commands and scripts into NiFi data flows. ExecuteStreamCommand can pass the incoming FlowFile’s content to the command that it executes similarly how piping works.

Tags: command execution, command, stream, execute

Properties

Working Directory

The directory to use as the current working directory when executing the command

Command Path

Specifies the command to be executed; if just the name of an executable is provided, it must be in the user’s environment PATH.

Command Arguments Strategy

Strategy for configuring arguments to be supplied to the command.

Command Arguments

The arguments to supply to the executable delimited by the ';' character.

Argument Delimiter

Delimiter to use to separate arguments for a command [default: ;]. Must be a single character

Ignore STDIN

If true, the contents of the incoming flowfile will not be passed to the executing command

Output Destination Attribute

If set, the output of the stream command will be put into an attribute of the original FlowFile instead of a separate FlowFile. There will no longer be a relationship for 'output stream' or 'nonzero status'. The value of this property will be the key for the output attribute.

Max Attribute Length

If routing the output of the stream command to an attribute, the number of characters put to the attribute value will be at most this amount. This is important because attributes are held in memory and large attributes will quickly cause out of memory issues. If the output goes longer than this value, it will truncated to fit. Consider making this smaller if able.

Output MIME Type

Specifies the value to set for the "mime.type" attribute. This property is ignored if 'Output Destination Attribute' is set.

Dynamic Properties

An environment variable name

These environment variables are passed to the process spawned by this Processor

command.argument.<commandIndex>

These arguments are supplied to the process spawned by this Processor when using the Command Arguments Strategy : Dynamic Property Arguments. <commandIndex> is a number and it will determine the order.

Relationships

  • nonzero status: The destination path for the flow file created from the command’s output, if the returned status code is non-zero. All flow files routed to this relationship will be penalized.

  • original: The original FlowFile will be routed. It will have new attributes detailing the result of the script execution.

  • output stream: The destination path for the flow file created from the command’s output, if the returned status code is zero.

Writes Attributes

  • execution.command: The name of the command executed

  • execution.command.args: The semi-colon delimited list of arguments. Sensitive properties will be masked

  • execution.status: The exit status code returned from executing the command

  • execution.error: Any error messages returned from executing the command

  • mime.type: Sets the MIME type of the output if the 'Output MIME Type' property is set and 'Output Destination Attribute' is not set

Restricted

Making changes to this component requires admin rights. Non-admins can still schedule this component.

Input Requirement

This component requires an incoming relationship.

Additional Details

Description

The ExecuteStreamCommand processor provides a flexible way to integrate external commands and scripts into NiFi data flows. ExecuteStreamCommand can pass the incoming FlowFile’s content to the command that it executes similarly how piping works.

Configuration options
Working Directory

If not specified, NiFi root will be the default working directory.

Configuring command arguments

The ExecuteStreamCommand processor provides two ways to specify command arguments: using Dynamic Properties and the Command Arguments field.

======= Command Arguments field

This is the default. If there are multiple arguments, they need to be separated by a character specified in the Argument Delimiter field. When needed, ‘-’ and ‘–’ can be provided, but in these cases Argument Delimiter should be a different character.

Consider that we want to list all files in a directory which is different from the working directory:

Command Path Command Arguments

ls

-lah;/path/to/dir

NOTE: the command should be on $PATH or it should be in the working directory, otherwise path also should be specified.

======= Dynamic Properties

Arguments can be specified with Dynamic Properties. Dynamic Properties with the pattern of ’ command.argument.’ will be appended to the command in ascending order.

The above example with dynamic properties would look like this:

Property Name Property Value

command.argument.0

-lah

command.argument.1

/path/to/dir

Configuring environment variables

In addition to specifying command arguments using the Command Argument field or Dynamic Properties, users can also use environment variables with the ExecuteStreamCommand processor. Environment variables are a set of key-value pairs that can be accessed by processes running on the system. ExecuteStreamCommand will treat every Dynamic Property as an environment variable that doesn’t match the pattern ‘command.argument.’.

Consider that we want to execute a Maven command with the processor. If there are multiple Java versions installed on the system, you can specify which version will be used by setting the JAVA_HOME environment variable. The output FlowFile will looke like this if we run mvn command with --version argument:

Apache Maven 3.8.6 (84538c9988a25aec085021c365c560670ad80f63)
Maven home: /path/to/maven/home Java version: 11.0.18, vendor: Eclipse Adoptium, runtime: /path/to/default/java/home
Default locale: en_US, platform encoding: UTF-8 OS name: "mac os x", version: "13.1", arch: "x86_64", family: "mac"
Property Name Property Value

JAVA_HOME

path/to/another/java/home

After specifying the JAVA_HOME property, you can notice that maven is using a different runtime:

Apache Maven 3.8.6 (84538c9988a25aec085021c365c560670ad80f63)
Maven home: /path/to/maven/home Java version: 11.0.18, vendor: Eclipse Adoptium, runtime: /path/to/another/java/home
Default locale: en_US, platform encoding: UTF-8 OS name: "mac os x", version: "13.1", arch: "x86_64", family: "mac"
Streaming input to the command

ExecuteStreamCommand passes the incoming FlowFile’s content to the command that it executes similarly how piping works. It is possible to disable this behavior with the Ignore STDIN property. In the above examples we didn’t use the incoming FlowFile’s content, so in this case we could leverage this property for additional performance gain.

To utilize the streaming capability, consider that we want to use grep on the FlowFile. Let’s presume that we need to list all POST requests from an Apache HTTPD log:

127.0.0.1 - - [03/May/2023:13:54:26 +0000] "GET /example-page HTTP/1.1" 200 4825 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:88.0) Gecko/20100101 Firefox/88.0"
127.0.0.1 - - [03/May/2023:14:05:32 +0000] "POST /submit-form HTTP/1.1" 302 0 "http://localhost/example-page" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:88.0) Gecko/20100101 Firefox/88.0"
127.0.0.1 - - [03/May/2023:14:10:48 +0000] "GET /image.jpg HTTP/1.1" 200 35785 "http://localhost/example-page" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:88.0) Gecko/20100101 Firefox/88.0"
127.0.0.1 - - [03/May/2023:14:20:15 +0000] "GET /example-page HTTP/1.1" 200 4825 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:88.0) Gecko/20100101 Firefox/88.0"
127.0.0.1 - - [03/May/2023:14:30:42 +0000] "GET /example-page HTTP/1.1" 200 4825 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:88.0) Gecko/20100101 Firefox/88.0"

Processor configuration:

Working Directory Command Path Command Arguments Strategy Command Arguments Argument Delimiter Ignore STDIN Output Destination Attribute Max Attribute Length

grep

Command Arguments Property

POST

;

false

256

With this the emitted FlowFile on the “output stream” relationship should be:

127.0.0.1 - - [03/May/2023:14:05:32 +0000] "POST /submit-form HTTP/1.1" 302 0 "http://localhost/example-page" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:88.0) Gecko/20100101 Firefox/88.0"

ExitMetro

Retrieves FlowFiles from the "Metro Line", i.e. any queue connected to a PutMetro processor using the same Metro Line Controller as this processor. If no FlowFiles are found on the Metro Line, this process yields to avoid overloading the system (see "Yield Duration" in Settings).

Tags: virtimo, metro

Properties

Metro Controller

The processor uses this controller’s Metro Line to connect with PutMetro processors.

Relationships

  • success: The FlowFile was successfully transferred via metro.

Input Requirement

This component does not allow an incoming relationship.

ExtractAvroMetadata

Extracts metadata from the header of an Avro datafile.

Tags: avro, schema, metadata

Properties

Fingerprint Algorithm

The algorithm used to generate the schema fingerprint. Available choices are based on the Avro recommended practices for fingerprint generation.

Metadata Keys

A comma-separated list of keys indicating key/value pairs to extract from the Avro file header. The key 'avro.schema' can be used to extract the full schema in JSON format, and 'avro.codec' can be used to extract the codec name if one exists.

Count Items

If true the number of items in the datafile will be counted and stored in a FlowFile attribute 'item.count'. The counting is done by reading blocks and getting the number of items for each block, thus avoiding de-serializing. The items being counted will be the top-level items in the datafile. For example, with a schema of type record the items will be the records, and for a schema of type Array the items will be the arrays (not the number of entries in each array).

Relationships

  • success: A FlowFile is routed to this relationship after metadata has been extracted.

  • failure: A FlowFile is routed to this relationship if it cannot be parsed as Avro or metadata cannot be extracted for any reason

Writes Attributes

  • schema.type: The type of the schema (i.e. record, enum, etc.).

  • schema.name: Contains the name when the type is a record, enum or fixed, otherwise contains the name of the primitive type.

  • schema.fingerprint: The result of the Fingerprint Algorithm as a Hex string.

  • item.count: The total number of items in the datafile, only written if Count Items is set to true.

Input Requirement

This component requires an incoming relationship.

ExtractDocumentText

Extract text contents from supported binary document formats using Apache Tika

Tags: extract, document, text

Properties

Relationships

  • failure: Content extraction failed

  • extracted: Success for extracted text FlowFiles

  • original: Success for original input FlowFiles

Input Requirement

This component allows an incoming relationship.

ExtractEmailAttachments

Extract attachments from a mime formatted email file, splitting them into individual flowfiles.

Tags: split, email

Properties

Relationships

  • failure: FlowFiles that could not be parsed

  • attachments: Each individual attachment will be routed to the attachments relationship

  • original: The original file

Writes Attributes

  • *filename *: The filename of the attachment

  • *email.attachment.parent.filename *: The filename of the parent FlowFile

  • email.attachment.parent.uuid: The UUID of the original FlowFile.

  • mime.type: The mime type of the attachment.

Input Requirement

This component requires an incoming relationship.

ExtractEmailHeaders

Using the flowfile content as source of data, extract header from an RFC compliant email file adding the relevant attributes to the flowfile. This processor does not perform extensive RFC validation but still requires a bare minimum compliance with RFC 2822

Tags: split, email

Properties

Additional Header List

COLON separated list of additional headers to be extracted from the flowfile content.NOTE the header key is case insensitive and will be matched as lower-case. Values will respect email contents.

Email Address Parsing

If "strict", strict address format parsing rules are applied to mailbox and mailbox list fields, such as "to" and "from" headers, and FlowFiles with poorly formed addresses will be routed to the failure relationship, similar to messages that fail RFC compliant format validation. If "non-strict", the processor will extract the contents of mailbox list headers as comma-separated values without attempting to parse each value as well-formed Internet mailbox addresses. This is optional and defaults to Strict Address Parsing

Relationships

  • success: Extraction was successful

  • failure: Flowfiles that could not be parsed as a RFC-2822 compliant message

Writes Attributes

  • email.headers.bcc.*: Each individual BCC recipient (if available)

  • email.headers.cc.*: Each individual CC recipient (if available)

  • email.headers.from.*: Each individual mailbox contained in the From of the Email (array as per RFC-2822)

  • email.headers.message-id: The value of the Message-ID header (if available)

  • email.headers.received_date: The Received-Date of the message (if available)

  • email.headers.sent_date: Date the message was sent

  • email.headers.subject: Subject of the message (if available)

  • email.headers.to.*: Each individual TO recipient (if available)

  • email.attachment_count: Number of attachments of the message

Input Requirement

This component requires an incoming relationship.

ExtractGrok

Evaluates one or more Grok Expressions against the content of a FlowFile, adding the results as attributes or replacing the content of the FlowFile with a JSON notation of the matched content

Tags: grok, log, text, parse, delimit, extract

Properties

Grok Expression

Grok expression. If other Grok expressions are referenced in this expression, they must be provided in the Grok Pattern File if set or exist in the default Grok patterns

Grok Patterns

Custom Grok pattern definitions. These definitions will be loaded after the default Grok patterns. The Grok Parser will use the default Grok patterns when this property is not configured.

Destination

Control if Grok output value is written as a new flowfile attributes, in this case each of the Grok identifier that is matched in the flowfile will be added as an attribute, prefixed with "grok." or written in the flowfile content. Writing to flowfile content will overwrite any existing flowfile content.

Character Set

The Character Set in which the file is encoded

Maximum Buffer Size

Specifies the maximum amount of data to buffer (per file) in order to apply the Grok expressions. Files larger than the specified maximum will not be fully evaluated.

Named captures only

Only store named captures from grok

Keep Empty Captures

If true, then empty capture values will be included in the returned capture map.

Relationships

  • matched: FlowFiles are routed to this relationship when the Grok Expression is successfully evaluated and the FlowFile is modified as a result

  • unmatched: FlowFiles are routed to this relationship when no provided Grok Expression matches the content of the FlowFile

Writes Attributes

  • grok.XXX: When operating in flowfile-attribute mode, each of the Grok identifier that is matched in the flowfile will be added as an attribute, prefixed with "grok." For example,if the grok identifier "timestamp" is matched, then the value will be added to an attribute named "grok.timestamp"

Restricted

Making changes to this component requires admin rights. Non-admins can still schedule this component.

Input Requirement

This component requires an incoming relationship.

ExtractHL7Attributes

Extracts information from an HL7 (Health Level 7) formatted FlowFile and adds the information as FlowFile Attributes. The attributes are named as <Segment Name> <dot> <Field Index>. If the segment is repeating, the naming will be <Segment Name> <underscore> <Segment Index> <dot> <Field Index>. For example, we may have an attribute named "MHS.12" with a value of "2.1" and an attribute named "OBX_11.3" with a value of "93000^CPT4".

Tags: HL7, health level 7, healthcare, extract, attributes

Properties

Character Encoding

The Character Encoding that is used to encode the HL7 data

Use Segment Names

Whether or not to use HL7 segment names in attributes

Parse Segment Fields

Whether or not to parse HL7 segment fields into attributes

Skip Validation

Whether or not to validate HL7 message values

HL7 Input Version

The HL7 version to use for parsing and validation

Relationships

  • success: A FlowFile is routed to this relationship if it is properly parsed as HL7 and its attributes extracted

  • failure: A FlowFile is routed to this relationship if it cannot be mapped to FlowFile Attributes. This would happen if the FlowFile does not contain valid HL7 data

Input Requirement

This component requires an incoming relationship.

ExtractImageMetadata

Extract the image metadata from flowfiles containing images. This processor relies on this metadata extractor library https://github.com/drewnoakes/metadata-extractor. It extracts a long list of metadata types including but not limited to EXIF, IPTC, XMP and Photoshop fields. For the full list visit the library’s website.NOTE: The library being used loads the images into memory so extremely large images may cause problems.

Tags: Exif, Exchangeable, image, file, format, JPG, GIF, PNG, BMP, metadata, IPTC, XMP

Properties

Max Number of Attributes

Specify the max number of attributes to add to the flowfile. There is no guarantee in what order the tags will be processed. By default it will process all of them.

Relationships

  • success: Any FlowFile that successfully has image metadata extracted will be routed to success

  • failure: Any FlowFile that fails to have image metadata extracted will be routed to failure

Writes Attributes

  • <directory name>.<tag name>: The extracted image metadata will be inserted with the attribute name "<directory name>.<tag name>".

Input Requirement

This component requires an incoming relationship.

ExtractMediaMetadata

Extract the content metadata from flowfiles containing audio, video, image, and other file types. This processor relies on the Apache Tika project for file format detection and parsing. It extracts a long list of metadata types for media files including audio, video, and print media formats.NOTE: the attribute names and content extracted may vary across upgrades because parsing is performed by the external Tika tools which in turn depend on other projects for metadata extraction. For the more details and the list of supported file types, visit the library’s website at http://tika.apache.org/.

Tags: media, file, format, metadata, audio, video, image, document, pdf

Properties

Max Number of Attributes

Specify the max number of attributes to add to the flowfile. There is no guarantee in what order the tags will be processed. By default it will process all of them.

Max Attribute Length

Specifies the maximum length of a single attribute value. When a metadata item has multiple values, they will be merged until this length is reached and then ", …​" will be added as an indicator that additional values where dropped. If a single value is longer than this, it will be truncated and "(truncated)" appended to indicate that truncation occurred.

Metadata Key Filter

A regular expression identifying which metadata keys received from the parser should be added to the flowfile attributes. If left blank, all metadata keys parsed will be added to the flowfile attributes.

Metadata Key Prefix

Text to be prefixed to metadata keys as the are added to the flowfile attributes. It is recommended to end with with a separator character like '.' or '-', this is not automatically added by the processor.

Relationships

  • success: Any FlowFile that successfully has media metadata extracted will be routed to success

  • failure: Any FlowFile that fails to have media metadata extracted will be routed to failure

Writes Attributes

  • <Metadata Key Prefix><attribute>: The extracted content metadata will be inserted with the attribute name "<Metadata Key Prefix><attribute>", or "<attribute>" if "Metadata Key Prefix" is not provided.

Input Requirement

This component requires an incoming relationship.

ExtractRecordSchema

Extracts the record schema from the FlowFile using the supplied Record Reader and writes it to the avro.schema attribute.

Tags: record, generic, schema, json, csv, avro, freeform, text, xml

Properties

Record Reader

Specifies the Controller Service to use for reading incoming data

Schema Cache Size

Specifies the number of schemas to cache. This value should reflect the expected number of different schemas that may be in the incoming FlowFiles. This ensures more efficient retrieval of the schemas and thus the processor performance.

Relationships

  • success: FlowFiles whose record schemas are successfully extracted will be routed to this relationship

  • failure: If a FlowFile’s record schema cannot be extracted from the configured input format, the FlowFile will be routed to this relationship

Writes Attributes

  • record.error.message: This attribute provides on failure the error message encountered by the Reader.

  • avro.schema: This attribute provides the schema extracted from the input FlowFile using the provided RecordReader.

Input Requirement

This component requires an incoming relationship.

ExtractText

Evaluates one or more Regular Expressions against the content of a FlowFile. The results of those Regular Expressions are assigned to FlowFile Attributes. Regular Expressions are entered by adding user-defined properties; the name of the property maps to the Attribute Name into which the result will be placed. The attributes are generated differently based on the enabling of named capture groups. If named capture groups are not enabled: The first capture group, if any found, will be placed into that attribute name.But all capture groups, including the matching string sequence itself will also be provided at that attribute name with an index value provided, with the exception of a capturing group that is optional and does not match - for example, given the attribute name "regex" and expression "abc(def)?(g)" we would add an attribute "regex.1" with a value of "def" if the "def" matched. If the "def" did not match, no attribute named "regex.1" would be added but an attribute named "regex.2" with a value of "g" will be added regardless.If named capture groups are enabled: Each named capture group, if found will be placed into the attributes name with the name provided. If enabled the matching string sequence itself will be placed into the attribute name. If multiple matches are enabled, and index will be applied after the first set of matches. The exception is a capturing group that is optional and does not match For example, given the attribute name "regex" and expression "abc(?<NAMED>def)?(?<NAMED-TWO>g)" we would add an attribute "regex.NAMED" with the value of "def" if the "def" matched. We would add an attribute "regex.NAMED-TWO" with the value of "g" if the "g" matched regardless. The value of the property must be a valid Regular Expressions with one or more capturing groups. If named capture groups are enabled, all capture groups must be named. If they are not, then the processor configuration will fail validation. If the Regular Expression matches more than once, only the first match will be used unless the property enabling repeating capture group is set to true. If any provided Regular Expression matches, the FlowFile(s) will be routed to 'matched'. If no provided Regular Expression matches, the FlowFile will be routed to 'unmatched' and no attributes will be applied to the FlowFile.

Tags: evaluate, extract, Text, Regular Expression, regex

Properties

Character Set

The Character Set in which the file is encoded

Maximum Buffer Size

Specifies the maximum amount of data to buffer (per FlowFile) in order to apply the regular expressions. FlowFiles larger than the specified maximum will not be fully evaluated.

Maximum Capture Group Length

Specifies the maximum number of characters a given capture group value can have. Any characters beyond the max will be truncated.

Enable Canonical Equivalence

Indicates that two characters match only when their full canonical decompositions match.

Enable Case-insensitive Matching

Indicates that two characters match even if they are in a different case. Can also be specified via the embedded flag (?i).

Permit Whitespace and Comments in Pattern

In this mode, whitespace is ignored, and embedded comments starting with # are ignored until the end of a line. Can also be specified via the embedded flag (?x).

Enable DOTALL Mode

Indicates that the expression '.' should match any character, including a line terminator. Can also be specified via the embedded flag (?s).

Enable Literal Parsing of the Pattern

Indicates that Metacharacters and escape characters should be given no special meaning.

Enable Multiline Mode

Indicates that '^' and '$' should match just after and just before a line terminator or end of sequence, instead of only the beginning or end of the entire input. Can also be specified via the embeded flag (?m).

Enable Unicode-aware Case Folding

When used with 'Enable Case-insensitive Matching', matches in a manner consistent with the Unicode Standard. Can also be specified via the embedded flag (?u).

Enable Unicode Predefined Character Classes

Specifies conformance with the Unicode Technical Standard #18: Unicode Regular Expression Annex C: Compatibility Properties. Can also be specified via the embedded flag (?U).

Enable Unix Lines Mode

Indicates that only the ' ' line terminator is recognized in the behavior of '.', '^', and '$'. Can also be specified via the embedded flag (?d).

Include Capture Group 0

Indicates that Capture Group 0 should be included as an attribute. Capture Group 0 represents the entirety of the regular expression match, is typically not used, and could have considerable length.

Enable repeating capture group

If set to true, every string matching the capture groups will be extracted. Otherwise, if the Regular Expression matches more than once, only the first match will be extracted.

Enable named group support

If set to true, when named groups are present in the regular expression, the name of the group will be used in the attribute name as opposed to the group index. All capturing groups must be named, if the number of groups (not including capture group 0) does not equal the number of named groups validation will fail.

Dynamic Properties

A FlowFile attribute

The first capture group, if any found, will be placed into that attribute name.But all capture groups, including the matching string sequence itself will also be provided at that attribute name with an index value provided.

Relationships

  • matched: FlowFiles are routed to this relationship when the Regular Expression is successfully evaluated and the FlowFile is modified as a result

  • unmatched: FlowFiles are routed to this relationship when no provided Regular Expression matches the content of the FlowFile

Input Requirement

This component requires an incoming relationship.

Additional Details

Usage Information

The Extract Text processor provides different results based on whether named capture groups are enabled.

Example

Here is a like for like example that illustrates this.

======= Data

    `foo\r\nbar1\r\nbar2\r\nbar3\r\nhello\r\nworld\r\n`
Without named capture groups

======= Configuration

Property Name Property Value

regex.result1

(?s)(.*)

regex.result2

(?s).(bar1).

regex.result3

(?s).?(bar\\d).

regex.result4

(?s).?(?:bar\\d).?(bar\\d).?(bar3).

regex.result5

(?s).(bar\\d).

regex.result6

(?s)^(.*)$

regex.result7

(?s)(XXX)

======= Results

Attribute Name Attribute Value

regex.result1

foo\r\nbar1\r\nbar2\r\nbar3\r\nhello\r\nworld\r\n

regex.result2

bar1

regex.result3

bar1

regex.result4

bar2

regex.result4.0

foo\r\nbar1\r\nbar2\r\nbar3\r\nhello\r\nworld\r\n

regex.result4.1

bar2

regex.result4.2

bar3

regex.result5

bar3

regex.result6

foo\r\nbar1\r\nbar2\r\nbar3\r\nhello\r\nworld\r\n

regex.result7

With named capture groups

======= Configuration

Property Name Property Value

Enable named group support

True

regex.result1

(?s)(?.*

regex.result2

(?s).(?bar1).

regex.result3

(?s).?(?bar\d).

regex.result4

(?s).?(?:bar\d).?(?bar\d).?(?bar3).

regex.result5

(?s).(?bar\d).

regex.result6

(?s)^(?.*)$

regex.result7

(?s)(?XXX)

======= Results

Attribute Name Attribute Value

regex.result1

foo\r\nbar1\r\nbar2\r\nbar3\r\nhello\r\nworld\r\n

regex.result2.BAR1

bar1

regex.result3.BAR1

bar1

regex.result4.BAR2

bar2

regex.result4.BAR2

bar2

regex.result4.BAR3

bar3

regex.result5.BAR3

bar3

regex.result6.ALL

foo\r\nbar1\r\nbar2\r\nbar3\r\nhello\r\nworld\r\n

regex.result7.MISS

FetchAzureBlobStorage_v12

Retrieves the specified blob from Azure Blob Storage and writes its content to the content of the FlowFile. The processor uses Azure Blob Storage client library v12.

Multi-Processor Use Cases

Retrieve all files in an Azure Blob Storage container

Keywords: azure, blob, storage, state, retrieve, fetch, all, stream

ListAzureBlobStorage_v12:

  1. The "Container Name" property should be set to the name of the Blob Storage Container that files reside in. If the flow being built is to be reused elsewhere, it’s a good idea to parameterize this property by setting it to something like #{AZURE_CONTAINER}. .

  2. The "Storage Credentials" property should specify an instance of the AzureStorageCredentialsService_v12 in order to provide credentials for accessing the storage container. .

  3. The 'success' Relationship of this Processor is then connected to FetchAzureBlobStorage_v12. .

FetchAzureBlobStorage_v12:

  1. "Container Name" = "${azure.container}"

  2. "Blob Name" = "${azure.blobname}" .

  3. The "Storage Credentials" property should specify an instance of the AzureStorageCredentialsService_v12 in order to provide credentials for accessing the storage container. .

Tags: azure, microsoft, cloud, storage, blob

Properties

Storage Credentials

Controller Service used to obtain Azure Blob Storage Credentials.

Container Name

Name of the Azure storage container. In case of PutAzureBlobStorage processor, container can be created if it does not exist.

Blob Name

The full name of the blob

Range Start

The byte position at which to start reading from the blob. An empty value or a value of zero will start reading at the beginning of the blob.

Range Length

The number of bytes to download from the blob, starting from the Range Start. An empty value or a value that extends beyond the end of the blob will read to the end of the blob.

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP, SOCKS In case of SOCKS, it is not guaranteed that the selected SOCKS Version will be used by the processor.

Client-Side Encryption Key Type

Specifies the key type to use for client-side encryption.

Client-Side Encryption Key ID

Specifies the ID of the key to use for client-side encryption.

Client-Side Encryption Local Key

When using local client-side encryption, this is the raw key, encoded in hexadecimal

Relationships

  • success: All successfully processed FlowFiles are routed to this relationship

  • failure: Unsuccessful operations will be transferred to the failure relationship.

Writes Attributes

  • azure.container: The name of the Azure Blob Storage container

  • azure.blobname: The name of the blob on Azure Blob Storage

  • azure.primaryUri: Primary location of the blob

  • azure.etag: ETag of the blob

  • azure.blobtype: Type of the blob (either BlockBlob, PageBlob or AppendBlob)

  • mime.type: MIME Type of the content

  • lang: Language code for the content

  • azure.timestamp: Timestamp of the blob

  • azure.length: Length of the blob

Input Requirement

This component requires an incoming relationship.

FetchAzureDataLakeStorage

Fetch the specified file from Azure Data Lake Storage

Multi-Processor Use Cases

Retrieve all files in an Azure DataLake Storage directory

Keywords: azure, datalake, adls, state, retrieve, fetch, all, stream

ListAzureDataLakeStorage:

  1. The "Filesystem Name" property should be set to the name of the Azure Filesystem (also known as a Container) that files reside in. If the flow being built is to be reused elsewhere, it’s a good idea to parameterize this property by setting it to something like #{AZURE_FILESYSTEM}.

  2. Configure the "Directory Name" property to specify the name of the directory in the file system. If the flow being built is to be reused elsewhere, it’s a good idea to parameterize this property by setting it to something like #{AZURE_DIRECTORY}. .

  3. The "ADLS Credentials" property should specify an instance of the ADLSCredentialsService in order to provide credentials for accessing the filesystem. .

  4. The 'success' Relationship of this Processor is then connected to FetchAzureDataLakeStorage. .

FetchAzureDataLakeStorage:

  1. "Filesystem Name" = "${azure.filesystem}"

  2. "Directory Name" = "${azure.directory}"

  3. "File Name" = "${azure.filename}" .

  4. The "ADLS Credentials" property should specify an instance of the ADLSCredentialsService in order to provide credentials for accessing the filesystem. .

Tags: azure, microsoft, cloud, storage, adlsgen2, datalake

Properties

ADLS Credentials

Controller Service used to obtain Azure Credentials.

Filesystem Name

Name of the Azure Storage File System (also called Container). It is assumed to be already existing.

Directory Name

Name of the Azure Storage Directory. The Directory Name cannot contain a leading '/'. The root directory can be designated by the empty string value. In case of the PutAzureDataLakeStorage processor, the directory will be created if not already existing.

File Name

The filename

Range Start

The byte position at which to start reading from the object. An empty value or a value of zero will start reading at the beginning of the object.

Range Length

The number of bytes to download from the object, starting from the Range Start. An empty value or a value that extends beyond the end of the object will read to the end of the object.

Number of Retries

The number of automatic retries to perform if the download fails.

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP, SOCKS In case of SOCKS, it is not guaranteed that the selected SOCKS Version will be used by the processor.

Relationships

  • success: Files that have been successfully written to Azure storage are transferred to this relationship

  • failure: Files that could not be written to Azure storage for some reason are transferred to this relationship

Writes Attributes

  • azure.datalake.storage.statusCode: The HTTP error code (if available) from the failed operation

  • azure.datalake.storage.errorCode: The Azure Data Lake Storage moniker of the failed operation

  • azure.datalake.storage.errorMessage: The Azure Data Lake Storage error message from the failed operation

Input Requirement

This component requires an incoming relationship.

FetchBoxFile

Fetches files from a Box Folder. Designed to be used in tandem with ListBoxFile.

Tags: box, storage, fetch

Properties

Box Client Service

Controller Service used to obtain a Box API connection.

File ID

The ID of the File to fetch

Relationships

  • success: A FlowFile will be routed here for each successfully fetched File.

  • failure: A FlowFile will be routed here for each File for which fetch was attempted but failed.

Reads Attributes

  • box.id: The id of the file

Writes Attributes

  • box.id: The id of the file

  • filename: The name of the file

  • path: The folder path where the file is located

  • box.size: The size of the file

  • box.timestamp: The last modified time of the file

  • error.code: The error code returned by Box

  • error.message: The error message returned by Box

Input Requirement

This component requires an incoming relationship.

Additional Details

Fetch Box files in NiFi
  1. Find File ID
    Usually FetchBoxFile is used with ListBoxFile and ‘box.id’ is set.

    In case ‘box.id’ is not available, you can find the ID of the file in the following way: * Click on the file. * The URL in the browser will include the File ID.
    For example, if the URL were https://app.box.com/file/1012106094023?s=ldiqjwuor2vwdxeeap2rtcz66dql89h3,
    the File ID would be 1012106094023

  2. Set File ID in ‘File ID’ property

FetchBPCProcessLog

Fetches data from a Virtimo Business Process Center (BPC) log service. The data is written to the incoming FlowFile.

Tags: virtimo, bpc

Properties

BPC Controller

Controller used to define the connection to the BPC. The API-Key used by the controller requires 'LOG_SERVICE_READ_DATA'-Rights.

BPC Logger

Select the logger from available ones.

BPC Logger ID

The ID of the logger (i.e. the Component ID of the Log Service in BPC).

Parent ID

The ID of the log entry to fetch (in most cases the 'PROCESSID'-value). If not set, all entries from the selected logger are fetched. Multiple IDs can be set separated by comma.

Child ID

The ID of the child log entry to fetch (in most cases the 'CHILDID'-value). If not set, all child log entries along with the parent log entry are fetched. Multiple IDs can be set separated by comma. Must not be set if multiple Parent IDs or set.

Relationships

  • success: If the request was successfully processed by the BPC, the FlowFile is routed here.

  • failure: If there was an error, the FlowFile is routed here.

Writes Attributes

  • bpc.status.code: The response code from BPC.

Input Requirement

This component requires an incoming relationship.

FetchDistributedMapCache

Computes cache key(s) from FlowFile attributes, for each incoming FlowFile, and fetches the value(s) from the Distributed Map Cache associated with each key. If configured without a destination attribute, the incoming FlowFile’s content is replaced with the binary data received by the Distributed Map Cache. If there is no value stored under that key then the flow file will be routed to 'not-found'. Note that the processor will always attempt to read the entire cached value into memory before placing it in it’s destination. This could be potentially problematic if the cached value is very large.

Tags: map, cache, fetch, distributed

Properties

Cache Entry Identifier

A comma-delimited list of FlowFile attributes, or the results of Attribute Expression Language statements, which will be evaluated against a FlowFile in order to determine the value(s) used to identify duplicates; it is these values that are cached. NOTE: Only a single Cache Entry Identifier is allowed unless Put Cache Value In Attribute is specified. Multiple cache lookups are only supported when the destination is a set of attributes (see the documentation for 'Put Cache Value In Attribute' for more details including naming convention.

Distributed Cache Service

The Controller Service that is used to get the cached values.

Put Cache Value In Attribute

If set, the cache value received will be put into an attribute of the FlowFile instead of a the content of theFlowFile. The attribute key to put to is determined by evaluating value of this property. If multiple Cache Entry Identifiers are selected, multiple attributes will be written, using the evaluated value of this property, appended by a period (.) and the name of the cache entry identifier.

Max Length To Put In Attribute

If routing the cache value to an attribute of the FlowFile (by setting the "Put Cache Value in attribute" property), the number of characters put to the attribute value will be at most this amount. This is important because attributes are held in memory and large attributes will quickly cause out of memory issues. If the output goes longer than this value, it will be truncated to fit. Consider making this smaller if able.

Character Set

The Character Set in which the cached value is encoded. This will only be used when routing to an attribute.

Relationships

  • success: If the cache was successfully communicated with it will be routed to this relationship

  • failure: If unable to communicate with the cache or if the cache entry is evaluated to be blank, the FlowFile will be penalized and routed to this relationship

  • not-found: If a FlowFile’s Cache Entry Identifier was not found in the cache, it will be routed to this relationship

Writes Attributes

  • user-defined: If the 'Put Cache Value In Attribute' property is set then whatever it is set to will become the attribute key and the value would be whatever the response was from the Distributed Map Cache. If multiple cache entry identifiers are selected, multiple attributes will be written, using the evaluated value of this property, appended by a period (.) and the name of the cache entry identifier. For example, if the Cache Entry Identifier property is set to 'id,name', and the user-defined property is named 'fetched', then two attributes will be written, fetched.id and fetched.name, containing their respective values.

Input Requirement

This component requires an incoming relationship.

FetchDropbox

Fetches files from Dropbox. Designed to be used in tandem with ListDropbox.

Tags: dropbox, storage, fetch

Properties

Dropbox Credential Service

Controller Service used to obtain Dropbox credentials (App Key, App Secret, Access Token, Refresh Token). See controller service’s Additional Details for more information.

File

The Dropbox identifier or path of the Dropbox file to fetch. The 'File' should match the following regular expression pattern: /.*

id:.* . When ListDropbox is used for input, either '${dropbox.id}' (identifying files by Dropbox id) or '${path}/${filename}' (identifying files by path) can be used as 'File' value.

Proxy Configuration Service

Relationships

  • success: A FlowFile will be routed here for each successfully fetched File.

  • failure: A FlowFile will be routed here for each File for which fetch was attempted but failed.

Writes Attributes

  • error.message: The error message returned by Dropbox

  • dropbox.id: The Dropbox identifier of the file

  • path: The folder path where the file is located

  • filename: The name of the file

  • dropbox.size: The size of the file

  • dropbox.timestamp: The server modified time of the file

  • dropbox.revision: Revision of the file

Input Requirement

This component requires an incoming relationship.

FetchFile

Reads the contents of a file from disk and streams it into the contents of an incoming FlowFile. Once this is done, the file is optionally moved elsewhere or deleted to help keep the file system organized.

Multi-Processor Use Cases

Ingest all files from a directory into NiFi

Keywords: local, files, filesystem, ingest, ingress, get, source, input, fetch

ListFile:

  1. Configure the "Input Directory" property to point to the directory that you want to ingest files from.

  2. Set the "Input Directory Location" property to "Local"

  3. Optionally, set "Minimum File Age" to a small value such as "1 min" to avoid ingesting files that are still being written to. .

  4. Connect the 'success' Relationship to the FetchFile processor. .

FetchFile:

  1. Set the "File to Fetch" property to ${absolute.path}/${filename}

  2. Set the "Completion Strategy" property to None .

Ingest specific files from a directory into NiFi, filtering on filename

Keywords: local, files, filesystem, ingest, ingress, get, source, input, fetch, filter

ListFile:

  1. Configure the "Input Directory" property to point to the directory that you want to ingest files from.

  2. Set the "Input Directory Location" property to "Local"

  3. Set the "File Filter" property to a Regular Expression that matches the filename (without path) of the files that you want to ingest. For example, to ingest all .jpg files, set the value to .*\.jpg

  4. Optionally, set "Minimum File Age" to a small value such as "1 min" to avoid ingesting files that are still being written to. .

  5. Connect the 'success' Relationship to the FetchFile processor. .

FetchFile:

  1. Set the "File to Fetch" property to ${absolute.path}/${filename}

  2. Set the "Completion Strategy" property to None .

Tags: local, files, filesystem, ingest, ingress, get, source, input, fetch

Properties

File to Fetch

The fully-qualified filename of the file to fetch from the file system

Completion Strategy

Specifies what to do with the original file on the file system once it has been pulled into NiFi

Move Destination Directory

The directory to the move the original file to once it has been fetched from the file system. This property is ignored unless the Completion Strategy is set to "Move File". If the directory does not exist, it will be created.

Move Conflict Strategy

If Completion Strategy is set to Move File and a file already exists in the destination directory with the same name, this property specifies how that naming conflict should be resolved

Log level when file not found

Log level to use in case the file does not exist when the processor is triggered

Log level when permission denied

Log level to use in case user jenkins does not have sufficient permissions to read the file

Relationships

  • success: Any FlowFile that is successfully fetched from the file system will be transferred to this Relationship.

  • failure: Any FlowFile that could not be fetched from the file system for any reason other than insufficient permissions or the file not existing will be transferred to this Relationship.

  • not.found: Any FlowFile that could not be fetched from the file system because the file could not be found will be transferred to this Relationship.

  • permission.denied: Any FlowFile that could not be fetched from the file system due to the user running NiFi not having sufficient permissions will be transferred to this Relationship.

Restricted

Making changes to this component requires admin rights. Non-admins can still schedule this component.

Input Requirement

This component requires an incoming relationship.

FetchFTP

Fetches the content of a file from a remote FTP server and overwrites the contents of an incoming FlowFile with the content of the remote file.

Multi-Processor Use Cases

Retrieve all files in a directory of an FTP Server

Keywords: ftp, file, transform, state, retrieve, fetch, all, stream

ListFTP:

  1. The "Hostname" property should be set to the fully qualified hostname of the FTP Server. It’s a good idea to parameterize this property by setting it to something like #{FTP_SERVER}.

  2. The "Remote Path" property must be set to the directory on the FTP Server where the files reside. If the flow being built is to be reused elsewhere, it’s a good idea to parameterize this property by setting it to something like #{FTP_REMOTE_PATH}.

  3. Configure the "Username" property to the appropriate username for logging into the FTP Server. It’s usually a good idea to parameterize this property by setting it to something like #{FTP_USERNAME}.

  4. Configure the "Password" property to the appropriate password for the provided username. It’s usually a good idea to parameterize this property by setting it to something like #{FTP_PASSWORD}. .

  5. The 'success' Relationship of this Processor is then connected to FetchFTP. .

FetchFTP:

  1. "Hostname" = "${ftp.remote.host}"

  2. "Remote File" = "${path}/${filename}"

  3. "Username" = "${ftp.listing.user}"

  4. "Password" = "#{FTP_PASSWORD}" .

Tags: ftp, get, retrieve, files, fetch, remote, ingest, source, input

Properties

Hostname

The fully-qualified hostname or IP address of the host to fetch the data from

Port

The port to connect to on the remote host to fetch the data from

Username

Username

Password

Password for the user account

Remote File

The fully qualified filename on the remote system

Completion Strategy

Specifies what to do with the original file on the server once it has been pulled into NiFi. If the Completion Strategy fails, a warning will be logged but the data will still be transferred.

Move Destination Directory

The directory on the remote server to move the original file to once it has been ingested into NiFi. This property is ignored unless the Completion Strategy is set to 'Move File'. The specified directory must already exist on the remote system if 'Create Directory' is disabled, or the rename will fail.

Create Directory

Used when 'Completion Strategy' is 'Move File'. Specifies whether or not the remote directory should be created if it does not exist.

Connection Timeout

Amount of time to wait before timing out while creating a connection

Data Timeout

When transferring a file between the local and remote system, this value specifies how long is allowed to elapse without any data being transferred between systems

Use Compression

Indicates whether or not ZLIB compression should be used when transferring files

Connection Mode

The FTP Connection Mode

Transfer Mode

The FTP Transfer Mode

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP + AuthN, SOCKS + AuthN

Internal Buffer Size

Set the internal buffer size for buffered data streams

Log level when file not found

Log level to use in case the file does not exist when the processor is triggered

Use UTF-8 Encoding

Tells the client to use UTF-8 encoding when processing files and filenames. If set to true, the server must also support UTF-8 encoding.

Relationships

  • success: All FlowFiles that are received are routed to success

  • comms.failure: Any FlowFile that could not be fetched from the remote server due to a communications failure will be transferred to this Relationship.

  • not.found: Any FlowFile for which we receive a 'Not Found' message from the remote server will be transferred to this Relationship.

  • permission.denied: Any FlowFile that could not be fetched from the remote server due to insufficient permissions will be transferred to this Relationship.

Writes Attributes

  • ftp.remote.host: The hostname or IP address from which the file was pulled

  • ftp.remote.port: The port that was used to communicate with the remote FTP server

  • ftp.remote.filename: The name of the remote file that was pulled

  • filename: The filename is updated to point to the filename fo the remote file

  • path: If the Remote File contains a directory name, that directory name will be added to the FlowFile using the 'path' attribute

  • fetch.failure.reason: The name of the failure relationship applied when routing to any failure relationship

Input Requirement

This component requires an incoming relationship.

FetchGCSObject

Fetches a file from a Google Cloud Bucket. Designed to be used in tandem with ListGCSBucket.

Multi-Processor Use Cases

Retrieve all files in a Google Compute Storage (GCS) bucket

Keywords: gcp, gcs, google cloud, google compute storage, state, retrieve, fetch, all, stream

ListGCSBucket:

  1. The "Bucket" property should be set to the name of the GCS bucket that files reside in. If the flow being built is to be reused elsewhere, it’s a good idea to parameterize this property by setting it to something like #{GCS_SOURCE_BUCKET}.

  2. Configure the "Project ID" property to reflect the ID of your Google Compute Cloud Project. .

  3. The "GCP Credentials Provider Service" property should specify an instance of the GCPCredentialsService in order to provide credentials for accessing the bucket. .

  4. The 'success' Relationship of this Processor is then connected to FetchGCSObject. .

FetchGCSObject:

  1. "Bucket" = "${gcs.bucket}"

  2. "Name" = "${filename}" .

  3. The "GCP Credentials Provider Service" property should specify an instance of the GCPCredentialsService in order to provide credentials for accessing the bucket. .

Tags: google cloud, google, storage, gcs, fetch

Properties

GCP Credentials Provider Service

The Controller Service used to obtain Google Cloud Platform credentials.

Project ID

Google Cloud Project ID

Bucket

Bucket of the object.

Key

Name of the object.

Object Generation

The generation of the Object to download. If not set, the latest generation will be downloaded.

Server Side Encryption Key

An AES256 Key (encoded in base64) which the object has been encrypted in.

Range Start

The byte position at which to start reading from the object. An empty value or a value of zero will start reading at the beginning of the object.

Range Length

The number of bytes to download from the object, starting from the Range Start. An empty value or a value that extends beyond the end of the object will read to the end of the object.

Number of retries

How many retry attempts should be made before routing to the failure relationship.

Storage API URL

Overrides the default storage URL. Configuring an alternative Storage API URL also overrides the HTTP Host header on requests as described in the Google documentation for Private Service Connections.

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP + AuthN

Relationships

  • success: FlowFiles are routed to this relationship after a successful Google Cloud Storage operation.

  • failure: FlowFiles are routed to this relationship if the Google Cloud Storage operation fails.

Writes Attributes

  • filename: The name of the file, parsed if possible from the Content-Disposition response header

  • gcs.bucket: Bucket of the object.

  • gcs.key: Name of the object.

  • gcs.size: Size of the object.

  • gcs.cache.control: Data cache control of the object.

  • gcs.component.count: The number of components which make up the object.

  • gcs.content.disposition: The data content disposition of the object.

  • gcs.content.encoding: The content encoding of the object.

  • gcs.content.language: The content language of the object.

  • mime.type: The MIME/Content-Type of the object

  • gcs.crc32c: The CRC32C checksum of object’s data, encoded in base64 in big-endian order.

  • gcs.create.time: The creation time of the object (milliseconds)

  • gcs.update.time: The last modification time of the object (milliseconds)

  • gcs.encryption.algorithm: The algorithm used to encrypt the object.

  • gcs.encryption.sha256: The SHA256 hash of the key used to encrypt the object

  • gcs.etag: The HTTP 1.1 Entity tag for the object.

  • gcs.generated.id: The service-generated for the object

  • gcs.generation: The data generation of the object.

  • gcs.md5: The MD5 hash of the object’s data encoded in base64.

  • gcs.media.link: The media download link to the object.

  • gcs.metageneration: The metageneration of the object.

  • gcs.owner: The owner (uploader) of the object.

  • gcs.owner.type: The ACL entity type of the uploader of the object.

  • gcs.acl.owner: A comma-delimited list of ACL entities that have owner access to the object. Entities will be either email addresses, domains, or project IDs.

  • gcs.acl.writer: A comma-delimited list of ACL entities that have write access to the object. Entities will be either email addresses, domains, or project IDs.

  • gcs.acl.reader: A comma-delimited list of ACL entities that have read access to the object. Entities will be either email addresses, domains, or project IDs.

  • gcs.uri: The URI of the object as a string.

Input Requirement

This component requires an incoming relationship.

FetchGoogleDrive

Fetches files from a Google Drive Folder. Designed to be used in tandem with ListGoogleDrive. Please see Additional Details to set up access to Google Drive.

Multi-Processor Use Cases

Retrieve all files in a Google Drive folder

Keywords: google, drive, google cloud, state, retrieve, fetch, all, stream

FetchGoogleDrive:

  1. "File ID" = "${drive.id}" .

  2. The "GCP Credentials Provider Service" property should specify an instance of the GCPCredentialsService in order to provide credentials for accessing the bucket. .

ListGoogleDrive:

  1. The "Folder ID" property should be set to the ID of the Google Drive folder that files reside in. See processor documentation / additional details for more information on how to determine a Google Drive folder’s ID.

  2. If the flow being built is to be reused elsewhere, it’s a good idea to parameterize this property by setting it to something like #{GOOGLE_DRIVE_FOLDER_ID}. .

  3. The "GCP Credentials Provider Service" property should specify an instance of the GCPCredentialsService in order to provide credentials for accessing the folder. .

  4. The 'success' Relationship of this Processor is then connected to FetchGoogleDrive. .

Tags: google, drive, storage, fetch

Properties

GCP Credentials Provider Service

The Controller Service used to obtain Google Cloud Platform credentials.

File ID

The Drive ID of the File to fetch. Please see Additional Details for information on how to obtain the Drive ID.

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP + AuthN

Google Doc Export Type

Google Documents cannot be downloaded directly from Google Drive but instead must be exported to a specified MIME Type. In the event that the incoming FlowFile’s MIME Type indicates that the file is a Google Document, this property specifies the MIME Type to export the document to.

Google Spreadsheet Export Type

Google Spreadsheets cannot be downloaded directly from Google Drive but instead must be exported to a specified MIME Type. In the event that the incoming FlowFile’s MIME Type indicates that the file is a Google Spreadsheet, this property specifies the MIME Type to export the spreadsheet to.

Google Presentation Export Type

Google Presentations cannot be downloaded directly from Google Drive but instead must be exported to a specified MIME Type. In the event that the incoming FlowFile’s MIME Type indicates that the file is a Google Presentation, this property specifies the MIME Type to export the presentation to.

Google Drawing Export Type

Google Drawings cannot be downloaded directly from Google Drive but instead must be exported to a specified MIME Type. In the event that the incoming FlowFile’s MIME Type indicates that the file is a Google Drawing, this property specifies the MIME Type to export the drawing to.

Relationships

  • success: A FlowFile will be routed here for each successfully fetched File.

  • failure: A FlowFile will be routed here for each File for which fetch was attempted but failed.

Reads Attributes

  • drive.id: The id of the file

Writes Attributes

  • drive.id: The id of the file

  • filename: The name of the file

  • mime.type: The MIME type of the file

  • drive.size: The size of the file

  • drive.timestamp: The last modified time or created time (whichever is greater) of the file. The reason for this is that the original modified date of a file is preserved when uploaded to Google Drive. 'Created time' takes the time when the upload occurs. However uploaded files can still be modified later.

  • error.code: The error code returned by Google Drive

  • error.message: The error message returned by Google Drive

Input Requirement

This component requires an incoming relationship.

Additional Details

Accessing Google Drive from NiFi

This processor uses Google Cloud credentials for authentication to access Google Drive. The following steps are required to prepare the Google Cloud and Google Drive accounts for the processors:

  1. Enable Google Drive API in Google Cloud

  2. Grant access to Google Drive folder

    • In Google Cloud Console navigate to IAM & Admin → Service Accounts.

    • Take a note of the email of the service account you are going to use.

    • Navigate to the folder to be listed in Google Drive.

    • Right-click on the Folder → Share.

    • Enter the service account email.

  3. Find File ID
    Usually FetchGoogleDrive is used with ListGoogleDrive and ‘drive.id’ is set.
    In case ‘drive.id’ is not available, you can find the Drive ID of the file in the following way:

    • Right-click on the file and select “Get Link”.

    • In the pop-up window click on “Copy Link”.

    • You can obtain the file ID from the URL copied to clipboard. For example, if the URL were https://drive.google.com/file/d/16ALV9KIU_KKeNG557zyctqy2Fmzyqtq/view?usp=share_link,
      the File ID would be 16ALV9KIU_KKeNG557zyctqy2Fmzyqtq

  4. Set File ID in ‘File ID’ property

FetchGridFS

Retrieves one or more files from a GridFS bucket by file name or by a user-defined query.

Tags: fetch, gridfs, mongo

Properties

Client Service

The MongoDB client service to use for database connections.

Mongo Database Name

The name of the database to use

Bucket Name

The GridFS bucket where the files will be stored. If left blank, it will use the default value 'fs' that the MongoDB client driver uses.

File Name

The name of the file in the bucket that is the target of this processor.

Query

A valid MongoDB query to use to fetch one or more files from GridFS.

Query Output Attribute

If set, the query will be written to a specified attribute on the output flowfiles.

Operation Mode

This option controls when results are made available to downstream processors. If Stream Query Results is enabled, provenance will not be tracked relative to the input flowfile if an input flowfile is received and starts the query. In Stream Query Results mode errors will be handled by sending a new flowfile with the original content and attributes of the input flowfile to the failure relationship. Streaming should only be used if there is reliable connectivity between MongoDB and NiFi.

Relationships

  • success: When the operation succeeds, the flowfile is sent to this relationship.

  • failure: When there is a failure processing the flowfile, it goes to this relationship.

  • original: The original input flowfile goes to this relationship if the query does not cause an error

Writes Attributes

  • gridfs.file.metadata: The custom metadata stored with a file is attached to this property if it exists.

Input Requirement

This component requires an incoming relationship.

Additional Details

Description:

This processor retrieves one or more files from GridFS. The query can be provided in one of three ways:

  • Query configuration parameter.

  • Built for you by configuring the filename parameter. (Note: this is just a filename, Mongo queries cannot be embedded in the field).

  • Retrieving the query from the flowfile contents.

The processor can also be configured to either commit only once at the end of a fetch operation or after each file that is retrieved. Multiple commits is generally only necessary when retrieving a lot of data from GridFS as measured in total data size, not file count, to ensure that the disks NiFi is using are not overloaded.

FetchHDFS

Retrieves a file from HDFS. The content of the incoming FlowFile is replaced by the content of the file in HDFS. The file in HDFS is left intact without any changes being made to it.

Tags: hadoop, hcfs, hdfs, get, ingest, fetch, source

Properties

Hadoop Configuration Resources

A file or comma separated list of files which contains the Hadoop file system configuration. Without this, Hadoop will search the classpath for a 'core-site.xml' and 'hdfs-site.xml' file or will revert to a default configuration. To use swebhdfs, see 'Additional Details' section of PutHDFS’s documentation.

Kerberos User Service

Specifies the Kerberos User Controller Service that should be used for authenticating with Kerberos

Additional Classpath Resources

A comma-separated list of paths to files and/or directories that will be added to the classpath and used for loading native libraries. When specifying a directory, all files with in the directory will be added to the classpath, but further sub-directories will not be included.

HDFS Filename

The name of the HDFS file to retrieve

Compression codec

Relationships

  • success: FlowFiles will be routed to this relationship once they have been updated with the content of the HDFS file

  • failure: FlowFiles will be routed to this relationship if the content of the HDFS file cannot be retrieved and trying again will likely not be helpful. This would occur, for instance, if the file is not found or if there is a permissions issue

  • comms.failure: FlowFiles will be routed to this relationship if the content of the HDFS file cannot be retrieve due to a communications failure. This generally indicates that the Fetch should be tried again.

Writes Attributes

  • hdfs.failure.reason: When a FlowFile is routed to 'failure', this attribute is added indicating why the file could not be fetched from HDFS

  • hadoop.file.url: The hadoop url for the file is stored in this attribute.

Restricted

Making changes to this component requires admin rights. Non-admins can still schedule this component.

Input Requirement

This component requires an incoming relationship.

FetchParquet

Reads from a given Parquet file and writes records to the content of the flow file using the selected record writer. The original Parquet file will remain unchanged, and the content of the flow file will be replaced with records of the selected type. This processor can be used with ListHDFS or ListFile to obtain a listing of files to fetch.

Tags: parquet, hadoop, HDFS, get, ingest, fetch, source, record

Properties

Hadoop Configuration Resources

A file or comma separated list of files which contains the Hadoop file system configuration. Without this, Hadoop will search the classpath for a 'core-site.xml' and 'hdfs-site.xml' file or will revert to a default configuration. To use swebhdfs, see 'Additional Details' section of PutHDFS’s documentation.

Kerberos User Service

Specifies the Kerberos User Controller Service that should be used for authenticating with Kerberos

Additional Classpath Resources

A comma-separated list of paths to files and/or directories that will be added to the classpath and used for loading native libraries. When specifying a directory, all files with in the directory will be added to the classpath, but further sub-directories will not be included.

Filename

The name of the file to retrieve

Record Writer

The service for writing records to the FlowFile content

Relationships

  • success: FlowFiles will be routed to this relationship once they have been updated with the content of the file

  • failure: FlowFiles will be routed to this relationship if the content of the file cannot be retrieved and trying again will likely not be helpful. This would occur, for instance, if the file is not found or if there is a permissions issue

  • retry: FlowFiles will be routed to this relationship if the content of the file cannot be retrieved, but might be able to be in the future if tried again. This generally indicates that the Fetch should be tried again.

Reads Attributes

  • record.offset: Gets the index of first record in the input.

  • record.count: Gets the number of records in the input.

Writes Attributes

  • fetch.failure.reason: When a FlowFile is routed to 'failure', this attribute is added indicating why the file could not be fetched from the given filesystem.

  • record.count: The number of records in the resulting flow file

  • hadoop.file.url: The hadoop url for the file is stored in this attribute.

Restricted

Making changes to this component requires admin rights. Non-admins can still schedule this component.

Input Requirement

This component requires an incoming relationship.

See Also

FetchS3Object

Retrieves the contents of an S3 Object and writes it to the content of a FlowFile

Use Cases

Fetch a specific file from S3

Input Requirement: This component allows an incoming relationship.

  1. The "Bucket" property should be set to the name of the S3 bucket that contains the file. Typically this is defined as an attribute on an incoming FlowFile, so this property is set to ${s3.bucket}.

  2. The "Object Key" property denotes the fully qualified filename of the file to fetch. Typically, the FlowFile’s filename attribute is used, so this property is set to ${filename}.

  3. The "Region" property must be set to denote the S3 region that the Bucket resides in. If the flow being built is to be reused elsewhere, it’s a good idea to parameterize this property by setting it to something like #{S3_REGION}. .

  4. The "AWS Credentials Provider service" property should specify an instance of the AWSCredentialsProviderControllerService in order to provide credentials for accessing the file. .

Multi-Processor Use Cases

Retrieve all files in an S3 bucket

Keywords: s3, state, retrieve, fetch, all, stream

ListS3:

  1. The "Bucket" property should be set to the name of the S3 bucket that files reside in. If the flow being built is to be reused elsewhere, it’s a good idea to parameterize this property by setting it to something like #{S3_SOURCE_BUCKET}.

  2. The "Region" property must be set to denote the S3 region that the Bucket resides in. If the flow being built is to be reused elsewhere, it’s a good idea to parameterize this property by setting it to something like #{S3_SOURCE_REGION}. .

  3. The "AWS Credentials Provider service" property should specify an instance of the AWSCredentialsProviderControllerService in order to provide credentials for accessing the bucket. .

  4. The 'success' Relationship of this Processor is then connected to FetchS3Object. .

FetchS3Object:

  1. "Bucket" = "${s3.bucket}"

  2. "Object Key" = "${filename}" .

  3. The "AWS Credentials Provider service" property should specify an instance of the AWSCredentialsProviderControllerService in order to provide credentials for accessing the bucket. .

  4. The "Region" property must be set to the same value as the "Region" property of the ListS3 Processor. .

Retrieve only files from S3 that meet some specified criteria

Keywords: s3, state, retrieve, filter, select, fetch, criteria

ListS3:

  1. The "Bucket" property should be set to the name of the S3 bucket that files reside in. If the flow being built is to be reused elsewhere, it’s a good idea to parameterize this property by setting it to something like #{S3_SOURCE_BUCKET}.

  2. The "Region" property must be set to denote the S3 region that the Bucket resides in. If the flow being built is to be reused elsewhere, it’s a good idea to parameterize this property by setting it to something like #{S3_SOURCE_REGION}. .

  3. The "AWS Credentials Provider service" property should specify an instance of the AWSCredentialsProviderControllerService in order to provide credentials for accessing the bucket. .

  4. The 'success' Relationship of this Processor is then connected to RouteOnAttribute. .

Processor:

  1. If you would like to "OR" together all of the conditions (i.e., the file should be retrieved if any of the conditions are met), set "Routing Strategy" to "Route to 'matched' if any matches".

  2. If you would like to "AND" together all of the conditions (i.e., the file should only be retrieved if all of the conditions are met), set "Routing Strategy" to "Route to 'matched' if all match". .

  3. For each condition that you would like to filter on, add a new property. The name of the property should describe the condition. The value of the property should be an Expression Language expression that returns true if the file meets the condition or false if the file does not meet the condition. .

  4. Some attributes that you may consider filtering on are:

  5. - filename (the name of the file)

  6. - s3.length (the number of bytes in the file)

  7. - s3.tag.<tag name> (the value of the s3 tag with the name tag name)

  8. - s3.user.metadata.<key name> (the value of the user metadata with the key named key name) .

  9. For example, to fetch only files that are at least 1 MB and have a filename ending in .zip we would set the following properties:

  10. - "Routing Strategy" = "Route to 'matched' if all match"

  11. - "At least 1 MB" = "${s3.length:ge(1000000)}"

  12. - "Ends in .zip" = "${filename:endsWith('.zip')}" .

  13. Auto-terminate the unmatched Relationship.

  14. Connect the matched Relationship to the FetchS3Object processor. .

FetchS3Object:

  1. "Bucket" = "${s3.bucket}"

  2. "Object Key" = "${filename}" .

  3. The "AWS Credentials Provider service" property should specify an instance of the AWSCredentialsProviderControllerService in order to provide credentials for accessing the bucket. .

  4. The "Region" property must be set to the same value as the "Region" property of the ListS3 Processor. .

Retrieve new files as they arrive in an S3 bucket

Notes: This method of retrieving files from S3 is more efficient than using ListS3 and more cost effective. It is the pattern recommended by AWS. However, it does require that the S3 bucket be configured to place notifications on an SQS queue when new files arrive. For more information, see https://docs.aws.amazon.com/AmazonS3/latest/userguide/ways-to-add-notification-config-to-bucket.html

GetSQS:

  1. The "Queue URL" must be set to the appropriate URL for the SQS queue. It is recommended that this property be parameterized, using a value such as #{SQS_QUEUE_URL}.

  2. The "Region" property must be set to denote the SQS region that the queue resides in. It’s a good idea to parameterize this property by setting it to something like #{SQS_REGION}. .

  3. The "AWS Credentials Provider service" property should specify an instance of the AWSCredentialsProviderControllerService in order to provide credentials for accessing the bucket. .

  4. The 'success' relationship is connected to EvaluateJsonPath. .

Processor:

  1. "Destination" = "flowfile-attribute"

  2. "s3.bucket" = "$.Records[0].s3.bucket.name"

  3. "filename" = "$.Records[0].s3.object.key" .

  4. The 'success' relationship is connected to FetchS3Object. .

FetchS3Object:

  1. "Bucket" = "${s3.bucket}"

  2. "Object Key" = "${filename}" .

  3. The "Region" property must be set to the same value as the "Region" property of the GetSQS Processor.

  4. The "AWS Credentials Provider service" property should specify an instance of the AWSCredentialsProviderControllerService in order to provide credentials for accessing the bucket. .

Tags: Amazon, S3, AWS, Get, Fetch

Properties

Bucket

The S3 Bucket to interact with

Object Key

The S3 Object Key to use. This is analogous to a filename for traditional file systems.

Region

The AWS Region to connect to.

AWS Credentials Provider Service

The Controller Service that is used to obtain AWS credentials provider

Communications Timeout

The amount of time to wait in order to establish a connection to AWS or receive data from AWS before timing out.

Version

The Version of the Object to download

SSL Context Service

Specifies an optional SSL Context Service that, if provided, will be used to create connections

Endpoint Override URL

Endpoint URL to use instead of the AWS default including scheme, host, port, and path. The AWS libraries select an endpoint URL based on the AWS region, but this property overrides the selected endpoint URL, allowing use with other S3-compatible endpoints.

Signer Override

The AWS S3 library uses Signature Version 4 by default but this property allows you to specify the Version 2 signer to support older S3-compatible services or even to plug in your own custom signer implementation.

Custom Signer Class Name

Fully qualified class name of the custom signer class. The signer must implement com.amazonaws.auth.Signer interface.

Custom Signer Module Location

Comma-separated list of paths to files and/or directories which contain the custom signer’s JAR file and its dependencies (if any).

Encryption Service

Specifies the Encryption Service Controller used to configure requests. PutS3Object: For backward compatibility, this value is ignored when 'Server Side Encryption' is set. FetchS3Object: Only needs to be configured in case of Server-side Customer Key, Client-side KMS and Client-side Customer Key encryptions.

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP + AuthN

Requester Pays

If true, indicates that the requester consents to pay any charges associated with retrieving objects from the S3 bucket. This sets the 'x-amz-request-payer' header to 'requester'.

Range Start

The byte position at which to start reading from the object. An empty value or a value of zero will start reading at the beginning of the object.

Range Length

The number of bytes to download from the object, starting from the Range Start. An empty value or a value that extends beyond the end of the object will read to the end of the object.

Relationships

  • success: FlowFiles are routed to this Relationship after they have been successfully processed.

  • failure: If the Processor is unable to process a given FlowFile, it will be routed to this Relationship.

Writes Attributes

  • s3.url: The URL that can be used to access the S3 object

  • s3.bucket: The name of the S3 bucket

  • path: The path of the file

  • absolute.path: The path of the file

  • filename: The name of the file

  • hash.value: The MD5 sum of the file

  • hash.algorithm: MD5

  • mime.type: If S3 provides the content type/MIME type, this attribute will hold that file

  • s3.etag: The ETag that can be used to see if the file has changed

  • s3.exception: The class name of the exception thrown during processor execution

  • s3.additionalDetails: The S3 supplied detail from the failed operation

  • s3.statusCode: The HTTP error code (if available) from the failed operation

  • s3.errorCode: The S3 moniker of the failed operation

  • s3.errorMessage: The S3 exception message from the failed operation

  • s3.expirationTime: If the file has an expiration date, this attribute will be set, containing the milliseconds since epoch in UTC time

  • s3.expirationTimeRuleId: The ID of the rule that dictates this object’s expiration time

  • s3.sseAlgorithm: The server side encryption algorithm of the object

  • s3.version: The version of the S3 object

  • s3.encryptionStrategy: The name of the encryption strategy that was used to store the S3 object (if it is encrypted)

Input Requirement

This component requires an incoming relationship.

FetchSFTP

Fetches the content of a file from a remote SFTP server and overwrites the contents of an incoming FlowFile with the content of the remote file.

Multi-Processor Use Cases

Retrieve all files in a directory of an SFTP Server

Keywords: sftp, secure, file, transform, state, retrieve, fetch, all, stream

ListSFTP:

  1. The "Hostname" property should be set to the fully qualified hostname of the FTP Server. It’s a good idea to parameterize this property by setting it to something like #{SFTP_SERVER}.

  2. The "Remote Path" property must be set to the directory on the FTP Server where the files reside. If the flow being built is to be reused elsewhere, it’s a good idea to parameterize this property by setting it to something like #{SFTP_REMOTE_PATH}.

  3. Configure the "Username" property to the appropriate username for logging into the FTP Server. It’s usually a good idea to parameterize this property by setting it to something like #{SFTP_USERNAME}.

  4. Configure the "Password" property to the appropriate password for the provided username. It’s usually a good idea to parameterize this property by setting it to something like #{SFTP_PASSWORD}. .

  5. The 'success' Relationship of this Processor is then connected to FetchSFTP. .

FetchSFTP:

  1. "Hostname" = "${sftp.remote.host}"

  2. "Remote File" = "${path}/${filename}"

  3. "Username" = "${sftp.listing.user}"

  4. "Password" = "#{SFTP_PASSWORD}" .

Tags: sftp, get, retrieve, files, fetch, remote, ingest, source, input

Properties

Hostname

The fully-qualified hostname or IP address of the host to fetch the data from

Port

The port to connect to on the remote host to fetch the data from

Username

Username

Password

Password for the user account

Private Key Path

The fully qualified path to the Private Key file

Private Key Passphrase

Password for the private key

Remote File

The fully qualified filename on the remote system

Completion Strategy

Specifies what to do with the original file on the server once it has been pulled into NiFi. If the Completion Strategy fails, a warning will be logged but the data will still be transferred.

Move Destination Directory

The directory on the remote server to move the original file to once it has been ingested into NiFi. This property is ignored unless the Completion Strategy is set to 'Move File'. The specified directory must already exist on the remote system if 'Create Directory' is disabled, or the rename will fail.

Create Directory

Used when 'Completion Strategy' is 'Move File'. Specifies whether or not the remote directory should be created if it does not exist.

Disable Directory Listing

Control how 'Move Destination Directory' is created when 'Completion Strategy' is 'Move File' and 'Create Directory' is enabled. If set to 'true', directory listing is not performed prior to create missing directories. By default, this processor executes a directory listing command to see target directory existence before creating missing directories. However, there are situations that you might need to disable the directory listing such as the following. Directory listing might fail with some permission setups (e.g. chmod 100) on a directory. Also, if any other SFTP client created the directory after this processor performed a listing and before a directory creation request by this processor is finished, then an error is returned because the directory already exists.

Connection Timeout

Amount of time to wait before timing out while creating a connection

Data Timeout

When transferring a file between the local and remote system, this value specifies how long is allowed to elapse without any data being transferred between systems

Send Keep Alive On Timeout

Send a Keep Alive message every 5 seconds up to 5 times for an overall timeout of 25 seconds.

Host Key File

If supplied, the given file will be used as the Host Key; otherwise, if 'Strict Host Key Checking' property is applied (set to true) then uses the 'known_hosts' and 'known_hosts2' files from ~/.ssh directory else no host key file will be used

Strict Host Key Checking

Indicates whether or not strict enforcement of hosts keys should be applied

Use Compression

Indicates whether or not ZLIB compression should be used when transferring files

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP + AuthN, SOCKS + AuthN

Log level when file not found

Log level to use in case the file does not exist when the processor is triggered

Ciphers Allowed

A comma-separated list of Ciphers allowed for SFTP connections. Leave unset to allow all. Available options are: 3des-cbc, 3des-ctr, aes128-cbc, aes128-ctr, aes128-gcm@openssh.com, aes192-cbc, aes192-ctr, aes256-cbc, aes256-ctr, aes256-gcm@openssh.com, arcfour, arcfour128, arcfour256, blowfish-cbc, blowfish-ctr, cast128-cbc, cast128-ctr, chacha20-poly1305@openssh.com, idea-cbc, idea-ctr, serpent128-cbc, serpent128-ctr, serpent192-cbc, serpent192-ctr, serpent256-cbc, serpent256-ctr, twofish-cbc, twofish128-cbc, twofish128-ctr, twofish192-cbc, twofish192-ctr, twofish256-cbc, twofish256-ctr

Key Algorithms Allowed

A comma-separated list of Key Algorithms allowed for SFTP connections. Leave unset to allow all. Available options are: ecdsa-sha2-nistp256, ecdsa-sha2-nistp256-cert-v01@openssh.com, ecdsa-sha2-nistp384, ecdsa-sha2-nistp384-cert-v01@openssh.com, ecdsa-sha2-nistp521, ecdsa-sha2-nistp521-cert-v01@openssh.com, rsa-sha2-256, rsa-sha2-512, ssh-dss, ssh-dss-cert-v01@openssh.com, ssh-ed25519, ssh-ed25519-cert-v01@openssh.com, ssh-rsa, ssh-rsa-cert-v01@openssh.com

Key Exchange Algorithms Allowed

A comma-separated list of Key Exchange Algorithms allowed for SFTP connections. Leave unset to allow all. Available options are: curve25519-sha256, curve25519-sha256@libssh.org, diffie-hellman-group-exchange-sha1, diffie-hellman-group-exchange-sha256, diffie-hellman-group1-sha1, diffie-hellman-group14-sha1, diffie-hellman-group14-sha256, diffie-hellman-group14-sha256@ssh.com, diffie-hellman-group15-sha256, diffie-hellman-group15-sha256@ssh.com, diffie-hellman-group15-sha384@ssh.com, diffie-hellman-group15-sha512, diffie-hellman-group16-sha256, diffie-hellman-group16-sha384@ssh.com, diffie-hellman-group16-sha512, diffie-hellman-group16-sha512@ssh.com, diffie-hellman-group17-sha512, diffie-hellman-group18-sha512, diffie-hellman-group18-sha512@ssh.com, ecdh-sha2-nistp256, ecdh-sha2-nistp384, ecdh-sha2-nistp521, ext-info-c

Message Authentication Codes Allowed

A comma-separated list of Message Authentication Codes allowed for SFTP connections. Leave unset to allow all. Available options are: hmac-md5, hmac-md5-96, hmac-md5-96-etm@openssh.com, hmac-md5-etm@openssh.com, hmac-ripemd160, hmac-ripemd160-96, hmac-ripemd160-etm@openssh.com, hmac-ripemd160@openssh.com, hmac-sha1, hmac-sha1-96, hmac-sha1-96@openssh.com, hmac-sha1-etm@openssh.com, hmac-sha2-256, hmac-sha2-256-etm@openssh.com, hmac-sha2-512, hmac-sha2-512-etm@openssh.com

Relationships

  • success: All FlowFiles that are received are routed to success

  • comms.failure: Any FlowFile that could not be fetched from the remote server due to a communications failure will be transferred to this Relationship.

  • not.found: Any FlowFile for which we receive a 'Not Found' message from the remote server will be transferred to this Relationship.

  • permission.denied: Any FlowFile that could not be fetched from the remote server due to insufficient permissions will be transferred to this Relationship.

Writes Attributes

  • sftp.remote.host: The hostname or IP address from which the file was pulled

  • sftp.remote.port: The port that was used to communicate with the remote SFTP server

  • sftp.remote.filename: The name of the remote file that was pulled

  • filename: The filename is updated to point to the filename fo the remote file

  • path: If the Remote File contains a directory name, that directory name will be added to the FlowFile using the 'path' attribute

  • fetch.failure.reason: The name of the failure relationship applied when routing to any failure relationship

Input Requirement

This component requires an incoming relationship.

FetchSmb

Fetches files from a SMB Share. Designed to be used in tandem with ListSmb.

Tags: samba, smb, cifs, files, fetch

Properties

SMB Client Provider Service

Specifies the SMB client provider to use for creating SMB connections.

Remote File

The full path of the file to be retrieved from the remote server. Expression language is supported.

Completion Strategy

Specifies what to do with the original file on the server once it has been processed. If the Completion Strategy fails, a warning will be logged but the data will still be transferred.

Destination Directory

The directory on the remote server to move the original file to once it has been processed.

Create Destination Directory

Specifies whether or not the remote directory should be created if it does not exist.

Relationships

  • success: A FlowFile will be routed here for each successfully fetched file.

  • failure: A FlowFile will be routed here when failed to fetch its content.

Writes Attributes

  • error.code: The error code returned by SMB when the fetch of a file fails.

  • error.message: The error message returned by SMB when the fetch of a file fails.

Input Requirement

This component requires an incoming relationship.

FilterAttribute

Filters the attributes of a FlowFile by retaining specified attributes and removing the rest or by removing specified attributes and retaining the rest.

Use Cases

Retain all FlowFile attributes matching a regular expression

Input Requirement: This component allows an incoming relationship.

  1. Set "Filter Mode" to "Retain".

  2. Set "Attribute Matching Strategy" to "Use regular expression".

  3. Specify the "Filtered Attributes Pattern", e.g. "my-property|a-prefix[.].*". .

Remove only a specified set of FlowFile attributes

Input Requirement: This component allows an incoming relationship.

  1. Set "Filter Mode" to "Remove".

  2. Set "Attribute Matching Strategy" to "Enumerate attributes".

  3. Specify the set of "Filtered Attributes" using the delimiter comma ',', e.g. "my-property,other,filename". .

Tags: attributes, modification, filter, retain, remove, delete, regex, regular expression, Attribute Expression Language

Properties

Filter Mode

Specifies the strategy to apply on filtered attributes. Either 'Remove' or 'Retain' only the matching attributes.

Attribute Matching Strategy

Specifies the strategy to filter attributes by.

Filtered Attributes

A set of attribute names to filter from FlowFiles. Each attribute name is separated by the comma delimiter ','.

Filtered Attributes Pattern

A regular expression to match names of attributes to filter from FlowFiles.

Relationships

  • success: All successful FlowFiles are routed to this relationship

Input Requirement

This component requires an incoming relationship.

FlattenJson

Provides the user with the ability to take a nested JSON document and flatten it into a simple key/value pair document. The keys are combined at each level with a user-defined separator that defaults to '.'. This Processor also allows to unflatten back the flattened json. It supports four kinds of flatten mode such as normal, keep-arrays, dot notation for MongoDB query and keep-primitive-arrays. Default flatten mode is 'keep-arrays'.

Tags: json, flatten, unflatten

Properties

Separator

The separator character used for joining keys. Must be a JSON-legal character.

Flatten Mode

Specifies how json should be flattened/unflattened

Ignore Reserved Characters

If true, reserved characters in keys will be ignored

Return Type

Specifies the desired return type of json such as flatten/unflatten

Character Set

The Character Set in which file is encoded

Pretty Print JSON

Specifies whether or not resulted json should be pretty printed

Relationships

  • success: Successfully flattened/unflattened files go to this relationship.

  • failure: Files that cannot be flattened/unflattened go to this relationship.

Input Requirement

This component requires an incoming relationship.

ForkEnrichment

Used in conjunction with the JoinEnrichment processor, this processor is responsible for adding the attributes that are necessary for the JoinEnrichment processor to perform its function. Each incoming FlowFile will be cloned. The original FlowFile will have appropriate attributes added and then be transferred to the 'original' relationship. The clone will have appropriate attributes added and then be routed to the 'enrichment' relationship. See the documentation for the JoinEnrichment processor (and especially its Additional Details) for more information on how these Processors work together and how to perform enrichment tasks in NiFi by using these Processors.

Tags: fork, join, enrich, record

Properties

Relationships

  • enrichment: A clone of the incoming FlowFile will be routed to this relationship, after adding appropriate attributes.

  • original: The incoming FlowFile will be routed to this relationship, after adding appropriate attributes.

Writes Attributes

  • enrichment.group.id: The Group ID to use in order to correlate the 'original' FlowFile with the 'enrichment' FlowFile.

  • enrichment.role: The role to use for enrichment. This will either be ORIGINAL or ENRICHMENT.

Input Requirement

This component requires an incoming relationship.

See Also

Additional Details

Introduction

The ForkEnrichment processor is designed to be used in conjunction with the JoinEnrichment Processor. Used together, they provide a powerful mechanism for transforming data into a separate request payload for gathering enrichment data, gathering that enrichment data, optionally transforming the enrichment data, and finally joining together the original payload with the enrichment data.

Typical Dataflow

A ForkEnrichment processor that is responsible for taking in a FlowFile and producing two copies of it: one to the “original” relationship and the other to the “enrichment” relationship. Each copy will have its own set of attributes added to it.

The “original” FlowFile being routed to the JoinEnrichment processor, while the “enrichment” FlowFile is routed in a different direction. Each of these FlowFiles will have an attribute named “enrichment.group.id” with the same value. The JoinEnrichment processor then uses this information to correlate the two FlowFiles. The “enrichment.role” attribute will also be added to each FlowFile but with a different value. The FlowFile routed to “original” will have an enrichment.role of ORIGINAL while the FlowFile routed to “enrichment” will have an enrichment.role of ENRICHMENT.

The Processors that make up the “enrichment” path will vary from use case to use case. We use JoltTransformJSON processor in order to transform our payload from the original payload into a payload that is expected by our web service. We then use the InvokeHTTP processor in order to gather enrichment data that is relevant to our use case. Other common processors to use in this path include QueryRecord, UpdateRecord, ReplaceText, JoltTransformRecord, and ScriptedTransformRecord. It is also be a common use case to transform the response from the web service that is invoked via InvokeHTTP using one or more of these processors.

After the enrichment data has been gathered, it does us little good unless we are able to somehow combine our enrichment data back with our original payload. To achieve this, we use the JoinEnrichment processor. It is responsible for combining records from both the “original” FlowFile and the “enrichment” FlowFile.

The JoinEnrichment Processor is configured with a separate RecordReader for the “original” FlowFile and for the “enrichment” FlowFile. This means that the original data and the enrichment data can have entirely different schemas and can even be in different data formats. For example, our original payload may be CSV data, while our enrichment data is a JSON payload. Because we make use of RecordReaders, this is entirely okay. The Processor also requires a RecordWriter to use for writing out the enriched payload (i.e., the payload that contains the join of both the “original” and the “enrichment” data).

For details on how to join the original payload with the enrichment data, see the Additional Details of the JoinEnrichment Processor documentation.

ForkRecord

This processor allows the user to fork a record into multiple records. The user must specify at least one Record Path, as a dynamic property, pointing to a field of type ARRAY containing RECORD objects. The processor accepts two modes: 'split' and 'extract'. In both modes, there is one record generated per element contained in the designated array. In the 'split' mode, each generated record will preserve the same schema as given in the input but the array will contain only one element. In the 'extract' mode, the element of the array must be of record type and will be the generated record. Additionally, in the 'extract' mode, it is possible to specify if each generated record should contain all the fields of the parent records from the root level to the extracted record. This assumes that the fields to add in the record are defined in the schema of the Record Writer controller service. See examples in the additional details documentation of this processor.

Tags: fork, record, content, array, stream, event

Properties

Record Reader

Specifies the Controller Service to use for reading incoming data

Record Writer

Specifies the Controller Service to use for writing out the records

Mode

Specifies the forking mode of the processor

Include Parent Fields

This parameter is only valid with the 'extract' mode. If set to true, all the fields from the root level to the given array will be added as fields of each element of the array to fork.

Dynamic Properties

Record Path property

A Record Path value, pointing to a field of type ARRAY containing RECORD objects

Relationships

  • failure: In case a FlowFile generates an error during the fork operation, it will be routed to this relationship

  • fork: The FlowFiles containing the forked records will be routed to this relationship

  • original: The original FlowFiles will be routed to this relationship

Writes Attributes

  • record.count: The generated FlowFile will have a 'record.count' attribute indicating the number of records that were written to the FlowFile.

  • mime.type: The MIME Type indicated by the Record Writer

  • <Attributes from Record Writer>: Any Attribute that the configured Record Writer returns will be added to the FlowFile.

Input Requirement

This component requires an incoming relationship.

Additional Details

ForkRecord allows the user to fork a record into multiple records. To do that, the user must specify one or multiple RecordPath (as dynamic properties of the processor) pointing to a field of type ARRAY containing RECORD elements.

The processor accepts two modes:

  • Split mode - in this mode, the generated records will have the same schema as the input. For every element in the array, one record will be generated and the array will only contain this element.

  • Extract mode - in this mode, the generated records will be the elements contained in the array. Besides, it is also possible to add in each record all the fields of the parent records from the root level to the record element being forked. However it supposes the fields to add are defined in the schema of the Record Writer controller service.

Examples
EXTRACT mode

To better understand how this Processor works, we will lay out a few examples. For the sake of these examples, let’s assume that our input data is JSON formatted and looks like this:

[
  {
    "id": 1,
    "name": "John Doe",
    "address": "123 My Street",
    "city": "My City",
    "state": "MS",
    "zipCode": "11111",
    "country": "USA",
    "accounts": [
      {
        "id": 42,
        "balance": 4750.89
      },
      {
        "id": 43,
        "balance": 48212.38
      }
    ]
  },
  {
    "id": 2,
    "name": "Jane Doe",
    "address": "345 My Street",
    "city": "Her City",
    "state": "NY",
    "zipCode": "22222",
    "country": "USA",
    "accounts": [
      {
        "id": 45,
        "balance": 6578.45
      },
      {
        "id": 46,
        "balance": 34567.21
      }
    ]
  }
]
json

======= Example 1 - Extracting without parent fields

For this case, we want to create one record per account and we don’t care about the other fields. We’ll add a dynamic property “path” set to /accounts. The resulting flow file will contain 4 records and will look like (assuming the Record Writer schema is correctly set):

[
  {
    "id": 42,
    "balance": 4750.89
  },
  {
    "id": 43,
    "balance": 48212.38
  },
  {
    "id": 45,
    "balance": 6578.45
  },
  {
    "id": 46,
    "balance": 34567.21
  }
]
json

======= Example 2 - Extracting with parent fields

Now, if we set the property “Include parent fields” to true, this will recursively include the parent fields into the output records assuming the Record Writer schema allows it. In case multiple fields have the same name (like we have in this example for id), the child field will have the priority over all the parent fields sharing the same name. In this case, the id of the array accounts will be saved in the forked records. The resulting flow file will contain 4 records and will look like:

[
  {
    "name": "John Doe",
    "address": "123 My Street",
    "city": "My City",
    "state": "MS",
    "zipCode": "11111",
    "country": "USA",
    "id": 42,
    "balance": 4750.89
  },
  {
    "name": "John Doe",
    "address": "123 My Street",
    "city": "My City",
    "state": "MS",
    "zipCode": "11111",
    "country": "USA",
    "id": 43,
    "balance": 48212.38
  },
  {
    "name": "Jane Doe",
    "address": "345 My Street",
    "city": "Her City",
    "state": "NY",
    "zipCode": "22222",
    "country": "USA",
    "id": 45,
    "balance": 6578.45
  },
  {
    "name": "Jane Doe",
    "address": "345 My Street",
    "city": "Her City",
    "state": "NY",
    "zipCode": "22222",
    "country": "USA",
    "id": 46,
    "balance": 34567.21
  }
]
json

======= Example 3 - Multi-nested arrays

Now let’s say that the input record contains multi-nested arrays like the below example:

[
  {
    "id": 1,
    "name": "John Doe",
    "address": "123 My Street",
    "city": "My City",
    "state": "MS",
    "zipCode": "11111",
    "country": "USA",
    "accounts": [
      {
        "id": 42,
        "balance": 4750.89,
        "transactions": [
          {
            "id": 5,
            "amount": 150.31
          },
          {
            "id": 6,
            "amount": -15.31
          }
        ]
      },
      {
        "id": 43,
        "balance": 48212.38,
        "transactions": [
          {
            "id": 7,
            "amount": 36.78
          },
          {
            "id": 8,
            "amount": -21.34
          }
        ]
      }
    ]
  }
]
json

``

If we want to have one record per transaction for each account, then the Record Path should be set to /accounts[*]/transactions. If we have the following schema for our Record Reader:

{
  "type": "record",
  "name": "bank",
  "fields": [
    {
      "name": "id",
      "type": "int"
    },
    {
      "name": "name",
      "type": "string"
    },
    {
      "name": "address",
      "type": "string"
    },
    {
      "name": "city",
      "type": "string"
    },
    {
      "name": "state",
      "type": "string"
    },
    {
      "name": "zipCode",
      "type": "string"
    },
    {
      "name": "country",
      "type": "string"
    },
    {
      "name": "accounts",
      "type": {
        "type": "array",
        "items": {
          "type": "record",
          "name": "accounts",
          "fields": [
            {
              "name": "id",
              "type": "int"
            },
            {
              "name": "balance",
              "type": "double"
            },
            {
              "name": "transactions",
              "type": {
                "type": "array",
                "items": {
                  "type": "record",
                  "name": "transactions",
                  "fields": [
                    {
                      "name": "id",
                      "type": "int"
                    },
                    {
                      "name": "amount",
                      "type": "double"
                    }
                  ]
                }
              }
            }
          ]
        }
      }
    }
  ]
}
json

And if we have the following schema for our Record Writer:

{
  "type": "record",
  "name": "bank",
  "fields": [
    {
      "name": "id",
      "type": "int"
    },
    {
      "name": "name",
      "type": "string"
    },
    {
      "name": "address",
      "type": "string"
    },
    {
      "name": "city",
      "type": "string"
    },
    {
      "name": "state",
      "type": "string"
    },
    {
      "name": "zipCode",
      "type": "string"
    },
    {
      "name": "country",
      "type": "string"
    },
    {
      "name": "amount",
      "type": "double"
    },
    {
      "name": "balance",
      "type": "double"
    }
  ]
}
json

Then, if we include the parent fields, we’ll have 4 records as below:

[
  {
    "id": 5,
    "name": "John Doe",
    "address": "123 My Street",
    "city": "My City",
    "state": "MS",
    "zipCode": "11111",
    "country": "USA",
    "amount": 150.31,
    "balance": 4750.89
  },
  {
    "id": 6,
    "name": "John Doe",
    "address": "123 My Street",
    "city": "My City",
    "state": "MS",
    "zipCode": "11111",
    "country": "USA",
    "amount": -15.31,
    "balance": 4750.89
  },
  {
    "id": 7,
    "name": "John Doe",
    "address": "123 My Street",
    "city": "My City",
    "state": "MS",
    "zipCode": "11111",
    "country": "USA",
    "amount": 36.78,
    "balance": 48212.38
  },
  {
    "id": 8,
    "name": "John Doe",
    "address": "123 My Street",
    "city": "My City",
    "state": "MS",
    "zipCode": "11111",
    "country": "USA",
    "amount": -21.34,
    "balance": 48212.38
  }
]
json
SPLIT mode

======= Example

Assuming we have the below data and we added a property “path” set to /accounts:

[
  {
    "id": 1,
    "name": "John Doe",
    "address": "123 My Street",
    "city": "My City",
    "state": "MS",
    "zipCode": "11111",
    "country": "USA",
    "accounts": [
      {
        "id": 42,
        "balance": 4750.89
      },
      {
        "id": 43,
        "balance": 48212.38
      }
    ]
  },
  {
    "id": 2,
    "name": "Jane Doe",
    "address": "345 My Street",
    "city": "Her City",
    "state": "NY",
    "zipCode": "22222",
    "country": "USA",
    "accounts": [
      {
        "id": 45,
        "balance": 6578.45
      },
      {
        "id": 46,
        "balance": 34567.21
      }
    ]
  }
]
json

Then we’ll get 4 records as below:

[
  {
    "id": 1,
    "name": "John Doe",
    "address": "123 My Street",
    "city": "My City",
    "state": "MS",
    "zipCode": "11111",
    "country": "USA",
    "accounts": [
      {
        "id": 42,
        "balance": 4750.89
      }
    ]
  },
  {
    "id": 1,
    "name": "John Doe",
    "address": "123 My Street",
    "city": "My City",
    "state": "MS",
    "zipCode": "11111",
    "country": "USA",
    "accounts": [
      {
        "id": 43,
        "balance": 48212.38
      }
    ]
  },
  {
    "id": 2,
    "name": "Jane Doe",
    "address": "345 My Street",
    "city": "Her City",
    "state": "NY",
    "zipCode": "22222",
    "country": "USA",
    "accounts": [
      {
        "id": 45,
        "balance": 6578.45
      }
    ]
  },
  {
    "id": 2,
    "name": "Jane Doe",
    "address": "345 My Street",
    "city": "Her City",
    "state": "NY",
    "zipCode": "22222",
    "country": "USA",
    "accounts": [
      {
        "id": 46,
        "balance": 34567.21
      }
    ]
  }
]
json

GenerateFlowFile

This processor creates FlowFiles with random data or custom content. GenerateFlowFile is useful for load testing, configuration, and simulation. Also see DuplicateFlowFile for additional load testing.

Tags: test, random, generate, load

Properties

File Size

The size of the file that will be used

Batch Size

The number of FlowFiles to be transferred in each invocation

Data Format

Specifies whether the data should be Text or Binary

Unique FlowFiles

If true, each FlowFile that is generated will be unique. If false, a random value will be generated and all FlowFiles will get the same content but this offers much higher throughput

Custom Text

If Data Format is text and if Unique FlowFiles is false, then this custom text will be used as content of the generated FlowFiles and the File Size will be ignored. Finally, if Expression Language is used, evaluation will be performed only once per batch of generated FlowFiles

Character Set

Specifies the character set to use when writing the bytes of Custom Text to a flow file.

Mime Type

Specifies the value to set for the "mime.type" attribute.

Dynamic Properties

Generated FlowFile attribute name

Specifies an attribute on generated FlowFiles defined by the Dynamic Property’s key and value. If Expression Language is used, evaluation will be performed only once per batch of generated FlowFiles.

Relationships

  • success:

Writes Attributes

  • mime.type: Sets the MIME type of the output if the 'Mime Type' property is set

Input Requirement

This component does not allow an incoming relationship.

Additional Details

This processor can be configured to generate variable-sized FlowFiles. The File Size property accepts both a literal value, e.g. “1 KB”, and Expression Language statements. In order to create FlowFiles of variable sizes, the Expression Language function random() can be used. For example, ${random():mod(101)} will generate values between 0 and 100, inclusive. A data size label, e.g. B, KB, MB, etc., must be included in the Expression Language statement since the File Size property holds a data size value. The table below shows some examples.

File Size Expression Language Statement File Sizes Generated (values are inclusive)

$\{random():mod(101)}b

0 - 100 bytes

$\{random():mod(101)}mb

0 - 100 MB

$\{random():mod(101):plus(20)} B

20 - 120 bytes

$\{random():mod(71):plus(30):append(“KB”)}

30 - 100 KB

See the Expression Language Guide for more details on the random() function.

GenerateRecord

This processor creates FlowFiles with records having random value for the specified fields. GenerateRecord is useful for testing, configuration, and simulation. It uses either user-defined properties to define a record schema or a provided schema and generates the specified number of records using random data for the fields in the schema.

Tags: test, random, generate, fake

Properties

Record Writer

Specifies the Controller Service to use for writing out the records

Number of Records

Specifies how many records will be generated for each outgoing FlowFile.

Nullable Fields

Whether the generated fields will be nullable. Note that this property is ignored if Schema Text is set. Also it only affects the schema of the generated data, not whether any values will be null. If this property is true, see 'Null Value Percentage' to set the probability that any generated field will be null.

Null Value Percentage

The percent probability (0-100%) that a generated value for any nullable field will be null. Set this property to zero to have no null values, or 100 to have all null values.

Schema Text

The text of an Avro-formatted Schema used to generate record data. If this property is set, any user-defined properties are ignored.

Dynamic Properties

Field name in generated record

Custom properties define the generated record schema using configured field names and value data types in absence of the Schema Text property

Relationships

  • success: FlowFiles that are successfully created will be routed to this relationship

Writes Attributes

  • mime.type: Sets the mime.type attribute to the MIME Type specified by the Record Writer

  • record.count: The number of records in the FlowFile

Input Requirement

This component does not allow an incoming relationship.

GenerateTableFetch

Generates SQL select queries that fetch "pages" of rows from a table. The partition size property, along with the table’s row count, determine the size and number of pages and generated FlowFiles. In addition, incremental fetching can be achieved by setting Maximum-Value Columns, which causes the processor to track the columns' maximum values, thus only fetching rows whose columns' values exceed the observed maximums. This processor is intended to be run on the Primary Node only.

This processor can accept incoming connections; the behavior of the processor is different whether incoming connections are provided: - If no incoming connection(s) are specified, the processor will generate SQL queries on the specified processor schedule. Expression Language is supported for many fields, but no FlowFile attributes are available. However the properties will be evaluated using the Environment/System properties. - If incoming connection(s) are specified and no FlowFile is available to a processor task, no work will be performed. - If incoming connection(s) are specified and a FlowFile is available to a processor task, the FlowFile’s attributes may be used in Expression Language for such fields as Table Name and others. However, the Max-Value Columns and Columns to Return fields must be empty or refer to columns that are available in each specified table.

Tags: sql, select, jdbc, query, database, fetch, generate

Properties

Database Connection Pooling Service

The Controller Service that is used to obtain a connection to the database.

Database Type

Database Type for generating statements specific to a particular service or vendor. The Generic Type supports most cases but selecting a specific type enables optimal processing or additional features.

Database Dialect Service

Database Dialect Service for generating statements specific to a particular service or vendor.

Table Name

The name of the database table to be queried.

Columns to Return

A comma-separated list of column names to be used in the query. If your database requires special treatment of the names (quoting, e.g.), each name should include such treatment. If no column names are supplied, all columns in the specified table will be returned. NOTE: It is important to use consistent column names for a given table for incremental fetch to work properly.

Maximum-value Columns

A comma-separated list of column names. The processor will keep track of the maximum value for each column that has been returned since the processor started running. Using multiple columns implies an order to the column list, and each column’s values are expected to increase more slowly than the previous columns' values. Thus, using multiple columns implies a hierarchical structure of columns, which is usually used for partitioning tables. This processor can be used to retrieve only those rows that have been added/updated since the last retrieval. Note that some JDBC types such as bit/boolean are not conducive to maintaining maximum value, so columns of these types should not be listed in this property, and will result in error(s) during processing. If no columns are provided, all rows from the table will be considered, which could have a performance impact. NOTE: It is important to use consistent max-value column names for a given table for incremental fetch to work properly.

Max Wait Time

The maximum amount of time allowed for a running SQL select query , zero means there is no limit. Max time less than 1 second will be equal to zero.

Partition Size

The number of result rows to be fetched by each generated SQL statement. The total number of rows in the table divided by the partition size gives the number of SQL statements (i.e. FlowFiles) generated. A value of zero indicates that a single FlowFile is to be generated whose SQL statement will fetch all rows in the table.

Column for Value Partitioning

The name of a column whose values will be used for partitioning. The default behavior is to use row numbers on the result set for partitioning into 'pages' to be fetched from the database, using an offset/limit strategy. However for certain databases, it can be more efficient under the right circumstances to use the column values themselves to define the 'pages'. This property should only be used when the default queries are not performing well, when there is no maximum-value column or a single maximum-value column whose type can be coerced to a long integer (i.e. not date or timestamp), and the column values are evenly distributed and not sparse, for best performance.

Additional WHERE clause

A custom clause to be added in the WHERE condition when building SQL queries.

Custom ORDER BY Column

The name of a column to be used for ordering the results if Max-Value Columns are not provided and partitioning is enabled. This property is ignored if either Max-Value Columns is set or Partition Size = 0. NOTE: If neither Max-Value Columns nor Custom ORDER BY Column is set, then depending on the the database/driver, the processor may report an error and/or the generated SQL may result in missing and/or duplicate rows. This is because without an explicit ordering, fetching each partition is done using an arbitrary ordering.

Output Empty FlowFile on Zero Results

Depending on the specified properties, an execution of this processor may not result in any SQL statements generated. When this property is true, an empty FlowFile will be generated (having the parent of the incoming FlowFile if present) and transferred to the 'success' relationship. When this property is false, no output FlowFiles will be generated.

Dynamic Properties

initial.maxvalue.<max_value_column>

Specifies an initial max value for max value columns. Properties should be added in the format initial.maxvalue.<max_value_column>. This value is only used the first time the table is accessed (when a Maximum Value Column is specified). In the case of incoming connections, the value is only used the first time for each table specified in the FlowFiles.

Relationships

  • success: Successfully created FlowFile from SQL query result set.

  • failure: This relationship is only used when SQL query execution (using an incoming FlowFile) failed. The incoming FlowFile will be penalized and routed to this relationship. If no incoming connection(s) are specified, this relationship is unused.

Writes Attributes

  • generatetablefetch.sql.error: If the processor has incoming connections, and processing an incoming FlowFile causes a SQL Exception, the FlowFile is routed to failure and this attribute is set to the exception message.

  • generatetablefetch.tableName: The name of the database table to be queried.

  • generatetablefetch.columnNames: The comma-separated list of column names used in the query.

  • generatetablefetch.whereClause: Where clause used in the query to get the expected rows.

  • generatetablefetch.maxColumnNames: The comma-separated list of column names used to keep track of data that has been returned since the processor started running.

  • generatetablefetch.limit: The number of result rows to be fetched by the SQL statement.

  • generatetablefetch.offset: Offset to be used to retrieve the corresponding partition.

  • fragment.identifier: All FlowFiles generated from the same query result set will have the same value for the fragment.identifier attribute. This can then be used to correlate the results.

  • fragment.count: This is the total number of FlowFiles produced by a single ResultSet. This can be used in conjunction with the fragment.identifier attribute in order to know how many FlowFiles belonged to the same incoming ResultSet.

  • fragment.index: This is the position of this FlowFile in the list of outgoing FlowFiles that were all generated from the same execution. This can be used in conjunction with the fragment.identifier attribute to know which FlowFiles originated from the same execution and in what order FlowFiles were produced

Stateful

Scope: Cluster

After performing a query on the specified table, the maximum values for the specified column(s) will be retained for use in future executions of the query. This allows the Processor to fetch only those records that have max values greater than the retained values. This can be used for incremental fetching, fetching of newly added rows, etc. To clear the maximum values, clear the state of the processor per the State Management documentation

Input Requirement

This component allows an incoming relationship.

Additional Details

GenerateTableFetch uses its properties and the specified database connection to generate FlowFiles containing SQL statements that can be used to fetch “pages” (aka “partitions”) of data from a table. GenerateTableFetch executes a query to the database to determine the current row count and maximum value, and if Maximum Value Columns are specified, will collect the count of rows whose values for the Maximum Value Columns are larger than those last observed by GenerateTableFetch. This allows for incremental fetching of “new” rows, rather than generating SQL to fetch the entire table each time. If no Maximum Value Columns are set, then the processor will generate SQL to fetch the entire table each time.

In order to generate SQL that will fetch pages/partitions of data, by default GenerateTableFetch will generate SQL that orders the data based on the Maximum Value Columns (if present) and utilize the row numbers of the result set to determine each page. For example if the Maximum Value Column is an integer “id” and the partition size is 10, then the SQL for the first page might be “SELECT * FROM myTable LIMIT 10” and the second page might be “SELECT * FROM myTable OFFSET 10 LIMIT 10”, and so on.

Ordering the data can be an expensive operation depending on the database, the number of rows, etc. Alternatively, it is possible to specify a column whose values will be used to determine the pages, using the Column for Value Partitioning property. If set, GenerateTableFetch will determine the minimum and maximum values for the column, and uses the minimum value as the initial offset. The SQL to fetch a page is then based on this initial offset and the total difference in values (i.e. maximum - minimum) divided by the page size. For example, if the column “id” is used for value partitioning, and the column contains values 100 to 200, then with a page size of 10 the SQL to fetch the first page might be “SELECT * FROM myTable WHERE id >= 100 AND id < 110” and the second page might be “SELECT * FROM myTable WHERE id >= 110 AND id < 120”, and so on.

It is important that the Column for Value Partitioning be set to a column whose type can be coerced to a long integer ( i.e. not date or timestamp), and that the column values are evenly distributed and not sparse, for best performance. As a counterexample to the above, consider a column “id” whose values are 100, 2000, and 30000. If the Partition Size is 100, then the column values are relatively sparse, so the SQL for the “second page” (see above example) will return zero rows, and so will every page until the value in the query becomes “id >= 2000”. Another counterexample is when the values are not uniformly distributed. Consider a column “id” with values 100, 200, 201, 202, … 299. Then the SQL for the first page (see above example) will return one row with value id = 100, and the second page will return 100 rows with values 200 … 299. This can cause inconsistent processing times downstream, as the pages may contain a very different number of rows. For these reasons it is recommended to use a Column for Value Partitioning that is sufficiently dense (not sparse) and fairly evenly distributed.

GeoEnrichIP

Looks up geolocation information for an IP address and adds the geo information to FlowFile attributes. The geo data is provided as a MaxMind database. The attribute that contains the IP address to lookup is provided by the 'IP Address Attribute' property. If the name of the attribute provided is 'X', then the the attributes added by enrichment will take the form X.geo.<fieldName>

Tags: geo, enrich, ip, maxmind

Properties

MaxMind Database File

Path to Maxmind IP Enrichment Database File

IP Address Attribute

The name of an attribute whose value is a dotted decimal IP address for which enrichment should occur

Log Level

The Log Level to use when an IP is not found in the database. Accepted values: INFO, DEBUG, WARN, ERROR.

Relationships

  • found: Where to route flow files after successfully enriching attributes with data provided by database

  • not found: Where to route flow files after unsuccessfully enriching attributes because no data was found

Writes Attributes

  • X.geo.lookup.micros: The number of microseconds that the geo lookup took

  • X.geo.city: The city identified for the IP address

  • X.geo.accuracy: The accuracy radius if provided by the database (in Kilometers)

  • X.geo.latitude: The latitude identified for this IP address

  • X.geo.longitude: The longitude identified for this IP address

  • X.geo.subdivision.N: Each subdivision that is identified for this IP address is added with a one-up number appended to the attribute name, starting with 0

  • X.geo.subdivision.isocode.N: The ISO code for the subdivision that is identified by X.geo.subdivision.N

  • X.geo.country: The country identified for this IP address

  • X.geo.country.isocode: The ISO Code for the country identified

  • X.geo.postalcode: The postal code for the country identified

Input Requirement

This component requires an incoming relationship.

GeoEnrichIPRecord

Looks up geolocation information for an IP address and adds the geo information to FlowFile attributes. The geo data is provided as a MaxMind database. This version uses the NiFi Record API to allow large scale enrichment of record-oriented data sets. Each field provided by the MaxMind database can be directed to a field of the user’s choosing by providing a record path for that field configuration.

Tags: geo, enrich, ip, maxmind, record

Properties

MaxMind Database File

Path to Maxmind IP Enrichment Database File

Record Reader

Record reader service to use for reading the flowfile contents.

Record Writer

Record writer service to use for enriching the flowfile contents.

Separate Enriched From Not Enriched

Separate records that have been enriched from ones that have not. Default behavior is to send everything to the found relationship if even one record is enriched.

IP Address Record Path

The record path to retrieve the IP address for doing the lookup.

City Record Path

Record path for putting the city identified for the IP address

Latitude Record Path

Record path for putting the latitude identified for this IP address

Longitude Record Path

Record path for putting the longitude identified for this IP address

Country Record Path

Record path for putting the country identified for this IP address

Country ISO Code Record Path

Record path for putting the ISO Code for the country identified

Country Postal Code Record Path

Record path for putting the postal code for the country identified

Log Level

The Log Level to use when an IP is not found in the database. Accepted values: INFO, DEBUG, WARN, ERROR.

Relationships

  • found: Where to route flow files after successfully enriching attributes with data provided by database

  • not found: Where to route flow files after unsuccessfully enriching attributes because no data was found

  • original: The original input flowfile goes to this relationship regardless of whether the content was enriched or not.

Input Requirement

This component requires an incoming relationship.

GeohashRecord

A record-based processor that encodes and decodes Geohashes from and to latitude/longitude coordinates.

Tags: geo, geohash, record

Properties

Mode

Specifies whether to encode latitude/longitude to geohash or decode geohash to latitude/longitude

Record Reader

Specifies the record reader service to use for reading incoming data

Record Writer

Specifies the record writer service to use for writing data

Routing Strategy

Specifies how to route flowfiles after encoding or decoding being performed. SKIP will enrich those records that can be enriched and skip the rest. The SKIP strategy will route a flowfile to failure only if unable to parse the data. Otherwise, it will route the enriched flowfile to success, and the original input to original. SPLIT will separate the records that have been enriched from those that have not and send them to matched, while unenriched records will be sent to unmatched; the original input flowfile will be sent to original. The SPLIT strategy will route a flowfile to failure only if unable to parse the data. REQUIRE will route a flowfile to success only if all of its records are enriched, and the original input will be sent to original. The REQUIRE strategy will route the original input flowfile to failure if any of its records cannot be enriched or unable to be parsed

Latitude Record Path

In the ENCODE mode, this property specifies the record path to retrieve the latitude values. Latitude values should be in the range of [-90, 90]; invalid values will be logged at warn level. In the DECODE mode, this property specifies the record path to put the latitude value

Longitude Record Path

In the ENCODE mode, this property specifies the record path to retrieve the longitude values; Longitude values should be in the range of [-180, 180]; invalid values will be logged at warn level. In the DECODE mode, this property specifies the record path to put the longitude value

Geohash Record Path

In the ENCODE mode, this property specifies the record path to put the geohash value; in the DECODE mode, this property specifies the record path to retrieve the geohash value

Geohash Format

In the ENCODE mode, this property specifies the desired format for encoding geohash; in the DECODE mode, this property specifies the format of geohash provided

Geohash Level

The integer precision level(1-12) desired for encoding geohash

Relationships

  • success: Flowfiles that are successfully encoded or decoded will be routed to success

  • failure: Flowfiles that cannot be encoded or decoded will be routed to failure

  • original: The original input flowfile will be sent to this relationship

Writes Attributes

  • mime.type: The MIME type indicated by the record writer

  • record.count: The number of records in the resulting flow file

Input Requirement

This component requires an incoming relationship.

Additional Details

Overview

A Geohash value corresponds to a specific area with pre-defined granularity and is widely used in identifying, representing and indexing geospatial objects. This GeohashRecord processor provides the ability to encode and decode Geohashes with desired format and precision.

Formats supported
  • BASE32: The most commonly used alphanumeric version. It is compact and more human-readable by discarding some letters (such as “a” and “o”, “i” and “l”) that might cause confusion.

  • BINARY: This format is generated by directly interleaving latitude and longitude binary strings. The even bits in the binary strings correspond to the longitude, while the odd digits correspond to the latitude.

  • LONG: Although this 64-bit number format is not human-readable, it can be calculated very fast and is more efficient.

Precision supported

In ENCODE mode, users specify the desired precision level, which should be an integer number between 1 and 12. A greater level will generate a longer Geohash with higher precision.

In DECODE mode, users are not asked to provide a precision level because this information is contained in the length of Geohash values given.

GetAsanaObject

This processor collects data from Asana

Tags: asana, source, ingest

Properties

Asana Client Service

Specify which controller service to use for accessing Asana.

Distributed Cache Service

Cache service to store fetched item fingerprints. These, from the last successful query are stored, in order to enable incremental loading and change detection.

Object Type

Specify what kind of objects to be collected from Asana

Project Name

Fetch only objects in this project. Case sensitive.

Section Name

Fetch only objects in this section. Case sensitive.

Team

Team name. Case sensitive.

Tag

Fetch only objects having this tag. Case sensitive.

Output Batch Size

The number of items batched together in a single Flow File. If set to 1 (default), then each item is transferred in a separate Flow File and each will have an asana.gid attribute, to help identifying the fetched item on the server side, if needed. If the batch size is greater than 1, then the specified amount of items are batched together in a single Flow File as a Json array, and the Flow Files won’t have the asana.gid attribute.

Relationships

  • new: Newly collected objects are routed to this relationship.

  • removed: Notification about deleted objects are routed to this relationship. Flow files will not have any payload. IDs of the resources no longer exist are carried by the asana.gid attribute of the generated FlowFiles.

  • updated: Objects that have already been collected earlier, but were updated since, are routed to this relationship.

Writes Attributes

  • asana.gid: Global ID of the object in Asana.

Input Requirement

This component does not allow an incoming relationship.

System Resource Considerations

  • MEMORY: An instance of this component can cause high usage of this system resource. Multiple instances or high concurrency settings may result a degradation of performance.

Additional Details

Description

This processor collects various objects (e.g. tasks, comments, etc…) from Asana via the specified AsanaClientService. When the processor started for the first time with a given configuration it collects each of the objects matching the user specified criteria, and emits FlowFile`s of each on the `NEW relationship. Then, it polls Asana in the frequency of the configured Run Schedule and detects changes by comparing the object fingerprints. When there are updates, it emits them through the UPDATED and REMOVED relationships, respectively.

FlowFile contents & attributes

Each emitted FlowFile contains the Json representation of the fetched Asana object. These can be processed further via the respective processors, that accept text data in this format. The FlowFile`s emitted from the `REMOVED relationship have no content, because the actual data is not stored in the processor, and so there is no way to retrieve the deleted content.

Each FlowFile, regardless to which relationship they were emitted from, have an asana.gid attribute set, which contain the ID of the object in Asana. These IDs are globally unique within the Asana instance, regardless of what type of object they were assigned to. In case of Events, these IDs are generated by the client, because Asana does not keep track of these objects.

Object fingerprints

These are used only for content change detection.

Fingerprints are generally calculated by applying an SHA-512 algorithm on the retrieved object. In case of immutable objects, like Attachments, these fingerprints are static, so update_s (which is impossible anyway) are not detected. In case of _Projects and Tasks, where the last modification time is available, these timestamps are stored as fingerprints.

Batch size

By default, this processor emits each fetched object from Asana in a separate FlowFile. This is usually OK for a workspace having low traffic, and thus generating data in low rate. For workspaces with high volume of traffic, it is advisable to set the batch size to a reasonably high value, to have better performance. With this value set to something other than the default (1), the processor will emit FlowFile`s that have multiple items batched together in a Json array, but in exchange, without having the `asana.gid attribute set.

Configuring filters, filtering by name

In case of collecting some objects, like Project Events, Tasks, and Team Members, the processor requires/allows defining filters. In example: if you would like to collect Tasks, then you need to define the project from where the tasks you would like to collect.

In these cases, when the filters refer to some parent object, you need to provide its name in the configuration, in case-sensitive manner. Another important note to keep in mind, Asana lets the users create multiple objects with the same name. In example: you can create two projects with name ‘My project’. But when you need to refer to this project by its name, it is impossible to figure out which ‘My project’ you intended to refer to, therefore these situations should be avoided. In such cases, this processor picks the first one returned by Asana when listing them. This is not random, but the ordering is not guaranteed.

GetAwsPollyJobStatus

Retrieves the current status of an AWS Polly job.

Tags: Amazon, AWS, ML, Machine Learning, Polly

Properties

AWS Task ID

AWS Credentials Provider Service

The Controller Service that is used to obtain AWS credentials provider

Region

Communications Timeout

SSL Context Service

Specifies an optional SSL Context Service that, if provided, will be used to create connections

Endpoint Override URL

Endpoint URL to use instead of the AWS default including scheme, host, port, and path. The AWS libraries select an endpoint URL based on the AWS region, but this property overrides the selected endpoint URL, allowing use with other S3-compatible endpoints.

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP + AuthN

Relationships

  • success: Job successfully finished. FlowFile will be routed to this relation.

  • failure: The job failed, the original FlowFile will be routed to this relationship.

  • original: Upon successful completion, the original FlowFile will be routed to this relationship.

  • running: The job is currently still being processed

Writes Attributes

  • PollyS3OutputBucket: The bucket name where polly output will be located.

  • filename: Object key of polly output.

  • outputLocation: S3 path-style output location of the result.

Input Requirement

This component allows an incoming relationship.

See Also

Additional Details

GetAwsPollyJobStatus

Amazon Polly is a service that turns text into lifelike speech, allowing you to create applications that talk, and build entirely new categories of speech-enabled products. Polly’s Text-to-Speech (TTS) service uses advanced deep learning technologies to synthesize natural sounding human speech. With dozens of lifelike voices across a broad set of languages, you can build speech-enabled applications that work in many different countries.

Usage

GetAwsPollyJobStatus Processor is designed to periodically check polly job status. This processor should be used in pair with StartAwsPollyJob Processor. If the job successfully finished it will populate outputLocation attribute of the flow file where you can find the output of the Polly job. In case of an error failure.reason attribute will be populated with the details.

GetAwsTextractJobStatus

Retrieves the current status of an AWS Textract job.

Tags: Amazon, AWS, ML, Machine Learning, Textract

Properties

AWS Task ID

AWS Credentials Provider Service

The Controller Service that is used to obtain AWS credentials provider

Region

Communications Timeout

SSL Context Service

Specifies an optional SSL Context Service that, if provided, will be used to create connections

Endpoint Override URL

Endpoint URL to use instead of the AWS default including scheme, host, port, and path. The AWS libraries select an endpoint URL based on the AWS region, but this property overrides the selected endpoint URL, allowing use with other S3-compatible endpoints.

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP + AuthN

Textract Type

Supported values: "Document Analysis", "Document Text Detection", "Expense Analysis"

Relationships

  • success: Job successfully finished. FlowFile will be routed to this relation.

  • failure: The job failed, the original FlowFile will be routed to this relationship.

  • original: Upon successful completion, the original FlowFile will be routed to this relationship.

  • running: The job is currently still being processed

  • throttled: Retrieving results failed for some reason, but the issue is likely to resolve on its own, such as Provisioned Throughput Exceeded or a Throttling failure. It is generally expected to retry this relationship.

Input Requirement

This component allows an incoming relationship.

Additional Details

GetAwsTextractJobStatus

Amazon Textract is a machine learning (ML) service that automatically extracts text, handwriting, and data from scanned documents. It goes beyond simple optical character recognition (OCR) to identify, understand, and extract data from forms and tables.

Usage

GetAwsTextractJobStatus Processor is designed to periodically check textract job status. This processor should be used in pair with StartAwsTextractJob Processor. FlowFile will contain the serialized Tetract response that contains the result and additional metadata as it is documented in AWS Textract Reference.

GetAwsTranscribeJobStatus

Retrieves the current status of an AWS Transcribe job.

Tags: Amazon, AWS, ML, Machine Learning, Transcribe

Properties

AWS Task ID

AWS Credentials Provider Service

The Controller Service that is used to obtain AWS credentials provider

Region

Communications Timeout

SSL Context Service

Specifies an optional SSL Context Service that, if provided, will be used to create connections

Endpoint Override URL

Endpoint URL to use instead of the AWS default including scheme, host, port, and path. The AWS libraries select an endpoint URL based on the AWS region, but this property overrides the selected endpoint URL, allowing use with other S3-compatible endpoints.

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP + AuthN

Relationships

  • success: Job successfully finished. FlowFile will be routed to this relation.

  • failure: The job failed, the original FlowFile will be routed to this relationship.

  • original: Upon successful completion, the original FlowFile will be routed to this relationship.

  • running: The job is currently still being processed

  • throttled: Retrieving results failed for some reason, but the issue is likely to resolve on its own, such as Provisioned Throughput Exceeded or a Throttling failure. It is generally expected to retry this relationship.

Writes Attributes

  • outputLocation: S3 path-style output location of the result.

Input Requirement

This component allows an incoming relationship.

Additional Details

Automatically convert speech to text

Usage

GetAwsTranscribeJobStatus Processor is designed to periodically check Transcribe job status. This processor should be used in pair with Transcribe Processor. FlowFile will contain the serialized Transcribe response and it will populate the path of the output in the outputLocation attribute.

GetAwsTranslateJobStatus

Retrieves the current status of an AWS Translate job.

Tags: Amazon, AWS, ML, Machine Learning, Translate

Properties

AWS Task ID

AWS Credentials Provider Service

The Controller Service that is used to obtain AWS credentials provider

Region

Communications Timeout

SSL Context Service

Specifies an optional SSL Context Service that, if provided, will be used to create connections

Endpoint Override URL

Endpoint URL to use instead of the AWS default including scheme, host, port, and path. The AWS libraries select an endpoint URL based on the AWS region, but this property overrides the selected endpoint URL, allowing use with other S3-compatible endpoints.

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP + AuthN

Relationships

  • success: Job successfully finished. FlowFile will be routed to this relation.

  • failure: The job failed, the original FlowFile will be routed to this relationship.

  • original: Upon successful completion, the original FlowFile will be routed to this relationship.

  • running: The job is currently still being processed

  • throttled: Retrieving results failed for some reason, but the issue is likely to resolve on its own, such as Provisioned Throughput Exceeded or a Throttling failure. It is generally expected to retry this relationship.

Writes Attributes

  • outputLocation: S3 path-style output location of the result.

Input Requirement

This component allows an incoming relationship.

Additional Details

GetAwsTranslateJobStatus

Amazon Translate is a neural machine translation service for translating text to and from English across a breadth of supported languages. Powered by deep-learning technologies, Amazon Translate delivers fast, high-quality, and affordable language translation. It provides a managed, continually trained solution, so you can easily translate company and user-authored content or build applications that require support across multiple languages. The machine translation engine has been trained on a wide variety of content across different domains to produce quality translations that serve any industry need.

Usage

GetAwsTranslateJobStatus Processor is designed to periodically check translate job status. This processor should be used in pair with Translate Processor. If the job successfully finished it will populate outputLocation attribute of the flow file where you can find the output of the Translation job.

GetAzureEventHub

Receives messages from Microsoft Azure Event Hubs without reliable checkpoint tracking. In clustered environment, GetAzureEventHub processor instances work independently and all cluster nodes process all messages (unless running the processor in Primary Only mode). ConsumeAzureEventHub offers the recommended approach to receiving messages from Azure Event Hubs. This processor creates a thread pool for connections to Azure Event Hubs.

Tags: azure, microsoft, cloud, eventhub, events, streaming, streams

Properties

Event Hub Namespace

Namespace of Azure Event Hubs prefixed to Service Bus Endpoint domain

Event Hub Name

Name of Azure Event Hubs source

Service Bus Endpoint

To support namespaces not in the default windows.net domain.

Transport Type

Advanced Message Queuing Protocol Transport Type for communication with Azure Event Hubs

Shared Access Policy Name

The name of the shared access policy. This policy must have Listen claims.

Shared Access Policy Key

The key of the shared access policy. Either the primary or the secondary key can be used.

Use Azure Managed Identity

Choose whether or not to use the managed identity of Azure VM/VMSS

Consumer Group

The name of the consumer group to use when pulling events

Message Enqueue Time

A timestamp (ISO-8601 Instant) formatted as YYYY-MM-DDThhmmss.sssZ (2016-01-01T01:01:01.000Z) from which messages should have been enqueued in the Event Hub to start reading from

Partition Receiver Fetch Size

The number of events that a receiver should fetch from an Event Hubs partition before returning. The default is 100

Partition Receiver Timeout

The amount of time in milliseconds a Partition Receiver should wait to receive the Fetch Size before returning. The default is 60000

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP + AuthN

Relationships

  • success: Any FlowFile that is successfully received from the event hub will be transferred to this Relationship.

Writes Attributes

  • eventhub.enqueued.timestamp: The time (in milliseconds since epoch, UTC) at which the message was enqueued in the event hub

  • eventhub.offset: The offset into the partition at which the message was stored

  • eventhub.sequence: The Azure sequence number associated with the message

  • eventhub.name: The name of the event hub from which the message was pulled

  • eventhub.partition: The name of the event hub partition from which the message was pulled

  • eventhub.property.*: The application properties of this message. IE: 'application' would be 'eventhub.property.application'

Input Requirement

This component does not allow an incoming relationship.

GetAzureQueueStorage_v12

Retrieves the messages from an Azure Queue Storage. The retrieved messages will be deleted from the queue by default. If the requirement is to consume messages without deleting them, set 'Auto Delete Messages' to 'false'. Note: There might be chances of receiving duplicates in situations like when a message is received but was unable to be deleted from the queue due to some unexpected situations.

Tags: azure, queue, microsoft, storage, dequeue, cloud

Properties

Queue Name

Name of the Azure Storage Queue

Endpoint Suffix

Storage accounts in public Azure always use a common FQDN suffix. Override this endpoint suffix with a different suffix in certain circumstances (like Azure Stack or non-public Azure regions).

Credentials Service

Controller Service used to obtain Azure Storage Credentials.

Auto Delete Messages

Specifies whether the received message is to be automatically deleted from the queue.

Message Batch Size

The number of messages to be retrieved from the queue.

Visibility Timeout

The duration during which the retrieved message should be invisible to other consumers.

Request Timeout

The timeout for read or write requests to Azure Queue Storage. Defaults to 1 second.

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP, SOCKS In case of SOCKS, it is not guaranteed that the selected SOCKS Version will be used by the processor.

Relationships

  • success: All successfully processed FlowFiles are routed to this relationship

Writes Attributes

  • azure.queue.uri: The absolute URI of the configured Azure Queue Storage

  • azure.queue.insertionTime: The time when the message was inserted into the queue storage

  • azure.queue.expirationTime: The time when the message will expire from the queue storage

  • azure.queue.messageId: The ID of the retrieved message

  • azure.queue.popReceipt: The pop receipt of the retrieved message

Input Requirement

This component does not allow an incoming relationship.

GetDynamoDB

Retrieves a document from DynamoDB based on hash and range key. The key can be string or number.For any get request all the primary keys are required (hash or hash and range based on the table keys).A Json Document ('Map') attribute of the DynamoDB item is read into the content of the FlowFile.

Tags: Amazon, DynamoDB, AWS, Get, Fetch

Properties

Table Name

The DynamoDB table name

Region

AWS Credentials Provider Service

The Controller Service that is used to obtain AWS credentials provider

Json Document attribute

The Json document to be retrieved from the dynamodb item ('s' type in the schema)

Hash Key Name

The hash key name of the item

Range Key Name

The range key name of the item

Hash Key Value

The hash key value of the item

Range Key Value

Hash Key Value Type

The hash key value type of the item

Range Key Value Type

The range key value type of the item

Batch items for each request (between 1 and 50)

The items to be retrieved in one batch

Communications Timeout

Endpoint Override URL

Endpoint URL to use instead of the AWS default including scheme, host, port, and path. The AWS libraries select an endpoint URL based on the AWS region, but this property overrides the selected endpoint URL, allowing use with other S3-compatible endpoints.

SSL Context Service

Specifies an optional SSL Context Service that, if provided, will be used to create connections

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP + AuthN

Relationships

  • success: FlowFiles are routed to success relationship

  • failure: FlowFiles are routed to failure relationship

  • not found: FlowFiles are routed to not found relationship if key not found in the table

  • unprocessed: FlowFiles are routed to unprocessed relationship when DynamoDB is not able to process all the items in the request. Typical reasons are insufficient table throughput capacity and exceeding the maximum bytes per request. Unprocessed FlowFiles can be retried with a new request.

Reads Attributes

  • * dynamodb.item.hash.key.value*: Items hash key value

  • * dynamodb.item.range.key.value*: Items range key value

Writes Attributes

  • dynamodb.key.error.unprocessed: DynamoDB unprocessed keys

  • dynmodb.range.key.value.error: DynamoDB range key error

  • dynamodb.key.error.not.found: DynamoDB key not found

  • dynamodb.error.exception.message: DynamoDB exception message

  • dynamodb.error.code: DynamoDB error code

  • dynamodb.error.message: DynamoDB error message

  • dynamodb.error.service: DynamoDB error service

  • dynamodb.error.retryable: DynamoDB error is retryable

  • dynamodb.error.request.id: DynamoDB error request id

  • dynamodb.error.status.code: DynamoDB status code

Input Requirement

This component requires an incoming relationship.

GetElasticsearch

Elasticsearch get processor that uses the official Elastic REST client libraries to fetch a single document from Elasticsearch by _id. Note that the full body of the document will be read into memory before being written to a FlowFile for transfer.

Tags: json, elasticsearch, elasticsearch5, elasticsearch6, elasticsearch7, elasticsearch8, put, index, record

Properties

Document Id

The _id of the document to retrieve.

Index

The name of the index to use.

Type

The type of this document (used by Elasticsearch for indexing and searching).

Destination

Indicates whether the retrieved document is written to the FlowFile content or a FlowFile attribute.

Attribute Name

The name of the FlowFile attribute to use for the retrieved document output.

Client Service

An Elasticsearch client service to use for running queries.

Dynamic Properties

The name of a URL query parameter to add

Adds the specified property name/value as a query parameter in the Elasticsearch URL used for processing.

Relationships

  • failure: All flowfiles that fail for reasons unrelated to server availability go to this relationship.

  • document: Fetched documents are routed to this relationship.

  • not_found: A FlowFile is routed to this relationship if the specified document does not exist in the Elasticsearch cluster.

  • retry: All flowfiles that fail due to server/cluster availability go to this relationship.

Writes Attributes

  • filename: The filename attribute is set to the document identifier

  • elasticsearch.index: The Elasticsearch index containing the document

  • elasticsearch.type: The Elasticsearch document type

  • elasticsearch.get.error: The error message provided by Elasticsearch if there is an error fetching the document.

Input Requirement

This component allows an incoming relationship.

GetFile

Creates FlowFiles from files in a directory. NiFi will ignore files it doesn’t have at least read permissions for.

Tags: local, files, filesystem, ingest, ingress, get, source, input

Properties

Input Directory

The input directory from which to pull files

File Filter

Only files whose names match the given regular expression will be picked up

Path Filter

When Recurse Subdirectories is true, then only subdirectories whose path matches the given regular expression will be scanned

Batch Size

The maximum number of files to pull in each invocation of the processor

Keep Source File

If true, the file is not deleted after it has been copied to the Content Repository; this causes the file to be picked up continually and is useful for testing purposes. If not keeping original NiFi will need write permissions on the directory it is pulling from otherwise it will ignore the file.

Recurse Subdirectories

Indicates whether or not to pull files from subdirectories

Polling Interval

Indicates how long to wait before performing a directory listing

Ignore Hidden Files

Indicates whether or not hidden files should be ignored

Minimum File Age

The minimum age that a file must be in order to be pulled; any file younger than this amount of time (according to last modification date) will be ignored

Maximum File Age

The maximum age that a file must be in order to be pulled; any file older than this amount of time (according to last modification date) will be ignored

Minimum File Size

The minimum size that a file must be in order to be pulled

Maximum File Size

The maximum size that a file can be in order to be pulled

Relationships

  • success: All files are routed to success

Writes Attributes

  • filename: The filename is set to the name of the file on disk

  • path: The path is set to the relative path of the file’s directory on disk. For example, if the <Input Directory> property is set to /tmp, files picked up from /tmp will have the path attribute set to ./. If the <Recurse Subdirectories> property is set to true and a file is picked up from /tmp/abc/1/2/3, then the path attribute will be set to abc/1/2/3

  • file.creationTime: The date and time that the file was created. May not work on all file systems

  • file.lastModifiedTime: The date and time that the file was last modified. May not work on all file systems

  • file.lastAccessTime: The date and time that the file was last accessed. May not work on all file systems

  • file.owner: The owner of the file. May not work on all file systems

  • file.group: The group owner of the file. May not work on all file systems

  • file.permissions: The read/write/execute permissions of the file. May not work on all file systems

  • absolute.path: The full/absolute path from where a file was picked up. The current 'path' attribute is still populated, but may be a relative path

Restricted

Making changes to this component requires admin rights. Non-admins can still schedule this component.

Input Requirement

This component does not allow an incoming relationship.

See Also

GetFileResource

This processor creates FlowFiles with the content of the configured File Resource. GetFileResource is useful for load testing, configuration, and simulation.

Tags: test, file, generate, load

Properties

File Resource

Location of the File Resource (Local File or URL). This file will be used as content of the generated FlowFiles.

MIME Type

Specifies the value to set for the [mime.type] attribute.

Dynamic Properties

Generated FlowFile attribute name

Specifies an attribute on generated FlowFiles defined by the Dynamic Property’s key and value.

Relationships

  • success:

Writes Attributes

  • mime.type: Sets the MIME type of the output if the 'MIME Type' property is set

  • Dynamic property key: Value for the corresponding dynamic property, if any is set

Restricted

Making changes to this component requires admin rights. Non-admins can still schedule this component.

Input Requirement

This component does not allow an incoming relationship.

GetFTP

Fetches files from an FTP Server and creates FlowFiles from them

Tags: FTP, get, retrieve, files, fetch, remote, ingest, source, input

Properties

Hostname

The fully qualified hostname or IP address of the remote system

Port

The port that the remote system is listening on for file transfers

Username

Username

Password

Password for the user account

Connection Mode

The FTP Connection Mode

Transfer Mode

The FTP Transfer Mode

Remote Path

The path on the remote system from which to pull or push files

File Filter Regex

Provides a Java Regular Expression for filtering Filenames; if a filter is supplied, only files whose names match that Regular Expression will be fetched

Path Filter Regex

When Search Recursively is true, then only subdirectories whose path matches the given Regular Expression will be scanned

Polling Interval

Determines how long to wait between fetching the listing for new files

Search Recursively

If true, will pull files from arbitrarily nested subdirectories; otherwise, will not traverse subdirectories

Follow symlink

If true, will pull even symbolic files and also nested symbolic subdirectories; otherwise, will not read symbolic files and will not traverse symbolic link subdirectories

Ignore Dotted Files

If true, files whose names begin with a dot (".") will be ignored

Delete Original

Determines whether or not the file is deleted from the remote system after it has been successfully transferred

Connection Timeout

Amount of time to wait before timing out while creating a connection

Data Timeout

When transferring a file between the local and remote system, this value specifies how long is allowed to elapse without any data being transferred between systems

Max Selects

The maximum number of files to pull in a single connection

Remote Poll Batch Size

The value specifies how many file paths to find in a given directory on the remote system when doing a file listing. This value in general should not need to be modified but when polling against a remote system with a tremendous number of files this value can be critical. Setting this value too high can result very poor performance and setting it too low can cause the flow to be slower than normal.

Use Natural Ordering

If true, will pull files in the order in which they are naturally listed; otherwise, the order in which the files will be pulled is not defined

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP + AuthN, SOCKS + AuthN

Internal Buffer Size

Set the internal buffer size for buffered data streams

Use UTF-8 Encoding

Tells the client to use UTF-8 encoding when processing files and filenames. If set to true, the server must also support UTF-8 encoding.

Relationships

  • success: All FlowFiles that are received are routed to success

Writes Attributes

  • filename: The filename is set to the name of the file on the remote server

  • path: The path is set to the path of the file’s directory on the remote server. For example, if the <Remote Path> property is set to /tmp, files picked up from /tmp will have the path attribute set to /tmp. If the <Search Recursively> property is set to true and a file is picked up from /tmp/abc/1/2/3, then the path attribute will be set to /tmp/abc/1/2/3

  • file.lastModifiedTime: The date and time that the source file was last modified

  • file.lastAccessTime: The date and time that the file was last accessed. May not work on all file systems

  • file.owner: The numeric owner id of the source file

  • file.group: The numeric group id of the source file

  • file.permissions: The read/write/execute permissions of the source file

  • absolute.path: The full/absolute path from where a file was picked up. The current 'path' attribute is still populated, but may be a relative path

Input Requirement

This component does not allow an incoming relationship.

See Also

GetGcpVisionAnnotateFilesOperationStatus

Retrieves the current status of an Google Vision operation.

Tags: Google, Cloud, Vision, Machine Learning

Properties

GCP Credentials Provider Service

The Controller Service used to obtain Google Cloud Platform credentials.

GCP Operation Key

The unique identifier of the Vision operation.

Relationships

  • success: FlowFiles are routed to success relationship

  • failure: FlowFiles are routed to failure relationship

  • original: Upon successful completion, the original FlowFile will be routed to this relationship.

  • running: The job is currently still being processed

Reads Attributes

  • operationKey: A unique identifier of the operation designated by the Vision server.

Input Requirement

This component allows an incoming relationship.

Additional Details

Get Annotate Files Status
Usage

GetGcpVisionAnnotateFilesOperationStatus is designed to periodically check the statuses of file annotation operations. This processor should be used in pair with StartGcpVisionAnnotateFilesOperation Processor. An outgoing FlowFile contains the raw response returned by the Vision server. This response is in JSON format and contains a Google storage reference where the result is located, as well as additional metadata, as written in the Google Vision API Reference document.

GetGcpVisionAnnotateImagesOperationStatus

Retrieves the current status of an Google Vision operation.

Tags: Google, Cloud, Vision, Machine Learning

Properties

GCP Credentials Provider Service

The Controller Service used to obtain Google Cloud Platform credentials.

GCP Operation Key

The unique identifier of the Vision operation.

Relationships

  • success: FlowFiles are routed to success relationship

  • failure: FlowFiles are routed to failure relationship

  • original: Upon successful completion, the original FlowFile will be routed to this relationship.

  • running: The job is currently still being processed

Reads Attributes

  • operationKey: A unique identifier of the operation designated by the Vision server.

Input Requirement

This component allows an incoming relationship.

Additional Details

Google Cloud Vision - Get Annotate Images Status
Usage

GetGcpVisionAnnotateImagesOperationStatus is designed to periodically check the statuses of image annotation operations. This processor should be used in pair with StartGcpVisionAnnotateImagesOperation Processor. An outgoing FlowFile contains the raw response returned by the Vision server. This response is in JSON format and contains a Google storage reference where the result is located, as well as additional metadata, as written in the Google Vision API Reference document.

GetHDFS

Fetch files from Hadoop Distributed File System (HDFS) into FlowFiles. This Processor will delete the file from HDFS after fetching it.

Tags: hadoop, HCFS, HDFS, get, fetch, ingest, source, filesystem

Properties

Hadoop Configuration Resources

A file or comma separated list of files which contains the Hadoop file system configuration. Without this, Hadoop will search the classpath for a 'core-site.xml' and 'hdfs-site.xml' file or will revert to a default configuration. To use swebhdfs, see 'Additional Details' section of PutHDFS’s documentation.

Kerberos User Service

Specifies the Kerberos User Controller Service that should be used for authenticating with Kerberos

Additional Classpath Resources

A comma-separated list of paths to files and/or directories that will be added to the classpath and used for loading native libraries. When specifying a directory, all files with in the directory will be added to the classpath, but further sub-directories will not be included.

Directory

The HDFS directory from which files should be read

Recurse Subdirectories

Indicates whether to pull files from subdirectories of the HDFS directory

Keep Source File

Determines whether to delete the file from HDFS after it has been successfully transferred. If true, the file will be fetched repeatedly. This is intended for testing only.

File Filter Regex

A Java Regular Expression for filtering Filenames; if a filter is supplied then only files whose names match that Regular Expression will be fetched, otherwise all files will be fetched

Filter Match Name Only

If true then File Filter Regex will match on just the filename, otherwise subdirectory names will be included with filename in the regex comparison

Ignore Dotted Files

If true, files whose names begin with a dot (".") will be ignored

Minimum File Age

The minimum age that a file must be in order to be pulled; any file younger than this amount of time (based on last modification date) will be ignored

Maximum File Age

The maximum age that a file must be in order to be pulled; any file older than this amount of time (based on last modification date) will be ignored

Polling Interval

Indicates how long to wait between performing directory listings

Batch Size

The maximum number of files to pull in each iteration, based on run schedule.

IO Buffer Size

Amount of memory to use to buffer file contents during IO. This overrides the Hadoop Configuration

Compression codec

Relationships

  • success: All files retrieved from HDFS are transferred to this relationship

Writes Attributes

  • filename: The name of the file that was read from HDFS.

  • path: The path is set to the relative path of the file’s directory on HDFS. For example, if the Directory property is set to /tmp, then files picked up from /tmp will have the path attribute set to "./". If the Recurse Subdirectories property is set to true and a file is picked up from /tmp/abc/1/2/3, then the path attribute will be set to "abc/1/2/3".

Restricted

Making changes to this component requires admin rights. Non-admins can still schedule this component.

Input Requirement

This component does not allow an incoming relationship.

See Also

GetHDFSEvents

This processor polls the notification events provided by the HdfsAdmin API. Since this uses the HdfsAdmin APIs it is required to run as an HDFS super user. Currently there are six types of events (append, close, create, metadata, rename, and unlink). Please see org.apache.hadoop.hdfs.inotify.Event documentation for full explanations of each event. This processor will poll for new events based on a defined duration. For each event received a new flow file will be created with the expected attributes and the event itself serialized to JSON and written to the flow file’s content. For example, if event.type is APPEND then the content of the flow file will contain a JSON file containing the information about the append event. If successful the flow files are sent to the 'success' relationship. Be careful of where the generated flow files are stored. If the flow files are stored in one of processor’s watch directories there will be a never ending flow of events. It is also important to be aware that this processor must consume all events. The filtering must happen within the processor. This is because the HDFS admin’s event notifications API does not have filtering.

Tags: hadoop, events, inotify, notifications, filesystem

Properties

Hadoop Configuration Resources

A file or comma separated list of files which contains the Hadoop file system configuration. Without this, Hadoop will search the classpath for a 'core-site.xml' and 'hdfs-site.xml' file or will revert to a default configuration. To use swebhdfs, see 'Additional Details' section of PutHDFS’s documentation.

Kerberos User Service

Specifies the Kerberos User Controller Service that should be used for authenticating with Kerberos

Additional Classpath Resources

A comma-separated list of paths to files and/or directories that will be added to the classpath and used for loading native libraries. When specifying a directory, all files with in the directory will be added to the classpath, but further sub-directories will not be included.

Poll Duration

The time before the polling method returns with the next batch of events if they exist. It may exceed this amount of time by up to the time required for an RPC to the NameNode.

HDFS Path to Watch

The HDFS path to get event notifications for. This property accepts both expression language and regular expressions. This will be evaluated during the OnScheduled phase.

Ignore Hidden Files

If true and the final component of the path associated with a given event starts with a '.' then that event will not be processed.

Event Types to Filter On

A comma-separated list of event types to process. Valid event types are: append, close, create, metadata, rename, and unlink. Case does not matter.

IOException Retries During Event Polling

According to the HDFS admin API for event polling it is good to retry at least a few times. This number defines how many times the poll will be retried if it throws an IOException.

Relationships

  • success: A flow file with updated information about a specific event will be sent to this relationship.

Writes Attributes

  • mime.type: This is always application/json.

  • hdfs.inotify.event.type: This will specify the specific HDFS notification event type. Currently there are six types of events (append, close, create, metadata, rename, and unlink).

  • hdfs.inotify.event.path: The specific path that the event is tied to.

Stateful

Scope: Cluster

The last used transaction id is stored. This is used

Input Requirement

This component does not allow an incoming relationship.

GetHDFSFileInfo

Retrieves a listing of files and directories from HDFS. This processor creates a FlowFile(s) that represents the HDFS file/dir with relevant information. Main purpose of this processor to provide functionality similar to HDFS Client, i.e. count, du, ls, test, etc. Unlike ListHDFS, this processor is stateless, supports incoming connections and provides information on a dir level.

Tags: hadoop, HCFS, HDFS, get, list, ingest, source, filesystem

Properties

Hadoop Configuration Resources

A file or comma separated list of files which contains the Hadoop file system configuration. Without this, Hadoop will search the classpath for a 'core-site.xml' and 'hdfs-site.xml' file or will revert to a default configuration. To use swebhdfs, see 'Additional Details' section of PutHDFS’s documentation.

Kerberos User Service

Specifies the Kerberos User Controller Service that should be used for authenticating with Kerberos

Additional Classpath Resources

A comma-separated list of paths to files and/or directories that will be added to the classpath and used for loading native libraries. When specifying a directory, all files with in the directory will be added to the classpath, but further sub-directories will not be included.

Full path

A directory to start listing from, or a file’s full path.

Recurse Subdirectories

Indicates whether to list files from subdirectories of the HDFS directory

Directory Filter

Regex. Only directories whose names match the given regular expression will be picked up. If not provided, any filter would be apply (performance considerations).

File Filter

Regex. Only files whose names match the given regular expression will be picked up. If not provided, any filter would be apply (performance considerations).

Exclude Files

Regex. Files whose names match the given regular expression will not be picked up. If not provided, any filter won’t be apply (performance considerations).

Ignore Dotted Directories

If true, directories whose names begin with a dot (".") will be ignored

Ignore Dotted Files

If true, files whose names begin with a dot (".") will be ignored

Group Results

Groups HDFS objects

Batch Size

Number of records to put into an output flowfile when 'Destination' is set to 'Content' and 'Group Results' is set to 'None'

Destination

Sets the destination for the resutls. When set to 'Content', attributes of flowfile won’t be used for storing results.

Relationships

  • success: All successfully generated FlowFiles are transferred to this relationship

  • failure: All failed attempts to access HDFS will be routed to this relationship

  • not found: If no objects are found, original FlowFile are transferred to this relationship

  • original: Original FlowFiles are transferred to this relationship

Writes Attributes

  • hdfs.objectName: The name of the file/dir found on HDFS.

  • hdfs.path: The path is set to the absolute path of the object’s parent directory on HDFS. For example, if an object is a directory 'foo', under directory '/bar' then 'hdfs.objectName' will have value 'foo', and 'hdfs.path' will be '/bar'

  • hdfs.type: The type of an object. Possible values: directory, file, link

  • hdfs.owner: The user that owns the object in HDFS

  • hdfs.group: The group that owns the object in HDFS

  • hdfs.lastModified: The timestamp of when the object in HDFS was last modified, as milliseconds since midnight Jan 1, 1970 UTC

  • hdfs.length: In case of files: The number of bytes in the file in HDFS. In case of dirs: Retuns storage space consumed by directory.

  • hdfs.count.files: In case of type='directory' will represent total count of files under this dir. Won’t be populated to other types of HDFS objects.

  • hdfs.count.dirs: In case of type='directory' will represent total count of directories under this dir (including itself). Won’t be populated to other types of HDFS objects.

  • hdfs.replication: The number of HDFS replicas for the file

  • hdfs.permissions: The permissions for the object in HDFS. This is formatted as 3 characters for the owner, 3 for the group, and 3 for other users. For example rw-rw-r--

  • hdfs.status: The status contains comma separated list of file/dir paths, which couldn’t be listed/accessed. Status won’t be set if no errors occured.

  • hdfs.full.tree: When destination is 'attribute', will be populated with full tree of HDFS directory in JSON format.WARNING: In case when scan finds thousands or millions of objects, having huge values in attribute could impact flow file repo and GC/heap usage. Use content destination for such cases

Input Requirement

This component allows an incoming relationship.

GetHDFSSequenceFile

Fetch sequence files from Hadoop Distributed File System (HDFS) into FlowFiles

Tags: hadoop, HCFS, HDFS, get, fetch, ingest, source, sequence file

Properties

Hadoop Configuration Resources

A file or comma separated list of files which contains the Hadoop file system configuration. Without this, Hadoop will search the classpath for a 'core-site.xml' and 'hdfs-site.xml' file or will revert to a default configuration. To use swebhdfs, see 'Additional Details' section of PutHDFS’s documentation.

Kerberos User Service

Specifies the Kerberos User Controller Service that should be used for authenticating with Kerberos

Additional Classpath Resources

A comma-separated list of paths to files and/or directories that will be added to the classpath and used for loading native libraries. When specifying a directory, all files with in the directory will be added to the classpath, but further sub-directories will not be included.

Directory

The HDFS directory from which files should be read

Recurse Subdirectories

Indicates whether to pull files from subdirectories of the HDFS directory

Keep Source File

Determines whether to delete the file from HDFS after it has been successfully transferred. If true, the file will be fetched repeatedly. This is intended for testing only.

File Filter Regex

A Java Regular Expression for filtering Filenames; if a filter is supplied then only files whose names match that Regular Expression will be fetched, otherwise all files will be fetched

Filter Match Name Only

If true then File Filter Regex will match on just the filename, otherwise subdirectory names will be included with filename in the regex comparison

Ignore Dotted Files

If true, files whose names begin with a dot (".") will be ignored

Minimum File Age

The minimum age that a file must be in order to be pulled; any file younger than this amount of time (based on last modification date) will be ignored

Maximum File Age

The maximum age that a file must be in order to be pulled; any file older than this amount of time (based on last modification date) will be ignored

Polling Interval

Indicates how long to wait between performing directory listings

Batch Size

The maximum number of files to pull in each iteration, based on run schedule.

IO Buffer Size

Amount of memory to use to buffer file contents during IO. This overrides the Hadoop Configuration

Compression codec

FlowFile Content

Indicate if the content is to be both the key and value of the Sequence File, or just the value.

Relationships

  • success: All files retrieved from HDFS are transferred to this relationship

Writes Attributes

  • filename: The name of the file that was read from HDFS.

  • path: The path is set to the relative path of the file’s directory on HDFS. For example, if the Directory property is set to /tmp, then files picked up from /tmp will have the path attribute set to "./". If the Recurse Subdirectories property is set to true and a file is picked up from /tmp/abc/1/2/3, then the path attribute will be set to "abc/1/2/3".

Restricted

Making changes to this component requires admin rights. Non-admins can still schedule this component.

Input Requirement

This component does not allow an incoming relationship.

See Also

GetHubSpot

Retrieves JSON data from a private HubSpot application. This processor is intended to be run on the Primary Node only.

Tags: hubspot

Properties

Object Type

The HubSpot Object Type requested

Access Token

Access Token to authenticate requests

Result Limit

The maximum number of results to request for each invocation of the Processor

Incremental Loading

The processor can incrementally load the queried objects so that each object is queried exactly once. For each query, the processor queries objects within a time window where the objects were modified between the previous run time and the current time (optionally adjusted by the Incremental Delay property).

Incremental Delay

The ending timestamp of the time window will be adjusted earlier by the amount configured in this property. For example, with a property value of 10 seconds, an ending timestamp of 12:30:45 would be changed to 12:30:35. Set this property to avoid missing objects when the clock of your local machines and HubSpot servers' clock are not in sync and to protect against HubSpot’s mechanism that changes last updated timestamps after object creation.

Incremental Initial Start Time

This property specifies the start time that the processor applies when running the first request. The expected format is a UTC date-time such as '2011-12-03T10:15:30Z'

Web Client Service Provider

Controller service for HTTP client operations

Relationships

  • success: For FlowFiles created as a result of a successful HTTP request.

Writes Attributes

  • mime.type: Sets the MIME type to application/json

Stateful

Scope: Cluster

In case of incremental loading, the start and end timestamps of the last query time window are stored in the state. When the 'Result Limit' property is set, the paging cursor is saved after executing a request. Only the objects after the paging cursor will be retrieved. The maximum number of retrieved objects can be set in the 'Result Limit' property.

Input Requirement

This component does not allow an incoming relationship.

Additional Details

Authentication Methods

The processor is working with HubSpot private applications. A HubSpot private app must be created ( see HubSpot Private App Creation) in order to connect to HubSpot and make requests. Private App Access Tokens are the only authentication method that is currently supported.

Incremental Loading

HubSpot objects can be processed incrementally by NiFi. This means that only objects created or modified after the last run time of the processor are processed. The processor state can be reset in the context menu. The incremental loading is based on the objects last modified time.

Paging

GetHubSpot supports both paging and incrementality at the same time. In case the number of results exceeds the ‘Result Limit’, in the next processor run the remaining objects will be returned.

Due to the page handling mechanism of the HubSpot API, parallel deletions are not supported. Some objects may be omitted if any object is deleted between fetching two pages.

GetMetro

Replaces incoming FlowFiles with matching FlowFiles on the "Metro Line", i.e. any queue connected to a PutMetro processor using the same Metro Line Controller as this processor. Incoming FlowFiles and FlowFiles on the Metro Line are matched based on the attribute set in the "Correlation Attribute Name" property. If no matching FlowFile can be found on the Metro Line and "Failure if not found" is unchecked, the incoming FlowFile is put back into the queue and penalized (see "Penalty Duration" in Settings).

Tags: virtimo, metro

Properties

Metro Controller

The processor uses this controller’s Metro Line to connect with PutMetro processors.

Correlation Attribute Name

The name of the attribute used to match and replace incoming FlowFiles with FlowFiles cached in the associated Metro Line.

Failure if not found

If checked, route an incoming FlowFile to the 'failure' relationship if no matching FlowFile is immediately found in the Metro Line. This should only ever be checked in conjunction with retries on the failure relationship, as a rendezvous attempt between PutMetro and GetMetro/MergeMetro may irregularly fail. If unchecked, the incoming FlowFile is put back in the queue and penalized (see 'Penalty Duration' in Settings).

Relationships

  • success: The FlowFile was successfully transferred via metro.

  • failure: The metro did not come.

Reads Attributes

  • <Correlation Attribute Name>: Use this property to specify which attribute should be evaluated for matching FlowFiles.

Input Requirement

This component requires an incoming relationship.

GetMongo

Creates FlowFiles from documents in MongoDB loaded by a user-specified query.

Tags: mongodb, read, get

Properties

Client Service

If configured, this property will use the assigned client service for connection pooling.

Mongo Database Name

The name of the database to use

Mongo Collection Name

The name of the collection to use

JSON Type

By default, MongoDB’s Java driver returns "extended JSON". Some of the features of this variant of JSON may cause problems for other JSON parsers that expect only standard JSON types and conventions. This configuration setting controls whether to use extended JSON or provide a clean view that conforms to standard JSON.

Pretty Print Results JSON

Choose whether or not to pretty print the JSON from the results of the query. Choosing 'True' can greatly increase the space requirements on disk depending on the complexity of the JSON document

Character Set

Specifies the character set of the document data.

Query

The selection criteria to do the lookup. If the field is left blank, it will look for input from an incoming connection from another processor to provide the query as a valid JSON document inside of the FlowFile’s body. If this field is left blank and a timer is enabled instead of an incoming connection, that will result in a full collection fetch using a "{}" query.

Query Output Attribute

If set, the query will be written to a specified attribute on the output flowfiles.

Projection

The fields to be returned from the documents in the result set; must be a valid BSON document

Sort

The fields by which to sort; must be a valid BSON document

Limit

The maximum number of elements to return

Batch Size

The number of elements to be returned from the server in one batch

Results Per FlowFile

How many results to put into a FlowFile at once. The whole body will be treated as a JSON array of results.

Date Format

The date format string to use for formatting Date fields that are returned from Mongo. It is only applied when the JSON output format is set to Standard JSON.

Send Empty Result

If a query executes successfully, but returns no results, send an empty JSON document signifying no result.

Relationships

  • success: All FlowFiles that have the results of a successful query execution go here.

  • failure: All input FlowFiles that are part of a failed query execution go here.

  • original: All input FlowFiles that are part of a successful query execution go here.

Writes Attributes

  • mongo.database.name: The database where the results came from.

  • mongo.collection.name: The collection where the results came from.

Input Requirement

This component allows an incoming relationship.

Additional Details

Description:

This processor runs queries against a MongoDB instance or cluster and writes the results to a flowfile. It allows input, but can run standalone as well.

Specifying the Query

The query can be specified in one of three ways:

  • Query configuration property.

  • Query Attribute configuration property.

  • FlowFile content.

If a value is specified in either of the configuration properties, it will not look in the FlowFile content for a query.

Limiting/Shaping Results

The following options for limiting/shaping results are available:

  • Limit - limit the number of results. This should not be confused with the “batch size” option which is a setting for the underlying MongoDB driver to tell it how many items to retrieve in each poll of the server.

  • Sort - sort the result set. Requires a JSON document like \{ “someDate”: -1 }

  • Projection - control which fields to return. Exampe, which would remove _id: \{ “_id”: 0 }

Misc Options

Results Per FlowFile, if set, creates a JSON array out of a batch of results and writes the result to the output. Pretty Print, if enabled, will format the JSON data to be easy read by a human (ex. proper indentation of fields).

GetMongoRecord

A record-based version of GetMongo that uses the Record writers to write the MongoDB result set.

Tags: mongo, mongodb, get, fetch, record, json

Properties

Client Service

If configured, this property will use the assigned client service for connection pooling.

Record Writer

The record writer to use to write the result sets.

Mongo Database Name

The name of the database to use

Mongo Collection Name

The name of the collection to use

Schema Name

The name of the schema in the configured schema registry to use for the query results.

Query Output Attribute

If set, the query will be written to a specified attribute on the output flowfiles.

Query

The selection criteria to do the lookup. If the field is left blank, it will look for input from an incoming connection from another processor to provide the query as a valid JSON document inside of the FlowFile’s body. If this field is left blank and a timer is enabled instead of an incoming connection, that will result in a full collection fetch using a "{}" query.

Projection

The fields to be returned from the documents in the result set; must be a valid BSON document

Sort

The fields by which to sort; must be a valid BSON document

Limit

The maximum number of elements to return

Batch Size

The number of elements to be returned from the server in one batch

Relationships

  • success: All FlowFiles that have the results of a successful query execution go here.

  • failure: All input FlowFiles that are part of a failed query execution go here.

  • original: All input FlowFiles that are part of a successful query execution go here.

Writes Attributes

  • mongo.database.name: The database where the results came from.

  • mongo.collection.name: The collection where the results came from.

Input Requirement

This component allows an incoming relationship.

Additional Details

Description:

This processor runs queries against a MongoDB instance or cluster and writes the results to a flowfile. It allows input, but can run standalone as well. It is a record-aware version of the GetMongo processor.

Specifying the Query

The query can be specified in one of three ways:

  • Query configuration property.

  • Query Attribute configuration property.

  • FlowFile content.

If a value is specified in either of the configuration properties, it will not look in the FlowFile content for a query.

Limiting/Shaping Results

The following options for limiting/shaping results are available:

  • Limit - limit the number of results. This should not be confused with the “batch size” option which is a setting for the underlying MongoDB driver to tell it how many items to retrieve in each poll of the server.

  • Sort - sort the result set. Requires a JSON document like \{ “someDate”: -1 }

  • Projection - control which fields to return. Exampe, which would remove _id: \{ “_id”: 0 }

Misc Options

Results Per FlowFile, if set, creates a JSON array out of a batch of results and writes the result to the output. Pretty Print, if enabled, will format the JSON data to be easy read by a human (ex. proper indentation of fields).

GetS3ObjectMetadata

Check for the existence of a file in S3 without attempting to download it. This processor can be used as a router for work flows that need to check on a file in S3 before proceeding with data processing

Tags: Amazon, S3, AWS, Archive, Exists

Properties

Metadata Target

This determines where the metadata will be written when found.

Metadata Attribute Include Pattern

A regular expression pattern to use for determining which object metadata entries are included as FlowFile attributes. This pattern is only applied to the 'found' relationship and will not be used to filter the error attributes in the 'failure' relationship.

Bucket

The S3 Bucket to interact with

Object Key

The S3 Object Key to use. This is analogous to a filename for traditional file systems.

AWS Credentials Provider Service

The Controller Service that is used to obtain AWS credentials provider

Region

The AWS Region to connect to.

Communications Timeout

The amount of time to wait in order to establish a connection to AWS or receive data from AWS before timing out.

FullControl User List

A comma-separated list of Amazon User ID’s or E-mail addresses that specifies who should have Full Control for an object

Read Permission User List

A comma-separated list of Amazon User ID’s or E-mail addresses that specifies who should have Read Access for an object

Read ACL User List

A comma-separated list of Amazon User ID’s or E-mail addresses that specifies who should have permissions to read the Access Control List for an object

Owner

The Amazon ID to use for the object’s owner

SSL Context Service

Specifies an optional SSL Context Service that, if provided, will be used to create connections

Endpoint Override URL

Endpoint URL to use instead of the AWS default including scheme, host, port, and path. The AWS libraries select an endpoint URL based on the AWS region, but this property overrides the selected endpoint URL, allowing use with other S3-compatible endpoints.

Signer Override

The AWS S3 library uses Signature Version 4 by default but this property allows you to specify the Version 2 signer to support older S3-compatible services or even to plug in your own custom signer implementation.

Custom Signer Class Name

Fully qualified class name of the custom signer class. The signer must implement com.amazonaws.auth.Signer interface.

Custom Signer Module Location

Comma-separated list of paths to files and/or directories which contain the custom signer’s JAR file and its dependencies (if any).

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP + AuthN

Relationships

  • failure: If the Processor is unable to process a given FlowFile, it will be routed to this Relationship.

  • found: An object was found in the bucket at the supplied key

  • not found: No object was found in the bucket the supplied key

Input Requirement

This component requires an incoming relationship.

GetSFTP

Fetches files from an SFTP Server and creates FlowFiles from them

Tags: sftp, get, retrieve, files, fetch, remote, ingest, source, input

Properties

Hostname

The fully qualified hostname or IP address of the remote system

Port

The port that the remote system is listening on for file transfers

Username

Username

Password

Password for the user account

Private Key Path

The fully qualified path to the Private Key file

Private Key Passphrase

Password for the private key

Remote Path

The path on the remote system from which to pull or push files

File Filter Regex

Provides a Java Regular Expression for filtering Filenames; if a filter is supplied, only files whose names match that Regular Expression will be fetched

Path Filter Regex

When Search Recursively is true, then only subdirectories whose path matches the given Regular Expression will be scanned

Polling Interval

Determines how long to wait between fetching the listing for new files

Search Recursively

If true, will pull files from arbitrarily nested subdirectories; otherwise, will not traverse subdirectories

Follow symlink

If true, will pull even symbolic files and also nested symbolic subdirectories; otherwise, will not read symbolic files and will not traverse symbolic link subdirectories

Ignore Dotted Files

If true, files whose names begin with a dot (".") will be ignored

Delete Original

Determines whether or not the file is deleted from the remote system after it has been successfully transferred

Connection Timeout

Amount of time to wait before timing out while creating a connection

Data Timeout

When transferring a file between the local and remote system, this value specifies how long is allowed to elapse without any data being transferred between systems

Host Key File

If supplied, the given file will be used as the Host Key; otherwise, if 'Strict Host Key Checking' property is applied (set to true) then uses the 'known_hosts' and 'known_hosts2' files from ~/.ssh directory else no host key file will be used

Max Selects

The maximum number of files to pull in a single connection

Remote Poll Batch Size

The value specifies how many file paths to find in a given directory on the remote system when doing a file listing. This value in general should not need to be modified but when polling against a remote system with a tremendous number of files this value can be critical. Setting this value too high can result very poor performance and setting it too low can cause the flow to be slower than normal.

Strict Host Key Checking

Indicates whether or not strict enforcement of hosts keys should be applied

Send Keep Alive On Timeout

Send a Keep Alive message every 5 seconds up to 5 times for an overall timeout of 25 seconds.

Use Compression

Indicates whether or not ZLIB compression should be used when transferring files

Use Natural Ordering

If true, will pull files in the order in which they are naturally listed; otherwise, the order in which the files will be pulled is not defined

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP + AuthN, SOCKS + AuthN

Ciphers Allowed

A comma-separated list of Ciphers allowed for SFTP connections. Leave unset to allow all. Available options are: 3des-cbc, 3des-ctr, aes128-cbc, aes128-ctr, aes128-gcm@openssh.com, aes192-cbc, aes192-ctr, aes256-cbc, aes256-ctr, aes256-gcm@openssh.com, arcfour, arcfour128, arcfour256, blowfish-cbc, blowfish-ctr, cast128-cbc, cast128-ctr, chacha20-poly1305@openssh.com, idea-cbc, idea-ctr, serpent128-cbc, serpent128-ctr, serpent192-cbc, serpent192-ctr, serpent256-cbc, serpent256-ctr, twofish-cbc, twofish128-cbc, twofish128-ctr, twofish192-cbc, twofish192-ctr, twofish256-cbc, twofish256-ctr

Key Algorithms Allowed

A comma-separated list of Key Algorithms allowed for SFTP connections. Leave unset to allow all. Available options are: ecdsa-sha2-nistp256, ecdsa-sha2-nistp256-cert-v01@openssh.com, ecdsa-sha2-nistp384, ecdsa-sha2-nistp384-cert-v01@openssh.com, ecdsa-sha2-nistp521, ecdsa-sha2-nistp521-cert-v01@openssh.com, rsa-sha2-256, rsa-sha2-512, ssh-dss, ssh-dss-cert-v01@openssh.com, ssh-ed25519, ssh-ed25519-cert-v01@openssh.com, ssh-rsa, ssh-rsa-cert-v01@openssh.com

Key Exchange Algorithms Allowed

A comma-separated list of Key Exchange Algorithms allowed for SFTP connections. Leave unset to allow all. Available options are: curve25519-sha256, curve25519-sha256@libssh.org, diffie-hellman-group-exchange-sha1, diffie-hellman-group-exchange-sha256, diffie-hellman-group1-sha1, diffie-hellman-group14-sha1, diffie-hellman-group14-sha256, diffie-hellman-group14-sha256@ssh.com, diffie-hellman-group15-sha256, diffie-hellman-group15-sha256@ssh.com, diffie-hellman-group15-sha384@ssh.com, diffie-hellman-group15-sha512, diffie-hellman-group16-sha256, diffie-hellman-group16-sha384@ssh.com, diffie-hellman-group16-sha512, diffie-hellman-group16-sha512@ssh.com, diffie-hellman-group17-sha512, diffie-hellman-group18-sha512, diffie-hellman-group18-sha512@ssh.com, ecdh-sha2-nistp256, ecdh-sha2-nistp384, ecdh-sha2-nistp521, ext-info-c

Message Authentication Codes Allowed

A comma-separated list of Message Authentication Codes allowed for SFTP connections. Leave unset to allow all. Available options are: hmac-md5, hmac-md5-96, hmac-md5-96-etm@openssh.com, hmac-md5-etm@openssh.com, hmac-ripemd160, hmac-ripemd160-96, hmac-ripemd160-etm@openssh.com, hmac-ripemd160@openssh.com, hmac-sha1, hmac-sha1-96, hmac-sha1-96@openssh.com, hmac-sha1-etm@openssh.com, hmac-sha2-256, hmac-sha2-256-etm@openssh.com, hmac-sha2-512, hmac-sha2-512-etm@openssh.com

Relationships

  • success: All FlowFiles that are received are routed to success

Writes Attributes

  • filename: The filename is set to the name of the file on the remote server

  • path: The path is set to the path of the file’s directory on the remote server. For example, if the <Remote Path> property is set to /tmp, files picked up from /tmp will have the path attribute set to /tmp. If the <Search Recursively> property is set to true and a file is picked up from /tmp/abc/1/2/3, then the path attribute will be set to /tmp/abc/1/2/3

  • file.lastModifiedTime: The date and time that the source file was last modified

  • file.owner: The numeric owner id of the source file

  • file.group: The numeric group id of the source file

  • file.permissions: The read/write/execute permissions of the source file

  • absolute.path: The full/absolute path from where a file was picked up. The current 'path' attribute is still populated, but may be a relative path

Input Requirement

This component does not allow an incoming relationship.

See Also

GetShopify

Retrieves objects from a custom Shopify store. The processor yield time must be set to the account’s rate limit accordingly.

Tags: shopify

Properties

Store Domain

The domain of the Shopify store, e.g. nifistore.myshopify.com

Access Token

Access Token to authenticate requests

API Version

The Shopify REST API version

Object Category

Shopify object category

Customer Category

Customer resource to query

Discount Category

Discount resource to query

Inventory Category

Inventory resource to query

Online Store Category

Online Store resource to query

Order Category

Order resource to query

Product Category

Product resource to query

Sales Channel Category

Sales Channel resource to query

Store Property Category

Store Property resource to query

Result Limit

The maximum number of results to request for each invocation of the Processor

Incremental Loading

The processor can incrementally load the queried objects so that each object is queried exactly once. For each query, the processor queries objects which were created or modified after the previous run time but before the current time.

Incremental Delay

The ending timestamp of the time window will be adjusted earlier by the amount configured in this property. For example, with a property value of 10 seconds, an ending timestamp of 12:30:45 would be changed to 12:30:35. Set this property to avoid missing objects when the clock of your local machines and Shopify servers' clock are not in sync.

Incremental Initial Start Time

This property specifies the start time when running the first request. Represents an ISO 8601-encoded date and time string. For example, 3:50 pm on September 7, 2019 in the time zone of UTC (Coordinated Universal Time) is represented as "2019-09-07T15:50:00Z".

Web Client Service Provider

Controller service for HTTP client operations

Relationships

  • success: For FlowFiles created as a result of a successful query.

Writes Attributes

  • mime.type: Sets the MIME type to application/json

Stateful

Scope: Cluster

For a few resources the processor supports incremental loading. The list of the resources with the supported parameters can be found in the additional details.

Input Requirement

This component does not allow an incoming relationship.

Additional Details

Setting Up a Custom App

Follow the Shopify tutorial to enable and create private apps, set API Scopes and generate API tokens.

Incremental Loading

Some resources can be processed incrementally by NiFi. This means that only resources created or modified after the last run time of the processor are displayed. The processor state can be reset in the context menu. The following list shows which date-time fields are incremented for which resources.

  • Customers

    • Customers: updated_at_min

  • Discounts

    • Price Rules: updated_at_min

  • Inventory

    • Inventory Levels: updated_at_min

  • Online Store

    • Script Tags: updated_at_min

  • Orders

    • Abandoned Checkouts: updated_at_min

    • Draft Orders: updated_at_min

    • Orders: updated_at_min

  • Product

    • Custom Collections: updated_at_min

    • Products: updated_at_min

    • Smart Collections: updated_at_min

  • Sales Channels

    • Product Listings: updated_at_min

  • Store Properties

    • Shipping Zones: updated_at_min

GetSmbFile

Reads file from a samba network location to FlowFiles. Use this processor instead of a cifs mounts if share access control is important. Configure the Hostname, Share and Directory accordingly: \\[Hostname]\[Share]\[path\to\Directory]

Tags: samba, smb, cifs, files, get

Properties

Hostname

The network host to which files should be written.

Share

The network share to which files should be written. This is the "first folder"after the hostname: \\hostname\[share]\dir1\dir2

Directory

The network folder to which files should be written. This is the remaining relative path after the share: \\hostname\share\[dir1\dir2].

Domain

The domain used for authentication. Optional, in most cases username and password is sufficient.

Username

The username used for authentication. If no username is set then anonymous authentication is attempted.

Password

The password used for authentication. Required if Username is set.

Share Access Strategy

Indicates which shared access are granted on the file during the read. None is the most restrictive, but the safest setting to prevent corruption.

File Filter

Only files whose names match the given regular expression will be picked up

Path Filter

When Recurse Subdirectories is true, then only subdirectories whose path matches the given regular expression will be scanned

Batch Size

The maximum number of files to pull in each iteration

Keep Source File

If true, the file is not deleted after it has been copied to the Content Repository; this causes the file to be picked up continually and is useful for testing purposes. If not keeping original NiFi will need write permissions on the directory it is pulling from otherwise it will ignore the file.

Recurse Subdirectories

Indicates whether or not to pull files from subdirectories

Polling Interval

Indicates how long to wait before performing a directory listing

Ignore Hidden Files

Indicates whether or not hidden files should be ignored

SMB Dialect

The SMB dialect is negotiated between the client and the server by default to the highest common version supported by both end. In some rare cases, the client-server communication may fail with the automatically negotiated dialect. This property can be used to set the dialect explicitly (e.g. to downgrade to a lower version), when those situations would occur.

Use Encryption

Turns on/off encrypted communication between the client and the server. The property’s behavior is SMB dialect dependent: SMB 2.x does not support encryption and the property has no effect. In case of SMB 3.x, it is a hint/request to the server to turn encryption on if the server also supports it.

Enable DFS

Enables accessing Distributed File System (DFS) and following DFS links during SMB operations.

Timeout

Timeout for read and write operations.

Relationships

  • success: All files are routed to success

Writes Attributes

  • filename: The filename is set to the name of the file on the network share

  • path: The path is set to the relative path of the file’s network share name. For example, if the input is set to \\hostname\share\tmp, files picked up from \tmp will have the path attribute set to tmp

  • file.creationTime: The date and time that the file was created. May not work on all file systems

  • file.lastModifiedTime: The date and time that the file was last modified. May not work on all file systems

  • file.lastAccessTime: The date and time that the file was last accessed. May not work on all file systems

  • absolute.path: The full path from where a file was picked up. This includes the hostname and the share name

Input Requirement

This component does not allow an incoming relationship.

GetSNMP

Retrieves information from SNMP Agent with SNMP Get request and outputs a FlowFile with information in attributes and without any content

Tags: snmp, get, oid, walk

Properties

SNMP Agent Hostname

Hostname or network address of the SNMP Agent.

SNMP Agent Port

Port of the SNMP Agent.

SNMP Version

Three significant versions of SNMP have been developed and deployed. SNMPv1 is the original version of the protocol. More recent versions, SNMPv2c and SNMPv3, feature improvements in performance, flexibility and security.

SNMP Community

SNMPv1 and SNMPv2 use communities to establish trust between managers and agents. Most agents support three community names, one each for read-only, read-write and trap. These three community strings control different types of activities. The read-only community applies to get requests. The read-write community string applies to set requests. The trap community string applies to receipt of traps.

SNMP Security Level

SNMP version 3 provides extra security with User Based Security Model (USM). The three levels of security is 1. Communication without authentication and encryption (NoAuthNoPriv). 2. Communication with authentication and without encryption (AuthNoPriv). 3. Communication with authentication and encryption (AuthPriv).

SNMP Security Name

User name used for SNMP v3 Authentication.

SNMP Authentication Protocol

Hash based authentication protocol for secure authentication.

SNMP Authentication Passphrase

Passphrase used for SNMP authentication protocol.

SNMP Privacy Protocol

Privacy allows for encryption of SNMP v3 messages to ensure confidentiality of data.

SNMP Privacy Passphrase

Passphrase used for SNMP privacy protocol.

Number of Retries

Set the number of retries when requesting the SNMP Agent.

Timeout (ms)

Set the timeout in ms when requesting the SNMP Agent.

OID

Each OID (object identifier) identifies a variable that can be read or set via SNMP. This value is not taken into account for an input flowfile and will be omitted. Can be set to emptystring when the OIDs are provided through flowfile.

Textual OID

The textual form of the numeric OID to request. This property is user defined, not processed and appended to the outgoing flowfile.

SNMP Strategy

SNMP strategy to use (SNMP Get or SNMP Walk)

Relationships

  • success: All FlowFiles that are received from the SNMP agent are routed to this relationship.

  • failure: All FlowFiles that cannot received from the SNMP agent are routed to this relationship.

Writes Attributes

  • snmp$<OID>: Response variable binding: OID (e.g. 1.3.6.1.4.1.343) and its value.

  • snmp$errorIndex: Denotes the variable binding in which the error occured.

  • snmp$errorStatus: The snmp4j error status of the PDU.

  • snmp$errorStatusText: The description of error status.

  • snmp$nonRepeaters: The number of non repeater variable bindings in a GETBULK PDU (currently not supported).

  • snmp$requestID: The request ID associated with the PDU.

  • snmp$type: The snmp4j numeric representation of the type of the PDU.

  • snmp$typeString: The name of the PDU type.

  • snmp$textualOid: This attribute will exist if and only if the strategy is GET and will be equal to the value given in Textual Oid property.

Input Requirement

This component allows an incoming relationship.

Additional Details

Summary

This processor polls an SNMP agent to get information for a given OID or OIDs (Strategy = GET) or for all the subtree associated to a given OID or OIDs (Strategy = WALK). This processor supports SNMPv1, SNMPv2c and SNMPv3. The component is based on SNMP4J.

The processor can compile the SNMP Get PDU from the attributes of an input flowfile (multiple OIDs can be specified) or from a single OID specified in the processor property. In the former case, the processor will only consider the OIDs specified in the flowfile. The processor is looking for attributes prefixed with _ snmp$. If such an attribute is found, the attribute name is split using the $ character. The second element must respect the OID format to be considered as a valid OID. The flowfile attribute value can be empty (it will be later filled with the retrieved value and written into the outgoing flowfile). When the processor is triggered, it sends the SNMP request and gets the information associated to request OID(s). Once response is received from the SNMP agent, a FlowFile is constructed. The FlowFile content is empty, all the information is written in the FlowFile attributes. In case of a single GET request, the properties associated to the received PDU are transferred into the FlowFile as attributes. In case of a WALK request, only the couples “OID/value” are transferred into the FlowFile as attributes. SNMP attributes names are prefixed with _snmp$ prefix.

Regarding the attributes representing the couples “OID/value”, the attribute name has the following format:

  • snmp$OID$SMI_Syntax_Value

where OID is the request OID, and SMI_Syntax_Value is the integer representing the type of the value associated to the OID. This value is provided to allow the SetSNMP processor setting values in the correct type.

SNMP Properties

In case of a single SNMP Get request, the following is the list of available standard SNMP properties which may come with the PDU: (“snmp$errorIndex”, “snmp$errorStatus”, “snmp$errorStatusText”, “snmp$nonRepeaters”, “snmp$requestID”, “snmp$type”)

GetSnowflakeIngestStatus

Waits until a file in a Snowflake stage is ingested. The stage must be created in the Snowflake account beforehand. This processor is usually connected to an upstream StartSnowflakeIngest processor to make sure that the file is ingested.

Tags: snowflake, snowpipe, ingest, history

Properties

Ingest Manager Provider

Specifies the Controller Service to use for ingesting Snowflake staged files.

Relationships

  • success: For FlowFiles of successful ingestion

  • failure: For FlowFiles of failed ingestion

  • retry: For FlowFiles whose file is still not ingested. These FlowFiles should be routed back to this processor to try again later

Reads Attributes

  • snowflake.staged.file.path: Staged file path

Input Requirement

This component requires an incoming relationship.

Additional Details

Description

The GetSnowflakeIngestStatus processor can be used to get the status of a staged file ingested by a Snowflake pipe. To wait until a staged file is fully ingested (copied into the table) you should connect this processor’s “retry” relationship to itself. The processor requires an upstream connection that provides the path of the staged file to be checked through the “snowflake.staged.file.path” attribute. See StartSnowflakeIngest processor for details about how to properly set up a flow to ingest staged files. NOTE: Snowflake pipes cache the paths of ingested files and never ingest the same file multiple times. This can cause the processor to enter an “infinite loop” with a FlowFile that has the same “snowflake.staged.file.path” attribute as a staged file that has been previously ingested by the pipe. It is recommended that the retry mechanism be configured to avoid these scenarios.

GetSplunk

Retrieves data from Splunk Enterprise.

Tags: get, splunk, logs

Properties

Scheme

The scheme for connecting to Splunk.

Hostname

The ip address or hostname of the Splunk server.

Port

The port of the Splunk server.

Connection Timeout

Max wait time for connection to the Splunk server.

Read Timeout

Max wait time for response from the Splunk server.

Query

The query to execute. Typically beginning with a <search> command followed by a search clause, such as <search source="tcp:7689"> to search for messages received on TCP port 7689.

Time Field Strategy

Indicates whether to search by the time attached to the event, or by the time the event was indexed in Splunk.

Time Range Strategy

Indicates how to apply time ranges to each execution of the query. Selecting a managed option allows the processor to apply a time range from the last execution time to the current execution time. When using <Managed from Beginning>, an earliest time will not be applied on the first execution, and thus all records searched. When using <Managed from Current> the earliest time of the first execution will be the initial execution time. When using <Provided>, the time range will come from the Earliest Time and Latest Time properties, or no time range will be applied if these properties are left blank.

Earliest Time

The value to use for the earliest time when querying. Only used with a Time Range Strategy of Provided. See Splunk’s documentation on Search Time Modifiers for guidance in populating this field.

Latest Time

The value to use for the latest time when querying. Only used with a Time Range Strategy of Provided. See Splunk’s documentation on Search Time Modifiers for guidance in populating this field.

Time Zone

The Time Zone to use for formatting dates when performing a search. Only used with Managed time strategies.

Application

The Splunk Application to query.

Owner

The owner to pass to Splunk.

Token

The token to pass to Splunk.

Username

The username to authenticate to Splunk.

Password

The password to authenticate to Splunk.

Security Protocol

The security protocol to use for communicating with Splunk.

Output Mode

The output mode for the results.

SSL Context Service

The SSL Context Service used to provide client certificate information for TLS/SSL connections.

Relationships

  • success: Results retrieved from Splunk are sent out this relationship.

Writes Attributes

  • splunk.query: The query that performed to produce the FlowFile.

  • splunk.earliest.time: The value of the earliest time that was used when performing the query.

  • splunk.latest.time: The value of the latest time that was used when performing the query.

Stateful

Scope: Cluster

If using one of the managed Time Range Strategies, this processor will store the values of the latest and earliest times from the previous execution so that the next execution of the can pick up where the last execution left off. The state will be cleared and start over if the query is changed.

Input Requirement

This component does not allow an incoming relationship.

GetSQS

Fetches messages from an Amazon Simple Queuing Service Queue

Tags: Amazon, AWS, SQS, Queue, Get, Fetch, Poll

Properties

Queue URL

The URL of the queue to get messages from

Region

AWS Credentials Provider Service

The Controller Service that is used to obtain AWS credentials provider

SSL Context Service

Specifies an optional SSL Context Service that, if provided, will be used to create connections

Auto Delete Messages

Specifies whether the messages should be automatically deleted by the processors once they have been received.

Batch Size

The maximum number of messages to send in a single network request

Communications Timeout

Endpoint Override URL

Endpoint URL to use instead of the AWS default including scheme, host, port, and path. The AWS libraries select an endpoint URL based on the AWS region, but this property overrides the selected endpoint URL, allowing use with other S3-compatible endpoints.

Character Set

The Character Set that should be used to encode the textual content of the SQS message

Visibility Timeout

The amount of time after a message is received but not deleted that the message is hidden from other consumers

Receive Message Wait Time

The maximum amount of time to wait on a long polling receive call. Setting this to a value of 1 second or greater will reduce the number of SQS requests and decrease fetch latency at the cost of a constantly active thread.

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP + AuthN

Relationships

  • success: FlowFiles are routed to success relationship

Writes Attributes

  • hash.value: The MD5 sum of the message

  • hash.algorithm: MD5

  • sqs.message.id: The unique identifier of the SQS message

  • sqs.receipt.handle: The SQS Receipt Handle that is to be used to delete the message from the queue

Input Requirement

This component does not allow an incoming relationship.

See Also

GetWorkdayReport

A processor which can interact with a configurable Workday Report. The processor can forward the content without modification, or you can transform it by providing the specific Record Reader and Record Writer services based on your needs. You can also remove fields by defining schema in the Record Writer. Supported Workday report formats are: csv, simplexml, json

Tags: Workday, report

Properties

Workday Report URL

HTTP remote URL of Workday report including a scheme of http or https, as well as a hostname or IP address with optional port and path elements.

Authorization Type

The type of authorization for retrieving data from Workday resources.

Access Token Provider

Enables managed retrieval of OAuth2 Bearer Token.

Workday Username

The username provided for authentication of Workday requests. Encoded using Base64 for HTTP Basic Authentication as described in RFC 7617.

Workday Password

The password provided for authentication of Workday requests. Encoded using Base64 for HTTP Basic Authentication as described in RFC 7617.

Web Client Service Provider

Web client which is used to communicate with the Workday API.

Record Reader

Specifies the Controller Service to use for parsing incoming data and determining the data’s schema.

Record Writer

The Record Writer to use for serializing Records to an output FlowFile.

Relationships

  • success: Response FlowFiles transferred when receiving HTTP responses with a status code between 200 and 299.

  • failure: Request FlowFiles transferred when receiving socket communication errors.

  • original: Request FlowFiles transferred when receiving HTTP responses with a status code between 200 and 299.

Writes Attributes

  • getworkdayreport.java.exception.class: The Java exception class raised when the processor fails

  • getworkdayreport.java.exception.message: The Java exception message raised when the processor fails

  • mime.type: Sets the mime.type attribute to the MIME Type specified by the Source / Record Writer

  • record.count: The number of records in an outgoing FlowFile. This is only populated on the 'success' relationship when Record Reader and Writer is set.

Input Requirement

This component allows an incoming relationship.

Additional Details

Summary

This processor acts as a client endpoint to interact with the Workday API. It is capable of reading reports from Workday RaaS and transferring the content directly to the output, or you can define the required Record Reader and RecordSet Writer, so you can transform the report to the required format.

Supported report formats
  • csv

  • simplexml

  • json

In case of json source you need to set the following parameters in the JsonTreeReader:

  • Starting Field Strategy: Nested Field

  • Starting Field Name: Report_Entry

It is possible to hide specific columns from the response if you define the Writer scheme explicitly in the configuration of the RecordSet Writer.

Example: Remove name2 column from the response

Let’s say we have the following record structure:

RecordSet (
  Record (
    Field "name1" = "value1",
    Field "name2" = 42
  ),
  Record (
    Field "name1" = "value2",
    Field "name2" = 84
  )
)

If you would like to remove the “name2” column from the response, then you need to define the following writer schema:

{                   "name": "test",                   "namespace": "nifi",                   "type": "record",                   "fields": [                     { "name": "name1", "type": "string" }                 ]                 }
json

GetZendesk

Incrementally fetches data from Zendesk API.

Tags: zendesk

Properties

Web Client Service Provider

Controller service for HTTP client operations.

Subdomain Name

Name of the Zendesk subdomain.

User Name

Login user to Zendesk subdomain.

Authentication Type

Type of authentication to Zendesk API.

Authentication Credential

Password or authentication token for Zendesk login user.

Export Method

Method for incremental export.

Resource

The particular Zendesk resource which is meant to be exported.

Query Start Timestamp

Initial timestamp to query Zendesk API from in Unix timestamp seconds format.

Relationships

  • success: For FlowFiles created as a result of a successful HTTP request.

Writes Attributes

  • record.count: The number of records fetched by the processor.

Stateful

Scope: Cluster

Paging cursor for Zendesk API is stored. Cursor is updated after each successful request.

Input Requirement

This component does not allow an incoming relationship.

Additional Details

Description

The processor uses the Zendesk Incremental Exports API to initially export a complete list of items from some arbitrary milestone, and then periodically poll the API to incrementally export items that have been added or changed since the previous poll. The processor extracts data from the response and emits a flow file having an array of objects as content if the response was not empty, also placing an attribute on the flow file having the value of the number of records fetched. If the response was empty, no flow file is emitted. SplitJson processor can be used the split the array of records into distinct flow files where each flow file will contain exactly one record.

Authentication

Zendesk Incremental Exports API uses basic authentication. Either a password or an authentication token have to be provided. Authentication token can be created in Zendesk API Settings, so the users don’t have to expose their passwords, and also auth tokens can be revoked quickly if necessary.

Export methods

Zendesk Incremental Exports API supports cursor and time based export methods. Cursor based method is the preferred way and should be used where available. Due to the limitations of time based export the result set may contain duplicated records. For more details on export methods please visit this guide

Excluding duplicate items

Because of limitations with time-based pagination, the exported data may contain duplicate items. The processor won’t do the deduplication, instead DetectDuplicate or DeduplicateRecord processors can be used with UpdateAttribute processor to extract the necessary attributes from the flow file content. Please see the following guide for details and the list of attributes to use in the deduplication process.

HandleHttpRequest

Starts an HTTP Server and listens for HTTP Requests. For each request, creates a FlowFile and transfers to 'success'. This Processor is designed to be used in conjunction with the HandleHttpResponse Processor in order to create a Web Service. In case of a multipart request, one FlowFile is generated for each part.

Tags: http, https, request, listen, ingress, web service

Properties

Listening Port

The Port to listen on for incoming HTTP requests

Hostname

The Hostname to bind to. If not specified, will bind to all hosts

SSL Context Service

The SSL Context Service to use in order to secure the server. If specified, the server will accept only HTTPS requests; otherwise, the server will accept only HTTP requests

HTTP Protocols

HTTP Protocols supported for Application Layer Protocol Negotiation with TLS

HTTP Context Map

The HTTP Context Map Controller Service to use for caching the HTTP Request Information

Allowed Paths

A Regular Expression that specifies the valid HTTP Paths that are allowed in the incoming URL Requests. If this value is specified and the path of the HTTP Requests does not match this Regular Expression, the Processor will respond with a 404: NotFound

Default URL Character Set

The character set to use for decoding URL parameters if the HTTP Request does not supply one

Allow GET

Allow HTTP GET Method

Allow POST

Allow HTTP POST Method

Allow PUT

Allow HTTP PUT Method

Allow DELETE

Allow HTTP DELETE Method

Allow HEAD

Allow HTTP HEAD Method

Allow OPTIONS

Allow HTTP OPTIONS Method

Maximum Threads

The maximum number of threads that the embedded HTTP server will use for handling requests.

Request Header Maximum Size

The maximum supported size of HTTP headers in requests sent to this processor

Additional HTTP Methods

A comma-separated list of non-standard HTTP Methods that should be allowed

Client Authentication

Specifies whether or not the Processor should authenticate clients. This value is ignored if the <SSL Context Service> Property is not specified or the SSL Context provided uses only a KeyStore and not a TrustStore.

Container Queue Size

The size of the queue for Http Request Containers

Multipart Request Max Size

The max size of the request. Only applies for requests with Content-Type: multipart/form-data, and is used to prevent denial of service type of attacks, to prevent filling up the heap or disk space

Multipart Read Buffer Size

The threshold size, at which the contents of an incoming file would be written to disk. Only applies for requests with Content-Type: multipart/form-data. It is used to prevent denial of service type of attacks, to prevent filling up the heap or disk space.

Parameters to Attributes List

A comma-separated list of HTTP parameters or form data to output as attributes

Relationships

  • success: All content that is received is routed to the 'success' relationship

Writes Attributes

  • http.context.identifier: An identifier that allows the HandleHttpRequest and HandleHttpResponse to coordinate which FlowFile belongs to which HTTP Request/Response.

  • mime.type: The MIME Type of the data, according to the HTTP Header "Content-Type"

  • http.servlet.path: The part of the request URL that is considered the Servlet Path

  • http.context.path: The part of the request URL that is considered to be the Context Path

  • http.method: The HTTP Method that was used for the request, such as GET or POST

  • http.local.name: IP address/hostname of the server

  • http.server.port: Listening port of the server

  • http.query.string: The query string portion of the Request URL

  • http.remote.host: The hostname of the requestor

  • http.remote.addr: The hostname:port combination of the requestor

  • http.remote.user: The username of the requestor

  • http.protocol: The protocol used to communicate

  • http.request.uri: The full Request URL

  • http.auth.type: The type of HTTP Authorization used

  • http.principal.name: The name of the authenticated user making the request

  • http.query.param.XXX: Each of query parameters in the request will be added as an attribute, prefixed with "http.query.param."

  • http.param.XXX: Form parameters in the request that are configured by "Parameters to Attributes List" will be added as an attribute, prefixed with "http.param.". Putting form parameters of large size is not recommended.

  • http.subject.dn: The Distinguished Name of the requestor. This value will not be populated unless the Processor is configured to use an SSLContext Service

  • http.issuer.dn: The Distinguished Name of the entity that issued the Subject’s certificate. This value will not be populated unless the Processor is configured to use an SSLContext Service

  • http.certificate.sans.N.name: X.509 Client Certificate Subject Alternative Name value from mutual TLS authentication. The attribute name has a zero-based index ordered according to the content of Client Certificate

  • http.certificate.sans.N.nameType: X.509 Client Certificate Subject Alternative Name type from mutual TLS authentication. The attribute name has a zero-based index ordered according to the content of Client Certificate. The attribute value is one of the General Names from RFC 3280 Section 4.1.2.7

  • http.headers.XXX: Each of the HTTP Headers that is received in the request will be added as an attribute, prefixed with "http.headers." For example, if the request contains an HTTP Header named "x-my-header", then the value will be added to an attribute named "http.headers.x-my-header"

  • http.headers.multipart.XXX: Each of the HTTP Headers that is received in the multipart request will be added as an attribute, prefixed with "http.headers.multipart." For example, if the multipart request contains an HTTP Header named "content-disposition", then the value will be added to an attribute named "http.headers.multipart.content-disposition"

  • http.multipart.size: For requests with Content-Type "multipart/form-data", the part’s content size is recorded into this attribute

  • http.multipart.content.type: For requests with Content-Type "multipart/form-data", the part’s content type is recorded into this attribute

  • http.multipart.name: For requests with Content-Type "multipart/form-data", the part’s name is recorded into this attribute

  • http.multipart.filename: For requests with Content-Type "multipart/form-data", when the part contains an uploaded file, the name of the file is recorded into this attribute. Files are stored temporarily at the default temporary-file directory specified in "java.io.File" Java Docs)

  • http.multipart.fragments.sequence.number: For requests with Content-Type "multipart/form-data", the part’s index is recorded into this attribute. The index starts with 1.

  • http.multipart.fragments.total.number: For requests with Content-Type "multipart/form-data", the count of all parts is recorded into this attribute.

Input Requirement

This component does not allow an incoming relationship.

Additional Details

Usage Description

The pairing of this Processor with a HandleHttpResponse Processor provides the ability to use NiFi to visually construct a web server that can carry out any functionality that is available through the existing Processors. For example, one could construct a Web-based front end to an SFTP Server by constructing a flow such as:

HandleHttpRequest → PutSFTP → HandleHttpResponse

The HandleHttpRequest Processor provides several Properties to configure which methods are supported, the paths that are supported, and SSL configuration.

To handle requests with Content-Type: multipart/form-data containing multiple parts, additional attention needs to be paid. Each part generates a FlowFile of its own. To each these FlowFiles, some special attributes are written:

  • http.context.identifier

  • http.multipart.fragments.sequence.number

  • http.multipart.fragments.total.number

These attributes could be used to implement a gating mechanism for HandleHttpResponse processor to wait for the processing of FlowFiles with sequence number http.multipart.fragments.sequence.number until up to http.multipart.fragments.total.number** of flow files are processed, belonging to the same http.context.identifier, which is unique to the request.

HandleHttpResponse

Sends an HTTP Response to the Requestor that generated a FlowFile. This Processor is designed to be used in conjunction with the HandleHttpRequest in order to create a web service.

Tags: http, https, response, egress, web service

Properties

HTTP Status Code

The HTTP Status Code to use when responding to the HTTP Request. See Section 10 of RFC 2616 for more information.

HTTP Context Map

The HTTP Context Map Controller Service to use for caching the HTTP Request Information

Attributes to add to the HTTP Response (Regex)

Specifies the Regular Expression that determines the names of FlowFile attributes that should be added to the HTTP response

Dynamic Properties

An HTTP header name

These HTTPHeaders are set in the HTTP Response

Relationships

  • success: FlowFiles will be routed to this Relationship after the response has been successfully sent to the requestor

  • failure: FlowFiles will be routed to this Relationship if the Processor is unable to respond to the requestor. This may happen, for instance, if the connection times out or if NiFi is restarted before responding to the HTTP Request.

Reads Attributes

  • http.context.identifier: The value of this attribute is used to lookup the HTTP Response so that the proper message can be sent back to the requestor. If this attribute is missing, the FlowFile will be routed to 'failure.'

  • http.request.uri: Value of the URI requested by the client. Used for provenance event.

  • http.remote.host: IP address of the client. Used for provenance event.

  • http.local.name: IP address/hostname of the server. Used for provenance event.

  • http.server.port: Listening port of the server. Used for provenance event.

  • http.subject.dn: SSL distinguished name (if any). Used for provenance event.

Input Requirement

This component requires an incoming relationship.

Additional Details

Usage Description:

The pairing of this Processor with a HandleHttpRequest Processor provides the ability to use NiFi to visually construct a web server that can carry out any functionality that is available through the existing Processors. For example, one could construct a Web-based front end to an SFTP Server by constructing a flow such as:

HandleHttpRequest → PutSFTP → HandleHttpResponse

This Processor must be configured with the same service as the corresponding HandleHttpRequest Processor. Otherwise, all FlowFiles will be routed to the ‘failure’ relationship.

HandleSftpRequest

(TECHNOLOGY PREVIEW) Handles SFTP requests made to the specified directory. Requests are converted to FlowFiles and routed to the respective relationship. Only read and write requests are supported. If a Record Reader is set, incoming FlowFiles can be used to keep track of which files are available at the specified directory.

Properties

SFTP Server Controller

The controller that hosts the SFTP Server.

Directory Path

Requests made by the client on the specified directory path or any subdirectory will be routed to this processor.

Record Reader

If set, incoming FlowFiles will be read using the Record Reader to determine which files are shown to clients as available at the given Directory Path. This is optional and not needed to read files from the server. Available files won’t be checked when a read request is received.

Relationships

  • failure: TODO

  • list: Incoming FlowFiles will be routed through this relationship.

  • read: Read requests will be routed to this relationship. The FlowFile’s content should be filled with the requested file and then routed to HandleSftpResponse processor.

  • write: Write requests will be routed to this relationship. The FlowFile’s content will contain the written file.

Writes Attributes

  • sftp.context: This attribute is used by a HandleSftpResponse processor to completed the read requests.

  • sftp.directory: Equal to the specified Directory Path property.

  • filename: The name of the written/read file.

Input Requirement

This component allows an incoming relationship.

HandleSftpResponse

(TECHNOLOGY PREVIEW) Completes a read request initiated by HandleSftpRequest.

Properties

SFTP Server Controller

The controller that hosts the SFTP Server.

SFTP Status Code

The HTTP Status Code to use when responding to the SFTP Request. 0 indicates successful completion of the request.

SFTP Error Message

Error message used when responding to a request which doesn’t use 0 as the status code.

Relationships

  • success: FlowFiles will be routed to this Relationship after the response has been successfully sent to the requester.

  • failure: FlowFiles will be routed to this Relationship if the Processor is unable to respond to the requester. This may happen, for instance, if the connection times out or if NiFi is restarted before responding to the request.

Reads Attributes

  • sftp.context: The ID of the read request that will be completed.

Input Requirement

This component allows an incoming relationship.

IdentifyMimeType

Attempts to identify the MIME Type used for a FlowFile. If the MIME Type can be identified, an attribute with the name 'mime.type' is added with the value being the MIME Type. If the MIME Type cannot be determined, the value will be set to 'application/octet-stream'. In addition, the attribute 'mime.extension' will be set if a common file extension for the MIME Type is known. If the MIME Type detected is of type text/*, attempts to identify the charset used and an attribute with the name 'mime.charset' is added with the value being the charset.

Tags: compression, gzip, bzip2, zip, MIME, mime.type, file, identify

Properties

Use Filename In Detection

If true will pass the filename to Tika to aid in detection.

Config Strategy

Select the loading strategy for MIME Type configuration to be used.

Config Body

Body of MIME type config file. Only one of Config File or Config Body may be used.

Config File

Path to MIME type config file. Only one of Config File or Config Body may be used.

Relationships

  • success: All FlowFiles are routed to success

Writes Attributes

  • mime.type: This Processor sets the FlowFile’s mime.type attribute to the detected MIME Type. If unable to detect the MIME Type, the attribute’s value will be set to application/octet-stream

  • mime.extension: This Processor sets the FlowFile’s mime.extension attribute to the file extension associated with the detected MIME Type. If there is no correlated extension, the attribute’s value will be empty

  • mime.charset: This Processor sets the FlowFile’s mime.charset attribute to the detected charset. If unable to detect the charset or the detected MIME type is not of type text/*, the attribute will not be set

Input Requirement

This component requires an incoming relationship.

Additional Details

The following is a non-exhaustive list of MIME Types detected by default in NiFi:

  • application/gzip

  • application/bzip2

  • application/flowfile-v3

  • application/flowfile-v1

  • application/xml

  • video/mp4

  • video/x-m4v

  • video/mp4a-latm

  • video/quicktime

  • video/mpeg

  • audio/wav

  • audio/mp3

  • image/bmp

  • image/png

  • image/jpg

  • image/gif

  • image/tif

  • application/vnd.ms-works

  • application/msexcel

  • application/mspowerpoint

  • application/msaccess

  • application/x-ms-wmv

  • application/pdf

  • application/x-rpm

  • application/tar

  • application/x-7z-compressed

  • application/java-archive

  • application/zip

  • application/x-lzh

An example value for the Config Body property that will identify a file whose contents start with “abcd” as MIME Type ” custom/abcd” and with extension “.abcd” would look like the following:

<?xml version="1.0" encoding="UTF-8"?>
<mime-info>
    <mime-type type="custom/abcd">
        <magic priority="50">
            <match value="abcd" type="string" offset="0"/>
        </magic>
        <glob pattern="\*.abcd"/>
    </mime-type>
</mime-info>
xml

For a more complete list of Tika’s default types (and additional details regarding customization of the value for the Config Body property), please refer to Apache Tika’s documentation

InvokeCloud

Used in an IGUASU Gateway (running on an on-prem system) for sending requests with incoming FlowFiles' content and attributes to IGUASU (running on the Virtimo Cloud). A request’s response message is routed via the success relationship. IGUASU can receive and respond to requests using ListenOnPrem and RespondOnPrem processors. Data is securely sent in either direction using a standing WSS connection initiated by IGUASU Gateway, meaning the on-prem system mustn’t open external ports.

Tags: hybrid, cloud, onprem, virtimo, websocket

Properties

Client Controller

Controller for connecting to a Virtimo system and enable WSS communication.

Choose listener

Choose the target listener from available ones.

Listener Identifier

The identifier of the listener that will be called.

Target unavailable routing

Which relationship to route FlowFiles in if the target listener is unavailable.

Request timeout

The request timeout after which a message will be routed to the failure relationship.

Relationships

  • success: A successful request’s response will be routed to this relationship.

  • failure: An unsuccessful request’s response/error will be routed to this relationship. If the target listener was not available, routing is subject to the "Target unavailable routing" property.

  • original: The request as it was sent will be routed to this relationship.

  • unavailable: Subject to the "Target unavailable routing" property, the FlowFile may be routed here if the target listener was not available.

Reads Attributes

  • hybrid.response: Responses are routed to the 'success', 'failed', 'unavailable' relationships, or put back into the queue depending on the response code (and the 'Target unavailable routing'-strategy)

Writes Attributes

  • hybrid.invoker: This processor’s unique identifier. Used to direct responses back to this processor.

  • hybrid.listener: The chosen listener’s identifier. Used to direct the requests to the target listener.

  • hybrid.request: Unique identifier generated for each request. Used to match responses with the corresponding request.

Input Requirement

This component requires an incoming relationship.

InvokeHTTP

An HTTP client processor which can interact with a configurable HTTP Endpoint. The destination URL and HTTP Method are configurable. When the HTTP Method is PUT, POST or PATCH, the FlowFile contents are included as the body of the request and FlowFile attributes are converted to HTTP headers, optionally, based on configuration properties.

Tags: http, https, rest, client

Properties

HTTP Method

HTTP request method (GET, POST, PUT, PATCH, DELETE, HEAD, OPTIONS). Arbitrary methods are also supported. Methods other than POST, PUT and PATCH will be sent without a message body.

HTTP URL

HTTP remote URL including a scheme of http or https, as well as a hostname or IP address with optional port and path elements. Any encoding of the URL must be done by the user.

HTTP/2 Disabled

Disable negotiation of HTTP/2 protocol. HTTP/2 requires TLS. HTTP/1.1 protocol supported is required when HTTP/2 is disabled.

SSL Context Service

SSL Context Service provides trusted certificates and client certificates for TLS communication.

Connection Timeout

Maximum time to wait for initial socket connection to the HTTP URL.

Socket Read Timeout

Maximum time to wait for receiving responses from a socket connection to the HTTP URL.

Socket Write Timeout

Maximum time to wait for write operations while sending requests from a socket connection to the HTTP URL.

Socket Idle Timeout

Maximum time to wait before closing idle connections to the HTTP URL.

Socket Idle Connections

Maximum number of idle connections to the HTTP URL.

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP, SOCKS + AuthN In case of SOCKS, it is not guaranteed that the selected SOCKS Version will be used by the processor.

Request OAuth2 Access Token Provider

Enables managed retrieval of OAuth2 Bearer Token applied to HTTP requests using the Authorization Header.

Request Username

The username provided for authentication of HTTP requests. Encoded using Base64 for HTTP Basic Authentication as described in RFC 7617.

Request Password

The password provided for authentication of HTTP requests. Encoded using Base64 for HTTP Basic Authentication as described in RFC 7617.

Request Digest Authentication Enabled

Enable Digest Authentication on HTTP requests with Username and Password credentials as described in RFC 7616.

Request Failure Penalization Enabled

Enable penalization of request FlowFiles when receiving HTTP response with a status code between 400 and 499.

Request Body Enabled

Enable sending HTTP request body for PATCH, POST, or PUT methods.

Request Multipart Form-Data Name

Enable sending HTTP request body formatted using multipart/form-data and using the form name configured.

Request Multipart Form-Data Filename Enabled

Enable sending the FlowFile filename attribute as the filename parameter in the Content-Disposition Header for multipart/form-data HTTP requests.

Request Chunked Transfer-Encoding Enabled

Enable sending HTTP requests with the Transfer-Encoding Header set to chunked, and disable sending the Content-Length Header. Transfer-Encoding applies to the body in HTTP/1.1 requests as described in RFC 7230 Section 3.3.1

Request Content-Encoding

HTTP Content-Encoding applied to request body during transmission. The receiving server must support the selected encoding to avoid request failures.

Request Content-Type

HTTP Content-Type Header applied to when sending an HTTP request body for PATCH, POST, or PUT methods. The Content-Type defaults to application/octet-stream when not configured.

Request Date Header Enabled

Enable sending HTTP Date Header on HTTP requests as described in RFC 7231 Section 7.1.1.2.

Request Header Attributes Pattern

Regular expression that defines which FlowFile attributes to send as HTTP headers in the request. If not defined, no attributes are sent as headers. Dynamic properties will be always be sent as headers. The dynamic property name will be the header key and the dynamic property value, interpreted as Expression Language, will be the header value. Attributes and their values are limited to ASCII characters due to the requirement of the HTTP protocol.

Request User-Agent

HTTP User-Agent Header applied to requests. RFC 7231 Section 5.5.3 describes recommend formatting.

Response Body Attribute Name

FlowFile attribute name used to write an HTTP response body for FlowFiles transferred to the Original relationship.

Response Body Attribute Size

Maximum size in bytes applied when writing an HTTP response body to a FlowFile attribute. Attributes exceeding the maximum will be truncated.

Response Body Ignored

Disable writing HTTP response FlowFiles to Response relationship

Response Cache Enabled

Enable HTTP response caching described in RFC 7234. Caching responses considers ETag and other headers.

Response Cache Size

Maximum size of HTTP response cache in bytes. Caching responses considers ETag and other headers.

Response Cookie Strategy

Strategy for accepting and persisting HTTP cookies. Accepting cookies enables persistence across multiple requests.

Response Generation Required

Enable generation and transfer of a FlowFile to the Response relationship regardless of HTTP response received.

Response FlowFile Naming Strategy

Determines the strategy used for setting the filename attribute of FlowFiles transferred to the Response relationship.

Response Header Request Attributes Enabled

Enable adding HTTP response headers as attributes to FlowFiles transferred to the Original, Retry or No Retry relationships.

Response Header Request Attributes Prefix

Prefix to HTTP response headers when included as attributes to FlowFiles transferred to the Original, Retry or No Retry relationships. It is recommended to end with a separator character like '.' or '-'.

Response Redirects Enabled

Enable following HTTP redirects sent with HTTP 300 series responses as described in RFC 7231 Section 6.4.

Dynamic Properties

Header Name

Send request header with a key matching the Dynamic Property Key and a value created by evaluating the Attribute Expression Language set in the value of the Dynamic Property.

post:form:<NAME>

When the HTTP Method is POST, dynamic properties with the property name in the form of post:form:<NAME>, where the <NAME> will be the form data name, will be used to fill out the multipart form parts. If send message body is false, the flowfile will not be sent, but any other form data will be.

Relationships

  • Failure: Request FlowFiles transferred when receiving socket communication errors.

  • No Retry: Request FlowFiles transferred when receiving HTTP responses with a status code between 400 an 499.

  • Original: Request FlowFiles transferred when receiving HTTP responses with a status code between 200 and 299.

  • Response: Response FlowFiles transferred when receiving HTTP responses with a status code between 200 and 299.

  • Retry: Request FlowFiles transferred when receiving HTTP responses with a status code between 500 and 599.

Writes Attributes

  • invokehttp.status.code: The status code that is returned

  • invokehttp.status.message: The status message that is returned

  • invokehttp.response.body: In the instance where the status code received is not a success (2xx) then the response body will be put to the 'invokehttp.response.body' attribute of the request FlowFile.

  • invokehttp.request.url: The original request URL

  • invokehttp.request.duration: Duration (in milliseconds) of the HTTP call to the external endpoint

  • invokehttp.response.url: The URL that was ultimately requested after any redirects were followed

  • invokehttp.tx.id: The transaction ID that is returned after reading the response

  • invokehttp.remote.dn: The DN of the remote server

  • invokehttp.java.exception.class: The Java exception class raised when the processor fails

  • invokehttp.java.exception.message: The Java exception message raised when the processor fails

  • user-defined: If the 'Put Response Body In Attribute' property is set then whatever it is set to will become the attribute key and the value would be the body of the HTTP response.

Input Requirement

This component allows an incoming relationship.

InvokeInubit

Used in IGUASU (running on the Virtimo Cloud) for sending requests with incoming FlowFiles' content and attributes to an INUBIT system. A request’s response message is routed via this processor’s success relationship. INUBIT can receive and respond to requests using an IGUASU-Listener-Connector. Data is securely sent in either direction using a standing WSS connection initiated by INUBIT, meaning the INUBIT system mustn’t open external ports.

Tags: hybrid, cloud, onprem, virtimo, websocket, inubit

Properties

Server Controller

Controller for letting other Virtimo systems to connect to and enable WSS communication.

Choose listener

Choose the target listener from available ones.

Listener Identifier

The identifier of the listener that will be called.

Target unavailable routing

Which relationship to route FlowFiles in if the target listener is unavailable.

Request timeout

The request timeout after which a message will be routed to the failure relationship.

Relationships

  • success: A successful request’s response will be routed to this relationship.

  • failure: An unsuccessful request’s response/error will be routed to this relationship. If the target listener was not available, routing is subject to the "Target unavailable routing" property.

  • original: The request as it was sent will be routed to this relationship.

  • unavailable: Subject to the "Target unavailable routing" property, the FlowFile may be routed here if the target listener was not available.

Reads Attributes

  • hybrid.response: Responses are routed to the 'success', 'failed', 'unavailable' relationships, or put back into the queue depending on the response code (and the 'Target unavailable routing'-strategy)

Writes Attributes

  • hybrid.invoker: This processor’s unique identifier. Used to direct responses back to this processor.

  • hybrid.listener: The chosen listener’s identifier. Used to direct the requests to the target listener.

  • hybrid.request: Unique identifier generated for each request. Used to match responses with the corresponding request.

Input Requirement

This component requires an incoming relationship.

See Also

InvokeOnPrem

Used in IGUASU (running on the Virtimo Cloud) for sending requests with incoming FlowFiles' to a IGUASU Gateway (running on an on-prem system). A request’s response message is routed via this processor’s success relationship. The IGUASU Gateway can receive and respond to requests using ListenCloud and RespondCloud processors. Data is securely sent in either direction using a standing WSS connection initiated by the IGUASU Gateway, meaning the on-prem system mustn’t open external ports.

Tags: hybrid, cloud, onprem, virtimo, websocket

Properties

Server Controller

Controller for letting other Virtimo systems to connect to and enable WSS communication.

Choose listener

Choose the target listener from available ones.

Listener Identifier

The identifier of the listener that will be called.

Target unavailable routing

Which relationship to route FlowFiles in if the target listener is unavailable.

Request timeout

The request timeout after which a message will be routed to the failure relationship.

Relationships

  • success: A successful request’s response will be routed to this relationship.

  • failure: An unsuccessful request’s response/error will be routed to this relationship. If the target listener was not available, routing is subject to the "Target unavailable routing" property.

  • original: The request as it was sent will be routed to this relationship.

  • unavailable: Subject to the "Target unavailable routing" property, the FlowFile may be routed here if the target listener was not available.

Reads Attributes

  • hybrid.response: Responses are routed to the 'success', 'failed', 'unavailable' relationships, or put back into the queue depending on the response code (and the 'Target unavailable routing'-strategy)

Writes Attributes

  • hybrid.invoker: This processor’s unique identifier. Used to direct responses back to this processor.

  • hybrid.listener: The chosen listener’s identifier. Used to direct the requests to the target listener.

  • hybrid.request: Unique identifier generated for each request. Used to match responses with the corresponding request.

Input Requirement

This component requires an incoming relationship.

InvokeScriptedProcessor

Experimental - Invokes a script engine for a Processor defined in the given script. The script must define a valid class that implements the Processor interface, and it must set a variable 'processor' to an instance of the class. Processor methods such as onTrigger() will be delegated to the scripted Processor instance. Also any Relationships or PropertyDescriptors defined by the scripted processor will be added to the configuration dialog. The scripted processor can implement public void setLogger(ComponentLog logger) to get access to the parent logger, as well as public void onScheduled(ProcessContext context) and public void onStopped(ProcessContext context) methods to be invoked when the parent InvokeScriptedProcessor is scheduled or stopped, respectively. NOTE: The script will be loaded when the processor is populated with property values, see the Restrictions section for more security implications. Experimental: Impact of sustained usage not yet verified.

Tags: script, invoke, groovy

Properties

Script Engine

No Script Engines found

Script File

Path to script file to execute. Only one of Script File or Script Body may be used

Script Body

Body of script to execute. Only one of Script File or Script Body may be used

Module Directory

Comma-separated list of paths to files and/or directories which contain modules required by the script.

Dynamic Properties

Script Engine Binding property

Updates a script engine property specified by the Dynamic Property’s key with the value specified by the Dynamic Property’s value

Stateful

Scope: Local, Cluster

Scripts can store and retrieve state using the State Management APIs. Consult the State Manager section of the Developer’s Guide for more details.

Restricted

Making changes to this component requires admin rights. Non-admins can still schedule this component.

Input Requirement

This component allows an incoming relationship.

See Also

ISPEnrichIP

Looks up ISP information for an IP address and adds the information to FlowFile attributes. The ISP data is provided as a MaxMind ISP database. (Note that this is NOT the same as the GeoLite database utilized by some geo enrichment tools). The attribute that contains the IP address to lookup is provided by the 'IP Address Attribute' property. If the name of the attribute provided is 'X', then the the attributes added by enrichment will take the form X.isp.<fieldName>

Tags: ISP, enrich, ip, maxmind

Properties

MaxMind Database File

Path to Maxmind IP Enrichment Database File

IP Address Attribute

The name of an attribute whose value is a dotted decimal IP address for which enrichment should occur

Log Level

The Log Level to use when an IP is not found in the database. Accepted values: INFO, DEBUG, WARN, ERROR.

Relationships

  • found: Where to route flow files after successfully enriching attributes with data provided by database

  • not found: Where to route flow files after unsuccessfully enriching attributes because no data was found

Writes Attributes

  • X.isp.lookup.micros: The number of microseconds that the geo lookup took

  • X.isp.asn: The Autonomous System Number (ASN) identified for the IP address

  • X.isp.asn.organization: The Organization Associated with the ASN identified

  • X.isp.name: The name of the ISP associated with the IP address provided

  • X.isp.organization: The Organization associated with the IP address provided

Input Requirement

This component requires an incoming relationship.

JoinEnrichment

Joins together Records from two different FlowFiles where one FlowFile, the 'original' contains arbitrary records and the second FlowFile, the 'enrichment' contains additional data that should be used to enrich the first. See Additional Details for more information on how to configure this processor and the different use cases that it aims to accomplish.

Tags: fork, join, enrichment, record, sql, wrap, recordpath, merge, combine, streams

Properties

Original Record Reader

The Record Reader for reading the 'original' FlowFile

Enrichment Record Reader

The Record Reader for reading the 'enrichment' FlowFile

Record Writer

The Record Writer to use for writing the results. If the Record Writer is configured to inherit the schema from the Record, the schema that it will inherit will be the result of merging both the 'original' record schema and the 'enrichment' record schema.

Join Strategy

Specifies how to join the two FlowFiles into a single FlowFile

SQL

The SQL SELECT statement to evaluate. Expression Language may be provided, but doing so may result in poorer performance. Because this Processor is dealing with two FlowFiles at a time, it’s also important to understand how attributes will be referenced. If both FlowFiles have an attribute with the same name but different values, the Expression Language will resolve to the value provided by the 'enrichment' FlowFile.

Default Decimal Precision

When a DECIMAL/NUMBER value is written as a 'decimal' Avro logical type, a specific 'precision' denoting number of available digits is required. Generally, precision is defined by column data type definition or database engines default. However undefined precision (0) can be returned from some database engines. 'Default Decimal Precision' is used when writing those undefined precision numbers.

Default Decimal Scale

When a DECIMAL/NUMBER value is written as a 'decimal' Avro logical type, a specific 'scale' denoting number of available decimal digits is required. Generally, scale is defined by column data type definition or database engines default. However when undefined precision (0) is returned, scale can also be uncertain with some database engines. 'Default Decimal Scale' is used when writing those undefined numbers. If a value has more decimals than specified scale, then the value will be rounded-up, e.g. 1.53 becomes 2 with scale 0, and 1.5 with scale 1.

Insertion Record Path

Specifies where in the 'original' Record the 'enrichment' Record’s fields should be inserted. Note that if the RecordPath does not point to any existing field in the original Record, the enrichment will not be inserted.

Maximum number of Bins

Specifies the maximum number of bins that can be held in memory at any one time

Timeout

Specifies the maximum amount of time to wait for the second FlowFile once the first arrives at the processor, after which point the first FlowFile will be routed to the 'timeout' relationship.

Relationships

  • failure: If both the 'original' and 'enrichment' FlowFiles arrive at the processor but there was a failure in joining the records, both of those FlowFiles will be routed to this relationship.

  • joined: The resultant FlowFile with Records joined together from both the original and enrichment FlowFiles will be routed to this relationship

  • original: Both of the incoming FlowFiles ('original' and 'enrichment') will be routed to this Relationship. I.e., this is the 'original' version of both of these FlowFiles.

  • timeout: If one of the incoming FlowFiles (i.e., the 'original' FlowFile or the 'enrichment' FlowFile) arrives to this Processor but the other does not arrive within the configured Timeout period, the FlowFile that did arrive is routed to this relationship.

Writes Attributes

  • mime.type: Sets the mime.type attribute to the MIME Type specified by the Record Writer

  • record.count: The number of records in the FlowFile

Input Requirement

This component requires an incoming relationship.

System Resource Considerations

  • MEMORY: This Processor will load into heap all FlowFiles that are on its incoming queues. While it loads the FlowFiles themselves, and not their content, the FlowFile attributes can be very memory intensive. Additionally, if the Join Strategy is set to SQL, the SQL engine may require buffering the entire contents of the enrichment FlowFile for each concurrent task. See Processor’s Additional Details for more details and for steps on how to mitigate these concerns.

See Also

Additional Details

Introduction

The JoinEnrichment processor is designed to be used in conjunction with the ForkEnrichment Processor. Used together, they provide a powerful mechanism for transforming data into a separate request payload for gathering enrichment data, gathering that enrichment data, optionally transforming the enrichment data, and finally joining together the original payload with the enrichment data.

Typical Dataflow

A ForkEnrichment processor that is responsible for taking in a FlowFile and producing two copies of it: one to the “original” relationship and the other to the “enrichment” relationship. Each copy will have its own set of attributes added to it.

The “original” FlowFile being routed to the JoinEnrichment processor, while the “enrichment” FlowFile is routed in a different direction. Each of these FlowFiles will have an attribute named “enrichment.group.id” with the same value. The JoinEnrichment processor then uses this information to correlate the two FlowFiles. The “enrichment.role” attribute will also be added to each FlowFile but with a different value. The FlowFile routed to “original” will have an enrichment.role of ORIGINAL while the FlowFile routed to “enrichment” will have an enrichment.role of ENRICHMENT.

The Processors that make up the “enrichment” path will vary from use case to use case. In this example, we use JoltTransformJSON processor in order to transform our payload from the original payload into a payload that is expected by our web service. We then use the InvokeHTTP processor in order to gather enrichment data that is relevant to our use case. Other common processors to use in this path include QueryRecord, UpdateRecord, ReplaceText, JoltTransformRecord, and ScriptedTransformRecord. It is also be a common use case to transform the response from the web service that is invoked via InvokeHTTP using one or more of these processors.

After the enrichment data has been gathered, it does us little good unless we are able to somehow combine our enrichment data back with our original payload. To achieve this, we use the JoinEnrichment processor. It is responsible for combining records from both the “original” FlowFile and the “enrichment” FlowFile.

The JoinEnrichment Processor is configured with a separate RecordReader for the “original” FlowFile and for the “enrichment” FlowFile. This means that the original data and the enrichment data can have entirely different schemas and can even be in different data formats. For example, our original payload may be CSV data, while our enrichment data is a JSON payload. Because we make use of RecordReaders, this is entirely okay. The Processor also requires a RecordWriter to use for writing out the enriched payload (i.e., the payload that contains the join of both the “original” and the “enrichment” data).

The JoinEnrichment Processor offers different strategies for how to combine the original records with the enrichment data. Each of these is explained here in some detail.

Wrapper

The Wrapper strategy is the default. Each record in the original payload is expected to have a corresponding record in the enrichment payload. The output record will be a record with two fields: original and enrichment. Each of these will contain the _n_th record from the corresponding FlowFile. For example, if the original FlowFile has the following content:

id, name, age
28021, John Doe, 55
832, Jane Doe, 22
29201, Jake Doe, 23
555, Joseph Doe, 2

And our enrichment FlowFile has the following content:

id, email
28021, john.doe@nifi.apache.org
832, jane.doe@nifi.apache.org
29201, jake.doe@nifi.apache.org

This strategy would produce output the looks like this (assuming a JSON Writer):

[
  {
    "original": {
      "id": 28021,
      "name": "John Doe",
      "age": 55
    },
    "enrichment": {
      "id": 28021,
      "email": "john.doe@nifi.apache.org"
    }
  },
  {
    "original": {
      "id": 832,
      "name": "Jane Doe",
      "age": 22
    },
    "enrichment": {
      "id": 832,
      "email": "jane.doe@nifi.apache.org"
    }
  },
  {
    "original": {
      "id": 29201,
      "name": "Jake Doe",
      "age": 23
    },
    "enrichment": {
      "id": 29201,
      "email": "jake.doe@nifi.apache.org"
    }
  },
  {
    "original": {
      "id": 555,
      "name": "Joseph Doe",
      "age": 2
    },
    "enrichment": null
  }
]
json

With this strategy, the first record of the original FlowFile is coupled together with the first record of the enrichment FlowFile. The second record of the original FlowFile is coupled together with the second record of the enrichment FlowFile, and so on. If one of the FlowFiles has more records than the other, a null value will be used.

Insert Enrichment Fields

The “Insert Enrichment Fields” strategy inserts all the fields of the “enrichment” record into the original record. The records are correlated by their index in the FlowFile. That is, the first record in the “enrichment” FlowFile is inserted into the first record in the “original” FlowFile. The second record of the “enrichment” FlowFile is inserted into the second record of the “original” FlowFile and so on.

When this strategy is selected, the “Record Path” property is required. The Record Path is evaluated against the ” original” record. Consider, for example, the following content for the “original” FlowFile:

[
  {
    "purchase": {
      "customer": {
        "loyaltyId": 48202,
        "firstName": "John",
        "lastName": "Doe"
      },
      "total": 48.28,
      "items": [
        {
          "itemDescription": "book",
          "price": 24.14,
          "quantity": 2
        }
      ]
    }
  },
  {
    "purchase": {
      "customer": {
        "loyaltyId": 5512,
        "firstName": "Jane",
        "lastName": "Doe"
      },
      "total": 121.44,
      "items": [
        {
          "itemDescription": "book",
          "price": 28.15,
          "quantity": 4
        },
        {
          "itemDescription": "inkpen",
          "price": 4.42,
          "quantity": 2
        }
      ]
    }
  }
]
json

Joined using the following enrichment content:

[
  {
    "customerDetails": {
      "id": 48202,
      "phone": "555-555-5555",
      "email": "john.doe@nifi.apache.org"
    }
  },
  {
    "customerDetails": {
      "id": 5512,
      "phone": "555-555-5511",
      "email": "jane.doe@nifi.apache.org"
    }
  }
]
json

Let us then consider that a Record Path is used with a value of “/purchase/customer”. This would yield the following results:

[
  {
    "purchase": {
      "customer": {
        "loyaltyId": 48202,
        "firstName": "John",
        "lastName": "Doe",
        "customerDetails": {
          "id": 48202,
          "phone": "555-555-5555",
          "email": "john.doe@nifi.apache.org"
        }
      },
      "total": 48.28,
      "items": [
        {
          "itemDescription": "book",
          "price": 24.14,
          "quantity": 2
        }
      ]
    }
  },
  {
    "purchase": {
      "customer": {
        "loyaltyId": 5512,
        "firstName": "Jane",
        "lastName": "Doe",
        "customerDetails": {
          "id": 5512,
          "phone": "555-555-5511",
          "email": "jane.doe@nifi.apache.org"
        }
      },
      "total": 121.44,
      "items": [
        {
          "itemDescription": "book",
          "price": 28.15,
          "quantity": 4
        },
        {
          "itemDescription": "inkpen",
          "price": 4.42,
          "quantity": 2
        }
      ]
    }
  }
]
json
SQL

The SQL strategy provides an important capability that differs from the others, in that it allows for correlating the records in the “original” FlowFile and the records in the “enrichment” FlowFile in ways other than index based. That is, the SQL-based strategy doesn’t necessarily correlate the first record of the original FlowFile with the first record of the enrichment FlowFile. Instead, it allows the records to be correlated using standard SQL JOIN expressions.

A common use case for this is to create a payload to query some web service. The response contains identifiers with additional information for enrichment, but the order of the records in the enrichment may not correspond to the order of the records in the original.

As an example, consider the following original payload, in CSV:

id, name, age
28021, John Doe, 55
832, Jane Doe, 22
29201, Jake Doe, 23
555, Joseph Doe, 2

Additionally, consider the following payload for the enrichment data:

customer_id, customer_email, customer_name, customer_since
555, joseph.doe@nifi.apache.org, Joe Doe, 08/Dec/14
832, jane.doe@nifi.apache.org, Mrs. Doe, 14/Nov/14
28021, john.doe@nifi.apache.org, John Doe, 22/Jan/22

When making use of the SQL strategy, we must provide a SQL SELECT statement to combine both our original data and our enrichment data into a single FlowFile. To do this, we treat our original FlowFile as its own table with the name ” original” while we treat the enrichment data as its own table with the name “enrichment”.

Given this, we might combine all the data using a simple query such as:

SELECT o.*, e.*
FROM original o
JOIN enrichment e
ON o.id = e.customer_id
sql

And this would provide the following output:

id, name, age, customer_id, customer_email, customer_name, customer_since
28021, John Doe, 55, 28021, john.doe@nifi.apache.org, John Doe, 22/Jan/22
832, Jane Doe, 22, 832, jane.doe@nifi.apache.org, Mrs. Doe, 14/Nov/14
555, Joseph Doe, 2, 555, joseph.doe@nifi.apache.org, Joe Doe, 08/Dec/14

Note that in this case, the record for Jake Doe was removed because we used a JOIN, rather than an OUTER JOIN. We could instead use a LEFT OUTER JOIN to ensure that we retain all records from the original FlowFile and simply provide null values for any missing records in the enrichment:

SELECT o.*, e.*
FROM original o
LEFT OUTER JOIN enrichment e
ON o.id = e.customer_id
sql

Which would produce the following output:

id, name, age, customer_id, customer_email, customer_name, customer_since
28021, John Doe, 55, 28021, john.doe@nifi.apache.org, John Doe, 22/Jan/22
832, Jane Doe, 22, 832, jane.doe@nifi.apache.org, Mrs. Doe, 14/Nov/14
29201, Jake Doe, 23,,,,
555, Joseph Doe, 2, 555, joseph.doe@nifi.apache.org, Joe Doe, 08/Dec/14

But SQL is far more expressive than this, allowing us to perform far more powerful expressions. In this case, we probably don’t want both the “id” and “customer_id” fields, or the “name” and “customer_name” fields. Let’s consider, though, that the enrichment provides the customer’s preferred name instead of their legal name. We might want to drop the customer_since column, as it doesn’t make sense for our use case. We might then change our SQL to the following:

SELECT o.id, o.name, e.customer_name AS preferred_name, o.age, e.customer_email AS email
FROM original o
         LEFT OUTER JOIN enrichment e
                         ON o.id = e.customer_id
sql

And this will produce a more convenient output:

id, name, preferred_name, age, email
28021, John Doe, John Doe, 55, john.doe@nifi.apache.org
832, Jane Doe, Mrs. Doe, 22, jane.doe@nifi.apache.org
29201, Jake Doe,, 23,
555, Joseph Doe, Joe Doe, 2, joseph.doe@nifi.apache.org

So we can see tremendous power from the SQL strategy. However, there is a very important consideration that must be taken into account when using the SQL strategy.

WARNING: while the SQL strategy provides us great power, it may require significant amounts of heap. Depending on the query, the SQL engine may require buffering the contents of the entire “enrichment” FlowFile in memory, in Java’s heap. Additionally, if the Processor is scheduled with multiple concurrent tasks, each of the tasks made hold the entire contents of the enrichment FlowFile in memory. This can lead to heap exhaustion and cause stability problems or OutOfMemoryErrors to occur.

There are a couple of options that will help to mitigate these concerns.

  1. Split into smaller chunks. It is generally ill-advised to split Record-oriented data into many tiny FlowFiles, as NiFi tends to perform best with larger FlowFiles. The sweet spot for NiFi tends to be around 300 KB to 3 MB in size. So we do not want to break a large FlowFile with 100,000 records into 100,000 FlowFiles each with 1 record. It may be advantageous, though, before the ForkEnrichment processor to break that FlowFile into 100 FlowFiles, each 1,000 records; or 10 FlowFiles, each 10,000 records. This typically results in a smaller amount of enrichment data so that we don’t need to hold as much in memory.

  2. Before the JoinEnrichment processor, trim the enrichment data to remove any fields that are not desirable. In the example above, we may have used QueryRecord, UpdateRecord, JoltTransformRecord, or updated our schema in order to remove the “customer_since” field from the enrichment dataset. Because we didn’t make use of the field, we could easily remove it before the JoinEnrichment in order to reduce the size of the enrichment FlowFile and thereby reduce the amount of data held in memory.

It is also worth noting that the SQL strategy may result in reordering the records within the FlowFile, so it may be necessary to use an ORDER BY clause, etc. if the ordering is important.

Additional Memory Considerations

In addition to the warning above about using the SQL Join Strategy, there is another consideration to keep in mind in order to limit the amount of information that this Processor must keep in memory. While the Processor does not store the contents of all FlowFiles in memory, it does hold all FlowFiles’ attributes in memory. As a result, the following points should be kept in mind when using this Processor.

  1. Avoid large attributes. FlowFile attributes should not be used to hold FlowFile content. Attributes are intended to be small. Generally, on the order of 100-200 characters. If there are any large attributes, it is recommended that they be removed by using the UpdateAttribute Processor before the ForkEnrichment processor.

  2. Avoid large numbers of attributes. While it is important to avoid creating large FlowFile attributes, it is just as important to avoid creating large numbers of attributes. Keeping 30 small attributes on a FlowFile is perfectly fine. Storing 300 attributes, on the other hand, may occupy a significant amount of heap.

  3. Limit backpressure. The JoinEnrichment Processor will pull into its own memory all the incoming FlowFiles. As a result, it will be helpful to avoid providing a huge number of FlowFiles to the Processor at any given time. This can be done by setting the backpressure limits to a smaller value. For example, in our example above, the ForkEnrichment Processor is connected directly to the JoinEnrichment Processor. We may want to limit the backpressure on this connection to 500 or 1,000 instead of the default 10,000. Doing so will limit the number of FlowFiles that are allowed to be loaded into the JoinEnrichment Processor at one time.

More Complex Joining Strategies

This Processor offers several strategies that can be used for correlating data together and joining records from two different FlowFiles into a single FlowFile. However, there are times when users may require more powerful capabilities than what is offered. We might, for example, want to use the information in an enrichment record to determine whether to null out a value in the corresponding original records.

For such uses cases, the recommended approach is to make use of the Wrapper strategy or the SQL strategy in order to combine the original and enrichment FlowFiles into a single FlowFile. Then, connect the “joined” relationship of this Processor to the most appropriate processor for further processing the data. For example, consider that we use the Wrapper strategy to produce output that looks like this:

{
  "original": {
    "id": 482028,
    "name": "John Doe",
    "ssn": "555-55-5555",
    "phone": "555-555-5555",
    "email": "john.doe@nifi.apache.org"
  },
  "enrichment": {
    "country": "UK",
    "allowsPII": false
  }
}
json

We might then use the TransformRecord processor with a JSON RecordReader and a JSON RecordSetWriter to transform this. Using Groovy, our transformation may look something like this:

import org.apache.nifi.serialization.record.Record

Record original = (Record) record.getValue("original")
Record enrichment = (Record) record.getValue("enrichment")

if (Boolean.TRUE != enrichment?.getAsBoolean("allowsPII")) {
    original.setValue("ssn", null)
    original.setValue("phone", null)
    original.setValue("email", null)
}

return original
groovy

Which will produce for us the following output:

{
  "id": 482028,
  "name": "John Doe",
  "ssn": null,
  "phone": null,
  "email": null
}
json

In this way, we have used information from the enrichment record to optionally transform the original record. We then return the original record, dropping the enrichment record all together. In this way, we open up an infinite number of possibilities for transforming our original payload based on the content of the enrichment data that we have fetched based on that data.

JoltTransformJSON

Applies a list of Jolt specifications to the flowfile JSON payload. A new FlowFile is created with transformed content and is routed to the 'success' relationship. If the JSON transform fails, the original FlowFile is routed to the 'failure' relationship.

Tags: json, jolt, transform, shiftr, chainr, defaultr, removr, cardinality, sort

Properties

Jolt Transform

Specifies the Jolt Transformation that should be used with the provided specification.

Jolt Specification

Jolt Specification for transformation of JSON data. The value for this property may be the text of a Jolt specification or the path to a file containing a Jolt specification. 'Jolt Specification' must be set, or the value is ignored if the Jolt Sort Transformation is selected.

Custom Transformation Class Name

Fully Qualified Class Name for Custom Transformation

Custom Module Directory

Comma-separated list of paths to files and/or directories which contain modules containing custom transformations (that are not included on NiFi’s classpath).

Transform Cache Size

Compiling a Jolt Transform can be fairly expensive. Ideally, this will be done only once. However, if the Expression Language is used in the transform, we may need a new Transform for each FlowFile. This value controls how many of those Transforms we cache in memory in order to avoid having to compile the Transform each time.

Pretty Print

Apply pretty print formatting to the output of the Jolt transform

Max String Length

The maximum allowed length of a string value when parsing the JSON document

Relationships

  • success: The FlowFile with transformed content will be routed to this relationship

  • failure: If a FlowFile fails processing for any reason (for example, the FlowFile is not valid JSON), it will be routed to this relationship

Writes Attributes

  • mime.type: Always set to application/json

Input Requirement

This component requires an incoming relationship.

Additional Details

Usage Information

The Jolt utilities processing JSON are not stream based therefore large JSON document transformation may consume large amounts of memory. Currently, UTF-8 FlowFile content and Jolt specifications are supported. A specification can be defined using Expression Language where attributes can be referred either on the left or right hand side within the specification syntax. Custom Jolt Transformations (that implement the Transform interface) are supported. Modules containing custom libraries which do not existing on the current class path can be included via the custom module directory property. Note: When configuring a processor if user selects of the Default transformation yet provides a Chain specification the system does not alert that the specification is invalid and will produce failed flow files. This is a known issue identified within the Jolt library.

JoltTransformRecord

Applies a JOLT specification to each record in the FlowFile payload. A new FlowFile is created with transformed content and is routed to the 'success' relationship. If the transform fails, the original FlowFile is routed to the 'failure' relationship.

Tags: record, jolt, transform, shiftr, chainr, defaultr, removr, cardinality, sort

Properties

Jolt Transform

Specifies the Jolt Transformation that should be used with the provided specification.

Jolt Specification

Jolt Specification for transformation of JSON data. The value for this property may be the text of a Jolt specification or the path to a file containing a Jolt specification. 'Jolt Specification' must be set, or the value is ignored if the Jolt Sort Transformation is selected.

Custom Transformation Class Name

Fully Qualified Class Name for Custom Transformation

Custom Module Directory

Comma-separated list of paths to files and/or directories which contain modules containing custom transformations (that are not included on NiFi’s classpath).

Transform Cache Size

Compiling a Jolt Transform can be fairly expensive. Ideally, this will be done only once. However, if the Expression Language is used in the transform, we may need a new Transform for each FlowFile. This value controls how many of those Transforms we cache in memory in order to avoid having to compile the Transform each time.

Record Reader

Specifies the Controller Service to use for parsing incoming data and determining the data’s schema.

Record Writer

Specifies the Controller Service to use for writing out the records

Relationships

  • success: The FlowFile with transformed content will be routed to this relationship

  • failure: If a FlowFile fails processing for any reason (for example, the FlowFile records cannot be parsed), it will be routed to this relationship

  • original: The original FlowFile that was transformed. If the FlowFile fails processing, nothing will be sent to this relationship

Writes Attributes

  • record.count: The number of records in an outgoing FlowFile

  • mime.type: The MIME Type that the configured Record Writer indicates is appropriate

Input Requirement

This component requires an incoming relationship.

JSLTTransformJSON

Applies a JSLT transformation to the FlowFile JSON payload. A new FlowFile is created with transformed content and is routed to the 'success' relationship. If the JSLT transform fails, the original FlowFile is routed to the 'failure' relationship.

Tags: json, jslt, transform

Properties

JSLT Transformation

JSLT Transformation for transform of JSON data. Any NiFi Expression Language present will be evaluated first to get the final transform to be applied. The JSLT Tutorial provides an overview of supported expressions: https://github.com/schibsted/jslt/blob/master/tutorial.md

Transformation Strategy

Whether to apply the JSLT transformation to the entire FlowFile contents or each JSON object in the root-level array

Pretty Print

Apply pretty-print formatting to the output of the JSLT transform

Transform Cache Size

Compiling a JSLT Transform can be fairly expensive. Ideally, this will be done only once. However, if the Expression Language is used in the transform, we may need a new Transform for each FlowFile. This value controls how many of those Transforms we cache in memory in order to avoid having to compile the Transform each time.

Transform Result Filter

A filter for output JSON results using a JSLT expression. This property supports changing the default filter, which removes JSON objects with null values, empty objects and empty arrays from the output JSON. This JSLT must return true for each JSON object to be included and false for each object to be removed. Using a filter value of "true" to disables filtering.

Relationships

  • success: The FlowFile with transformed content will be routed to this relationship

  • failure: If a FlowFile fails processing for any reason (for example, the FlowFile is not valid JSON), it will be routed to this relationship

Writes Attributes

  • mime.type: Always set to application/json

Input Requirement

This component requires an incoming relationship.

System Resource Considerations

  • MEMORY: An instance of this component can cause high usage of this system resource. Multiple instances or high concurrency settings may result a degradation of performance.

JSONataTransformJSON

Applies a JSONata transformation to a FlowFile’s JSON content and routes the transformed FlowFile to the 'success' relationship. To make changes to the FlowFile without changing its content, uncheck the 'Write Output' property. Transformations can be saved in attributes using the function $nfSetAttribute(<name>, <value>) or 'Write to Attribute' property. If the transformation fails, the original FlowFile is routed to the 'failure' relationship. Click on "Additional Details" to learn about the additional JSONata functions this processor offers.

Tags: virtimo, json, jsonata, transform

Properties

JSONata Transformation

JSONata transformation applied on the JSON content. This processor’s Editor tab offers jsonata syntax highlighting and autocomplete suggestions. Click on "Additional Details" above to learn about the additional JSONata functions this processor offers.

Input Data

The transformation can be executed on the content, on an empty JSON or on the content of an attribute.

Input Attribute

The name of the attribute from which the input data for the transformation is taken

Lookup Service

The lookup service used when calling a $nfLookup(<key>) function.

Write Output

If unchecked, the FlowFile’S JSON content remains unchanged. Disabling this could be useful if you only wish to add attributes with the $nfSetAttribute function or the 'Write to Attribute' property without changing the content.

Pretty Print

Apply pretty-print formatting to the output of the JSONata transform

Write to Attribute

If an attribute name is set here, the complete transformation’s result is written to this attribute.

Relationships

  • success: The FlowFile with transformed content will be routed to this relationship

  • failure: If a FlowFile fails processing for any reason (for example, the FlowFile is not valid JSON), it will be routed to this relationship

Reads Attributes

  • Any: Can read any attribute with the $nfGetAttribute(<name>) function

Writes Attributes

  • mime.type: Always set to application/json

  • Any: Can write to any attribute with the $nfSetAttribute(<name>,<value>) function.

  • <Write to Attribute>: Use this property to write the transformation result in the given attribute.

  • jsonata-error: If the transformation fails, the error message will be written in this attribute.

Input Requirement

This component requires an incoming relationship.

JsonQueryElasticsearch

A processor that allows the user to run a query (with aggregations) written with the Elasticsearch JSON DSL. It does not automatically paginate queries for the user. If an incoming relationship is added to this processor, it will use the flowfile’s content for the query. Care should be taken on the size of the query because the entire response from Elasticsearch will be loaded into memory all at once and converted into the resulting flowfiles.

Tags: elasticsearch, elasticsearch5, elasticsearch6, elasticsearch7, elasticsearch8, query, read, get, json

Properties

Query Definition Style

How the JSON Query will be defined for use by the processor.

Query

A query in JSON syntax, not Lucene syntax. Ex: {"query":{"match":{"somefield":"somevalue"}}}. If this parameter is not set, the query will be read from the flowfile content. If the query (property and flowfile content) is empty, a default empty JSON Object will be used, which will result in a "match_all" query in Elasticsearch.

Query Clause

A "query" clause in JSON syntax, not Lucene syntax. Ex: {"match":{"somefield":"somevalue"}}. If the query is empty, a default JSON Object will be used, which will result in a "match_all" query in Elasticsearch.

Size

The maximum number of documents to retrieve in the query. If the query is paginated, this "size" applies to each page of the query, not the "size" of the entire result set.

Sort

Sort results by one or more fields, in JSON syntax. Ex: [{"price" : {"order" : "asc", "mode" : "avg"}}, {"post_date" : {"format": "strict_date_optional_time_nanos"}}]

Aggregations

One or more query aggregations (or "aggs"), in JSON syntax. Ex: {"items": {"terms": {"field": "product", "size": 10}}}

Fields

Fields of indexed documents to be retrieved, in JSON syntax. Ex: ["user.id", "http.response.*", {"field": "@timestamp", "format": "epoch_millis"}]

Script Fields

Fields to created using script evaluation at query runtime, in JSON syntax. Ex: {"test1": {"script": {"lang": "painless", "source": "doc['price'].value * 2"}}, "test2": {"script": {"lang": "painless", "source": "doc['price'].value * params.factor", "params": {"factor": 2.0}}}}

Query Attribute

If set, the executed query will be set on each result flowfile in the specified attribute.

Index

The name of the index to use.

Type

The type of this document (used by Elasticsearch for indexing and searching).

Max JSON Field String Length

The maximum allowed length of a string value when parsing a JSON document or attribute.

Client Service

An Elasticsearch client service to use for running queries.

Search Results Split

Output a flowfile containing all hits or one flowfile for each individual hit.

Search Results Format

Format of Hits output.

Aggregation Results Split

Output a flowfile containing all aggregations or one flowfile for each individual aggregation.

Aggregation Results Format

Format of Aggregation output.

Output No Hits

Output a "hits" flowfile even if no hits found for query. If true, an empty "hits" flowfile will be output even if "aggregations" are output.

Dynamic Properties

The name of a URL query parameter to add

Adds the specified property name/value as a query parameter in the Elasticsearch URL used for processing. These parameters will override any matching parameters in the query request body

Relationships

  • failure: All flowfiles that fail for reasons unrelated to server availability go to this relationship.

  • aggregations: Aggregations are routed to this relationship.

  • hits: Search hits are routed to this relationship.

  • original: All original flowfiles that don’t cause an error to occur go to this relationship.

Writes Attributes

  • mime.type: application/json

  • aggregation.name: The name of the aggregation whose results are in the output flowfile

  • aggregation.number: The number of the aggregation whose results are in the output flowfile

  • hit.count: The number of hits that are in the output flowfile

  • elasticsearch.query.error: The error message provided by Elasticsearch if there is an error querying the index.

Input Requirement

This component allows an incoming relationship.

Additional Details

This processor is intended for use with the Elasticsearch JSON DSL and Elasticsearch 5.X and newer. It is designed to be able to take a JSON query (e.g. from Kibana) and execute it as-is against an Elasticsearch cluster. Like all processors in the “restapi” bundle, it uses the official Elastic client APIs, so it supports leader detection.

The query JSON to execute can be provided either in the Query configuration property or in the content of the flowfile. If the Query Attribute property is configured, the executed query JSON will be placed in the attribute provided by this property.

Additionally, search results and aggregation results can be split up into multiple flowfiles. Aggregation results will only be split at the top level because nested aggregations lose their context (and thus lose their value) if separated from their parent aggregation. The following is an example query that would be accepted:

{
  "query": {
    "match": {
      "restaurant.keyword": "Local Pizzaz FTW Inc"
    }
  },
  "aggs": {
    "weekly_sales": {
      "date_histogram": {
        "field": "date",
        "interval": "week"
      },
      "aggs": {
        "items": {
          "terms": {
            "field": "product",
            "size": 10
          }
        }
      }
    }
  }
}
json

ListAzureBlobStorage_v12

Lists blobs in an Azure Blob Storage container. Listing details are attached to an empty FlowFile for use with FetchAzureBlobStorage. This Processor is designed to run on Primary Node only in a cluster. If the primary node changes, the new Primary Node will pick up where the previous node left off without duplicating all of the data. The processor uses Azure Blob Storage client library v12.

Tags: azure, microsoft, cloud, storage, blob

Properties

Storage Credentials

Controller Service used to obtain Azure Blob Storage Credentials.

Container Name

Name of the Azure storage container. In case of PutAzureBlobStorage processor, container can be created if it does not exist.

Blob Name Prefix

Search prefix for listing

Record Writer

Specifies the Record Writer to use for creating the listing. If not specified, one FlowFile will be created for each entity that is listed. If the Record Writer is specified, all entities will be written to a single FlowFile instead of adding attributes to individual FlowFiles.

Listing Strategy

Specify how to determine new/updated entities. See each strategy descriptions for detail.

Entity Tracking State Cache

Listed entities are stored in the specified cache storage so that this processor can resume listing across NiFi restart or in case of primary node change. 'Tracking Entities' strategy require tracking information of all listed entities within the last 'Tracking Time Window'. To support large number of entities, the strategy uses DistributedMapCache instead of managed state. Cache key format is 'ListedEntities::{processorId}(::{nodeId})'. If it tracks per node listed entities, then the optional '::{nodeId}' part is added to manage state separately. E.g. cluster wide cache key = 'ListedEntities::8dda2321-0164-1000-50fa-3042fe7d6a7b', per node cache key = 'ListedEntities::8dda2321-0164-1000-50fa-3042fe7d6a7b::nifi-node3' The stored cache content is Gzipped JSON string. The cache key will be deleted when target listing configuration is changed. Used by 'Tracking Entities' strategy.

Entity Tracking Time Window

Specify how long this processor should track already-listed entities. 'Tracking Entities' strategy can pick any entity whose timestamp is inside the specified time window. For example, if set to '30 minutes', any entity having timestamp in recent 30 minutes will be the listing target when this processor runs. A listed entity is considered 'new/updated' and a FlowFile is emitted if one of following condition meets: 1. does not exist in the already-listed entities, 2. has newer timestamp than the cached entity, 3. has different size than the cached entity. If a cached entity’s timestamp becomes older than specified time window, that entity will be removed from the cached already-listed entities. Used by 'Tracking Entities' strategy.

Entity Tracking Initial Listing Target

Specify how initial listing should be handled. Used by 'Tracking Entities' strategy.

Minimum File Age

The minimum age that a file must be in order to be pulled; any file younger than this amount of time (according to last modification date) will be ignored

Maximum File Age

The maximum age that a file must be in order to be pulled; any file older than this amount of time (according to last modification date) will be ignored

Minimum File Size

The minimum size that a file must be in order to be pulled

Maximum File Size

The maximum size that a file can be in order to be pulled

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP, SOCKS In case of SOCKS, it is not guaranteed that the selected SOCKS Version will be used by the processor.

Relationships

  • success: All FlowFiles that are received are routed to success

Writes Attributes

  • azure.container: The name of the Azure Blob Storage container

  • azure.blobname: The name of the blob on Azure Blob Storage

  • azure.primaryUri: Primary location of the blob

  • azure.etag: ETag of the blob

  • azure.blobtype: Type of the blob (either BlockBlob, PageBlob or AppendBlob)

  • mime.type: MIME Type of the content

  • lang: Language code for the content

  • azure.timestamp: Timestamp of the blob

  • azure.length: Length of the blob

Stateful

Scope: Cluster

After performing a listing of blobs, the timestamp of the newest blob is stored if 'Tracking Timestamps' Listing Strategy is in use (by default). This allows the Processor to list only blobs that have been added or modified after this date the next time that the Processor is run. State is stored across the cluster so that this Processor can be run on Primary Node only and if a new Primary Node is selected, the new node can pick up where the previous node left off, without duplicating the data.

Input Requirement

This component does not allow an incoming relationship.

ListAzureDataLakeStorage

Lists directory in an Azure Data Lake Storage Gen 2 filesystem

Tags: azure, microsoft, cloud, storage, adlsgen2, datalake

Properties

ADLS Credentials

Controller Service used to obtain Azure Credentials.

Filesystem Name

Name of the Azure Storage File System (also called Container). It is assumed to be already existing.

Directory Name

Name of the Azure Storage Directory. The Directory Name cannot contain a leading '/'. The root directory can be designated by the empty string value. In case of the PutAzureDataLakeStorage processor, the directory will be created if not already existing.

Recurse Subdirectories

Indicates whether to list files from subdirectories of the directory

File Filter

Only files whose names match the given regular expression will be listed

Path Filter

When 'Recurse Subdirectories' is true, then only subdirectories whose paths match the given regular expression will be scanned

Include Temporary Files

Whether to include temporary files when listing the contents of configured directory paths.

Record Writer

Specifies the Record Writer to use for creating the listing. If not specified, one FlowFile will be created for each entity that is listed. If the Record Writer is specified, all entities will be written to a single FlowFile instead of adding attributes to individual FlowFiles.

Listing Strategy

Specify how to determine new/updated entities. See each strategy descriptions for detail.

Entity Tracking State Cache

Listed entities are stored in the specified cache storage so that this processor can resume listing across NiFi restart or in case of primary node change. 'Tracking Entities' strategy require tracking information of all listed entities within the last 'Tracking Time Window'. To support large number of entities, the strategy uses DistributedMapCache instead of managed state. Cache key format is 'ListedEntities::{processorId}(::{nodeId})'. If it tracks per node listed entities, then the optional '::{nodeId}' part is added to manage state separately. E.g. cluster wide cache key = 'ListedEntities::8dda2321-0164-1000-50fa-3042fe7d6a7b', per node cache key = 'ListedEntities::8dda2321-0164-1000-50fa-3042fe7d6a7b::nifi-node3' The stored cache content is Gzipped JSON string. The cache key will be deleted when target listing configuration is changed. Used by 'Tracking Entities' strategy.

Entity Tracking Time Window

Specify how long this processor should track already-listed entities. 'Tracking Entities' strategy can pick any entity whose timestamp is inside the specified time window. For example, if set to '30 minutes', any entity having timestamp in recent 30 minutes will be the listing target when this processor runs. A listed entity is considered 'new/updated' and a FlowFile is emitted if one of following condition meets: 1. does not exist in the already-listed entities, 2. has newer timestamp than the cached entity, 3. has different size than the cached entity. If a cached entity’s timestamp becomes older than specified time window, that entity will be removed from the cached already-listed entities. Used by 'Tracking Entities' strategy.

Entity Tracking Initial Listing Target

Specify how initial listing should be handled. Used by 'Tracking Entities' strategy.

Minimum File Age

The minimum age that a file must be in order to be pulled; any file younger than this amount of time (according to last modification date) will be ignored

Maximum File Age

The maximum age that a file must be in order to be pulled; any file older than this amount of time (according to last modification date) will be ignored

Minimum File Size

The minimum size that a file must be in order to be pulled

Maximum File Size

The maximum size that a file can be in order to be pulled

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP, SOCKS In case of SOCKS, it is not guaranteed that the selected SOCKS Version will be used by the processor.

Relationships

  • success: All FlowFiles that are received are routed to success

Writes Attributes

  • azure.filesystem: The name of the Azure File System

  • azure.filePath: The full path of the Azure File

  • azure.directory: The name of the Azure Directory

  • azure.filename: The name of the Azure File

  • azure.length: The length of the Azure File

  • azure.lastModified: The last modification time of the Azure File

  • azure.etag: The ETag of the Azure File

Stateful

Scope: Cluster

After performing a listing of files, the timestamp of the newest file is stored. This allows the Processor to list only files that have been added or modified after this date the next time that the Processor is run. State is stored across the cluster so that this Processor can be run on Primary Node only and if a new Primary Node is selected, the new node can pick up where the previous node left off, without duplicating the data.

Input Requirement

This component does not allow an incoming relationship.

ListBoxFile

Lists files in a Box folder. Each listed file may result in one FlowFile, the metadata being written as FlowFile attributes. Or - in case the 'Record Writer' property is set - the entire result is written as records to a single FlowFile. This Processor is designed to run on Primary Node only in a cluster. If the primary node changes, the new Primary Node will pick up where the previous node left off without duplicating all of the data.

Tags: box, storage

Properties

Box Client Service

Controller Service used to obtain a Box API connection.

Folder ID

The ID of the folder from which to pull list of files.

Search Recursively

When 'true', will include list of files from sub-folders. Otherwise, will return only files that are within the folder defined by the 'Folder ID' property.

Minimum File Age

The minimum age a file must be in order to be considered; any files younger than this will be ignored.

Listing Strategy

Specify how to determine new/updated entities. See each strategy descriptions for detail.

Entity Tracking State Cache

Listed entities are stored in the specified cache storage so that this processor can resume listing across NiFi restart or in case of primary node change. 'Tracking Entities' strategy require tracking information of all listed entities within the last 'Tracking Time Window'. To support large number of entities, the strategy uses DistributedMapCache instead of managed state. Cache key format is 'ListedEntities::{processorId}(::{nodeId})'. If it tracks per node listed entities, then the optional '::{nodeId}' part is added to manage state separately. E.g. cluster wide cache key = 'ListedEntities::8dda2321-0164-1000-50fa-3042fe7d6a7b', per node cache key = 'ListedEntities::8dda2321-0164-1000-50fa-3042fe7d6a7b::nifi-node3' The stored cache content is Gzipped JSON string. The cache key will be deleted when target listing configuration is changed. Used by 'Tracking Entities' strategy.

Entity Tracking Time Window

Specify how long this processor should track already-listed entities. 'Tracking Entities' strategy can pick any entity whose timestamp is inside the specified time window. For example, if set to '30 minutes', any entity having timestamp in recent 30 minutes will be the listing target when this processor runs. A listed entity is considered 'new/updated' and a FlowFile is emitted if one of following condition meets: 1. does not exist in the already-listed entities, 2. has newer timestamp than the cached entity, 3. has different size than the cached entity. If a cached entity’s timestamp becomes older than specified time window, that entity will be removed from the cached already-listed entities. Used by 'Tracking Entities' strategy.

Entity Tracking Initial Listing Target

Specify how initial listing should be handled. Used by 'Tracking Entities' strategy.

Record Writer

Specifies the Record Writer to use for creating the listing. If not specified, one FlowFile will be created for each entity that is listed. If the Record Writer is specified, all entities will be written to a single FlowFile instead of adding attributes to individual FlowFiles.

Relationships

  • success: All FlowFiles that are received are routed to success

Writes Attributes

  • box.id: The id of the file

  • filename: The name of the file

  • path: The folder path where the file is located

  • box.size: The size of the file

  • box.timestamp: The last modified time of the file

Stateful

Scope: Cluster

The processor stores necessary data to be able to keep track what files have been listed already. What exactly needs to be stored depends on the 'Listing Strategy'.

Input Requirement

This component does not allow an incoming relationship.

Additional Details

List Box folders in NiFi
  1. Find Folder ID

    • Navigate to the folder to be listed in Box and enter it. The URL in your browser will include the ID at the end of the URL. For example, if the URL were https://app.box.com/folder/191632099757, the Folder ID would be 191632099757

  2. Set Folder ID in ‘Folder ID’ property

ListDatabaseTables

Generates a set of flow files, each containing attributes corresponding to metadata about a table from a database connection. Once metadata about a table has been fetched, it will not be fetched again until the Refresh Interval (if set) has elapsed, or until state has been manually cleared.

Multi-Processor Use Cases

Perform a full load of a database, retrieving all rows from all tables, or a specific set of tables.

Keywords: full load, rdbms, jdbc, database

ListDatabaseTables:

  1. Configure the "Database Connection Pooling Service" property to specify a Connection Pool that is applicable for interacting with your database.

  2. Leave the RecordWriter property unset. .

  3. Set the "Catalog" property to the name of the database Catalog; leave it empty to include all catalogs.

  4. Set the "Schema Pattern" property to a Java Regular Expression that matches all database Schemas that should be included; leave it empty to include all Schemas.

  5. Set the "Table Name Pattern" property to a Java Regular Expression that matches the names of all tables that should be included; leave it empty to include all Tables. .

  6. Connect the "success" relationship to GenerateTableFetch. .

GenerateTableFetch:

  1. Configure the "Database Connection Pooling Service" property to specify the same Connection Pool that was used in ListDatabaseTables.

  2. Set the "Database Type" property to match the appropriate value for your RDBMS vendor.

  3. Set "Table Name" to ${db.table.fullname}

  4. Leave the RecordWriter property unset. .

  5. Connect the "success" relationship to ExecuteSQLRecord. .

ExecuteSQLRecord:

  1. Configure the "Database Connection Pooling Service" property to specify the same Connection Pool that was used in ListDatabaseTables.

  2. Configure the "Record Writer" property to specify a Record Writer that is appropriate for the desired output data type.

  3. Leave the "SQL select query" unset. .

  4. Connect the "success" relationship to the next Processor in the flow. .

Tags: sql, list, jdbc, table, database

Properties

Database Connection Pooling Service

The Controller Service that is used to obtain connection to database

Catalog

The name of a catalog from which to list database tables. The name must match the catalog name as it is stored in the database. If the property is not set, the catalog name will not be used to narrow the search for tables. If the property is set to an empty string, tables without a catalog will be listed.

Schema Pattern

A pattern for matching schemas in the database. Within a pattern, "%" means match any substring of 0 or more characters, and "_" means match any one character. The pattern must match the schema name as it is stored in the database. If the property is not set, the schema name will not be used to narrow the search for tables. If the property is set to an empty string, tables without a schema will be listed.

Table Name Pattern

A pattern for matching tables in the database. Within a pattern, "%" means match any substring of 0 or more characters, and "_" means match any one character. The pattern must match the table name as it is stored in the database. If the property is not set, all tables will be retrieved.

Table Types

A comma-separated list of table types to include. For example, some databases support TABLE and VIEW types. If the property is not set, tables of all types will be returned.

Include Count

Whether to include the table’s row count as a flow file attribute. This affects performance as a database query will be generated for each table in the retrieved list.

Record Writer

Specifies the Record Writer to use for creating the listing. If not specified, one FlowFile will be created for each entity that is listed. If the Record Writer is specified, all entities will be written to a single FlowFile instead of adding attributes to individual FlowFiles.

Refresh Interval

The amount of time to elapse before resetting the processor state, thereby causing all current tables to be listed. During this interval, the processor may continue to run, but tables that have already been listed will not be re-listed. However new/added tables will be listed as the processor runs. A value of zero means the state will never be automatically reset, the user must Clear State manually.

Relationships

  • success: All FlowFiles that are received are routed to success

Writes Attributes

  • db.table.name: Contains the name of a database table from the connection

  • db.table.catalog: Contains the name of the catalog to which the table belongs (may be null)

  • db.table.schema: Contains the name of the schema to which the table belongs (may be null)

  • db.table.fullname: Contains the fully-qualifed table name (possibly including catalog, schema, etc.)

  • db.table.type: Contains the type of the database table from the connection. Typical types are "TABLE", "VIEW", "SYSTEM TABLE", "GLOBAL TEMPORARY", "LOCAL TEMPORARY", "ALIAS", "SYNONYM"

  • db.table.remarks: Contains the name of a database table from the connection

  • db.table.count: Contains the number of rows in the table

Stateful

Scope: Cluster

After performing a listing of tables, the timestamp of the query is stored. This allows the Processor to not re-list tables the next time that the Processor is run. Specifying the refresh interval in the processor properties will indicate that when the processor detects the interval has elapsed, the state will be reset and tables will be re-listed as a result. This processor is meant to be run on the primary node only.

Input Requirement

This component does not allow an incoming relationship.

ListDropbox

Retrieves a listing of files from Dropbox (shortcuts are ignored). Each listed file may result in one FlowFile, the metadata being written as FlowFile attributes. When the 'Record Writer' property is set, the entire result is written as records to a single FlowFile. This Processor is designed to run on Primary Node only in a cluster. If the primary node changes, the new Primary Node will pick up where the previous node left off without duplicating all of the data.

Tags: dropbox, storage

Properties

Dropbox Credential Service

Controller Service used to obtain Dropbox credentials (App Key, App Secret, Access Token, Refresh Token). See controller service’s Additional Details for more information.

Folder

The Dropbox identifier or path of the folder from which to pull list of files. 'Folder' should match the following regular expression pattern: /.*

id:.* . Example for folder identifier: id:odTlUvbpIEAAAAAAAAAGGQ. Example for folder path: /Team1/Task1.

Search Recursively

Indicates whether to list files from subfolders of the Dropbox folder.

Minimum File Age

The minimum age a file must be in order to be considered; any files newer than this will be ignored.

Listing Strategy

Specify how to determine new/updated entities. See each strategy descriptions for detail.

Entity Tracking State Cache

Listed entities are stored in the specified cache storage so that this processor can resume listing across NiFi restart or in case of primary node change. 'Tracking Entities' strategy require tracking information of all listed entities within the last 'Tracking Time Window'. To support large number of entities, the strategy uses DistributedMapCache instead of managed state. Cache key format is 'ListedEntities::{processorId}(::{nodeId})'. If it tracks per node listed entities, then the optional '::{nodeId}' part is added to manage state separately. E.g. cluster wide cache key = 'ListedEntities::8dda2321-0164-1000-50fa-3042fe7d6a7b', per node cache key = 'ListedEntities::8dda2321-0164-1000-50fa-3042fe7d6a7b::nifi-node3' The stored cache content is Gzipped JSON string. The cache key will be deleted when target listing configuration is changed. Used by 'Tracking Entities' strategy.

Entity Tracking Time Window

Specify how long this processor should track already-listed entities. 'Tracking Entities' strategy can pick any entity whose timestamp is inside the specified time window. For example, if set to '30 minutes', any entity having timestamp in recent 30 minutes will be the listing target when this processor runs. A listed entity is considered 'new/updated' and a FlowFile is emitted if one of following condition meets: 1. does not exist in the already-listed entities, 2. has newer timestamp than the cached entity, 3. has different size than the cached entity. If a cached entity’s timestamp becomes older than specified time window, that entity will be removed from the cached already-listed entities. Used by 'Tracking Entities' strategy.

Entity Tracking Initial Listing Target

Specify how initial listing should be handled. Used by 'Tracking Entities' strategy.

Record Writer

Specifies the Record Writer to use for creating the listing. If not specified, one FlowFile will be created for each entity that is listed. If the Record Writer is specified, all entities will be written to a single FlowFile instead of adding attributes to individual FlowFiles.

Proxy Configuration Service

Relationships

  • success: All FlowFiles that are received are routed to success

Writes Attributes

  • dropbox.id: The Dropbox identifier of the file

  • path: The folder path where the file is located

  • filename: The name of the file

  • dropbox.size: The size of the file

  • dropbox.timestamp: The server modified time of the file

  • dropbox.revision: Revision of the file

Stateful

Scope: Cluster

The processor stores necessary data to be able to keep track what files have been listed already. What exactly needs to be stored depends on the 'Listing Strategy'.

Input Requirement

This component does not allow an incoming relationship.

ListenBPCFlowStarter

Listens to requests sent from a Virtimo Business Process Center (BPC) and creates a FlowFile for each request with the request’s body in its content.

Tags: virtimo, bpc

Properties

Listener Identifier

The identifier of this listener. Must be unique among listeners using the same BPC Listener Controller.

BPC Listener Controller

This processor listens to requests sent to the BPC Listener Controller whose request-URI matches this processor’s 'Listener Identifier'.

Name

The name of the flow that appears in the connected Virtimo product.

Description

The description of the flow that appears in in the connected Virtimo product.

Load the BPC Session

If checked, loads the complete session from the BPC at every call.

BPC Controller

Controller used to define the connection to the BPC. This is used to load the complete session object of the user. This does not use the authentication configured on the controller, but the cookie passed to the ListenBPCFlowStarter.

Relationships

  • success: All content that is received is routed to the 'success' relationship.

Writes Attributes

  • hybrid.context: An identifier later used by a ListenBPCResponder to respond to the message.

Input Requirement

This component does not allow an incoming relationship.

ListenBPCResponder

Responds to/terminates a call received from a ListenBPCFlowStarter processor using incoming FlowFiles. The response’s body is inserted from the FlowFile’s content, while headers can be inserted from the FlowFile’s attributes.

Tags: virtimo, bpc

Properties

BPC Listener Controller

Controller used to define a Virtimo BPC Listener.

HTTP Status Code

The HTTP Status Code to use when responding to the HTTP Request. See Section 10 of RFC 2616 for more information.

Define response headers

Add FlowFile attributes to the HTTP response as headers by using Regular Expression to specify which attributes to add. Each header’s name and value will be equivalent to the matching attributes.

Dynamic Properties

Header name

Adds a header to the response with the given name and value.

Relationships

  • success: FlowFiles will be routed to this Relationship after the response has been successfully sent to the requester.

  • failure: FlowFiles will be routed to this Relationship if the Processor is unable to respond to the requester. This may happen, for instance, if the connection times out or if NiFi is restarted before responding to the request.

Reads Attributes

  • <Define response headers>: Use this property to read attributes and add them to the HTTP response

  • hybrid.context: An identifier written by a ListenBPCFlowStarter for this processor to respond to the correct call.

Input Requirement

This component requires an incoming relationship.

ListenCloud

Used in IGUASU Gateway (running on an on-prem system) for receiving requests sent from IGUASU (running on the Virtimo Cloud). For each request, a FlowFile with the content and attributes of the request is created and routed via the success relationship. Requests from IGUASU are sent using an InvokeOnPrem-processor, while responses are sent back using a RespondCloud-processor. Data is securely sent in either direction using a standing WSS connection initiated by IGUASU Gateway, meaning the on-prem system mustn’t open external ports.

Tags: hybrid, cloud, onprem, virtimo, websocket

Properties

Listener Identifier

The identifier of this listener. Must be unique among listeners using the same Client Controller.

Client Controller

Controller for connecting to a Virtimo system and enable WSS communication.

Name

The name of the flow that appears in the connected Virtimo product.

Description

The description of the flow that appears in in the connected Virtimo product.

Relationships

  • success: All content that is received is routed to the 'success' relationship.

Reads Attributes

  • hybrid.listener: Used to direct requests to this listener.

  • hybrid.request: Used to match responses with the corresponding request.

Writes Attributes

  • hybrid.response: Writes '503' if any of this processor’s success connections have reached their back-pressure limit, in which case a response will be sent back immediately without processing the request.

  • hybrid.error: Writes an explanation to any error that occurred when receiving the request.

Input Requirement

This component does not allow an incoming relationship.

ListenFTP

Starts an FTP server that listens on the specified port and transforms incoming files into FlowFiles. The URI of the service will be ftp://{hostname}:{port}. The default port is 2221.

Tags: ingest, FTP, FTPS, listen

Properties

Bind Address

The address the FTP server should be bound to. If not set (or set to 0.0.0.0), the server binds to all available addresses (i.e. all network interfaces of the host machine).

Listening Port

The Port to listen on for incoming connections. On Linux, root privileges are required to use port numbers below 1024.

Username

The name of the user that is allowed to log in to the FTP server. If a username is provided, a password must also be provided. If no username is specified, anonymous connections will be permitted.

Password

If the Username is set, then a password must also be specified. The password provided by the client trying to log in to the FTP server will be checked against this password.

SSL Context Service

Specifies the SSL Context Service that can be used to create secure connections. If an SSL Context Service is selected, then a keystore file must also be specified in the SSL Context Service. Without a keystore file, the processor cannot be started successfully.Specifying a truststore file is optional. If a truststore file is specified, client authentication is required (the client needs to send a certificate to the server).Regardless of the selected TLS protocol, the highest available protocol is used for the connection. For example if NiFi is running on Java 11 and TLSv1.2 is selected in the controller service as the preferred TLS Protocol, TLSv1.3 will be used (regardless of TLSv1.2 being selected) because Java 11 supports TLSv1.3.

Relationships

  • success: Relationship for successfully received files.

Writes Attributes

  • filename: The name of the file received via the FTP/FTPS connection.

  • path: The path pointing to the file’s target directory. E.g.: file.txt is uploaded to /Folder1/SubFolder, then the value of the path attribute will be "/Folder1/SubFolder/" (note that it ends with a separator character).

Input Requirement

This component does not allow an incoming relationship.

Additional Details

Usage Description

By starting the processor, an FTP server is started that listens for incoming connections on the specified port. Each file copied to this FTP server gets converted into a FlowFile and transferred to the next processor via the ListenFTP processor’s ‘success’ relationship.

Before starting the processor, the following properties can be set:

  • Bind Address: if not set, the FTP server binds to all network interfaces of the host machine (this is the default). If set to a valid address, the server is only available on that specific address.

  • Listening Port: the port on which the server listens for incoming connections. Root privileges are required on Linux to be able to use port numbers below 1024.

  • Username and Password: Either both of them need to be set, or none of them. If set, the FTP server only allows users to log in with the username-password pair specified in these properties. If the Username and Password properties are left blank, the FTP server allows anonymous connections, meaning that the client can connect to the FTP server by providing ‘anonymous’ as username, and leaving the password field blank. Setting empty string as the value of these properties is not permitted, and doing so results in the processor becoming invalid.

  • SSL Context Service: a Controller Service can optionally be specified that provides the ability to configure keystore and/or truststore properties. When not specified, the FTP server does not use encryption. By specifying an SSL Context Service, the FTP server started by this processor is set to use Transport Layer Security (TLS) over FTP ( FTPS).
    If an SSL Context Service is selected, then a keystore file must also be specified in the SSL Context Service. Without a keystore file, the processor cannot be started successfully.
    Specifying a truststore file is optional. If a truststore file is specified, client authentication is required (the client needs to send a certificate to the server).
    Regardless of the selected TLS protocol, the highest available protocol is used for the connection. For example if NiFi is running on Java 11 and TLSv1.2 is selected in the controller service as the preferred TLS protocol, TLSv1.3 will be used (regardless of TLSv1.2 being selected) because Java 11 supports TLSv1.3.

After starting the processor and connecting to the FTP server, an empty root directory is visible in the client application. Folders can be created in and deleted from the root directory and any of its subdirectories. Files can be uploaded to any directory. Uploaded files do not show in the content list of directories, since files are not actually stored on this FTP server, but converted into FlowFiles and transferred to the next processor via the ‘success’ relationship. It is not possible to download or delete files like on a regular FTP server.
All the folders (including the root directory) are virtual directories, meaning that they only exist in memory and do not get created in the file system of the host machine. Also, these directories are not persisted: by restarting the processor all the directories (except for the root directory) get removed. Uploaded files do not get removed by restarting the processor, since they are not stored on the FTP server, but transferred to the next processor as FlowFiles.
When a file named for example text01.txt is uploaded to the target folder /MyDirectory/MySubdirectory, a FlowFile gets created. The content of the FlowFile is the same as the content of text01.txt, the ‘filename’ attribute of the FlowFile contains the name of the original file (text01.txt) and the ‘path’ attribute of the flowfile contains the path where the file was uploaded (/MyDirectory/MySubdirectory/).

The list of the FTP commands that are supported by the FTP server is available by starting the processor and issuing the ‘HELP’ command to the server from an FTP client application.

ListenHTTP

Starts an HTTP Server and listens on a given base path to transform incoming requests into FlowFiles. The default URI of the Service will be http://{hostname}:{port}/contentListener. Only HEAD and POST requests are supported. GET, PUT, DELETE, OPTIONS and TRACE will result in an error and the HTTP response status code 405; CONNECT will also result in an error and the HTTP response status code 400. GET is supported on <service_URI>/healthcheck. If the service is available, it returns "200 OK" with the content "OK". The health check functionality can be configured to be accessible via a different port. For details see the documentation of the "Listening Port for health check requests" property.A Record Reader and Record Writer property can be enabled on the processor to process incoming requests as records. Record processing is not allowed for multipart requests and request in FlowFileV3 format (minifi).

Use Cases

Unpack FlowFileV3 content received in a POST

Notes: POST requests with "Content-Type: application/flowfile-v3" will have their payload interpreted as FlowFileV3 format and will be automatically unpacked. This will output the original FlowFile(s) from within the FlowFileV3 format and will not require a separate UnpackContent processor.

Keywords: flowfile, flowfilev3, unpack

Input Requirement: This component allows an incoming relationship.

  1. This feature of ListenHTTP is always on, no configuration required. .

  2. The MergeContent and PackageFlowFile processors can generate FlowFileV3 formatted data. .

Tags: ingest, http, https, rest, listen

Properties

Base Path

Base path for incoming connections

Listening Port

The Port to listen on for incoming connections

Listening Port for Health Check Requests

The port to listen on for incoming health check requests. If set, it must be different from the Listening Port. Configure this port if the processor is set to use two-way SSL and a load balancer that does not support client authentication for health check requests is used. Only /<base_path>/healthcheck service is available via this port and only GET and HEAD requests are supported. If the processor is set not to use SSL, SSL will not be used on this port, either. If the processor is set to use one-way SSL, one-way SSL will be used on this port. If the processor is set to use two-way SSL, one-way SSL will be used on this port (client authentication not required).

Max Data to Receive per Second

The maximum amount of data to receive per second; this allows the bandwidth to be throttled to a specified data rate; if not specified, the data rate is not throttled

SSL Context Service

SSL Context Service enables support for HTTPS

HTTP Protocols

HTTP Protocols supported for Application Layer Protocol Negotiation with TLS

Client Authentication

Client Authentication policy for TLS connections. Required when SSL Context Service configured.

Authorized Subject DN Pattern

A Regular Expression to apply against the Subject’s Distinguished Name of incoming connections. If the Pattern does not match the Subject DN, the the processor will respond with a status of HTTP 403 Forbidden.

Authorized Issuer DN Pattern

A Regular Expression to apply against the Issuer’s Distinguished Name of incoming connections. If the Pattern does not match the Issuer DN, the processor will respond with a status of HTTP 403 Forbidden.

Max Unconfirmed Flowfile Time

The maximum amount of time to wait for a FlowFile to be confirmed before it is removed from the cache

HTTP Headers to receive as Attributes (Regex)

Specifies the Regular Expression that determines the names of HTTP Headers that should be passed along as FlowFile attributes

Request Header Maximum Size

The maximum supported size of HTTP headers in requests sent to this processor

Return Code

The HTTP return code returned after every HTTP call

Multipart Request Max Size

The max size of the request. Only applies for requests with Content-Type: multipart/form-data, and is used to prevent denial of service type of attacks, to prevent filling up the heap or disk space

Multipart Read Buffer Size

The threshold size, at which the contents of an incoming file would be written to disk. Only applies for requests with Content-Type: multipart/form-data. It is used to prevent denial of service type of attacks, to prevent filling up the heap or disk space.

Maximum Thread Pool Size

The maximum number of threads to be used by the embedded Jetty server. The value can be set between 8 and 1000. The value of this property affects the performance of the flows and the operating system, therefore the default value should only be changed in justified cases. A value that is less than the default value may be suitable if only a small number of HTTP clients connect to the server. A greater value may be suitable if a large number of HTTP clients are expected to make requests to the server simultaneously.

Record Reader

The Record Reader to use parsing the incoming FlowFile into Records

Record Writer

The Record Writer to use for serializing Records after they have been transformed

Relationships

  • success: Relationship for successfully received FlowFiles

Input Requirement

This component does not allow an incoming relationship.

ListenInubit

Used in IGUASU for receiving requests sent from an (on-prem) INUBIT system. For each request, a FlowFile with the content and attributes of the request is created and routed via the success relationship. Requests from INUBIT are sent using a IGUASU-Connector, while responses are sent back using a RespondInubit-processor. Data is securely sent in either direction using a standing WSS connection initiated by INUBIT, meaning the INUBIT system mustn’t open external ports.

Tags: hybrid, cloud, onprem, virtimo, websocket, inubit

Properties

Listener Identifier

The identifier of this listener. Must be unique among listeners using the same Server Controller.

Server Controller

Controller for letting other Virtimo systems to connect to and enable WSS communication.

Name

The name of the flow that appears in the connected Virtimo product.

Description

The description of the flow that appears in in the connected Virtimo product.

Relationships

  • success: All content that is received is routed to the 'success' relationship.

Reads Attributes

  • hybrid.listener: Used to direct requests to this listener.

  • hybrid.request: Used to match responses with the corresponding request.

Writes Attributes

  • hybrid.response: Writes '503' if any of this processor’s success connections have reached their back-pressure limit, in which case a response will be sent back immediately without processing the request.

  • hybrid.error: Writes an explanation to any error that occurred when receiving the request.

Input Requirement

This component does not allow an incoming relationship.

ListenOnPrem

Used in IGUASU (running on the Virtimo Cloud) for receiving requests sent from IGUASU Gateway (running on an on-prem system). For each request, a FlowFile with the content and attributes of the request is created and routed via the success relationship. Requests from IGUASU Gateway are sent using an InvokeCloud-processor, while responses are sent back using a RespondOnPrem-processor. Data is securely sent in either direction using a standing WSS connection initiated by IGUASU Gateway, meaning the on-prem system mustn’t open external ports.

Tags: hybrid, cloud, onprem, virtimo, websocket

Properties

Listener Identifier

The identifier of this listener. Must be unique among listeners using the same Server Controller.

Server Controller

Controller for letting other Virtimo systems to connect to and enable WSS communication.

Name

The name of the flow that appears in the connected Virtimo product.

Description

The description of the flow that appears in in the connected Virtimo product.

Relationships

  • success: All content that is received is routed to the 'success' relationship.

Reads Attributes

  • hybrid.listener: Used to direct requests to this listener.

  • hybrid.request: Used to match responses with the corresponding request.

Writes Attributes

  • hybrid.response: Writes '503' if any of this processor’s success connections have reached their back-pressure limit, in which case a response will be sent back immediately without processing the request.

  • hybrid.error: Writes an explanation to any error that occurred when receiving the request.

Input Requirement

This component does not allow an incoming relationship.

ListenOTLP

Collect OpenTelemetry messages over HTTP or gRPC. Supports standard Export Service Request messages for logs, metrics, and traces. Implements OpenTelemetry OTLP Specification 1.0.0 with OTLP/gRPC and OTLP/HTTP. Provides protocol detection using the HTTP Content-Type header.

Tags: OpenTelemetry, OTel, OTLP, telemetry, metrics, traces, logs

Properties

Address

Internet Protocol Address on which to listen for OTLP Export Service Requests. The default value enables listening on all addresses.

Port

TCP port number on which to listen for OTLP Export Service Requests over HTTP and gRPC

SSL Context Service

SSL Context Service enables TLS communication for HTTPS

Client Authentication

Client authentication policy for TLS communication with HTTPS

Worker Threads

Number of threads responsible for decoding and queuing incoming OTLP Export Service Requests

Queue Capacity

Maximum number of OTLP request resource elements that can be received and queued

Batch Size

Maximum number of OTLP request resource elements included in each FlowFile produced

Relationships

  • success: Export Service Requests containing OTLP Telemetry

Writes Attributes

  • mime.type: Content-Type set to application/json

  • resource.type: OpenTelemetry Resource Type: LOGS, METRICS, or TRACES

  • resource.count: Count of resource elements included in messages

Input Requirement

This component does not allow an incoming relationship.

ListenSlack

Retrieves real-time messages or Slack commands from one or more Slack conversations. The messages are written out in JSON format. Note that this Processor should be used to obtain real-time messages and commands from Slack and does not provide a mechanism for obtaining historical messages. The ConsumeSlack Processor should be used for an initial load of messages from a channel. See Usage / Additional Details for more information about how to configure this Processor and enable it to retrieve messages and commands from Slack.

Tags: slack, real-time, event, message, command, listen, receive, social media, team, text, unstructured

Properties

App Token

The Application Token that is registered to your Slack application

Bot Token

The Bot Token that is registered to your Slack application

Event Type to Receive

Specifies the type of Event that the Processor should respond to

Resolve User Details

Specifies whether the Processor should lookup details about the Slack User who sent the received message. If true, the output JSON will contain an additional field named 'userDetails'. The 'user' field will still contain the ID of the user. In order to enable this capability, the Bot Token must be granted the 'users:read' and optionally the 'users.profile:read' Bot Token Scope. If the rate limit is exceeded when retrieving this information, the received message will be rejected and must be re-delivered.

Relationships

  • success: All FlowFiles that are created will be sent to this Relationship.

Writes Attributes

  • mime.type: Set to application/json, as the output will always be in JSON format

Input Requirement

This component does not allow an incoming relationship.

See Also

Additional Details

Description:

ListenSlack allows for receiving real-time messages and commands from Slack using Slack’s Events API. This Processor does not provide any capabilities for retrieving historical messages. However, the ConsumeSlack Processor provides the ability to do so. This Processor is generally used when implementing a bot in NiFi, or when it is okay to lose messages in the case that NiFi or this Processor is stopped for more than 5 minutes.

This Processor may be used to listen for Message Events, App Mention Events (when the bot user is mentioned in a message) or Slack Commands. For example, you may wish to create a Slack App that receives the /nifi command and when received, performs some task. The Processor does not allow listening for both Message Events and Commands, as the output format is very different for the two, and this would lead to significant confusion. Instead, if there is a desire to consume both Message Events and Commands, two ListenSlack Processors should be used - one for Messages and another the Commands.

Note that unlike the ConsumeSlack Processor, ListenSlack does not require that a Channel name or ID be provided. This is because the Processor listens for Events/Commands from all channels (and “channel-like” conversations) that the application has been added to.

Slack Setup

In order use this Processor, it requires that a Slack App be created and installed in your Slack workspace. Additionally, the App must have Socket Mode enabled. Please see Slack’s documentation for the latest information on how to create an Application and install it into your workspace.

At the time of this writing, the following steps may be used to create a Slack App with the necessary scopes. However, these instructions are subject to change at any time, so it is best to read through Slack’s Quickstart Guide.

  • Create a Slack App. Click here to get started. From here, click the “Create New App” button and choose “From scratch.” Give your App a name and choose the workspace that you want to use for developing the app.

  • Creating your app will take you to the configuration page for your application. For example, https://api.slack.com/apps/<APP_IDENTIFIER>. From here, click on “Socket Mode” and flip the toggle for “Enable Socket Mode.” Accept the default scope and apply the changes. From here, click on “Event Subscriptions.”

  • Flip the toggle to turn on “Enable Events.” In the “Subscribe to bot events” section, add the following Bot User Events: app_mention, message.channels, message.groups, message.im, message.mpim. Click “Save Changes” at the bottom of the screen.

  • Click on the “OAuth & Permissions” link on the left-hand side. Under the “OAuth Tokens for Your Workspace” section, click the “Install to Workspace” button. This will prompt you to allow the application to be added to your workspace, if you have the appropriate permissions. Otherwise, it will generate a notification for a Workspace Owner to approve the installation. Additionally, it will generate a “Bot User OAuth Token”.

  • The Bot must then be enabled for each Channel that you would like to consume messages from. In order to do that, in the Slack application, go to the Channel that you would like to consume from and press /. Choose the Add apps to this channel option, and add the Application that you created as a Bot to the channel.

  • Additionally, if you would like your Bot to receive commands, navigate to the “Slash Commands” section on the left-hand side. Create a New Command and complete the form. If you have already installed the app in a workspace, you will need to re-install your app at this time, in order for the changes to take effect. You should be prompted to do so with a link at the top of the page. Now, whenever a user is in a channel with your App installed, the user may send a command. For example, if you configured your command to be /nifi then a user can trigger your bot to receive the command by simply typing /nifi followed by some text. If your Processor is running, it will receive the command and output it. Otherwise, the user will receive an error.

Configuring the Tokens

Now that your Slack Application has been created and configured, you will need to provide the ListenSlack Processor with two tokens: the App Token and the Bot token. To get the App Token, go to your Slack Application’s configuration page. On the left-hand side, navigate to “Basic Information.” Scroll down to “App-Level Tokens” and click on the token that you created in the Slack Setup section above. This will provide you with a pop-up showing your App Token. Click the “Copy” button and paste the value into your Processor’s configuration. Then click “Done” to close the popup.

To obtain your Bot Token, again in the Slack Application’s configuration page, navigate to the “OAuth & Permissions” section on the left-hand side. Under the “OAuth Tokens for Your Workspace” section, click the “Copy” button under the ” Bot User OAuth Token” and paste this into your NiFi Processor’s configuration.

ListenSyslog

Listens for Syslog messages being sent to a given port over TCP or UDP. Incoming messages are checked against regular expressions for RFC5424 and RFC3164 formatted messages. The format of each message is: (<PRIORITY>)(VERSION )(TIMESTAMP) (HOSTNAME) (BODY) where version is optional. The timestamp can be an RFC5424 timestamp with a format of "yyyy-MM-dd’T’HH:mm:ss.SZ" or "yyyy-MM-dd’T’HH:mm:ss.S+hh:mm", or it can be an RFC3164 timestamp with a format of "MMM d HH:mm:ss". If an incoming messages matches one of these patterns, the message will be parsed and the individual pieces will be placed in FlowFile attributes, with the original message in the content of the FlowFile. If an incoming message does not match one of these patterns it will not be parsed and the syslog.valid attribute will be set to false with the original message in the content of the FlowFile. Valid messages will be transferred on the success relationship, and invalid messages will be transferred on the invalid relationship.

Tags: syslog, listen, udp, tcp, logs

Properties

Protocol

The protocol for Syslog communication.

Port

The port for Syslog communication. Note that Expression language is not evaluated per FlowFile.

Local Network Interface

The name of a local network interface to be used to restrict listening to a specific LAN.

Socket Keep Alive

Whether or not to have TCP socket keep alive turned on. Timing details depend on operating system properties.

SSL Context Service

The Controller Service to use in order to obtain an SSL Context. If this property is set, syslog messages will be received over a secure connection.

Client Auth

The client authentication policy to use for the SSL Context. Only used if an SSL Context Service is provided.

Receive Buffer Size

The size of each buffer used to receive Syslog messages. Adjust this value appropriately based on the expected size of the incoming Syslog messages. When UDP is selected each buffer will hold one Syslog message. When TCP is selected messages are read from an incoming connection until the buffer is full, or the connection is closed.

Max Size of Message Queue

The maximum size of the internal queue used to buffer messages being transferred from the underlying channel to the processor. Setting this value higher allows more messages to be buffered in memory during surges of incoming messages, but increases the total memory used by the processor.

Max Size of Socket Buffer

The maximum size of the socket buffer that should be used. This is a suggestion to the Operating System to indicate how big the socket buffer should be. If this value is set too low, the buffer may fill up before the data can be read, and incoming data will be dropped.

Max Number of TCP Connections

The maximum number of concurrent connections to accept Syslog messages in TCP mode.

Max Batch Size

The maximum number of Syslog events to add to a single FlowFile. If multiple events are available, they will be concatenated along with the <Message Delimiter> up to this configured maximum number of messages

Message Delimiter

Specifies the delimiter to place between Syslog messages when multiple messages are bundled together (see <Max Batch Size> property).

Parse Messages

Indicates if the processor should parse the Syslog messages. If set to false, each outgoing FlowFile will only contain the sender, protocol, and port, and no additional attributes.

Character Set

Specifies the character set of the Syslog messages. Note that Expression language is not evaluated per FlowFile.

Relationships

  • success: Syslog messages that match one of the expected formats will be sent out this relationship as a FlowFile per message.

  • invalid: Syslog messages that do not match one of the expected formats will be sent out this relationship as a FlowFile per message.

Writes Attributes

  • syslog.priority: The priority of the Syslog message.

  • syslog.severity: The severity of the Syslog message derived from the priority.

  • syslog.facility: The facility of the Syslog message derived from the priority.

  • syslog.version: The optional version from the Syslog message.

  • syslog.timestamp: The timestamp of the Syslog message.

  • syslog.hostname: The hostname or IP address of the Syslog message.

  • syslog.sender: The hostname of the Syslog server that sent the message.

  • syslog.body: The body of the Syslog message, everything after the hostname.

  • syslog.valid: An indicator of whether this message matched the expected formats. If this value is false, the other attributes will be empty and only the original message will be available in the content.

  • syslog.protocol: The protocol over which the Syslog message was received.

  • syslog.port: The port over which the Syslog message was received.

  • mime.type: The mime.type of the FlowFile which will be text/plain for Syslog messages.

Input Requirement

This component does not allow an incoming relationship.

ListenTCP

Listens for incoming TCP connections and reads data from each connection using a line separator as the message demarcator. The default behavior is for each message to produce a single FlowFile, however this can be controlled by increasing the Batch Size to a larger value for higher throughput. The Receive Buffer Size must be set as large as the largest messages expected to be received, meaning if every 100kb there is a line separator, then the Receive Buffer Size must be greater than 100kb. The processor can be configured to use an SSL Context Service to only allow secure connections. When connected clients present certificates for mutual TLS authentication, the Distinguished Names of the client certificate’s issuer and subject are added to the outgoing FlowFiles as attributes. The processor does not perform authorization based on Distinguished Name values, but since these values are attached to the outgoing FlowFiles, authorization can be implemented based on these attributes.

Tags: listen, tcp, tls, ssl

Properties

Local Network Interface

The name of a local network interface to be used to restrict listening to a specific LAN.

Port

The port to listen on for communication.

Receive Buffer Size

The size of each buffer used to receive messages. Adjust this value appropriately based on the expected size of the incoming messages.

Max Size of Message Queue

The maximum size of the internal queue used to buffer messages being transferred from the underlying channel to the processor. Setting this value higher allows more messages to be buffered in memory during surges of incoming messages, but increases the total memory used by the processor during these surges.

Max Size of Socket Buffer

The maximum size of the socket buffer that should be used. This is a suggestion to the Operating System to indicate how big the socket buffer should be. If this value is set too low, the buffer may fill up before the data can be read, and incoming data will be dropped.

Character Set

Specifies the character set of the received data.

Max Number of Worker Threads

The maximum number of worker threads available for servicing TCP connections.

Max Batch Size

The maximum number of messages to add to a single FlowFile. If multiple messages are available, they will be concatenated along with the <Message Delimiter> up to this configured maximum number of messages

Batching Message Delimiter

Specifies the delimiter to place between messages when multiple messages are bundled together (see <Max Batch Size> property).

Idle Connection Timeout

The amount of time a client’s connection will remain open if no data is received. The default of 0 seconds will leave connections open until they are closed by the client.

Pool Receive Buffers

Enable or disable pooling of buffers that the processor uses for handling bytes received on socket connections. The framework allocates buffers as needed during processing.

SSL Context Service

The Controller Service to use in order to obtain an SSL Context. If this property is set, messages will be received over a secure connection.

Client Auth

The client authentication policy to use for the SSL Context. Only used if an SSL Context Service is provided.

Relationships

  • success: Messages received successfully will be sent out this relationship.

Writes Attributes

  • tcp.sender: The sending host of the messages.

  • tcp.port: The sending port the messages were received.

  • client.certificate.issuer.dn: For connections using mutual TLS, the Distinguished Name of the Certificate Authority that issued the client’s certificate is attached to the FlowFile.

  • client.certificate.subject.dn: For connections using mutual TLS, the Distinguished Name of the client certificate’s owner (subject) is attached to the FlowFile.

Input Requirement

This component does not allow an incoming relationship.

ListenTrapSNMP

Receives information from SNMP Agent and outputs a FlowFile with information in attributes and without any content

Tags: snmp, listen, trap

Properties

SNMP Manager Port

The port where the SNMP Manager listens to the incoming traps.

SNMP Version

Three significant versions of SNMP have been developed and deployed. SNMPv1 is the original version of the protocol. More recent versions, SNMPv2c and SNMPv3, feature improvements in performance, flexibility and security.

SNMP Community

SNMPv1 and SNMPv2 use communities to establish trust between managers and agents. Most agents support three community names, one each for read-only, read-write and trap. These three community strings control different types of activities. The read-only community applies to get requests. The read-write community string applies to set requests. The trap community string applies to receipt of traps.

SNMP Security Level

SNMP version 3 provides extra security with User Based Security Model (USM). The three levels of security is 1. Communication without authentication and encryption (NoAuthNoPriv). 2. Communication with authentication and without encryption (AuthNoPriv). 3. Communication with authentication and encryption (AuthPriv).

USM Users Source

The ways to provide USM User data

USM Users JSON File Path

The path of the json file containing the user credentials for SNMPv3. Check Usage for more details.

USM Users JSON content

The JSON containing the user credentials for SNMPv3. Check Usage for more details.

SNMP Users Security Names

Security names listed separated by commas in SNMPv3. Check Usage for more details.

Relationships

  • success: All FlowFiles that are received from the SNMP agent are routed to this relationship

  • failure: All FlowFiles that cannot received from the SNMP agent are routed to this relationship

Writes Attributes

  • snmp$*: Attributes retrieved from the SNMP response. It may include: snmp$errorIndex, snmp$errorStatus, snmp$errorStatusText, snmp$nonRepeaters, snmp$requestID, snmp$type, snmp$variableBindings

Input Requirement

This component does not allow an incoming relationship.

Additional Details

Summary

This processor listens to SNMP traps and creates a flowfile from the trap PDU. The versions SNMPv1, SNMPv2c and SNMPv3 are supported. The component is based on SNMP4J.

SNMPv3 has user-based security. The USM Users Source property allows users to choose between three different ways to provide the USM user database. An example json file containing two users:

[
  {
    "securityName": "user1",
    "authProtocol": "MD5",
    "authPassphrase": "abc12345",
    "privProtocol": "DES",
    "privPassphrase": "abc12345"
  },
  {
    "securityName": "newUser2",
    "authProtocol": "MD5",
    "authPassphrase": "abc12345",
    "privProtocol": "AES256",
    "privPassphrase": "abc12345"
  }
]
json

ListenUDP

Listens for Datagram Packets on a given port. The default behavior produces a FlowFile per datagram, however for higher throughput the Max Batch Size property may be increased to specify the number of datagrams to batch together in a single FlowFile. This processor can be restricted to listening for datagrams from a specific remote host and port by specifying the Sending Host and Sending Host Port properties, otherwise it will listen for datagrams from all hosts and ports.

Tags: ingest, udp, listen, source

Properties

Local Network Interface

The name of a local network interface to be used to restrict listening to a specific LAN.

Port

The port to listen on for communication.

Receive Buffer Size

The size of each buffer used to receive messages. Adjust this value appropriately based on the expected size of the incoming messages.

Max Size of Message Queue

The maximum size of the internal queue used to buffer messages being transferred from the underlying channel to the processor. Setting this value higher allows more messages to be buffered in memory during surges of incoming messages, but increases the total memory used by the processor.

Max Size of Socket Buffer

The maximum size of the socket buffer that should be used. This is a suggestion to the Operating System to indicate how big the socket buffer should be. If this value is set too low, the buffer may fill up before the data can be read, and incoming data will be dropped.

Character Set

Specifies the character set of the received data.

Max Batch Size

The maximum number of messages to add to a single FlowFile. If multiple messages are available, they will be concatenated along with the <Message Delimiter> up to this configured maximum number of messages

Batching Message Delimiter

Specifies the delimiter to place between messages when multiple messages are bundled together (see <Max Batch Size> property).

Sending Host

IP, or name, of a remote host. Only Datagrams from the specified Sending Host Port and this host will be accepted. Improves Performance. May be a system property or an environment variable.

Sending Host Port

Port being used by remote host to send Datagrams. Only Datagrams from the specified Sending Host and this port will be accepted. Improves Performance. May be a system property or an environment variable.

Relationships

  • success: Messages received successfully will be sent out this relationship.

Writes Attributes

  • udp.sender: The sending host of the messages.

  • udp.port: The sending port the messages were received.

Input Requirement

This component does not allow an incoming relationship.

ListenUDPRecord

Listens for Datagram Packets on a given port and reads the content of each datagram using the configured Record Reader. Each record will then be written to a flow file using the configured Record Writer. This processor can be restricted to listening for datagrams from a specific remote host and port by specifying the Sending Host and Sending Host Port properties, otherwise it will listen for datagrams from all hosts and ports.

Tags: ingest, udp, listen, source, record

Properties

Local Network Interface

The name of a local network interface to be used to restrict listening to a specific LAN.

Port

The port to listen on for communication.

Receive Buffer Size

The size of each buffer used to receive messages. Adjust this value appropriately based on the expected size of the incoming messages.

Max Size of Message Queue

The maximum size of the internal queue used to buffer messages being transferred from the underlying channel to the processor. Setting this value higher allows more messages to be buffered in memory during surges of incoming messages, but increases the total memory used by the processor.

Max Size of Socket Buffer

The maximum size of the socket buffer that should be used. This is a suggestion to the Operating System to indicate how big the socket buffer should be. If this value is set too low, the buffer may fill up before the data can be read, and incoming data will be dropped.

Character Set

Specifies the character set of the received data.

Poll Timeout

The amount of time to wait when polling the internal queue for more datagrams. If no datagrams are found after waiting for the configured timeout, then the processor will emit whatever records have been obtained up to that point.

Batch Size

The maximum number of datagrams to write as records to a single FlowFile. The Batch Size will only be reached when data is coming in more frequently than the Poll Timeout.

Record Reader

The Record Reader to use for reading the content of incoming datagrams.

Record Writer

The Record Writer to use in order to serialize the data before writing to a flow file.

Sending Host

IP, or name, of a remote host. Only Datagrams from the specified Sending Host Port and this host will be accepted. Improves Performance. May be a system property or an environment variable.

Sending Host Port

Port being used by remote host to send Datagrams. Only Datagrams from the specified Sending Host and this port will be accepted. Improves Performance. May be a system property or an environment variable.

Relationships

  • success: Messages received successfully will be sent out this relationship.

  • parse.failure: If a datagram cannot be parsed using the configured Record Reader, the contents of the message will be routed to this Relationship as its own individual FlowFile.

Writes Attributes

  • udp.sender: The sending host of the messages.

  • udp.port: The sending port the messages were received.

  • record.count: The number of records written to the flow file.

  • mime.type: The mime-type of the writer used to write the records to the flow file.

Input Requirement

This component does not allow an incoming relationship.

ListenWebSocket

Acts as a WebSocket server endpoint to accept client connections. FlowFiles are transferred to downstream relationships according to received message types as the WebSocket server configured with this processor receives client requests

Tags: subscribe, WebSocket, consume, listen

Properties

WebSocket Server ControllerService

A WebSocket SERVER Controller Service which can accept WebSocket requests.

Server URL Path

The WetSocket URL Path on which this processor listens to. Must starts with '/', e.g. '/example'.

Relationships

  • binary message: The WebSocket binary message output

  • connected: The WebSocket session is established

  • disconnected: The WebSocket session is disconnected

  • text message: The WebSocket text message output

Writes Attributes

  • websocket.controller.service.id: WebSocket Controller Service id.

  • websocket.session.id: Established WebSocket session id.

  • websocket.endpoint.id: WebSocket endpoint id.

  • websocket.local.address: WebSocket server address.

  • websocket.remote.address: WebSocket client address.

  • websocket.message.type: TEXT or BINARY.

Input Requirement

This component does not allow an incoming relationship.

ListFile

Retrieves a listing of files from the input directory. For each file listed, creates a FlowFile that represents the file so that it can be fetched in conjunction with FetchFile. This Processor is designed to run on Primary Node only in a cluster when 'Input Directory Location' is set to 'Remote'. If the primary node changes, the new Primary Node will pick up where the previous node left off without duplicating all the data. When 'Input Directory Location' is 'Local', the 'Execution' mode can be anything, and synchronization won’t happen. Unlike GetFile, this Processor does not delete any data from the local filesystem.

Tags: file, get, list, ingest, source, filesystem

Properties

Input Directory

The input directory from which files to pull files

Listing Strategy

Specify how to determine new/updated entities. See each strategy descriptions for detail.

Recurse Subdirectories

Indicates whether to list files from subdirectories of the directory

Record Writer

Specifies the Record Writer to use for creating the listing. If not specified, one FlowFile will be created for each entity that is listed. If the Record Writer is specified, all entities will be written to a single FlowFile instead of adding attributes to individual FlowFiles.

Input Directory Location

Specifies where the Input Directory is located. This is used to determine whether state should be stored locally or across the cluster.

File Filter

Only files whose names match the given regular expression will be picked up

Path Filter

When Recurse Subdirectories is true, then only subdirectories whose path matches the given regular expression will be scanned

Include File Attributes

Whether or not to include information such as the file’s Last Modified Time and Owner as FlowFile Attributes. Depending on the File System being used, gathering this information can be expensive and as a result should be disabled. This is especially true of remote file shares.

Minimum File Age

The minimum age that a file must be in order to be pulled; any file younger than this amount of time (according to last modification date) will be ignored

Maximum File Age

The maximum age that a file must be in order to be pulled; any file older than this amount of time (according to last modification date) will be ignored

Minimum File Size

The minimum size that a file must be in order to be pulled

Maximum File Size

The maximum size that a file can be in order to be pulled

Ignore Hidden Files

Indicates whether or not hidden files should be ignored

Target System Timestamp Precision

Specify timestamp precision at the target system. Since this processor uses timestamp of entities to decide which should be listed, it is crucial to use the right timestamp precision.

Entity Tracking State Cache

Listed entities are stored in the specified cache storage so that this processor can resume listing across NiFi restart or in case of primary node change. 'Tracking Entities' strategy require tracking information of all listed entities within the last 'Tracking Time Window'. To support large number of entities, the strategy uses DistributedMapCache instead of managed state. Cache key format is 'ListedEntities::{processorId}(::{nodeId})'. If it tracks per node listed entities, then the optional '::{nodeId}' part is added to manage state separately. E.g. cluster wide cache key = 'ListedEntities::8dda2321-0164-1000-50fa-3042fe7d6a7b', per node cache key = 'ListedEntities::8dda2321-0164-1000-50fa-3042fe7d6a7b::nifi-node3' The stored cache content is Gzipped JSON string. The cache key will be deleted when target listing configuration is changed. Used by 'Tracking Entities' strategy.

Entity Tracking Time Window

Specify how long this processor should track already-listed entities. 'Tracking Entities' strategy can pick any entity whose timestamp is inside the specified time window. For example, if set to '30 minutes', any entity having timestamp in recent 30 minutes will be the listing target when this processor runs. A listed entity is considered 'new/updated' and a FlowFile is emitted if one of following condition meets: 1. does not exist in the already-listed entities, 2. has newer timestamp than the cached entity, 3. has different size than the cached entity. If a cached entity’s timestamp becomes older than specified time window, that entity will be removed from the cached already-listed entities. Used by 'Tracking Entities' strategy.

Entity Tracking Initial Listing Target

Specify how initial listing should be handled. Used by 'Tracking Entities' strategy.

Entity Tracking Node Identifier

The configured value will be appended to the cache key so that listing state can be tracked per NiFi node rather than cluster wide when tracking state is scoped to LOCAL. Used by 'Tracking Entities' strategy.

Track Performance

Whether or not the Processor should track the performance of disk access operations. If true, all accesses to disk will be recorded, including the file being accessed, the information being obtained, and how long it takes. This is then logged periodically at a DEBUG level. While the amount of data will be capped, this option may still consume a significant amount of heap (controlled by the 'Maximum Number of Files to Track' property), but it can be very useful for troubleshooting purposes if performance is poor is degraded.

Maximum Number of Files to Track

If the 'Track Performance' property is set to 'true', this property indicates the maximum number of files whose performance metrics should be held onto. A smaller value for this property will result in less heap utilization, while a larger value may provide more accurate insights into how the disk access operations are performing

Max Disk Operation Time

The maximum amount of time that any single disk operation is expected to take. If any disk operation takes longer than this amount of time, a warning bulletin will be generated for each operation that exceeds this amount of time.

Max Directory Listing Time

The maximum amount of time that listing any single directory is expected to take. If the listing for the directory specified by the 'Input Directory' property, or the listing of any subdirectory (if 'Recurse' is set to true) takes longer than this amount of time, a warning bulletin will be generated for each directory listing that exceeds this amount of time.

Relationships

  • success: All FlowFiles that are received are routed to success

Writes Attributes

  • filename: The name of the file that was read from filesystem.

  • path: The path is set to the relative path of the file’s directory on filesystem compared to the Input Directory property. For example, if Input Directory is set to /tmp, then files picked up from /tmp will have the path attribute set to "/". If the Recurse Subdirectories property is set to true and a file is picked up from /tmp/abc/1/2/3, then the path attribute will be set to "abc/1/2/3/".

  • absolute.path: The absolute.path is set to the absolute path of the file’s directory on filesystem. For example, if the Input Directory property is set to /tmp, then files picked up from /tmp will have the path attribute set to "/tmp/". If the Recurse Subdirectories property is set to true and a file is picked up from /tmp/abc/1/2/3, then the path attribute will be set to "/tmp/abc/1/2/3/".

  • file.owner: The user that owns the file in filesystem

  • file.group: The group that owns the file in filesystem

  • file.size: The number of bytes in the file in filesystem

  • file.permissions: The permissions for the file in filesystem. This is formatted as 3 characters for the owner, 3 for the group, and 3 for other users. For example rw-rw-r--

  • file.lastModifiedTime: The timestamp of when the file in filesystem was last modified as 'yyyy-MM-dd’T’HH:mm:ssZ'

  • file.lastAccessTime: The timestamp of when the file in filesystem was last accessed as 'yyyy-MM-dd’T’HH:mm:ssZ'

  • file.creationTime: The timestamp of when the file in filesystem was created as 'yyyy-MM-dd’T’HH:mm:ssZ'

Stateful

Scope: Local, Cluster

After performing a listing of files, the timestamp of the newest file is stored. This allows the Processor to list only files that have been added or modified after this date the next time that the Processor is run. Whether the state is stored with a Local or Cluster scope depends on the value of the <Input Directory Location> property.

Input Requirement

This component does not allow an incoming relationship.

Additional Details

ListFile performs a listing of all files that it encounters in the configured directory. There are two common, broadly defined use cases.

Streaming Use Case

By default, the Processor will create a separate FlowFile for each file in the directory and add attributes for filename, path, etc. A common use case is to connect ListFile to the FetchFile processor. These two processors used in conjunction with one another provide the ability to easily monitor a directory and fetch the contents of any new file as it lands in an efficient streaming fashion.

Batch Use Case

Another common use case is the desire to process all newly arriving files in a given directory, and to then perform some action only when all files have completed their processing. The above approach of streaming the data makes this difficult, because NiFi is inherently a streaming platform in that there is no “job” that has a beginning and an end. Data is simply picked up as it becomes available.

To solve this, the ListFile Processor can optionally be configured with a Record Writer. When a Record Writer is configured, a single FlowFile will be created that will contain a Record for each file in the directory, instead of a separate FlowFile per file. With this pattern, in order to fetch the contents of each file, the records must be split up into individual FlowFiles and then fetched. So how does this help us?

We can still accomplish the desired use case of waiting until all files in the directory have been processed by splitting apart the FlowFile and processing all the data within a Process Group. Configuring the Process Group with a FlowFile Concurrency of “Single FlowFile per Node” means that only one FlowFile will be brought into the Process Group. Once that happens, the FlowFile can be split apart and each part processed. Configuring the Process Group with an Outbound Policy of “Batch Output” means that none of the FlowFiles will leave the Process Group until all have finished processing.

In this flow, we perform a listing of a directory with ListFile. The processor is configured with a Record Writer (in this case a CSV Writer, but any Record Writer can be used) so that only a single FlowFile is generated for the entire listing. That listing is then sent to the “Process Listing” Process Group (shown below). Only after the contents of the entire directory have been processed will data leave the “Process Listing” Process Group. At that point, when all data in the Process Group is ready to leave, each of the processed files will be sent to the “Post-Processing” Process Group. At the same time, the original listing is to be sent to the “Processing Complete Notification” Process Group. In order to accomplish this, the Process Group must be configured with a FlowFile Concurrency of “Single FlowFile per Node” and an Outbound Policy of “Batch Output.”

The “Process Listing” Process Group a listing is received via the “Listing” Input Port. This is then sent directly to the “Listing of Processed Data” Output Port so that when all processing completes, the original listing will be sent out also.

Next, the listing is broken apart into an individual FlowFile per record. Because we want to use FetchFile to fetch the data, we need to get the file’s filename and path as FlowFile attributes. This can be done in a few different ways, but the easiest mechanism is to use the PartitionRecord processor. This Processor is configured with a Record Reader that is able to read the data written by ListFile (in this case, a CSV Reader). The Processor is also configured with two additional user-defined properties:

  • absolute.path: /path

  • filename: /filename

As a result, each record that comes into the PartitionRecord processor will be split into an individual FlowFile ( because the combination of the “path” and “filename” fields will be unique for each Record) and the “filename” and ” path” record fields will become attributes on the FlowFile (using attribute names of “absolute.path” and “filename”). FetchFile uses default configuration, which references these attributes.

Finally, we process the data - in this example, simply by compressing it with GZIP compression - and send the output to the “Processed Data” Output Port. The data will queue up here until all data is ready to leave the Process Group and then will be released.

Record Schema

When the Processor is configured to write the listing using a Record Writer, the Records will be written using the following schema (in Avro format):

{
  "type": "record",
  "name": "nifiRecord",
  "namespace": "org.apache.nifi",
  "fields": [
    {
      "name": "filename",
      "type": "string"
    },
    {
      "name": "path",
      "type": "string"
    },
    {
      "name": "directory",
      "type": "boolean"
    },
    {
      "name": "size",
      "type": "long"
    },
    {
      "name": "lastModified",
      "type": {
        "type": "long",
        "logicalType": "timestamp-millis"
      }
    },
    {
      "name": "permissions",
      "type": [
        "null",
        "string"
      ]
    },
    {
      "name": "owner",
      "type": [
        "null",
        "string"
      ]
    },
    {
      "name": "group",
      "type": [
        "null",
        "string"
      ]
    }
  ]
}
json

ListFTP

Performs a listing of the files residing on an FTP server. For each file that is found on the remote server, a new FlowFile will be created with the filename attribute set to the name of the file on the remote server. This can then be used in conjunction with FetchFTP in order to fetch those files.

Tags: list, ftp, remote, ingest, source, input, files

Properties

Listing Strategy

Specify how to determine new/updated entities. See each strategy descriptions for detail.

Hostname

The fully qualified hostname or IP address of the remote system

Port

The port to connect to on the remote host to fetch the data from

Username

Username

Password

Password for the user account

Remote Path

The path on the remote system from which to pull or push files

Record Writer

Specifies the Record Writer to use for creating the listing. If not specified, one FlowFile will be created for each entity that is listed. If the Record Writer is specified, all entities will be written to a single FlowFile instead of adding attributes to individual FlowFiles.

Search Recursively

If true, will pull files from arbitrarily nested subdirectories; otherwise, will not traverse subdirectories

Follow symlink

If true, will pull even symbolic files and also nested symbolic subdirectories; otherwise, will not read symbolic files and will not traverse symbolic link subdirectories

File Filter Regex

Provides a Java Regular Expression for filtering Filenames; if a filter is supplied, only files whose names match that Regular Expression will be fetched

Path Filter Regex

When Search Recursively is true, then only subdirectories whose path matches the given Regular Expression will be scanned

Ignore Dotted Files

If true, files whose names begin with a dot (".") will be ignored

Remote Poll Batch Size

The value specifies how many file paths to find in a given directory on the remote system when doing a file listing. This value in general should not need to be modified but when polling against a remote system with a tremendous number of files this value can be critical. Setting this value too high can result very poor performance and setting it too low can cause the flow to be slower than normal.

Connection Timeout

Amount of time to wait before timing out while creating a connection

Data Timeout

When transferring a file between the local and remote system, this value specifies how long is allowed to elapse without any data being transferred between systems

Connection Mode

The FTP Connection Mode

Transfer Mode

The FTP Transfer Mode

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP + AuthN, SOCKS + AuthN

Internal Buffer Size

Set the internal buffer size for buffered data streams

Target System Timestamp Precision

Specify timestamp precision at the target system. Since this processor uses timestamp of entities to decide which should be listed, it is crucial to use the right timestamp precision.

Entity Tracking State Cache

Listed entities are stored in the specified cache storage so that this processor can resume listing across NiFi restart or in case of primary node change. 'Tracking Entities' strategy require tracking information of all listed entities within the last 'Tracking Time Window'. To support large number of entities, the strategy uses DistributedMapCache instead of managed state. Cache key format is 'ListedEntities::{processorId}(::{nodeId})'. If it tracks per node listed entities, then the optional '::{nodeId}' part is added to manage state separately. E.g. cluster wide cache key = 'ListedEntities::8dda2321-0164-1000-50fa-3042fe7d6a7b', per node cache key = 'ListedEntities::8dda2321-0164-1000-50fa-3042fe7d6a7b::nifi-node3' The stored cache content is Gzipped JSON string. The cache key will be deleted when target listing configuration is changed. Used by 'Tracking Entities' strategy.

Entity Tracking Time Window

Specify how long this processor should track already-listed entities. 'Tracking Entities' strategy can pick any entity whose timestamp is inside the specified time window. For example, if set to '30 minutes', any entity having timestamp in recent 30 minutes will be the listing target when this processor runs. A listed entity is considered 'new/updated' and a FlowFile is emitted if one of following condition meets: 1. does not exist in the already-listed entities, 2. has newer timestamp than the cached entity, 3. has different size than the cached entity. If a cached entity’s timestamp becomes older than specified time window, that entity will be removed from the cached already-listed entities. Used by 'Tracking Entities' strategy.

Entity Tracking Initial Listing Target

Specify how initial listing should be handled. Used by 'Tracking Entities' strategy.

Use UTF-8 Encoding

Tells the client to use UTF-8 encoding when processing files and filenames. If set to true, the server must also support UTF-8 encoding.

Relationships

  • success: All FlowFiles that are received are routed to success

Writes Attributes

  • ftp.remote.host: The hostname of the FTP Server

  • ftp.remote.port: The port that was connected to on the FTP Server

  • ftp.listing.user: The username of the user that performed the FTP Listing

  • file.owner: The numeric owner id of the source file

  • file.group: The numeric group id of the source file

  • file.permissions: The read/write/execute permissions of the source file

  • file.size: The number of bytes in the source file

  • file.lastModifiedTime: The timestamp of when the file in the filesystem waslast modified as 'yyyy-MM-dd’T’HH:mm:ssZ'

  • filename: The name of the file on the FTP Server

  • path: The fully qualified name of the directory on the FTP Server from which the file was pulled

Stateful

Scope: Cluster

After performing a listing of files, the timestamp of the newest file is stored. This allows the Processor to list only files that have been added or modified after this date the next time that the Processor is run. State is stored across the cluster so that this Processor can be run on Primary Node only and if a new Primary Node is selected, the new node will not duplicate the data that was listed by the previous Primary Node.

Input Requirement

This component does not allow an incoming relationship.

Additional Details

ListFTP performs a listing of all files that it encounters in the configured directory of an FTP server. There are two common, broadly defined use cases.

Streaming Use Case

By default, the Processor will create a separate FlowFile for each file in the directory and add attributes for filename, path, etc. A common use case is to connect ListFTP to the FetchFTP processor. These two processors used in conjunction with one another provide the ability to easily monitor a directory and fetch the contents of any new file as it lands on the FTP server in an efficient streaming fashion.

Batch Use Case

Another common use case is the desire to process all newly arriving files in a given directory, and to then perform some action only when all files have completed their processing. The above approach of streaming the data makes this difficult, because NiFi is inherently a streaming platform in that there is no “job” that has a beginning and an end. Data is simply picked up as it becomes available.

To solve this, the ListFTP Processor can optionally be configured with a Record Writer. When a Record Writer is configured, a single FlowFile will be created that will contain a Record for each file in the directory, instead of a separate FlowFile per file. With this pattern, in order to fetch the contents of each file, the records must be split up into individual FlowFiles and then fetched. So how does this help us?

We can still accomplish the desired use case of waiting until all files in the directory have been processed by splitting apart the FlowFile and processing all the data within a Process Group. Configuring the Process Group with a FlowFile Concurrency of “Single FlowFile per Node” means that only one FlowFile will be brought into the Process Group. Once that happens, the FlowFile can be split apart and each part processed. Configuring the Process Group with an Outbound Policy of “Batch Output” means that none of the FlowFiles will leave the Process Group until all have finished processing.

In this flow, we perform a listing of a directory with ListFTP. The processor is configured with a Record Writer (in this case a CSV Writer, but any Record Writer can be used) so that only a single FlowFile is generated for the entire listing. That listing is then sent to the “Process Listing” Process Group (shown below). Only after the contents of the entire directory have been processed will data leave the “Process Listing” Process Group. At that point, when all data in the Process Group is ready to leave, each of the processed files will be sent to the “Post-Processing” Process Group. At the same time, the original listing is to be sent to the “Processing Complete Notification” Process Group. In order to accomplish this, the Process Group must be configured with a FlowFile Concurrency of “Single FlowFile per Node” and an Outbound Policy of “Batch Output.”

The “Process Listing” Process Group a listing is received via the “Listing” Input Port. This is then sent directly to the “Listing of Processed Data” Output Port so that when all processing completes, the original listing will be sent out also.

Next, the listing is broken apart into an individual FlowFile per record. Because we want to use FetchFTP to fetch the data, we need to get the file’s filename and path as FlowFile attributes. This can be done in a few different ways, but the easiest mechanism is to use the PartitionRecord processor. This Processor is configured with a Record Reader that is able to read the data written by ListFTP (in this case, a CSV Reader). The Processor is also configured with two additional user-defined properties:

  • path: /path

  • filename: /filename

As a result, each record that comes into the PartitionRecord processor will be split into an individual FlowFile ( because the combination of the “path” and “filename” fields will be unique for each Record) and the “filename” and ” path” record fields will become attributes on the FlowFile. FetchFTP is configured to use a value of ${path}/${filename} for the “Remote File” property, making use of these attributes.

Finally, we process the data - in this example, simply by compressing it with GZIP compression - and send the output to the “Processed Data” Output Port. The data will queue up here until all data is ready to leave the Process Group and then will be released.

Record Schema

When the Processor is configured to write the listing using a Record Writer, the Records will be written using the following schema (in Avro format):

{
  "type": "record",
  "name": "nifiRecord",
  "namespace": "org.apache.nifi",
  "fields": [
    {
      "name": "filename",
      "type": "string"
    },
    {
      "name": "path",
      "type": "string"
    },
    {
      "name": "directory",
      "type": "boolean"
    },
    {
      "name": "size",
      "type": "long"
    },
    {
      "name": "lastModified",
      "type": {
        "type": "long",
        "logicalType": "timestamp-millis"
      }
    },
    {
      "name": "permissions",
      "type": [
        "null",
        "string"
      ]
    },
    {
      "name": "owner",
      "type": [
        "null",
        "string"
      ]
    },
    {
      "name": "group",
      "type": [
        "null",
        "string"
      ]
    }
  ]
}
json

ListGCSBucket

Retrieves a listing of objects from a GCS bucket. For each object that is listed, creates a FlowFile that represents the object so that it can be fetched in conjunction with FetchGCSObject. This Processor is designed to run on Primary Node only in a cluster. If the primary node changes, the new Primary Node will pick up where the previous node left off without duplicating all of the data.

Tags: google cloud, google, storage, gcs, list

Properties

GCP Credentials Provider Service

The Controller Service used to obtain Google Cloud Platform credentials.

Project ID

Google Cloud Project ID

Bucket

Bucket of the object.

Prefix

The prefix used to filter the object list. In most cases, it should end with a forward slash ('/').

Listing Strategy

Specify how to determine new/updated entities. See each strategy descriptions for detail.

Entity Tracking State Cache

Listed entities are stored in the specified cache storage so that this processor can resume listing across NiFi restart or in case of primary node change. 'Tracking Entities' strategy require tracking information of all listed entities within the last 'Tracking Time Window'. To support large number of entities, the strategy uses DistributedMapCache instead of managed state. Cache key format is 'ListedEntities::{processorId}(::{nodeId})'. If it tracks per node listed entities, then the optional '::{nodeId}' part is added to manage state separately. E.g. cluster wide cache key = 'ListedEntities::8dda2321-0164-1000-50fa-3042fe7d6a7b', per node cache key = 'ListedEntities::8dda2321-0164-1000-50fa-3042fe7d6a7b::nifi-node3' The stored cache content is Gzipped JSON string. The cache key will be deleted when target listing configuration is changed. Used by 'Tracking Entities' strategy.

Entity Tracking Initial Listing Target

Specify how initial listing should be handled. Used by 'Tracking Entities' strategy.

Entity Tracking Time Window

Specify how long this processor should track already-listed entities. 'Tracking Entities' strategy can pick any entity whose timestamp is inside the specified time window. For example, if set to '30 minutes', any entity having timestamp in recent 30 minutes will be the listing target when this processor runs. A listed entity is considered 'new/updated' and a FlowFile is emitted if one of following condition meets: 1. does not exist in the already-listed entities, 2. has newer timestamp than the cached entity, 3. has different size than the cached entity. If a cached entity’s timestamp becomes older than specified time window, that entity will be removed from the cached already-listed entities. Used by 'Tracking Entities' strategy.

Record Writer

Specifies the Record Writer to use for creating the listing. If not specified, one FlowFile will be created for each entity that is listed. If the Record Writer is specified, all entities will be written to a single FlowFile instead of adding attributes to individual FlowFiles.

Use Generations

Specifies whether to use GCS Generations, if applicable. If false, only the latest version of each object will be returned.

Number of retries

How many retry attempts should be made before routing to the failure relationship.

Storage API URL

Overrides the default storage URL. Configuring an alternative Storage API URL also overrides the HTTP Host header on requests as described in the Google documentation for Private Service Connections.

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP + AuthN

Relationships

  • success: FlowFiles are routed to this relationship after a successful Google Cloud Storage operation.

Writes Attributes

  • filename: The name of the file

  • gcs.bucket: Bucket of the object.

  • gcs.key: Name of the object.

  • gcs.size: Size of the object.

  • gcs.cache.control: Data cache control of the object.

  • gcs.component.count: The number of components which make up the object.

  • gcs.content.disposition: The data content disposition of the object.

  • gcs.content.encoding: The content encoding of the object.

  • gcs.content.language: The content language of the object.

  • mime.type: The MIME/Content-Type of the object

  • gcs.crc32c: The CRC32C checksum of object’s data, encoded in base64 in big-endian order.

  • gcs.create.time: The creation time of the object (milliseconds)

  • gcs.update.time: The last modification time of the object (milliseconds)

  • gcs.encryption.algorithm: The algorithm used to encrypt the object.

  • gcs.encryption.sha256: The SHA256 hash of the key used to encrypt the object

  • gcs.etag: The HTTP 1.1 Entity tag for the object.

  • gcs.generated.id: The service-generated for the object

  • gcs.generation: The data generation of the object.

  • gcs.md5: The MD5 hash of the object’s data encoded in base64.

  • gcs.media.link: The media download link to the object.

  • gcs.metageneration: The metageneration of the object.

  • gcs.owner: The owner (uploader) of the object.

  • gcs.owner.type: The ACL entity type of the uploader of the object.

  • gcs.acl.owner: A comma-delimited list of ACL entities that have owner access to the object. Entities will be either email addresses, domains, or project IDs.

  • gcs.acl.writer: A comma-delimited list of ACL entities that have write access to the object. Entities will be either email addresses, domains, or project IDs.

  • gcs.acl.reader: A comma-delimited list of ACL entities that have read access to the object. Entities will be either email addresses, domains, or project IDs.

  • gcs.uri: The URI of the object as a string.

Stateful

Scope: Cluster

After performing a listing of keys, the timestamp of the newest key is stored, along with the keys that share that same timestamp. This allows the Processor to list only keys that have been added or modified after this date the next time that the Processor is run. State is stored across the cluster so that this Processor can be run on Primary Node only and if a new Primary Node is selected, the new node can pick up where the previous node left off, without duplicating the data.

Input Requirement

This component does not allow an incoming relationship.

Additional Details

Streaming Versus Batch Processing

ListGCSBucket performs a listing of all GCS Objects that it encounters in the configured GCS bucket. There are two common, broadly defined use cases.

Streaming Use Case

By default, the Processor will create a separate FlowFile for each object in the bucket and add attributes for filename, bucket, etc. A common use case is to connect ListGCSBucket to the FetchGCSObject processor. These two processors used in conjunction with one another provide the ability to easily monitor a bucket and fetch the contents of any new object as it lands in GCS in an efficient streaming fashion.

Batch Use Case

Another common use case is the desire to process all newly arriving objects in a given bucket, and to then perform some action only when all objects have completed their processing. The above approach of streaming the data makes this difficult, because NiFi is inherently a streaming platform in that there is no “job” that has a beginning and an end. Data is simply picked up as it becomes available.

To solve this, the ListGCSBucket Processor can optionally be configured with a Record Writer. When a Record Writer is configured, a single FlowFile will be created that will contain a Record for each object in the bucket, instead of a separate FlowFile per object. See the documentation for ListFile for an example of how to build a dataflow that allows for processing all the objects before proceeding with any other step.

One important difference between the data produced by ListFile and ListGCSBucket, though, is the structure of the Records that are emitted. The Records emitted by ListFile have a different schema than those emitted by ListGCSBucket. ListGCSBucket emits records that follow the following schema (in Avro format):

{
  "type": "record",
  "name": "nifiRecord",
  "namespace": "org.apache.nifi",
  "fields": [
    {
      "name": "bucket",
      "type": "string"
    },
    {
      "name": "name",
      "type": "string"
    },
    {
      "name": "size",
      "type": [
        "null",
        "long"
      ]
    },
    {
      "name": "cacheControl",
      "type": [
        "null",
        "string"
      ]
    },
    {
      "name": "componentCount",
      "type": [
        "null",
        "int"
      ]
    },
    {
      "name": "contentDisposition",
      "type": [
        "null",
        "long"
      ]
    },
    {
      "name": "contentEncoding",
      "type": [
        "null",
        "string"
      ]
    },
    {
      "name": "contentLanguage",
      "type": [
        "null",
        "string"
      ]
    },
    {
      "name": "crc32c",
      "type": [
        "null",
        "string"
      ]
    },
    {
      "name": "createTime",
      "type": [
        "null",
        {
          "type": "long",
          "logicalType": "timestamp-millis"
        }
      ]
    },
    {
      "name": "updateTime",
      "type": [
        "null",
        {
          "type": "long",
          "logicalType": "timestamp-millis"
        }
      ]
    },
    {
      "name": "encryptionAlgorithm",
      "type": [
        "null",
        "string"
      ]
    },
    {
      "name": "encryptionKeySha256",
      "type": [
        "null",
        "string"
      ]
    },
    {
      "name": "etag",
      "type": [
        "null",
        "string"
      ]
    },
    {
      "name": "generatedId",
      "type": [
        "null",
        "string"
      ]
    },
    {
      "name": "generation",
      "type": [
        "null",
        "long"
      ]
    },
    {
      "name": "md5",
      "type": [
        "null",
        "string"
      ]
    },
    {
      "name": "mediaLink",
      "type": [
        "null",
        "string"
      ]
    },
    {
      "name": "metageneration",
      "type": [
        "null",
        "long"
      ]
    },
    {
      "name": "owner",
      "type": [
        "null",
        "string"
      ]
    },
    {
      "name": "ownerType",
      "type": [
        "null",
        "string"
      ]
    },
    {
      "name": "uri",
      "type": [
        "null",
        "string"
      ]
    }
  ]
}
json

ListGoogleDrive

Performs a listing of concrete files (shortcuts are ignored) in a Google Drive folder. If the 'Record Writer' property is set, a single Output FlowFile is created, and each file in the listing is written as a single record to the output file. Otherwise, for each file in the listing, an individual FlowFile is created, the metadata being written as FlowFile attributes. This Processor is designed to run on Primary Node only in a cluster. If the primary node changes, the new Primary Node will pick up where the previous node left off without duplicating all of the data. Please see Additional Details to set up access to Google Drive.

Tags: google, drive, storage

Properties

GCP Credentials Provider Service

The Controller Service used to obtain Google Cloud Platform credentials.

Folder ID

The ID of the folder from which to pull list of files. Please see Additional Details to set up access to Google Drive and obtain Folder ID. WARNING: Unauthorized access to the folder is treated as if the folder was empty. This results in the processor not creating outgoing FlowFiles. No additional error message is provided.

Search Recursively

When 'true', will include list of files from concrete sub-folders (ignores shortcuts). Otherwise, will return only files that have the defined 'Folder ID' as their parent directly. WARNING: The listing may fail if there are too many sub-folders (500+).

Minimum File Age

The minimum age a file must be in order to be considered; any files younger than this will be ignored.

Listing Strategy

Specify how to determine new/updated entities. See each strategy descriptions for detail.

Entity Tracking State Cache

Listed entities are stored in the specified cache storage so that this processor can resume listing across NiFi restart or in case of primary node change. 'Tracking Entities' strategy require tracking information of all listed entities within the last 'Tracking Time Window'. To support large number of entities, the strategy uses DistributedMapCache instead of managed state. Cache key format is 'ListedEntities::{processorId}(::{nodeId})'. If it tracks per node listed entities, then the optional '::{nodeId}' part is added to manage state separately. E.g. cluster wide cache key = 'ListedEntities::8dda2321-0164-1000-50fa-3042fe7d6a7b', per node cache key = 'ListedEntities::8dda2321-0164-1000-50fa-3042fe7d6a7b::nifi-node3' The stored cache content is Gzipped JSON string. The cache key will be deleted when target listing configuration is changed. Used by 'Tracking Entities' strategy.

Entity Tracking Time Window

Specify how long this processor should track already-listed entities. 'Tracking Entities' strategy can pick any entity whose timestamp is inside the specified time window. For example, if set to '30 minutes', any entity having timestamp in recent 30 minutes will be the listing target when this processor runs. A listed entity is considered 'new/updated' and a FlowFile is emitted if one of following condition meets: 1. does not exist in the already-listed entities, 2. has newer timestamp than the cached entity, 3. has different size than the cached entity. If a cached entity’s timestamp becomes older than specified time window, that entity will be removed from the cached already-listed entities. Used by 'Tracking Entities' strategy.

Entity Tracking Initial Listing Target

Specify how initial listing should be handled. Used by 'Tracking Entities' strategy.

Record Writer

Specifies the Record Writer to use for creating the listing. If not specified, one FlowFile will be created for each entity that is listed. If the Record Writer is specified, all entities will be written to a single FlowFile instead of adding attributes to individual FlowFiles.

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP + AuthN

Relationships

  • success: All FlowFiles that are received are routed to success

Writes Attributes

  • drive.id: The id of the file

  • filename: The name of the file

  • mime.type: The MIME type of the file

  • drive.size: The size of the file

  • drive.timestamp: The last modified time or created time (whichever is greater) of the file. The reason for this is that the original modified date of a file is preserved when uploaded to Google Drive. 'Created time' takes the time when the upload occurs. However uploaded files can still be modified later.

Stateful

Scope: Cluster

The processor stores necessary data to be able to keep track what files have been listed already. What exactly needs to be stored depends on the 'Listing Strategy'. State is stored across the cluster so that this Processor can be run on Primary Node only and if a new Primary Node is selected, the new node can pick up where the previous node left off, without duplicating the data.

Input Requirement

This component does not allow an incoming relationship.

Additional Details

Accessing Google Drive from NiFi

This processor uses Google Cloud credentials for authentication to access Google Drive. The following steps are required to prepare the Google Cloud and Google Drive accounts for the processors:

  1. Enable Google Drive API in Google Cloud

  2. Grant access to Google Drive folder

    • In Google Cloud Console navigate to IAM & Admin → Service Accounts.

    • Take a note of the email of the service account you are going to use.

    • Navigate to the folder to be listed in Google Drive.

    • Right-click on the Folder → Share.

    • Enter the service account email.

  3. Find Folder ID

    • Navigate to the folder to be listed in Google Drive and enter it. The URL in your browser will include the ID at the end of the URL. For example, if the URL were https://drive.google.com/drive/folders/1trTraPVCnX5_TNwO8d9P_bz278xWOmGm, the Folder ID would be 1trTraPVCnX5_TNwO8d9P_bz278xWOmGm

  4. Set Folder ID in ‘Folder ID’ property

ListHDFS

Retrieves a listing of files from HDFS. For each file that is listed in HDFS, this processor creates a FlowFile that represents the HDFS file to be fetched in conjunction with FetchHDFS. This Processor is designed to run on Primary Node only in a cluster. If the primary node changes, the new Primary Node will pick up where the previous node left off without duplicating all of the data. Unlike GetHDFS, this Processor does not delete any data from HDFS.

Tags: hadoop, HCFS, HDFS, get, list, ingest, source, filesystem

Properties

Hadoop Configuration Resources

A file or comma separated list of files which contains the Hadoop file system configuration. Without this, Hadoop will search the classpath for a 'core-site.xml' and 'hdfs-site.xml' file or will revert to a default configuration. To use swebhdfs, see 'Additional Details' section of PutHDFS’s documentation.

Kerberos User Service

Specifies the Kerberos User Controller Service that should be used for authenticating with Kerberos

Additional Classpath Resources

A comma-separated list of paths to files and/or directories that will be added to the classpath and used for loading native libraries. When specifying a directory, all files with in the directory will be added to the classpath, but further sub-directories will not be included.

Directory

The HDFS directory from which files should be read

Recurse Subdirectories

Indicates whether to list files from subdirectories of the HDFS directory

Record Writer

Specifies the Record Writer to use for creating the listing. If not specified, one FlowFile will be created for each entity that is listed. If the Record Writer is specified, all entities will be written to a single FlowFile.

File Filter

Only files whose names match the given regular expression will be picked up

File Filter Mode

Determines how the regular expression in File Filter will be used when retrieving listings.

Minimum File Age

The minimum age that a file must be in order to be pulled; any file younger than this amount of time (based on last modification date) will be ignored

Maximum File Age

The maximum age that a file must be in order to be pulled; any file older than this amount of time (based on last modification date) will be ignored. Minimum value is 100ms.

Relationships

  • success: All FlowFiles are transferred to this relationship

Writes Attributes

  • filename: The name of the file that was read from HDFS.

  • path: The path is set to the absolute path of the file’s directory on HDFS. For example, if the Directory property is set to /tmp, then files picked up from /tmp will have the path attribute set to "./". If the Recurse Subdirectories property is set to true and a file is picked up from /tmp/abc/1/2/3, then the path attribute will be set to "/tmp/abc/1/2/3".

  • hdfs.owner: The user that owns the file in HDFS

  • hdfs.group: The group that owns the file in HDFS

  • hdfs.lastModified: The timestamp of when the file in HDFS was last modified, as milliseconds since midnight Jan 1, 1970 UTC

  • hdfs.length: The number of bytes in the file in HDFS

  • hdfs.replication: The number of HDFS replicas for hte file

  • hdfs.permissions: The permissions for the file in HDFS. This is formatted as 3 characters for the owner, 3 for the group, and 3 for other users. For example rw-rw-r--

Stateful

Scope: Cluster

After performing a listing of HDFS files, the latest timestamp of all the files listed is stored. This allows the Processor to list only files that have been added or modified after this date the next time that the Processor is run, without having to store all of the actual filenames/paths which could lead to performance problems. State is stored across the cluster so that this Processor can be run on Primary Node only and if a new Primary Node is selected, the new node can pick up where the previous node left off, without duplicating the data.

Input Requirement

This component does not allow an incoming relationship.

Additional Details

ListHDFS Filter Modes

There are three filter modes available for ListHDFS that determine how the regular expression in the File Filter property will be applied to listings in HDFS.

  • Directories and Files Filtering will be applied to the names of directories and files. If Recurse Subdirectories is set to true, only subdirectories with a matching name will be searched for files that match the regular expression defined in File Filter.

  • Files Only Filtering will only be applied to the names of files. If Recurse Subdirectories is set to true, the entire subdirectory tree will be searched for files that match the regular expression defined in File Filter.

  • Full Path Filtering will be applied to the full path of files. If Recurse Subdirectories is set to true, the entire subdirectory tree will be searched for files in which the full path of the file matches the regular expression defined in File Filter. Regarding scheme and authority, if a given file has a full path of hdfs://hdfscluster:8020/data/txt/1.txt, the filter will evaluate the regular expression defined in File Filter against two cases, matching if either is true:

  • the full path including the scheme (hdfs), authority (hdfscluster:8020), and the remaining path components ( /data/txt/1.txt)

  • only the path components (/data/txt/1.txt)

Examples:

For the given examples, the following directory structure is used:

data
├── readme.txt
├── bin
│ ├── readme.txt
│ ├── 1.bin
│ ├── 2.bin
│ └── 3.bin
├── csv
│ ├── readme.txt
│ ├── 1.csv
│ ├── 2.csv
│ └── 3.csv
└── txt ├── readme.txt ├── 1.txt ├── 2.txt └── 3.txt

Directories and Files

This mode is useful when the listing should match the names of directories and files with the regular expression defined in File Filter. When Recurse Subdirectories is true, this mode allows the user to filter for files in subdirectories with names that match the regular expression defined in File Filter.

ListHDFS configuration:

Property Value

Directory

/data

Recurse Subdirectories

true

File Filter

.*txt.*

Filter Mode

Directories and Files

ListHDFS results:

  • /data/readme.txt

  • /data/txt/readme.txt

  • /data/txt/1.txt

  • /data/txt/2.txt

  • /data/txt/3.txt

Files Only

This mode is useful when the listing should match only the names of files with the regular expression defined in File Filter. Directory names will not be matched against the regular expression defined in File Filter. When Recurse Subdirectories is true, this mode allows the user to filter for files in the entire subdirectory tree of the directory specified in the Directory property.

ListHDFS configuration:

Property Value

Directory

/data

Recurse Subdirectories

true

File Filter

[^\.].*\.txt

Filter Mode

Files Only

ListHDFS results:

  • /data/readme.txt

  • /data/bin/readme.txt

  • /data/csv/readme.txt

  • /data/txt/readme.txt

  • /data/txt/1.txt

  • /data/txt/2.txt

  • /data/txt/3.txt

Full Path

This mode is useful when the listing should match the entire path of a file with the regular expression defined in File Filter. When Recurse Subdirectories is true, this mode allows the user to filter for files in the entire subdirectory tree of the directory specified in the Directory property while allowing filtering based on the full path of each file.

ListHDFS configuration:

Property Value

Directory

/data

Recurse Subdirectories

true

File Filter

(/.*/)*csv/.*

Filter Mode

Full Path

ListHDFS results:

  • /data/csv/readme.txt

  • /data/csv/1.csv

  • /data/csv/2.csv

  • /data/csv/3.csv

Streaming Versus Batch Processing

ListHDFS performs a listing of all files that it encounters in the configured HDFS directory. There are two common, broadly defined use cases.

Streaming Use Case

By default, the Processor will create a separate FlowFile for each file in the directory and add attributes for filename, path, etc. A common use case is to connect ListHDFS to the FetchHDFS processor. These two processors used in conjunction with one another provide the ability to easily monitor a directory and fetch the contents of any new file as it lands in HDFS in an efficient streaming fashion.

Batch Use Case

Another common use case is the desire to process all newly arriving files in a given directory, and to then perform some action only when all files have completed their processing. The above approach of streaming the data makes this difficult, because NiFi is inherently a streaming platform in that there is no “job” that has a beginning and an end. Data is simply picked up as it becomes available.

To solve this, the ListHDFS Processor can optionally be configured with a Record Writer. When a Record Writer is configured, a single FlowFile will be created that will contain a Record for each file in the directory, instead of a separate FlowFile per file. See the documentation for ListFile for an example of how to build a dataflow that allows for processing all the files before proceeding with any other step.

One important difference between the data produced by ListFile and ListHDFS, though, is the structure of the Records that are emitted. The Records emitted by ListFile have a different schema than those emitted by ListHDFS. ListHDFS emits records that follow the following schema (in Avro format):

{
  "type": "record",
  "name": "nifiRecord",
  "namespace": "org.apache.nifi",
  "fields": [
    {
      "name": "filename",
      "type": "string"
    },
    {
      "name": "path",
      "type": "string"
    },
    {
      "name": "directory",
      "type": "boolean"
    },
    {
      "name": "size",
      "type": "long"
    },
    {
      "name": "lastModified",
      "type": {
        "type": "long",
        "logicalType": "timestamp-millis"
      }
    },
    {
      "name": "permissions",
      "type": [
        "null",
        "string"
      ]
    },
    {
      "name": "owner",
      "type": [
        "null",
        "string"
      ]
    },
    {
      "name": "group",
      "type": [
        "null",
        "string"
      ]
    },
    {
      "name": "replication",
      "type": [
        "null",
        "int"
      ]
    },
    {
      "name": "symLink",
      "type": [
        "null",
        "boolean"
      ]
    },
    {
      "name": "encrypted",
      "type": [
        "null",
        "boolean"
      ]
    },
    {
      "name": "erasureCoded",
      "type": [
        "null",
        "boolean"
      ]
    }
  ]
}
json

ListS3

Retrieves a listing of objects from an S3 bucket. For each object that is listed, creates a FlowFile that represents the object so that it can be fetched in conjunction with FetchS3Object. This Processor is designed to run on Primary Node only in a cluster. If the primary node changes, the new Primary Node will pick up where the previous node left off without duplicating all of the data.

Tags: Amazon, S3, AWS, list

Properties

Bucket

The S3 Bucket to interact with

Region

The AWS Region to connect to.

AWS Credentials Provider Service

The Controller Service that is used to obtain AWS credentials provider

Listing Strategy

Specify how to determine new/updated entities. See each strategy descriptions for detail.

Entity Tracking State Cache

Listed entities are stored in the specified cache storage so that this processor can resume listing across NiFi restart or in case of primary node change. 'Tracking Entities' strategy require tracking information of all listed entities within the last 'Tracking Time Window'. To support large number of entities, the strategy uses DistributedMapCache instead of managed state. Cache key format is 'ListedEntities::{processorId}(::{nodeId})'. If it tracks per node listed entities, then the optional '::{nodeId}' part is added to manage state separately. E.g. cluster wide cache key = 'ListedEntities::8dda2321-0164-1000-50fa-3042fe7d6a7b', per node cache key = 'ListedEntities::8dda2321-0164-1000-50fa-3042fe7d6a7b::nifi-node3' The stored cache content is Gzipped JSON string. The cache key will be deleted when target listing configuration is changed. Used by 'Tracking Entities' strategy.

Entity Tracking Time Window

Specify how long this processor should track already-listed entities. 'Tracking Entities' strategy can pick any entity whose timestamp is inside the specified time window. For example, if set to '30 minutes', any entity having timestamp in recent 30 minutes will be the listing target when this processor runs. A listed entity is considered 'new/updated' and a FlowFile is emitted if one of following condition meets: 1. does not exist in the already-listed entities, 2. has newer timestamp than the cached entity, 3. has different size than the cached entity. If a cached entity’s timestamp becomes older than specified time window, that entity will be removed from the cached already-listed entities. Used by 'Tracking Entities' strategy.

Entity Tracking Initial Listing Target

Specify how initial listing should be handled. Used by 'Tracking Entities' strategy.

Record Writer

Specifies the Record Writer to use for creating the listing. If not specified, one FlowFile will be created for each entity that is listed. If the Record Writer is specified, all entities will be written to a single FlowFile instead of adding attributes to individual FlowFiles.

Minimum Object Age

The minimum age that an S3 object must be in order to be considered; any object younger than this amount of time (according to last modification date) will be ignored

Maximum Object Age

The maximum age that an S3 object can be in order to be considered; any object older than this amount of time (according to last modification date) will be ignored

Listing Batch Size

If not using a Record Writer, this property dictates how many S3 objects should be listed in a single batch. Once this number is reached, the FlowFiles that have been created will be transferred out of the Processor. Setting this value lower may result in lower latency by sending out the FlowFiles before the complete listing has finished. However, it can significantly reduce performance. Larger values may take more memory to store all of the information before sending the FlowFiles out. This property is ignored if using a Record Writer, as one of the main benefits of the Record Writer is being able to emit the entire listing as a single FlowFile.

Write Object Tags

If set to 'True', the tags associated with the S3 object will be written as FlowFile attributes

Write User Metadata

If set to 'True', the user defined metadata associated with the S3 object will be added to FlowFile attributes/records

Communications Timeout

The amount of time to wait in order to establish a connection to AWS or receive data from AWS before timing out.

SSL Context Service

Specifies an optional SSL Context Service that, if provided, will be used to create connections

Endpoint Override URL

Endpoint URL to use instead of the AWS default including scheme, host, port, and path. The AWS libraries select an endpoint URL based on the AWS region, but this property overrides the selected endpoint URL, allowing use with other S3-compatible endpoints.

Signer Override

The AWS S3 library uses Signature Version 4 by default but this property allows you to specify the Version 2 signer to support older S3-compatible services or even to plug in your own custom signer implementation.

Custom Signer Class Name

Fully qualified class name of the custom signer class. The signer must implement com.amazonaws.auth.Signer interface.

Custom Signer Module Location

Comma-separated list of paths to files and/or directories which contain the custom signer’s JAR file and its dependencies (if any).

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP + AuthN

Delimiter

The string used to delimit directories within the bucket. Please consult the AWS documentation for the correct use of this field.

Prefix

The prefix used to filter the object list. Do not begin with a forward slash '/'. In most cases, it should end with a forward slash '/'.

Use Versions

Specifies whether to use S3 versions, if applicable. If false, only the latest version of each object will be returned.

List Type

Specifies whether to use the original List Objects or the newer List Objects Version 2 endpoint.

Requester Pays

If true, indicates that the requester consents to pay any charges associated with listing the S3 bucket. This sets the 'x-amz-request-payer' header to 'requester'. Note that this setting is not applicable when 'Use Versions' is 'true'.

Relationships

  • success: FlowFiles are routed to this Relationship after they have been successfully processed.

Writes Attributes

  • s3.bucket: The name of the S3 bucket

  • s3.region: The region of the S3 bucket

  • filename: The name of the file

  • s3.etag: The ETag that can be used to see if the file has changed

  • s3.isLatest: A boolean indicating if this is the latest version of the object

  • s3.lastModified: The last modified time in milliseconds since epoch in UTC time

  • s3.length: The size of the object in bytes

  • s3.storeClass: The storage class of the object

  • s3.version: The version of the object, if applicable

  • s3.tag._: If 'Write Object Tags' is set to 'True', the tags associated to the S3 object that is being listed will be written as part of the flowfile attributes

  • s3.user.metadata._: If 'Write User Metadata' is set to 'True', the user defined metadata associated to the S3 object that is being listed will be written as part of the flowfile attributes

Stateful

Scope: Cluster

After performing a listing of keys, the timestamp of the newest key is stored, along with the keys that share that same timestamp. This allows the Processor to list only keys that have been added or modified after this date the next time that the Processor is run. State is stored across the cluster so that this Processor can be run on Primary Node only and if a new Primary Node is selected, the new node can pick up where the previous node left off, without duplicating the data.

Input Requirement

This component does not allow an incoming relationship.

Additional Details

Streaming Versus Batch Processing

ListS3 performs a listing of all S3 Objects that it encounters in the configured S3 bucket. There are two common, broadly defined use cases.

Streaming Use Case

By default, the Processor will create a separate FlowFile for each object in the bucket and add attributes for filename, bucket, etc. A common use case is to connect ListS3 to the FetchS3 processor. These two processors used in conjunction with one another provide the ability to easily monitor a bucket and fetch the contents of any new object as it lands in S3 in an efficient streaming fashion.

Batch Use Case

Another common use case is the desire to process all newly arriving objects in a given bucket, and to then perform some action only when all objects have completed their processing. The above approach of streaming the data makes this difficult, because NiFi is inherently a streaming platform in that there is no “job” that has a beginning and an end. Data is simply picked up as it becomes available.

To solve this, the ListS3 Processor can optionally be configured with a Record Writer. When a Record Writer is configured, a single FlowFile will be created that will contain a Record for each object in the bucket, instead of a separate FlowFile per object. See the documentation for ListFile for an example of how to build a dataflow that allows for processing all the objects before proceeding with any other step.

One important difference between the data produced by ListFile and ListS3, though, is the structure of the Records that are emitted. The Records emitted by ListFile have a different schema than those emitted by ListS3. ListS3 emits records that follow the following schema (in Avro format):

{
  "type": "record",
  "name": "nifiRecord",
  "namespace": "org.apache.nifi",
  "fields": [
    {
      "name": "key",
      "type": "string"
    },
    {
      "name": "bucket",
      "type": "string"
    },
    {
      "name": "owner",
      "type": [
        "null",
        "string"
      ]
    },
    {
      "name": "etag",
      "type": "string"
    },
    {
      "name": "lastModified",
      "type": {
        "type": "long",
        "logicalType": "timestamp-millis"
      }
    },
    {
      "name": "size",
      "type": "long"
    },
    {
      "name": "storageClass",
      "type": "string"
    },
    {
      "name": "latest",
      "type": "boolean"
    },
    {
      "name": "versionId",
      "type": [
        "null",
        "string"
      ]
    },
    {
      "name": "tags",
      "type": [
        "null",
        {
          "type": "map",
          "values": "string"
        }
      ]
    },
    {
      "name": "userMetadata",
      "type": [
        "null",
        {
          "type": "map",
          "values": "string"
        }
      ]
    }
  ]
}
json

ListSFTP

Performs a listing of the files residing on an SFTP server. For each file that is found on the remote server, a new FlowFile will be created with the filename attribute set to the name of the file on the remote server. This can then be used in conjunction with FetchSFTP in order to fetch those files.

Tags: list, sftp, remote, ingest, source, input, files

Properties

Listing Strategy

Specify how to determine new/updated entities. See each strategy descriptions for detail.

Hostname

The fully qualified hostname or IP address of the remote system

Port

The port that the remote system is listening on for file transfers

Username

Username

Password

Password for the user account

Private Key Path

The fully qualified path to the Private Key file

Private Key Passphrase

Password for the private key

Remote Path

The path on the remote system from which to pull or push files

Record Writer

Specifies the Record Writer to use for creating the listing. If not specified, one FlowFile will be created for each entity that is listed. If the Record Writer is specified, all entities will be written to a single FlowFile instead of adding attributes to individual FlowFiles.

Search Recursively

If true, will pull files from arbitrarily nested subdirectories; otherwise, will not traverse subdirectories

Follow symlink

If true, will pull even symbolic files and also nested symbolic subdirectories; otherwise, will not read symbolic files and will not traverse symbolic link subdirectories

File Filter Regex

Provides a Java Regular Expression for filtering Filenames; if a filter is supplied, only files whose names match that Regular Expression will be fetched

Path Filter Regex

When Search Recursively is true, then only subdirectories whose path matches the given Regular Expression will be scanned

Ignore Dotted Files

If true, files whose names begin with a dot (".") will be ignored

Remote Poll Batch Size

The value specifies how many file paths to find in a given directory on the remote system when doing a file listing. This value in general should not need to be modified but when polling against a remote system with a tremendous number of files this value can be critical. Setting this value too high can result very poor performance and setting it too low can cause the flow to be slower than normal.

Strict Host Key Checking

Indicates whether or not strict enforcement of hosts keys should be applied

Host Key File

If supplied, the given file will be used as the Host Key; otherwise, if 'Strict Host Key Checking' property is applied (set to true) then uses the 'known_hosts' and 'known_hosts2' files from ~/.ssh directory else no host key file will be used

Connection Timeout

Amount of time to wait before timing out while creating a connection

Data Timeout

When transferring a file between the local and remote system, this value specifies how long is allowed to elapse without any data being transferred between systems

Send Keep Alive On Timeout

Send a Keep Alive message every 5 seconds up to 5 times for an overall timeout of 25 seconds.

Target System Timestamp Precision

Specify timestamp precision at the target system. Since this processor uses timestamp of entities to decide which should be listed, it is crucial to use the right timestamp precision.

Use Compression

Indicates whether or not ZLIB compression should be used when transferring files

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP + AuthN, SOCKS + AuthN

Entity Tracking State Cache

Listed entities are stored in the specified cache storage so that this processor can resume listing across NiFi restart or in case of primary node change. 'Tracking Entities' strategy require tracking information of all listed entities within the last 'Tracking Time Window'. To support large number of entities, the strategy uses DistributedMapCache instead of managed state. Cache key format is 'ListedEntities::{processorId}(::{nodeId})'. If it tracks per node listed entities, then the optional '::{nodeId}' part is added to manage state separately. E.g. cluster wide cache key = 'ListedEntities::8dda2321-0164-1000-50fa-3042fe7d6a7b', per node cache key = 'ListedEntities::8dda2321-0164-1000-50fa-3042fe7d6a7b::nifi-node3' The stored cache content is Gzipped JSON string. The cache key will be deleted when target listing configuration is changed. Used by 'Tracking Entities' strategy.

Entity Tracking Time Window

Specify how long this processor should track already-listed entities. 'Tracking Entities' strategy can pick any entity whose timestamp is inside the specified time window. For example, if set to '30 minutes', any entity having timestamp in recent 30 minutes will be the listing target when this processor runs. A listed entity is considered 'new/updated' and a FlowFile is emitted if one of following condition meets: 1. does not exist in the already-listed entities, 2. has newer timestamp than the cached entity, 3. has different size than the cached entity. If a cached entity’s timestamp becomes older than specified time window, that entity will be removed from the cached already-listed entities. Used by 'Tracking Entities' strategy.

Entity Tracking Initial Listing Target

Specify how initial listing should be handled. Used by 'Tracking Entities' strategy.

Minimum File Age

The minimum age that a file must be in order to be pulled; any file younger than this amount of time (according to last modification date) will be ignored

Maximum File Age

The maximum age that a file must be in order to be pulled; any file older than this amount of time (according to last modification date) will be ignored

Minimum File Size

The minimum size that a file must be in order to be pulled

Maximum File Size

The maximum size that a file can be in order to be pulled

Ciphers Allowed

A comma-separated list of Ciphers allowed for SFTP connections. Leave unset to allow all. Available options are: 3des-cbc, 3des-ctr, aes128-cbc, aes128-ctr, aes128-gcm@openssh.com, aes192-cbc, aes192-ctr, aes256-cbc, aes256-ctr, aes256-gcm@openssh.com, arcfour, arcfour128, arcfour256, blowfish-cbc, blowfish-ctr, cast128-cbc, cast128-ctr, chacha20-poly1305@openssh.com, idea-cbc, idea-ctr, serpent128-cbc, serpent128-ctr, serpent192-cbc, serpent192-ctr, serpent256-cbc, serpent256-ctr, twofish-cbc, twofish128-cbc, twofish128-ctr, twofish192-cbc, twofish192-ctr, twofish256-cbc, twofish256-ctr

Key Algorithms Allowed

A comma-separated list of Key Algorithms allowed for SFTP connections. Leave unset to allow all. Available options are: ecdsa-sha2-nistp256, ecdsa-sha2-nistp256-cert-v01@openssh.com, ecdsa-sha2-nistp384, ecdsa-sha2-nistp384-cert-v01@openssh.com, ecdsa-sha2-nistp521, ecdsa-sha2-nistp521-cert-v01@openssh.com, rsa-sha2-256, rsa-sha2-512, ssh-dss, ssh-dss-cert-v01@openssh.com, ssh-ed25519, ssh-ed25519-cert-v01@openssh.com, ssh-rsa, ssh-rsa-cert-v01@openssh.com

Key Exchange Algorithms Allowed

A comma-separated list of Key Exchange Algorithms allowed for SFTP connections. Leave unset to allow all. Available options are: curve25519-sha256, curve25519-sha256@libssh.org, diffie-hellman-group-exchange-sha1, diffie-hellman-group-exchange-sha256, diffie-hellman-group1-sha1, diffie-hellman-group14-sha1, diffie-hellman-group14-sha256, diffie-hellman-group14-sha256@ssh.com, diffie-hellman-group15-sha256, diffie-hellman-group15-sha256@ssh.com, diffie-hellman-group15-sha384@ssh.com, diffie-hellman-group15-sha512, diffie-hellman-group16-sha256, diffie-hellman-group16-sha384@ssh.com, diffie-hellman-group16-sha512, diffie-hellman-group16-sha512@ssh.com, diffie-hellman-group17-sha512, diffie-hellman-group18-sha512, diffie-hellman-group18-sha512@ssh.com, ecdh-sha2-nistp256, ecdh-sha2-nistp384, ecdh-sha2-nistp521, ext-info-c

Message Authentication Codes Allowed

A comma-separated list of Message Authentication Codes allowed for SFTP connections. Leave unset to allow all. Available options are: hmac-md5, hmac-md5-96, hmac-md5-96-etm@openssh.com, hmac-md5-etm@openssh.com, hmac-ripemd160, hmac-ripemd160-96, hmac-ripemd160-etm@openssh.com, hmac-ripemd160@openssh.com, hmac-sha1, hmac-sha1-96, hmac-sha1-96@openssh.com, hmac-sha1-etm@openssh.com, hmac-sha2-256, hmac-sha2-256-etm@openssh.com, hmac-sha2-512, hmac-sha2-512-etm@openssh.com

Relationships

  • success: All FlowFiles that are received are routed to success

Writes Attributes

  • sftp.remote.host: The hostname of the SFTP Server

  • sftp.remote.port: The port that was connected to on the SFTP Server

  • sftp.listing.user: The username of the user that performed the SFTP Listing

  • file.owner: The numeric owner id of the source file

  • file.group: The numeric group id of the source file

  • file.permissions: The read/write/execute permissions of the source file

  • file.size: The number of bytes in the source file

  • file.lastModifiedTime: The timestamp of when the file in the filesystem waslast modified as 'yyyy-MM-dd’T’HH:mm:ssZ'

  • filename: The name of the file on the SFTP Server

  • path: The fully qualified name of the directory on the SFTP Server from which the file was pulled

  • mime.type: The MIME Type that is provided by the configured Record Writer

Stateful

Scope: Cluster

After performing a listing of files, the timestamp of the newest file is stored. This allows the Processor to list only files that have been added or modified after this date the next time that the Processor is run. State is stored across the cluster so that this Processor can be run on Primary Node only and if a new Primary Node is selected, the new node will not duplicate the data that was listed by the previous Primary Node.

Input Requirement

This component does not allow an incoming relationship.

Additional Details

ListSFTP performs a listing of all files that it encounters in the configured directory of an SFTP server. There are two common, broadly defined use cases.

Streaming Use Case

By default, the Processor will create a separate FlowFile for each file in the directory and add attributes for filename, path, etc. A common use case is to connect ListSFTP to the FetchSFTP processor. These two processors used in conjunction with one another provide the ability to easily monitor a directory and fetch the contents of any new file as it lands on the SFTP server in an efficient streaming fashion.

Batch Use Case

Another common use case is the desire to process all newly arriving files in a given directory, and to then perform some action only when all files have completed their processing. The above approach of streaming the data makes this difficult, because NiFi is inherently a streaming platform in that there is no “job” that has a beginning and an end. Data is simply picked up as it becomes available.

To solve this, the ListSFTP Processor can optionally be configured with a Record Writer. When a Record Writer is configured, a single FlowFile will be created that will contain a Record for each file in the directory, instead of a separate FlowFile per file. With this pattern, in order to fetch the contents of each file, the records must be split up into individual FlowFiles and then fetched. So how does this help us?

We can still accomplish the desired use case of waiting until all files in the directory have been processed by splitting apart the FlowFile and processing all the data within a Process Group. Configuring the Process Group with a FlowFile Concurrency of “Single FlowFile per Node” means that only one FlowFile will be brought into the Process Group. Once that happens, the FlowFile can be split apart and each part processed. Configuring the Process Group with an Outbound Policy of “Batch Output” means that none of the FlowFiles will leave the Process Group until all have finished processing.

In this flow, we perform a listing of a directory with ListSFTP. The processor is configured with a Record Writer (in this case a CSV Writer, but any Record Writer can be used) so that only a single FlowFile is generated for the entire listing. That listing is then sent to the “Process Listing” Process Group (shown below). Only after the contents of the entire directory have been processed will data leave the “Process Listing” Process Group. At that point, when all data in the Process Group is ready to leave, each of the processed files will be sent to the “Post-Processing” Process Group. At the same time, the original listing is to be sent to the “Processing Complete Notification” Process Group. In order to accomplish this, the Process Group must be configured with a FlowFile Concurrency of “Single FlowFile per Node” and an Outbound Policy of “Batch Output.”

The “Process Listing” Process Group a listing is received via the “Listing” Input Port. This is then sent directly to the “Listing of Processed Data” Output Port so that when all processing completes, the original listing will be sent out also.

Next, the listing is broken apart into an individual FlowFile per record. Because we want to use FetchSFTP to fetch the data, we need to get the file’s filename and path as FlowFile attributes. This can be done in a few different ways, but the easiest mechanism is to use the PartitionRecord processor. This Processor is configured with a Record Reader that is able to read the data written by ListSFTP (in this case, a CSV Reader). The Processor is also configured with two additional user-defined properties:

  • path: /path

  • filename: /filename

As a result, each record that comes into the PartitionRecord processor will be split into an individual FlowFile ( because the combination of the “path” and “filename” fields will be unique for each Record) and the “filename” and ” path” record fields will become attributes on the FlowFile. FetchSFTP is configured to use a value of ${path}/${filename} for the “Remote File” property, making use of these attributes.

Finally, we process the data - in this example, simply by compressing it with GZIP compression - and send the output to the “Processed Data” Output Port. The data will queue up here until all data is ready to leave the Process Group and then will be released.

Record Schema

When the Processor is configured to write the listing using a Record Writer, the Records will be written using the following schema (in Avro format):

{
  "type": "record",
  "name": "nifiRecord",
  "namespace": "org.apache.nifi",
  "fields": [
    {
      "name": "filename",
      "type": "string"
    },
    {
      "name": "path",
      "type": "string"
    },
    {
      "name": "directory",
      "type": "boolean"
    },
    {
      "name": "size",
      "type": "long"
    },
    {
      "name": "lastModified",
      "type": {
        "type": "long",
        "logicalType": "timestamp-millis"
      }
    },
    {
      "name": "permissions",
      "type": [
        "null",
        "string"
      ]
    },
    {
      "name": "owner",
      "type": [
        "null",
        "string"
      ]
    },
    {
      "name": "group",
      "type": [
        "null",
        "string"
      ]
    }
  ]
}
json

ListSmb

Lists concrete files shared via SMB protocol. Each listed file may result in one flowfile, the metadata being written as flowfile attributes. Or - in case the 'Record Writer' property is set - the entire result is written as records to a single flowfile. This Processor is designed to run on Primary Node only in a cluster. If the primary node changes, the new Primary Node will pick up where the previous node left off without duplicating all of the data.

Tags: samba, smb, cifs, files, list

Properties

SMB Client Provider Service

Specifies the SMB client provider to use for creating SMB connections.

Listing Strategy

Specify how to determine new/updated entities. See each strategy descriptions for detail.

Input Directory

The network folder from which to list files. This is the remaining relative path after the share: smb://HOSTNAME:PORT/SHARE/[DIRECTORY]/sub/directories. It is also possible to add subdirectories. The given path on the remote file share must exist. This can be checked using verification. You may mix Windows and Linux-style directory separators.

File Name Suffix Filter

Files ending with the given suffix will be omitted. Can be used to make sure that files that are still uploading are not listed multiple times, by having those files have a suffix and remove the suffix once the upload finishes. This is highly recommended when using 'Tracking Entities' or 'Tracking Timestamps' listing strategies.

Record Writer

Specifies the Record Writer to use for creating the listing. If not specified, one FlowFile will be created for each entity that is listed. If the Record Writer is specified, all entities will be written to a single FlowFile instead of adding attributes to individual FlowFiles.

Minimum File Age

The minimum age that a file must be in order to be listed; any file younger than this amount of time will be ignored.

Maximum File Age

Any file older than the given value will be omitted.

Minimum File Size

Any file smaller than the given value will be omitted.

Maximum File Size

Any file larger than the given value will be omitted.

Target System Timestamp Precision

Specify timestamp precision at the target system. Since this processor uses timestamp of entities to decide which should be listed, it is crucial to use the right timestamp precision.

Entity Tracking State Cache

Listed entities are stored in the specified cache storage so that this processor can resume listing across NiFi restart or in case of primary node change. 'Tracking Entities' strategy require tracking information of all listed entities within the last 'Tracking Time Window'. To support large number of entities, the strategy uses DistributedMapCache instead of managed state. Cache key format is 'ListedEntities::{processorId}(::{nodeId})'. If it tracks per node listed entities, then the optional '::{nodeId}' part is added to manage state separately. E.g. cluster wide cache key = 'ListedEntities::8dda2321-0164-1000-50fa-3042fe7d6a7b', per node cache key = 'ListedEntities::8dda2321-0164-1000-50fa-3042fe7d6a7b::nifi-node3' The stored cache content is Gzipped JSON string. The cache key will be deleted when target listing configuration is changed. Used by 'Tracking Entities' strategy.

Entity Tracking Time Window

Specify how long this processor should track already-listed entities. 'Tracking Entities' strategy can pick any entity whose timestamp is inside the specified time window. For example, if set to '30 minutes', any entity having timestamp in recent 30 minutes will be the listing target when this processor runs. A listed entity is considered 'new/updated' and a FlowFile is emitted if one of following condition meets: 1. does not exist in the already-listed entities, 2. has newer timestamp than the cached entity, 3. has different size than the cached entity. If a cached entity’s timestamp becomes older than specified time window, that entity will be removed from the cached already-listed entities. Used by 'Tracking Entities' strategy.

Entity Tracking Initial Listing Target

Specify how initial listing should be handled. Used by 'Tracking Entities' strategy.

Relationships

  • success: All FlowFiles that are received are routed to success

Writes Attributes

  • filename: The name of the file that was read from filesystem.

  • shortName: The short name of the file that was read from filesystem.

  • path: The path is set to the relative path of the file’s directory on the remote filesystem compared to the Share root directory. For example, for a given remote locationsmb://HOSTNAME:PORT/SHARE/DIRECTORY, and a file is being listed from smb://HOSTNAME:PORT/SHARE/DIRECTORY/sub/folder/file then the path attribute will be set to "DIRECTORY/sub/folder".

  • serviceLocation: The SMB URL of the share.

  • lastModifiedTime: The timestamp of when the file’s content changed in the filesystem as 'yyyy-MM-dd’T’HH:mm:ss'.

  • creationTime: The timestamp of when the file was created in the filesystem as 'yyyy-MM-dd’T’HH:mm:ss'.

  • lastAccessTime: The timestamp of when the file was accessed in the filesystem as 'yyyy-MM-dd’T’HH:mm:ss'.

  • changeTime: The timestamp of when the file’s attributes was changed in the filesystem as 'yyyy-MM-dd’T’HH:mm:ss'.

  • size: The size of the file in bytes.

  • allocationSize: The number of bytes allocated for the file on the server.

Stateful

Scope: Cluster

After performing a listing of files, the state of the previous listing can be stored in order to list files continuously without duplication.

Input Requirement

This component does not allow an incoming relationship.

LogAttribute

Emits attributes of the FlowFile at the specified log level

Tags: attributes, logging

Properties

Log Level

The Log Level to use when logging the Attributes

Log Payload

If true, the FlowFile’s payload will be logged, in addition to its attributes; otherwise, just the Attributes will be logged.

Attributes to Log

A comma-separated list of Attributes to Log. If not specified, all attributes will be logged unless Attributes to Log by Regular Expression is modified. There’s an AND relationship between the two properties.

Attributes to Log by Regular Expression

A regular expression indicating the Attributes to Log. If not specified, all attributes will be logged unless Attributes to Log is modified. There’s an AND relationship between the two properties.

Attributes to Ignore

A comma-separated list of Attributes to ignore. If not specified, no attributes will be ignored unless Attributes to Ignore by Regular Expression is modified. There’s an OR relationship between the two properties.

Attributes to Ignore by Regular Expression

A regular expression indicating the Attributes to Ignore. If not specified, no attributes will be ignored unless Attributes to Ignore is modified. There’s an OR relationship between the two properties.

Log FlowFile Properties

Specifies whether or not to log FlowFile "properties", such as Entry Date, Lineage Start Date, and content size

Output Format

Specifies the format to use for logging FlowFile attributes

Log prefix

Log prefix appended to the log lines. It helps to distinguish the output of multiple LogAttribute processors.

Character Set

The name of the CharacterSet to use

Relationships

  • success: All FlowFiles are routed to this relationship

Input Requirement

This component requires an incoming relationship.

LogMessage

Emits a log message at the specified log level

Tags: attributes, logging

Properties

Log Level

The Log Level to use when logging the message: [trace, debug, info, warn, error]

Log prefix

Log prefix appended to the log lines. It helps to distinguish the output of multiple LogMessage processors.

Log message

The log message to emit

Relationships

  • success: All FlowFiles are routed to this relationship

Input Requirement

This component requires an incoming relationship.

LookupAttribute

Lookup attributes from a lookup service

Tags: lookup, cache, enrich, join, attributes, Attribute Expression Language

Properties

Lookup Service

The lookup service to use for attribute lookups

Include Empty Values

Include null or blank values for keys that are null or blank

Dynamic Properties

The name of the attribute to add to the FlowFile

Adds a FlowFile attribute specified by the dynamic property’s key with the value found in the lookup service using the the dynamic property’s value

Relationships

  • failure: FlowFiles with failing lookups are routed to this relationship

  • matched: FlowFiles with matching lookups are routed to this relationship

  • unmatched: FlowFiles with missing lookups are routed to this relationship

Input Requirement

This component requires an incoming relationship.

LookupRecord

Extracts one or more fields from a Record and looks up a value for those fields in a LookupService. If a result is returned by the LookupService, that result is optionally added to the Record. In this case, the processor functions as an Enrichment processor. Regardless, the Record is then routed to either the 'matched' relationship or 'unmatched' relationship (if the 'Routing Strategy' property is configured to do so), indicating whether or not a result was returned by the LookupService, allowing the processor to also function as a Routing processor. The "coordinates" to use for looking up a value in the Lookup Service are defined by adding a user-defined property. Each property that is added will have an entry added to a Map, where the name of the property becomes the Map Key and the value returned by the RecordPath becomes the value for that key. If multiple values are returned by the RecordPath, then the Record will be routed to the 'unmatched' relationship (or 'success', depending on the 'Routing Strategy' property’s configuration). If one or more fields match the Result RecordPath, all fields that match will be updated. If there is no match in the configured LookupService, then no fields will be updated. I.e., it will not overwrite an existing value in the Record with a null value. Please note, however, that if the results returned by the LookupService are not accounted for in your schema (specifically, the schema that is configured for your Record Writer) then the fields will not be written out to the FlowFile.

Tags: lookup, enrichment, route, record, csv, json, avro, database, db, logs, convert, filter

Properties

Record Reader

Specifies the Controller Service to use for reading incoming data

Record Writer

Specifies the Controller Service to use for writing out the records

Lookup Service

The Lookup Service to use in order to lookup a value in each Record

Root Record Path

A RecordPath that points to a child Record within each of the top-level Records in the FlowFile. If specified, the additional RecordPath properties will be evaluated against this child Record instead of the top-level Record. This allows for performing enrichment against multiple child Records within a single top-level Record.

Routing Strategy

Specifies how to route records after a Lookup has completed

Record Result Contents

When a result is obtained that contains a Record, this property determines whether the Record itself is inserted at the configured path or if the contents of the Record (i.e., the sub-fields) will be inserted at the configured path.

Record Update Strategy

This property defines the strategy to use when updating the record with the value returned by the Lookup Service.

Result RecordPath

A RecordPath that points to the field whose value should be updated with whatever value is returned from the Lookup Service. If not specified, the value that is returned from the Lookup Service will be ignored, except for determining whether the FlowFile should be routed to the 'matched' or 'unmatched' Relationship.

Cache Size

Specifies how many lookup values/records should be cached.Setting this property to zero means no caching will be done and the table will be queried for each lookup value in each record. If the lookup table changes often or the most recent data must be retrieved, do not use the cache.

Dynamic Properties

Value To Lookup

A RecordPath that points to the field whose value will be looked up in the configured Lookup Service

Relationships

  • success: All records will be sent to this Relationship if configured to do so, unless a failure occurs

  • failure: If a FlowFile cannot be enriched, the unchanged FlowFile will be routed to this relationship

Writes Attributes

  • mime.type: Sets the mime.type attribute to the MIME Type specified by the Record Writer

  • record.count: The number of records in the FlowFile

Input Requirement

This component requires an incoming relationship.

Additional Details

LookupRecord makes use of the NiFi RecordPath Domain-Specific Language (DSL) to allow the user to indicate which field(s), depending on the Record Update Strategy, in the Record should be updated. The Record will be updated using the value returned by the provided Lookup Service.

Record Update Strategy - Use Property

In this case, the user should add, to the Processor’s configuration, as much User-defined Properties as required by the Lookup Service to form the lookup coordinates. The name of the properties should match the names expected by the Lookup Service.

The field evaluated using the path configured in the “Result RecordPath” property will be the field updated with the value returned by the Lookup Service.

Let’s assume a Simple Key Value Lookup Service containing the following key/value pairs:

FR => France CA => Canada

Let’s assume the following JSON with three records as input:

[
  {
    "country": null,
    "code": "FR"
  },
  {
    "country": null,
    "code": "CA"
  },
  {
    "country": null,
    "code": "JP"
  }
]
json

The processor is configured with “Use Property” as “Record Update Strategy”, the “Result RecordPath” is configured with “/country” and a user-defined property is added with the name “key” (as required by this Lookup Service) and the value “/code”.

When triggered, the processor will look for the value associated to the “/code” path and will use the value as the “key” of the Lookup Service. The value returned by the Lookup Service will be used to update the value corresponding to ” /country”. With the above examples, it will produce:

[
  {
    "country": "France",
    "code": "FR"
  },
  {
    "country": "Canada",
    "code": "CA"
  },
  {
    "country": null,
    "code": "JP"
  }
]
json
Record Update Strategy - Replace Existing Values

With this strategy, the “Result RecordPath” property will be ignored and the configured Lookup Service must be a single simple key lookup service. For each user-defined property, the value contained in the field corresponding to the record path will be used as the key in the Lookup Service and will be replaced by the value returned by the Lookup Service. It is possible to configure multiple dynamic properties to update multiple fields in one execution. This strategy only supports simple types replacements (strings, integers, etc).

Since this strategy allows in-place replacement, it is possible to use Record Paths for fields contained in arrays.

Let’s assume a Simple Key Value Lookup Service containing the following key/value pairs:

FR => France CA => Canada fr => French en => English

Let’s assume the following JSON with two records as input:

[
  {
    "locales": [
      {
        "region": "FR",
        "language": "fr"
      },
      {
        "region": "US",
        "language": "en"
      }
    ]
  },
  {
    "locales": [
      {
        "region": "CA",
        "language": "fr"
      },
      {
        "region": "JP",
        "language": "ja"
      }
    ]
  }
]
json

The processor is configured with “Replace Existing Values” as “Record Update Strategy”, two user-defined properties are added: “region” ⇒ “/locales[]/region” and “language ⇒”/locales[]/language”..

When triggered, the processor will loop over the user-defined properties. First, it’ll search for the fields corresponding to “/locales[*]/region”, for each value from the record, the value will be used as the key with the Lookup Service and the value will be replaced by the result returned by the Lookup Service. Example: the first region is “FR” and this key is associated to the value “France” in the Lookup Service, so the value “FR” is replaced by ” France” in the record. With the above examples, it will produce:

[
  {
    "locales": [
      {
        "region": "France",
        "language": "French"
      },
      {
        "region": "US",
        "language": "English"
      }
    ]
  },
  {
    "locales": [
      {
        "region": "Canada",
        "language": "French"
      },
      {
        "region": "JP",
        "language": "ja"
      }
    ]
  }
]
json

MergeContent

Merges a Group of FlowFiles together based on a user-defined strategy and packages them into a single FlowFile. It is recommended that the Processor be configured with only a single incoming connection, as Group of FlowFiles will not be created from FlowFiles in different connections. This processor updates the mime.type attribute as appropriate. NOTE: this processor should NOT be configured with Cron Driven for the Scheduling Strategy.

Use Cases

Concatenate FlowFiles with textual content together in order to create fewer, larger FlowFiles.

Keywords: concatenate, bundle, aggregate, bin, merge, combine, smash

Input Requirement: This component allows an incoming relationship.

  1. "Merge Strategy" = "Bin Packing Algorithm"

  2. "Merge Format" = "Binary Concatenation"

  3. "Delimiter Strategy" = "Text"

  4. "Demarcator" = "\n" (a newline can be inserted by pressing Shift + Enter)

  5. "Minimum Number of Entries" = "1"

  6. "Maximum Number of Entries" = "500000000"

  7. "Minimum Group Size" = the minimum amount of data to write to an output FlowFile. A reasonable value might be "128 MB"

  8. "Maximum Group Size" = the maximum amount of data to write to an output FlowFile. A reasonable value might be "256 MB"

  9. "Max Bin Age" = the maximum amount of time to wait for incoming data before timing out and transferring the FlowFile along even though it is smaller than the Max Bin Age. A reasonable value might be "5 mins" .

Concatenate FlowFiles with binary content together in order to create fewer, larger FlowFiles.

Notes: Not all binary data can be concatenated together. Whether or not this configuration is valid depends on the type of your data.

Keywords: concatenate, bundle, aggregate, bin, merge, combine, smash

Input Requirement: This component allows an incoming relationship.

  1. "Merge Strategy" = "Bin Packing Algorithm"

  2. "Merge Format" = "Binary Concatenation"

  3. "Delimiter Strategy" = "Text"

  4. "Minimum Number of Entries" = "1"

  5. "Maximum Number of Entries" = "500000000"

  6. "Minimum Group Size" = the minimum amount of data to write to an output FlowFile. A reasonable value might be "128 MB"

  7. "Maximum Group Size" = the maximum amount of data to write to an output FlowFile. A reasonable value might be "256 MB"

  8. "Max Bin Age" = the maximum amount of time to wait for incoming data before timing out and transferring the FlowFile along even though it is smaller than the Max Bin Age. A reasonable value might be "5 mins" .

Reassemble a FlowFile that was previously split apart into smaller FlowFiles by a processor such as SplitText, UnpackContext, SplitRecord, etc.

Keywords: reassemble, repack, merge, recombine

Input Requirement: This component allows an incoming relationship.

  1. "Merge Strategy" = "Defragment"

  2. "Merge Format" = the value of Merge Format depends on the desired output format. If the file was previously zipped together and was split apart by UnpackContent,

  3. a Merge Format of "ZIP" makes sense. If it was previously a .tar file, a Merge Format of "TAR" makes sense. If the data is textual, "Binary Concatenation" can be

  4. used to combine the text into a single document.

  5. "Delimiter Strategy" = "Text"

  6. "Max Bin Age" = the maximum amount of time to wait for incoming data before timing out and transferring the fragments to 'failure'. A reasonable value might be "5 mins" .

  7. For textual data, "Demarcator" should be set to a newline (\n), set by pressing Shift+Enter in the UI. For binary data, "Demarcator" should be left blank. .

Tags: merge, content, correlation, tar, zip, stream, concatenation, archive, flowfile-stream, flowfile-stream-v3

Properties

Merge Strategy

Specifies the algorithm used to merge content. The 'Defragment' algorithm combines fragments that are associated by attributes back into a single cohesive FlowFile. The 'Bin-Packing Algorithm' generates a FlowFile populated by arbitrarily chosen FlowFiles

Merge Format

Determines the format that will be used to merge the content.

Attribute Strategy

Determines which FlowFile attributes should be added to the bundle. If 'Keep All Unique Attributes' is selected, any attribute on any FlowFile that gets bundled will be kept unless its value conflicts with the value from another FlowFile. If 'Keep Only Common Attributes' is selected, only the attributes that exist on all FlowFiles in the bundle, with the same value, will be preserved.

Correlation Attribute Name

If specified, like FlowFiles will be binned together, where 'like FlowFiles' means FlowFiles that have the same value for this Attribute. If not specified, FlowFiles are bundled by the order in which they are pulled from the queue.

Metadata Strategy

For FlowFiles whose input format supports metadata (Avro, e.g.), this property determines which metadata should be added to the bundle. If 'Use First Metadata' is selected, the metadata keys/values from the first FlowFile to be bundled will be used. If 'Keep Only Common Metadata' is selected, only the metadata that exists on all FlowFiles in the bundle, with the same value, will be preserved. If 'Ignore Metadata' is selected, no metadata is transferred to the outgoing bundled FlowFile. If 'Do Not Merge Uncommon Metadata' is selected, any FlowFile whose metadata values do not match those of the first bundled FlowFile will not be merged.

Minimum Number of Entries

The minimum number of files to include in a bundle

Maximum Number of Entries

The maximum number of files to include in a bundle

Minimum Group Size

The minimum size for the bundle

Maximum Group Size

The maximum size for the bundle. If not specified, there is no maximum.

Bin Termination Check

Specifies an Expression Language Expression that is to be evaluated against each FlowFile. If the result of the expression is 'true', the bin that the FlowFile corresponds to will be terminated, even if the bin has not met the minimum number of entries or minimum size. Note that if the FlowFile that triggers the termination of the bin is itself larger than the Maximum Bin Size, it will be placed into its own bin without triggering the termination of any other bin. When using this property, it is recommended to use Prioritizers in the flow’s connections to ensure that the ordering is as desired.

FlowFile Insertion Strategy

If a given FlowFile terminates the bin based on the <Bin Termination Check> property, specifies where the FlowFile should be included in the bin.

Max Bin Age

The maximum age of a Bin that will trigger a Bin to be complete. Expected format is <duration> <time unit> where <duration> is a positive integer and time unit is one of seconds, minutes, hours

Maximum number of Bins

Specifies the maximum number of bins that can be held in memory at any one time

Delimiter Strategy

Determines if Header, Footer, and Demarcator should point to files containing the respective content, or if the values of the properties should be used as the content.

Header

Filename or text specifying the header to use. If not specified, no header is supplied.

Footer

Filename or text specifying the footer to use. If not specified, no footer is supplied.

Demarcator

Filename or text specifying the demarcator to use. If not specified, no demarcator is supplied.

Compression Level

Specifies the compression level to use when using the Zip Merge Format; if not using the Zip Merge Format, this value is ignored

Keep Path

If using the Zip or Tar Merge Format, specifies whether or not the FlowFiles' paths should be included in their entry names.

Tar Modified Time

If using the Tar Merge Format, specifies if the Tar entry should store the modified timestamp either by expression (e.g. ${file.lastModifiedTime} or static value, both of which must match the ISO8601 format 'yyyy-MM-dd’T’HH:mm:ssZ'.

Relationships

  • failure: If the bundle cannot be created, all FlowFiles that would have been used to created the bundle will be transferred to failure

  • merged: The FlowFile containing the merged content

  • original: The FlowFiles that were used to create the bundle

Reads Attributes

  • fragment.identifier: Applicable only if the <Merge Strategy> property is set to Defragment. All FlowFiles with the same value for this attribute will be bundled together.

  • fragment.index: Applicable only if the <Merge Strategy> property is set to Defragment. This attribute indicates the order in which the fragments should be assembled. This attribute must be present on all FlowFiles when using the Defragment Merge Strategy and must be a unique (i.e., unique across all FlowFiles that have the same value for the "fragment.identifier" attribute) integer between 0 and the value of the fragment.count attribute. If two or more FlowFiles have the same value for the "fragment.identifier" attribute and the same value for the "fragment.index" attribute, the first FlowFile processed will be accepted and subsequent FlowFiles will not be accepted into the Bin.

  • fragment.count: Applicable only if the <Merge Strategy> property is set to Defragment. This attribute indicates how many FlowFiles should be expected in the given bundle. At least one FlowFile must have this attribute in the bundle. If multiple FlowFiles contain the "fragment.count" attribute in a given bundle, all must have the same value.

  • segment.original.filename: Applicable only if the <Merge Strategy> property is set to Defragment. This attribute must be present on all FlowFiles with the same value for the fragment.identifier attribute. All FlowFiles in the same bundle must have the same value for this attribute. The value of this attribute will be used for the filename of the completed merged FlowFile.

  • tar.permissions: Applicable only if the <Merge Format> property is set to TAR. The value of this attribute must be 3 characters; each character must be in the range 0 to 7 (inclusive) and indicates the file permissions that should be used for the FlowFile’s TAR entry. If this attribute is missing or has an invalid value, the default value of 644 will be used

Writes Attributes

  • filename: When more than 1 file is merged, the filename comes from the segment.original.filename attribute. If that attribute does not exist in the source FlowFiles, then the filename is set to the number of nanoseconds matching system time. Then a filename extension may be applied:if Merge Format is TAR, then the filename will be appended with .tar, if Merge Format is ZIP, then the filename will be appended with .zip, if Merge Format is FlowFileStream, then the filename will be appended with .pkg

  • merge.count: The number of FlowFiles that were merged into this bundle

  • merge.bin.age: The age of the bin, in milliseconds, when it was merged and output. Effectively this is the greatest amount of time that any FlowFile in this bundle remained waiting in this processor before it was output

  • merge.uuid: UUID of the merged flow file that will be added to the original flow files attributes.

  • merge.reason: This processor allows for several thresholds to be configured for merging FlowFiles. This attribute indicates which of the Thresholds resulted in the FlowFiles being merged. For an explanation of each of the possible values and their meanings, see the Processor’s Usage / documentation and see the 'Additional Details' page.

Input Requirement

This component requires an incoming relationship.

System Resource Considerations

  • MEMORY: While content is not stored in memory, the FlowFiles' attributes are. The configuration of MergeContent (maximum bin size, maximum group size, maximum bin age, max number of entries) will influence how much memory is used. If merging together many small FlowFiles, a two-stage approach may be necessary in order to avoid excessive use of memory.

Additional Details

Introduction

The MergeContent Processor provides the ability to combine many FlowFiles into a single FlowFile. There are many reasons that a dataflow designer may want to do this. For example, it may be helpful to create batches of data before sending to a downstream system, because the downstream system is better optimized for large files than for many tiny files. NiFi itself can also benefit from this, as NiFi operates best on “micro-batches,” where each FlowFile is several kilobytes to several megabytes in size.

The Processor creates several ‘bins’ to put the FlowFiles in. The maximum number of bins to use is set to 5 by default, but this can be changed by updating the value of the property. The number of bins is bound in order to avoid running out of Java heap space. Note: while the contents of a FlowFile are stored in the Content Repository and not in the Java heap space, the Processor must hold the FlowFile objects themselves in memory. As a result, these FlowFiles with their attributes can potentially take up a great deal of heap space and cause OutOfMemoryError’s to be thrown. In order to avoid this, if you expect to merge many small FlowFiles together, it is advisable to instead use a MergeContent that merges no more than say 1,000 FlowFiles into a bundle and then use a second MergeContent to merges these small bundles into larger bundles. For example, to merge 1,000,000 FlowFiles together, use MergeContent that uses a of 1,000 and route the “merged” Relationship to a second MergeContent that also sets the to 1,000. The second MergeContent will then merge 1,000 bundles of 1,000, which in effect produces bundles of 1,000,000.

How FlowFiles are Binned

How the Processor determines which bin to place a FlowFile in depends on a few different configuration options. Firstly, the Merge Strategy is considered. The Merge Strategy can be set to one of two options: “Bin Packing Algorithm,” or ” Defragment”. When the goal is to simply combine smaller FlowFiles into one larger FlowFile, the Bin Packing Algorithm should be used. This algorithm picks a bin based on whether the FlowFile can fit in the bin according to its size and the property and whether the FlowFile is ‘like’ the other FlowFiles in the bin. What it means for two FlowFiles to be ‘like FlowFiles’ is discussed at the end of this section.

The “Defragment” Merge Strategy can be used when FlowFiles need to be explicitly assigned to the same bin. For example, if data is split apart using the UnpackContent Processor, each unpacked FlowFile can be processed independently and later merged back together using this Processor with the Merge Strategy set to Defragment. In order for FlowFiles to be added to the same bin when using this configuration, the FlowFiles must have the same value for the ” fragment.identifier” attribute. Each FlowFile with the same identifier must also have a unique value for the ” fragment.index” attribute so that the FlowFiles can be ordered correctly. For a given “fragment.identifier”, at least one FlowFile must have the “fragment.count” attribute (which indicates how many FlowFiles belong in the bin). Other FlowFiles with the same identifier must have the same value for the “fragment.count” attribute, or they can omit this attribute. NOTE: while there are valid use cases for breaking apart FlowFiles and later re-merging them, it is an antipattern to take a larger FlowFile, break it into a million tiny FlowFiles, and then re-merge them. Doing so can result in using huge amounts of Java heap and can result in Out Of Memory Errors. Additionally, it adds large amounts of load to the NiFi framework. This can result in increased CPU and disk utilization and often times can be an order of magnitude lower throughput and an order of magnitude higher latency. As an alternative, whenever possible, dataflows should be built to make use of Record-oriented processors, such as QueryRecord, PartitionRecord, UpdateRecord, LookupRecord, PublishKafkaRecord_2_6, etc.

In order to be added to the same bin, two FlowFiles must be ‘like FlowFiles.’ In order for two FlowFiles to be like FlowFiles, they must have the same schema, and if the property is set, they must have the same value for the specified attribute. For example, if the is set to “filename”, then two FlowFiles must have the same value for the “filename” attribute in order to be binned together. If more than one attribute is needed in order to correlate two FlowFiles, it is recommended to use an UpdateAttribute processor before the MergeContent processor and combine the attributes. For example, if the goal is to bin together two FlowFiles only if they have the same value for the “abc” attribute and the “xyz” attribute, then we could accomplish this by using UpdateAttribute and adding a property with name “correlation.attribute” and a value of “abc=\({abc},xyz=\){xyz}” and then setting MergeContent’s property to “correlation.attribute”.

When a Bin is Merged

Above, we discussed how a bin is chosen for a given FlowFile. Once a bin has been created and FlowFiles added to it, we must have some way to determine when a bin is “full” so that we can bin those FlowFiles together into a “merged” FlowFile.

If the property is set to “Bin Packing Algorithm”, then the following rules will be evaluated.

MergeContent exposes several different thresholds that can be used to create bins that are of an ideal size. For example, the user can specify the minimum number of FlowFiles that must be packaged together before merging will be performed. The minimum number of bytes can also be configured. Additionally, a maximum number of FlowFiles and bytes may be specified.

There are two other conditions that will result in the contents of a Bin being merged together. The Max Bin Age property specifies the maximum amount of time that FlowFiles can be binned together before the bin is merged. This property should almost always be set, as it provides a means to set a timeout on a bin, so that even if data stops flowing to the Processor for a while (due to a problem with an upstream system, a source processor being stopped, etc.) the FlowFiles won’t remain stuck in the MergeContent processor indefinitely. Additionally, the processor exposes a property for the maximum number of Bins that should be used. For some use cases, this won’t matter much. However, if the Correlation Attribute property is set, this can be important. When an incoming FlowFile is to be placed in a Bin, the processor must find an appropriate Bin to place the FlowFile into, or else create a new one. If a Bin must be created, and the number of Bins that exist is greater than or equal to the value of the property, then the oldest Bin will be merged together to make room for the new one.

If the property is set to “Defragment”, then a bin is full only when the number of FlowFiles in the bin is equal to the number specified by the “fragment.count” attribute of one of the FlowFiles in the bin. All FlowFiles that have this attribute must have the same value for this attribute, or else they will be routed to the “failure” relationship. It is not necessary that all FlowFiles have this value, but at least one FlowFile in the bin must have this value or the bin will never be complete. If all the necessary FlowFiles are not binned together by the point at which the bin times amount (as specified by the property), then the FlowFiles will all be routed to the ’ failure’ relationship instead of being merged together.

Finally, a bin can be merged if the property is configured and a FlowFile is received that satisfies the specified condition. The condition is specified as an Expression Language expression. If any FlowFile result in the expression returning a value of true, then the bin will be merged, regardless of how much data is in the bin or how old the bin is. This incoming FlowFile that triggers the bin to be merged can either be added as the last entry in the bin, as the first entry in a new bin, or output as its own bin, depending on the value of the property.

A bin of FlowFiles, then, is merged when any one of the following conditions is met: - The bin has reached the maximum number of bytes, as configured by the property. - The bin has reached the maximum number of FlowFiles, as configured by the property. - The bin has reached both the minimum number of bytes, as configured by the property, AND the minimum number of FlowFiles, as configured by the property. - The bin has reached the maximum age, as configured by the property. - The maximum number of bins has been reached, as configured by the property, and a new bin must be created. - The property is configured and a FlowFile is received that satisfies the specified condition.

Reason for Merge

Whenever the contents of a Bin are merged, an attribute with the name “merge.reason” will be added to the merged FlowFile. The below table provides a listing of all possible values for this attribute with an explanation of each.

Attribute Value Explanation

MAX_BYTES_THRESHOLD_REACHED

The bin has reached the maximum number of bytes, as configured by the property. When this threshold is reached, the contents of the Bin will be merged together, even if the Minimum Number of Entries has not yet been reached.

MAX_ENTRIES_THRESHOLD_REACHED

The bin has reached the maximum number of FlowFiles, as configured by the property. When this threshold is reached, the contents of the Bin will be merged together, even if the minimum number of bytes (Min Group Size) has not yet been reached.

MIN_THRESHOLDS_REACHED

The bin has reached both the minimum number of bytes, as configured by the property, AND the minimum number of FlowFiles, as configured by the property. The bin has not reached the maximum number of bytes (Max Group Size) OR the maximum number of FlowFiles (Maximum Number of Entries).

TIMEOUT

The Bin has reached the maximum age, as configured by the property. If this threshold is reached, the contents of the Bin will be merged together, even if the Bin has not yet reached either of the minimum thresholds. Note that the age here is determined by when the Bin was created, NOT the age of the FlowFiles that reside within those Bins. As a result, if the Processor is stopped until it has 1 million FlowFiles queued, each one being 10 days old, but the Max Bin Age is set to “1 day,” the Max Bin Age will not be met for at least one full day, even though the FlowFiles themselves are much older than this threshold. If the Processor is stopped and restarted, all Bins are destroyed and recreated, and the timer is reset.

BIN_MANAGER_FULL

If an incoming FlowFile does not fit into any of the existing Bins (either due to the Maximum thresholds set, or due to the Correlation Attribute being used, etc.), then a new Bin must be created for the incoming FlowFiles. If the number of active Bins is already equal to the property, the oldest Bin will be merged in order to make room for the new Bin. In that case, the Bin Manager is said to be full, and this value will be used.

BIN_TERMINATION_SIGNAL

A FlowFile signaled that the Bin should be terminated by satisfying the configured property.

Note that the attribute value is minimally named, while the textual description is far more verbose. This is done for a few reasons. Firstly, storing a large value for the attribute can be more costly, utilizing more heap space and requiring more resources to process. Secondly, it’s more succinct, which makes it easier to talk about. Most importantly, though, it means that a processor such as RouteOnAttribute can be used, if necessary, to route based on the value of the attribute. In this way, the explanation can be further expanded or updated, without changing the value of the attribute and without disturbing existing flows.

MergeMetro

Merges incoming FlowFiles with matching FlowFiles on the "Metro Line", i.e. any queue connected to a PutMetro processor using the same Metro Line Controller as this processor. Incoming FlowFiles and FlowFiles on the Metro Line are matched based on the attribute set in the "Correlation Attribute Name" property. If no matching FlowFile can be found on the Metro Line and "Failure if not found" is unchecked, the incoming FlowFile is put back into the queue and penalized (see "Penalty Duration" in Settings).

Tags: virtimo, metro

Properties

Metro Controller

The processor uses this controller’s Metro Line to connect with PutMetro processors.

Correlation Attribute Name

The name of the attribute used to match and merge incoming FlowFiles with FlowFiles cached in the associated Metro Line.

Failure if not found

If checked, route an incoming FlowFile to the 'failure' relationship if no matching FlowFile is immediately found in the Metro Line. This should only ever be checked in conjunction with retries on the failure relationship, as a rendezvous attempt between PutMetro and GetMetro/MergeMetro may irregularly fail. If unchecked, the incoming FlowFile is put back in the queue and penalized (see 'Penalty Duration' in Settings).

Header

Text (UTF-8) specifying the header to use. If not specified, no header is supplied.

Footer

Text (UTF-8) specifying the footer to use. If not specified, no footer is supplied.

Demarcator

Text (UTF-8) specifying the demarcator to use. If not specified, no demarcator is supplied.

Reads Attributes

  • <Correlation Attribute Name>: Use this property to specify which attribute should be evaluated for matching FlowFiles.

Input Requirement

This component requires an incoming relationship.

MergeRecord

This Processor merges together multiple record-oriented FlowFiles into a single FlowFile that contains all of the Records of the input FlowFiles. This Processor works by creating 'bins' and then adding FlowFiles to these bins until they are full. Once a bin is full, all of the FlowFiles will be combined into a single output FlowFile, and that FlowFile will be routed to the 'merged' Relationship. A bin will consist of potentially many 'like FlowFiles'. In order for two FlowFiles to be considered 'like FlowFiles', they must have the same Schema (as identified by the Record Reader) and, if the <Correlation Attribute Name> property is set, the same value for the specified attribute. See Processor Usage and Additional Details for more information. NOTE: this processor should NOT be configured with Cron Driven for the Scheduling Strategy.

Use Cases

Combine together many arbitrary Records in order to create a single, larger file

Input Requirement: This component allows an incoming relationship.

  1. Configure the "Record Reader" to specify a Record Reader that is appropriate for the incoming data type.

  2. Configure the "Record Writer" to specify a Record Writer that is appropriate for the desired output data type.

  3. Set "Merge Strategy" to Bin-Packing Algorithm.

  4. Set the "Minimum Bin Size" to desired file size of the merged output file. For example, a value of 1 MB will result in not merging data until at least

  5. 1 MB of data is available (unless the Max Bin Age is reached first). If there is no desired minimum file size, leave the default value of 0 B.

  6. Set the "Minimum Number of Records" property to the minimum number of Records that should be included in the merged output file. For example, setting the value

  7. to 10000 ensures that the output file will have at least 10,000 Records in it (unless the Max Bin Age is reached first).

  8. Set the "Max Bin Age" to specify the maximum amount of time to hold data before merging. This can be thought of as a "timeout" at which time the Processor will

  9. merge whatever data it is, even if the "Minimum Bin Size" and "Minimum Number of Records" has not been reached. It is always recommended to set the value.

  10. A reasonable default might be 10 mins if there is no other latency requirement. .

  11. Connect the 'merged' Relationship to the next component in the flow. Auto-terminate the 'original' Relationship. .

Multi-Processor Use Cases

Combine together many Records that have the same value for a particular field in the data, in order to create a single, larger file

Keywords: merge, combine, aggregate, like records, similar data

PartitionRecord:

  1. Configure the "Record Reader" to specify a Record Reader that is appropriate for the incoming data type.

  2. Configure the "Record Writer" to specify a Record Writer that is appropriate for the desired output data type. .

  3. Add a single additional property. The name of the property should describe the field on which the data is being merged together.

  4. The property’s value should be a RecordPath that specifies which output FlowFile the Record belongs to. .

  5. For example, to merge together data that has the same value for the "productSku" field, add a property named productSku with a value of /productSku. .

  6. Connect the "success" Relationship to MergeRecord.

  7. Auto-terminate the "original" Relationship. .

MergeRecord:

  1. Configure the "Record Reader" to specify a Record Reader that is appropriate for the incoming data type.

  2. Configure the "Record Writer" to specify a Record Writer that is appropriate for the desired output data type.

  3. Set "Merge Strategy" to Bin-Packing Algorithm.

  4. Set the "Minimum Bin Size" to desired file size of the merged output file. For example, a value of 1 MB will result in not merging data until at least

  5. 1 MB of data is available (unless the Max Bin Age is reached first). If there is no desired minimum file size, leave the default value of 0 B.

  6. Set the "Minimum Number of Records" property to the minimum number of Records that should be included in the merged output file. For example, setting the value

  7. to 10000 ensures that the output file will have at least 10,000 Records in it (unless the Max Bin Age is reached first).

  8. Set the "Maximum Number of Records" property to a value at least as large as the "Minimum Number of Records." If there is no need to limit the maximum number of

  9. records per file, this number can be set to a value that will never be reached such as 1000000000.

  10. Set the "Max Bin Age" to specify the maximum amount of time to hold data before merging. This can be thought of as a "timeout" at which time the Processor will

  11. merge whatever data it is, even if the "Minimum Bin Size" and "Minimum Number of Records" has not been reached. It is always recommended to set the value.

  12. A reasonable default might be 10 mins if there is no other latency requirement.

  13. Set the value of the "Correlation Attribute Name" property to the name of the property that you added in the PartitionRecord Processor. For example, if merging data

  14. based on the "productSku" field, the property in PartitionRecord was named productSku so the value of the "Correlation Attribute Name" property should

  15. be productSku.

  16. Set the "Maximum Number of Bins" property to a value that is at least as large as the different number of values that will be present for the Correlation Attribute.

  17. For example, if you expect 1,000 different SKUs, set this value to at least 1001. It is not advisable, though, to set the value above 10,000. .

  18. Connect the 'merged' Relationship to the next component in the flow.

  19. Auto-terminate the 'original' Relationship. .

Tags: merge, record, content, correlation, stream, event

Properties

Record Reader

Specifies the Controller Service to use for reading incoming data

Record Writer

Specifies the Controller Service to use for writing out the records

Merge Strategy

Specifies the algorithm used to merge records. The 'Defragment' algorithm combines fragments that are associated by attributes back into a single cohesive FlowFile. The 'Bin-Packing Algorithm' generates a FlowFile populated by arbitrarily chosen FlowFiles

Correlation Attribute Name

If specified, two FlowFiles will be binned together only if they have the same value for this Attribute. If not specified, FlowFiles are bundled by the order in which they are pulled from the queue.

Attribute Strategy

Determines which FlowFile attributes should be added to the bundle. If 'Keep All Unique Attributes' is selected, any attribute on any FlowFile that gets bundled will be kept unless its value conflicts with the value from another FlowFile. If 'Keep Only Common Attributes' is selected, only the attributes that exist on all FlowFiles in the bundle, with the same value, will be preserved.

Minimum Number of Records

The minimum number of records to include in a bin

Maximum Number of Records

The maximum number of Records to include in a bin. This is a 'soft limit' in that if a FlowFIle is added to a bin, all records in that FlowFile will be added, so this limit may be exceeded by up to the number of records in the last input FlowFile.

Minimum Bin Size

The minimum size of for the bin

Maximum Bin Size

The maximum size for the bundle. If not specified, there is no maximum. This is a 'soft limit' in that if a FlowFile is added to a bin, all records in that FlowFile will be added, so this limit may be exceeded by up to the number of bytes in last input FlowFile.

Max Bin Age

The maximum age of a Bin that will trigger a Bin to be complete. Expected format is <duration> <time unit> where <duration> is a positive integer and time unit is one of seconds, minutes, hours

Maximum Number of Bins

Specifies the maximum number of bins that can be held in memory at any one time. This number should not be smaller than the maximum number of concurrent threads for this Processor, or the bins that are created will often consist only of a single incoming FlowFile.

Relationships

  • failure: If the bundle cannot be created, all FlowFiles that would have been used to created the bundle will be transferred to failure

  • merged: The FlowFile containing the merged records

  • original: The FlowFiles that were used to create the bundle

Reads Attributes

  • fragment.identifier: Applicable only if the <Merge Strategy> property is set to Defragment. All FlowFiles with the same value for this attribute will be bundled together.

  • fragment.count: Applicable only if the <Merge Strategy> property is set to Defragment. This attribute must be present on all FlowFiles with the same value for the fragment.identifier attribute. All FlowFiles in the same bundle must have the same value for this attribute. The value of this attribute indicates how many FlowFiles should be expected in the given bundle.

Writes Attributes

  • record.count: The merged FlowFile will have a 'record.count' attribute indicating the number of records that were written to the FlowFile.

  • mime.type: The MIME Type indicated by the Record Writer

  • merge.count: The number of FlowFiles that were merged into this bundle

  • merge.bin.age: The age of the bin, in milliseconds, when it was merged and output. Effectively this is the greatest amount of time that any FlowFile in this bundle remained waiting in this processor before it was output

  • merge.uuid: UUID of the merged FlowFile that will be added to the original FlowFiles attributes

  • merge.completion.reason: This processor allows for several thresholds to be configured for merging FlowFiles. This attribute indicates which of the Thresholds resulted in the FlowFiles being merged. For an explanation of each of the possible values and their meanings, see the Processor’s Usage / documentation and see the 'Additional Details' page.

  • <Attributes from Record Writer>: Any Attribute that the configured Record Writer returns will be added to the FlowFile.

Input Requirement

This component requires an incoming relationship.

Additional Details

Introduction

The MergeRecord Processor allows the user to take many FlowFiles that consist of record-oriented data (any data format for which there is a Record Reader available) and combine the FlowFiles into one larger FlowFile. This may be preferable before pushing the data to a downstream system that prefers larger batches of data, such as HDFS, or in order to improve performance of a NiFi flow by reducing the number of FlowFiles that flow through the system (thereby reducing the contention placed on the FlowFile Repository, Provenance Repository, Content Repository, and FlowFile Queues).

The Processor creates several ‘bins’ to put the FlowFiles in. The maximum number of bins to use is set to 5 by default, but this can be changed by updating the value of the property. The number of bins is bound in order to avoid running out of Java heap space. Note: while the contents of a FlowFile are stored in the Content Repository and not in the Java heap space, the Processor must hold the FlowFile objects themselves in memory. As a result, these FlowFiles with their attributes can potentially take up a great deal of heap space and cause OutOfMemoryError’s to be thrown. In order to avoid this, if you expect to merge many small FlowFiles together, it is advisable to instead use a MergeRecord that merges no more than say 1,000 records into a bundle and then use a second MergeRecord to merges these small bundles into larger bundles. For example, to merge 1,000,000 records together, use MergeRecord that uses a of 1,000 and route the “merged” Relationship to a second MergeRecord that also sets the to 1,000. The second MergeRecord will then merge 1,000 bundles of 1,000, which in effect produces bundles of 1,000,000.

How FlowFiles are Binned

How the Processor determines which bin to place a FlowFile in depends on a few different configuration options. Firstly, the Merge Strategy is considered. The Merge Strategy can be set to one of two options: Bin Packing Algorithm, or Defragment. When the goal is to simply combine smaller FlowFiles into one larger FlowFiles, the Bin Packing Algorithm should be used. This algorithm picks a bin based on whether the FlowFile can fit in the bin according to its size and the property and whether the FlowFile is ‘like’ the other FlowFiles in the bin. What it means for two FlowFiles to be ‘like FlowFiles’ is discussed at the end of this section.

The “Defragment” Merge Strategy can be used when records need to be explicitly assigned to the same bin. For example, if data is split apart using the SplitRecord Processor, each ‘split’ can be processed independently and later merged back together using this Processor with the Merge Strategy set to Defragment. In order for FlowFiles to be added to the same bin when using this configuration, the FlowFiles must have the same value for the “fragment.identifier” attribute. Each FlowFile with the same identifier must also have the same value for the “fragment.count” attribute (which indicates how many FlowFiles belong in the bin) and a unique value for the “fragment.index” attribute so that the FlowFiles can be ordered correctly.

In order to be added to the same bin, two FlowFiles must be ‘like FlowFiles.’ In order for two FlowFiles to be like FlowFiles, they must have the same schema, and if the property is set, they must have the same value for the specified attribute. For example, if the is set to “filename” then two FlowFiles must have the same value for the “filename” attribute in order to be binned together. If more than one attribute is needed in order to correlate two FlowFiles, it is recommended to use an UpdateAttribute processor before the MergeRecord processor and combine the attributes. For example, if the goal is to bin together two FlowFiles only if they have the same value for the “abc” attribute and the “xyz” attribute, then we could accomplish this by using UpdateAttribute and adding a property with name “correlation.attribute” and a value of “abc=\({abc},xyz=\){xyz}” and then setting MergeRecord’s property to “correlation.attribute”.

It is often useful to bin together only Records that have the same value for some field. For example, if we have point-of-sale data, perhaps the desire is to bin together records that belong to the same store, as identified by the ’ storeId’ field. This can be accomplished by making use of the PartitionRecord Processor ahead of MergeRecord. This Processor will allow one or more fields to be configured as the partitioning criteria and will create attributes for those corresponding values. An UpdateAttribute processor could then be used, if necessary, to combine multiple attributes into a single correlation attribute, as described above. See documentation for those processors for more details.

When a Bin is Merged

Above, we discussed how a bin is chosen for a given FlowFile. Once a bin has been created and FlowFiles added to it, we must have some way to determine when a bin is “full” so that we can bin those FlowFiles together into a “merged” FlowFile.

If the property is set to “Bin Packing Algorithm” then the following rules will be evaluated. Firstly, in order for a bin to be full, both of the thresholds specified by the and the properties must be satisfied. If one of these properties is not set, then it is ignored. Secondly, if either the or the property is reached, then the bin is merged. That is, both of the minimum values must be reached but only one of the maximum values need be reached. Note that the property is a “soft limit,” meaning that all records in a given input FlowFile will be added to the same bin, and as a result the number of records may exceed the maximum configured number of records. Once this happens, though, no more Records will be added to that same bin from another FlowFile. If the is reached for a bin, then the FlowFiles in that bin will be merged, even if the minimum bin size and minimum number of records have not yet been met. Finally, if the maximum number of bins have been created (as specified by the property), and some input FlowFiles cannot fit into any of the existing bins, then the oldest bin will be merged to make room. This is done because otherwise we would not be able to add any additional FlowFiles to the existing bins and would have to wait until the Max Bin Age is reached (if ever) in order to merge any FlowFiles.

If the property is set to “Defragment” then a bin is full only when the number of FlowFiles in the bin is equal to the number specified by the “fragment.count” attribute of one of the FlowFiles in the bin. All FlowFiles that have this attribute must have the same value for this attribute, or else they will be routed to the “failure” relationship. It is not necessary that all FlowFiles have this value, but at least one FlowFile in the bin must have this value or the bin will never be complete. If all the necessary FlowFiles are not binned together by the point at which the bin times amount (as specified by the property), then the FlowFiles will all be routed to the ’ failure’ relationship instead of being merged together.

Once a bin is merged into a single FlowFile, it can sometimes be useful to understand why exactly the bin was merged when it was. For example, if the maximum number of allowable bins is reached, a merged FlowFile may consist of far fewer records than expected. In order to help understand the behavior, the Processor will emit a JOIN Provenance Events when creating the merged FlowFile, and the JOIN event will include in it a “Details” field that explains why the bin was merged when it was. For example, the event will indicate “Records Merged due to: Bin is full” if the bin reached its minimum thresholds and no more subsequent FlowFiles were added to it. Or it may indicate “Records Merged due to: Maximum number of bins has been exceeded” if the bin was merged due to the configured maximum number of bins being filled and needing to free up space for a new bin.

When a Failure Occurs

When a bin is filled, the Processor is responsible for merging together all the records in those FlowFiles into a single FlowFile. If the Processor fails to do so for any reason (for example, a Record cannot be read from an input FlowFile), then all the FlowFiles in that bin are routed to the ‘failure’ Relationship. The Processor does not skip the single problematic FlowFile and merge the others. This behavior was chosen because of two different considerations. Firstly, without those problematic records, the bin may not truly be full, as the minimum bin size may not be reached without those records. Secondly, and more importantly, if the problematic FlowFile contains 100 “good” records before the problematic ones, those 100 records would already have been written to the “merged” FlowFile. We cannot un-write those records. If we were to then send those 100 records on and route the problematic FlowFile to ‘failure’ then in a situation where the “failure” relationship is eventually routed back to MergeRecord, we could end up continually duplicating those 100 successfully processed records.

Examples

To better understand how this Processor works, we will lay out a few examples. For the sake of simplicity of these examples, we will use CSV-formatted data and write the merged data as CSV-formatted data, but the format of the data is not really relevant, as long as there is a Record Reader that is capable of reading the data and a Record Writer capable of writing the data in the desired format.

Example 1 - Batching Together Many Small FlowFiles

When we want to batch together many small FlowFiles in order to create one larger FlowFile, we will accomplish this by using the “Bin Packing Algorithm” Merge Strategy. The idea here is to bundle together as many FlowFiles as we can within our minimum and maximum number of records and bin size. Consider that we have the following properties set:

Property Name Property Value

Merge Strategy

Bin Packing Algorithm

Minimum Number of Records

3

Maximum Number of Records

5

Also consider that we have the following data on the queue, with the schema indicating a Name and an Age field:

FlowFile ID FlowFile Contents

1

Mark, 33

2

John, 45 Jane, 43

3

Jake, 3

4

Jan, 2

In this, because we have not configured a Correlation Attribute, and because all FlowFiles have the same schema, the Processor will attempt to add all of these FlowFiles to the same bin. Because the Minimum Number of Records is 3 and the Maximum Number of Records is 5, all the FlowFiles will be added to the same bin. The output, then, is a single FlowFile with the following content:

Mark, 33 John, 45 Jane, 43 Jake, 3 Jan, 2

When the Processor runs, it will bin all the FlowFiles that it can get from the queue. After that, it will merge any bin that is “full enough.” So if we had only 3 FlowFiles on the queue, those 3 would have been added, and a new bin would have been created in the next iteration, once the 4th FlowFile showed up. However, if we had 8 FlowFiles queued up, only 5 would have been added to the first bin. The other 3 would have been added to a second bin, and that bin would then be merged since it reached the minimum threshold of 3 also.

MergeXml

Merges XML content. FlowFiles can be grouped and merged based on the value of the correlation attribute. The correlation value will be written in the xml.correlation-attribute. If a correlation attribute is given and a FlowFile does not include this attribute, it will be routed to the uncorrelated relationship. If no correlation attribute name is given, the files will be merged in given order. If the FlowFile could not be parsed as XML it will be routed to the failure relationship.

Tags: virtimo, xml, merge

Properties

Root Tag Name

Name of the surrounding root tag.

Reuse Existing Roots

If there already is a root tag with the defined name in the source document, then this is not additionally attached, but the children are directly attached.

Minimum Number of Entries

The minimum number of files to include in a bin before merging. If Expression Language is used, it will be evaluated against the first matching FlowFile.

Maximum Number of Entries

The maximum number of files to include in a bin before merging. Only takes effect if no 'Correlation Attribute Name' is specified.

Correlation Attribute Name

FlowFiles with the same value for this attribute will be binned together and merged. If not specified, FlowFiles are bundled by the order in which they are pulled from the queue.

Inherit all Attributes

If checked, all attributes of all FlowFiles are inherited. If FlowFiles contain the same attribute but with different values, only the value of the last processed FlowFiles will be inherited. If unchecked, only the attributes that are common to all FlowFiles (same name and value) are inherited.

Relationships

  • success: Successfully merged content.

  • failure: Failed to merge content.

  • uncorrelated: Correlation attribute could not be found.

Reads Attributes

  • <Correlation Attribute Name>: Use this property to specify which attribute will determine which FlowFiles will be merged together.

Writes Attributes

  • xml.correlation: The correlation attribute’s value.

  • xml.error: The XML parsing error.

Stateful

Scope: Local

Stores the correlation attribute this instance is working with.

Input Requirement

This component requires an incoming relationship.

ModifyBytes

Discard byte range at the start and end or all content of a binary file.

Tags: binary, discard, keep

Properties

Start Offset

Number of bytes removed at the beginning of the file.

End Offset

Number of bytes removed at the end of the file.

Remove All Content

Remove all content from the FlowFile superseding Start Offset and End Offset properties.

Relationships

  • success: Processed flowfiles.

Input Requirement

This component requires an incoming relationship.

ModifyCompression

Changes the compression algorithm used to compress the contents of a FlowFile by decompressing the contents of FlowFiles using a user-specified compression algorithm and recompressing the contents using the specified compression format properties. This processor operates in a very memory efficient way so very large objects well beyond the heap size are generally fine to process

Tags: content, compress, recompress, gzip, bzip2, lzma, xz-lzma2, snappy, snappy-hadoop, snappy framed, lz4-framed, deflate, zstd, brotli

Properties

Input Compression Strategy

The strategy to use for decompressing input FlowFiles

Output Compression Strategy

The strategy to use for compressing output FlowFiles

Output Compression Level

The compression level for output FlowFiles for supported formats. A lower value results in faster processing but less compression; a value of 0 indicates no (that is, simple archiving) for gzip or minimal for xz-lzma2 compression. Higher levels can mean much larger memory usage such as the case with levels 7-9 for xz-lzma/2 so be careful relative to heap size.

Output Filename Strategy

Processing strategy for filename attribute on output FlowFiles

Relationships

  • success: FlowFiles will be transferred to the success relationship on compression modification success

  • failure: FlowFiles will be transferred to the failure relationship on compression modification errors

Reads Attributes

  • mime.type: If the Decompression Format is set to 'use mime.type attribute', this attribute is used to determine the decompression type. Otherwise, this attribute is ignored.

Writes Attributes

  • mime.type: The appropriate MIME Type is set based on the value of the Compression Format property. If the Compression Format is 'no compression' this attribute is removed as the MIME Type is no longer known.

Input Requirement

This component requires an incoming relationship.

System Resource Considerations

  • CPU: An instance of this component can cause high usage of this system resource. Multiple instances or high concurrency settings may result a degradation of performance.

  • MEMORY: An instance of this component can cause high usage of this system resource. Multiple instances or high concurrency settings may result a degradation of performance.

ModifyPDF

PDF forms can be filled using this processor. Fill a form field by creating a Dynamic Property with the form field’s name and setting the value to what the form field should be filled with. Form fields' names are written (and can be found) in the attribute 'pdf.forms'

Tags: virtimo, pdf

Properties

Dynamic Properties

The name of a PDF form field

The form fields will be filled with the respective values

Relationships

  • success: Successfully converted content.

  • failure: Failed to convert content.

Writes Attributes

  • pdf.forms: The names of the form fields contained in the PDF.

Input Requirement

This component requires an incoming relationship.

MonitorActivity

Monitors the flow for activity and sends out an indicator when the flow has not had any data for some specified amount of time and again when the flow’s activity is restored

Tags: monitor, flow, active, inactive, activity, detection

Properties

Threshold Duration

Determines how much time must elapse before considering the flow to be inactive

Continually Send Messages

If true, will send inactivity indicator continually every Threshold Duration amount of time until activity is restored; if false, will send an indicator only when the flow first becomes inactive

Inactivity Message

The message that will be the content of FlowFiles that are sent to the 'inactive' relationship

Activity Restored Message

The message that will be the content of FlowFiles that are sent to 'activity.restored' relationship

Wait for Activity

When the processor gets started or restarted, if set to true, only send an inactive indicator if there had been activity beforehand. Otherwise send an inactive indicator even if there had not been activity beforehand.

Reset State on Restart

When the processor gets started or restarted, if set to true, the initial state will always be active. Otherwise, the last reported flow state will be preserved.

Copy Attributes

If true, will copy all flow file attributes from the flow file that resumed activity to the newly created indicator flow file

Monitoring Scope

Specify how to determine activeness of the flow. 'node' means that activeness is examined at individual node separately. It can be useful if DFM expects each node should receive flow files in a distributed manner. With 'cluster', it defines the flow is active while at least one node receives flow files actively. If NiFi is running as standalone mode, this should be set as 'node', if it’s 'cluster', NiFi logs a warning message and act as 'node' scope.

Reporting Node

Specify which node should send notification flow-files to inactive and activity.restored relationships. With 'all', every node in this cluster send notification flow-files. 'primary' means flow-files will be sent only from a primary node. If NiFi is running as standalone mode, this should be set as 'all', even if it’s 'primary', NiFi act as 'all'.

Relationships

  • success: All incoming FlowFiles are routed to success

  • activity.restored: This relationship is used to transfer an Activity Restored indicator when FlowFiles are routing to 'success' following a period of inactivity

  • inactive: This relationship is used to transfer an Inactivity indicator when no FlowFiles are routed to 'success' for Threshold Duration amount of time

Writes Attributes

  • inactivityStartMillis: The time at which Inactivity began, in the form of milliseconds since Epoch

  • inactivityDurationMillis: The number of milliseconds that the inactivity has spanned

Stateful

Scope: Cluster, Local

MonitorActivity stores the last timestamp at each node as state, so that it can examine activity at cluster wide. If 'Copy Attribute' is set to true, then flow file attributes are also persisted. In local scope, it stores last known activity timestamp if the flow is inactive.

Input Requirement

This component requires an incoming relationship.

MoveAzureDataLakeStorage

Moves content within an Azure Data Lake Storage Gen 2. After the move, files will be no longer available on source location.

Tags: azure, microsoft, cloud, storage, adlsgen2, datalake

Properties

ADLS Credentials

Controller Service used to obtain Azure Credentials.

Source Filesystem

Name of the Azure Storage File System from where the move should happen.

Source Directory

Name of the Azure Storage Directory from where the move should happen. The Directory Name cannot contain a leading '/'. The root directory can be designated by the empty string value.

Destination Filesystem

Name of the Azure Storage File System where the files will be moved.

Destination Directory

Name of the Azure Storage Directory where the files will be moved. The Directory Name cannot contain a leading '/'. The root directory can be designated by the empty string value. Non-existing directories will be created. If the original directory structure should be kept, the full directory path needs to be provided after the destination directory. e.g.: destdir/${azure.directory}

File Name

The filename

Conflict Resolution Strategy

Indicates what should happen when a file with the same name already exists in the output directory

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP, SOCKS In case of SOCKS, it is not guaranteed that the selected SOCKS Version will be used by the processor.

Relationships

  • success: Files that have been successfully written to Azure storage are transferred to this relationship

  • failure: Files that could not be written to Azure storage for some reason are transferred to this relationship

Writes Attributes

  • azure.source.filesystem: The name of the source Azure File System

  • azure.source.directory: The name of the source Azure Directory

  • azure.filesystem: The name of the Azure File System

  • azure.directory: The name of the Azure Directory

  • azure.filename: The name of the Azure File

  • azure.primaryUri: Primary location for file content

  • azure.length: The length of the Azure File

Input Requirement

This component requires an incoming relationship.

MoveHDFS

Rename existing files or a directory of files (non-recursive) on Hadoop Distributed File System (HDFS).

Tags: hadoop, HCFS, HDFS, put, move, filesystem, moveHDFS

Properties

Hadoop Configuration Resources

A file or comma separated list of files which contains the Hadoop file system configuration. Without this, Hadoop will search the classpath for a 'core-site.xml' and 'hdfs-site.xml' file or will revert to a default configuration. To use swebhdfs, see 'Additional Details' section of PutHDFS’s documentation.

Kerberos User Service

Specifies the Kerberos User Controller Service that should be used for authenticating with Kerberos

Additional Classpath Resources

A comma-separated list of paths to files and/or directories that will be added to the classpath and used for loading native libraries. When specifying a directory, all files with in the directory will be added to the classpath, but further sub-directories will not be included.

Conflict Resolution Strategy

Indicates what should happen when a file with the same name already exists in the output directory

Input Directory or File

The HDFS directory from which files should be read, or a single file to read.

Output Directory

The HDFS directory where the files will be moved to

HDFS Operation

The operation that will be performed on the source file

File Filter Regex

A Java Regular Expression for filtering Filenames; if a filter is supplied then only files whose names match that Regular Expression will be fetched, otherwise all files will be fetched

Ignore Dotted Files

If true, files whose names begin with a dot (".") will be ignored

Remote Owner

Changes the owner of the HDFS file to this value after it is written. This only works if NiFi is running as a user that has HDFS super user privilege to change owner

Remote Group

Changes the group of the HDFS file to this value after it is written. This only works if NiFi is running as a user that has HDFS super user privilege to change group

Relationships

  • success: Files that have been successfully renamed on HDFS are transferred to this relationship

  • failure: Files that could not be renamed on HDFS are transferred to this relationship

Reads Attributes

  • filename: The name of the file written to HDFS comes from the value of this attribute.

Writes Attributes

  • filename: The name of the file written to HDFS is stored in this attribute.

  • absolute.hdfs.path: The absolute path to the file on HDFS is stored in this attribute.

  • hadoop.file.url: The hadoop url for the file is stored in this attribute.

Restricted

Making changes to this component requires admin rights. Non-admins can still schedule this component.

Input Requirement

This component allows an incoming relationship.

See Also

Notify

Caches a release signal identifier in the distributed cache, optionally along with the FlowFile’s attributes. Any flow files held at a corresponding Wait processor will be released once this signal in the cache is discovered.

Tags: map, cache, notify, distributed, signal, release

Properties

Release Signal Identifier

A value, or the results of an Attribute Expression Language statement, which will be evaluated against a FlowFile in order to determine the release signal cache key

Signal Counter Name

A value, or the results of an Attribute Expression Language statement, which will be evaluated against a FlowFile in order to determine the signal counter name. Signal counter name is useful when a corresponding Wait processor needs to know the number of occurrences of different types of events, such as success or failure, or destination data source names, etc.

Signal Counter Delta

A value, or the results of an Attribute Expression Language statement, which will be evaluated against a FlowFile in order to determine the signal counter delta. Specify how much the counter should increase. For example, if multiple signal events are processed at upstream flow in batch oriented way, the number of events processed can be notified with this property at once. Zero (0) has a special meaning, it clears target count back to 0, which is especially useful when used with Wait Releasable FlowFile Count = Zero (0) mode, to provide 'open-close-gate' type of flow control. One (1) can open a corresponding Wait processor, and Zero (0) can negate it as if closing a gate.

Signal Buffer Count

Specify the maximum number of incoming flow files that can be buffered until signals are notified to cache service. The more buffer can provide the better performance, as it reduces the number of interactions with cache service by grouping signals by signal identifier when multiple incoming flow files share the same signal identifier.

Distributed Cache Service

The Controller Service that is used to cache release signals in order to release files queued at a corresponding Wait processor

Attribute Cache Regex

Any attributes whose names match this regex will be stored in the distributed cache to be copied to any FlowFiles released from a corresponding Wait processor. Note that the uuid attribute will not be cached regardless of this value. If blank, no attributes will be cached.

Relationships

  • success: All FlowFiles where the release signal has been successfully entered in the cache will be routed to this relationship

  • failure: When the cache cannot be reached, or if the Release Signal Identifier evaluates to null or empty, FlowFiles will be routed to this relationship

Writes Attributes

  • notified: All FlowFiles will have an attribute 'notified'. The value of this attribute is true, is the FlowFile is notified, otherwise false.

Input Requirement

This component requires an incoming relationship.

PackageFlowFile

This processor will package FlowFile attributes and content into an output FlowFile that can be exported from NiFi and imported back into NiFi, preserving the original attributes and content.

Multi-Processor Use Cases

Send FlowFile content and attributes from one NiFi instance to another NiFi instance.

Notes: A Remote Process Group is preferred to send FlowFiles between two NiFi instances, but an alternative is to use PackageFlowFile then InvokeHTTP sending to ListenHTTP.

Keywords: flowfile, attributes, content, ffv3, flowfile-stream-v3, transfer

PackageFlowFile:

  1. "Maximum Batch Size" > 1 can help improve performance by batching many flowfiles together into 1 larger file that is transmitted by InvokeHTTP. .

  2. Connect the success relationship of PackageFlowFile to the input of InvokeHTTP. .

ListenHTTP:

  1. "Listening Port" = a unique port number. .

  2. ListenHTTP automatically unpacks files that have attribute mime.type=application/flowfile-v3.

  3. If PackageFlowFile batches 99 FlowFiles into 1 file that InvokeHTTP sends, then the original 99 FlowFiles will be output by ListenHTTP. .

InvokeHTTP:

  1. "HTTP Method" = "POST" to send data to ListenHTTP.

  2. "HTTP URL" should include the hostname, port, and path to the ListenHTTP.

  3. "Request Content-Type" = "${mime.type}" because PackageFlowFile output files have attribute mime.type=application/flowfile-v3. .

Export FlowFile content and attributes from NiFi to external storage and reimport.

Keywords: flowfile, attributes, content, ffv3, flowfile-stream-v3, offline, storage

PackageFlowFile:

  1. "Maximum Batch Size" > 1 can improve storage efficiency by batching many FlowFiles together into 1 larger file that is stored. .

  2. Connect the success relationship to the input of any NiFi egress processor for offline storage. .

UnpackContent:

  1. "Packaging Format" = "application/flowfile-v3". .

  2. Connect the output of a NiFi ingress processor that reads files stored offline to the input of UnpackContent.

  3. If PackageFlowFile batches 99 FlowFiles into 1 file that is read from storage, then the original 99 FlowFiles will be output by UnpackContent. .

Tags: flowfile, flowfile-stream, flowfile-stream-v3, package, attributes

Properties

Maximum Batch Size

Maximum number of FlowFiles to package into one output FlowFile using a best effort, non guaranteed approach. Multiple input queues can produce unexpected batching behavior.

Relationships

  • success: The packaged FlowFile is sent to this relationship

  • original: The FlowFiles that were used to create the package are sent to this relationship

Writes Attributes

  • mime.type: The mime.type will be changed to application/flowfile-v3

Input Requirement

This component requires an incoming relationship.

PaginatedJsonQueryElasticsearch

A processor that allows the user to run a paginated query (with aggregations) written with the Elasticsearch JSON DSL. It will use the flowfile’s content for the query unless the QUERY attribute is populated. Search After/Point in Time queries must include a valid "sort" field.

Tags: elasticsearch, elasticsearch5, elasticsearch6, elasticsearch7, elasticsearch8, query, scroll, page, read, json

Properties

Query Definition Style

How the JSON Query will be defined for use by the processor.

Query

A query in JSON syntax, not Lucene syntax. Ex: {"query":{"match":{"somefield":"somevalue"}}}. If this parameter is not set, the query will be read from the flowfile content. If the query (property and flowfile content) is empty, a default empty JSON Object will be used, which will result in a "match_all" query in Elasticsearch.

Query Clause

A "query" clause in JSON syntax, not Lucene syntax. Ex: {"match":{"somefield":"somevalue"}}. If the query is empty, a default JSON Object will be used, which will result in a "match_all" query in Elasticsearch.

Size

The maximum number of documents to retrieve in the query. If the query is paginated, this "size" applies to each page of the query, not the "size" of the entire result set.

Sort

Sort results by one or more fields, in JSON syntax. Ex: [{"price" : {"order" : "asc", "mode" : "avg"}}, {"post_date" : {"format": "strict_date_optional_time_nanos"}}]

Aggregations

One or more query aggregations (or "aggs"), in JSON syntax. Ex: {"items": {"terms": {"field": "product", "size": 10}}}

Fields

Fields of indexed documents to be retrieved, in JSON syntax. Ex: ["user.id", "http.response.*", {"field": "@timestamp", "format": "epoch_millis"}]

Script Fields

Fields to created using script evaluation at query runtime, in JSON syntax. Ex: {"test1": {"script": {"lang": "painless", "source": "doc['price'].value * 2"}}, "test2": {"script": {"lang": "painless", "source": "doc['price'].value * params.factor", "params": {"factor": 2.0}}}}

Query Attribute

If set, the executed query will be set on each result flowfile in the specified attribute.

Index

The name of the index to use.

Type

The type of this document (used by Elasticsearch for indexing and searching).

Max JSON Field String Length

The maximum allowed length of a string value when parsing a JSON document or attribute.

Client Service

An Elasticsearch client service to use for running queries.

Search Results Split

Output a flowfile containing all hits or one flowfile for each individual hit or one flowfile containing all hits from all paged responses.

Search Results Format

Format of Hits output.

Aggregation Results Split

Output a flowfile containing all aggregations or one flowfile for each individual aggregation.

Aggregation Results Format

Format of Aggregation output.

Output No Hits

Output a "hits" flowfile even if no hits found for query. If true, an empty "hits" flowfile will be output even if "aggregations" are output.

Pagination Type

Pagination method to use. Not all types are available for all Elasticsearch versions, check the Elasticsearch docs to confirm which are applicable and recommended for your service.

Pagination Keep Alive

Pagination "keep_alive" period. Period Elasticsearch will keep the scroll/pit cursor alive in between requests (this is not the time expected for all pages to be returned, but the maximum allowed time for requests between page retrievals).

Dynamic Properties

The name of a URL query parameter to add

Adds the specified property name/value as a query parameter in the Elasticsearch URL used for processing. These parameters will override any matching parameters in the query request body. For SCROLL type queries, these parameters are only used in the initial (first page) query as the Elasticsearch Scroll API does not support the same query parameters for subsequent pages of data.

Relationships

  • failure: All flowfiles that fail for reasons unrelated to server availability go to this relationship.

  • aggregations: Aggregations are routed to this relationship.

  • hits: Search hits are routed to this relationship.

  • original: All original flowfiles that don’t cause an error to occur go to this relationship.

Writes Attributes

  • mime.type: application/json

  • aggregation.name: The name of the aggregation whose results are in the output flowfile

  • aggregation.number: The number of the aggregation whose results are in the output flowfile

  • page.number: The number of the page (request), starting from 1, in which the results were returned that are in the output flowfile

  • hit.count: The number of hits that are in the output flowfile

  • elasticsearch.query.error: The error message provided by Elasticsearch if there is an error querying the index.

Input Requirement

This component requires an incoming relationship.

System Resource Considerations

  • MEMORY: Care should be taken on the size of each page because each response from Elasticsearch will be loaded into memory all at once and converted into the resulting flowfiles.

Additional Details

This processor is intended for use with the Elasticsearch JSON DSL and Elasticsearch 5.X and newer. It is designed to be able to take a JSON query (e.g. from Kibana) and execute it as-is against an Elasticsearch cluster in a paginated manner. Like all processors in the “restapi” bundle, it uses the official Elastic client APIs, so it supports leader detection.

The query JSON to execute can be provided either in the Query configuration property or in the content of the flowfile. If the Query Attribute property is configured, the executed query JSON will be placed in the attribute provided by this property.

The query is paginated in Elasticsearch using one of the available methods - “Scroll” or “Search After” (optionally with a “Point in Time” for Elasticsearch 7.10+ with XPack enabled). The number of results per page can be controlled using the size parameter in the Query JSON. For Search After functionality, a sort parameter must be present within the Query JSON.

Search results and aggregation results can be split up into multiple flowfiles. Aggregation results will only be split at the top level because nested aggregations lose their context (and thus lose their value) if separated from their parent aggregation. Additionally, the results from all pages can be combined into a single flowfile (but the processor will only load each page of data into memory at any one time).

The following is an example query that would be accepted:

{
  "query": {
    "size": 10000,
    "sort": {
      "product": "desc"
    },
    "match": {
      "restaurant.keyword": "Local Pizzaz FTW Inc"
    }
  },
  "aggs": {
    "weekly_sales": {
      "date_histogram": {
        "field": "date",
        "interval": "week"
      },
      "aggs": {
        "items": {
          "terms": {
            "field": "product",
            "size": 10
          }
        }
      }
    }
  }
}
json

ParseDocument

Parses incoming unstructured text documents. The output is formatted as "json-lines" with two keys: 'text' and 'metadata'.

Tags: text, embeddings, vector, machine learning, ML, artificial intelligence, ai, document, langchain, html, markdown

Properties

Input Format

The format of the input FlowFile. This dictates which TextLoader will be used to parse the input. Note that in order to process images or extract tables from PDF files,you must have both 'poppler' and 'tesseract' installed on your system.

Element Strategy

Specifies whether the input should be loaded as a single Document, or if each element in the input should be separated out into its own Document

Include Page Breaks

Specifies whether or not page breaks should be considered when creating Documents from the input

Metadata Fields

A comma-separated list of FlowFile attributes that will be added to the Documents' Metadata

Include Extracted Metadata

Whether or not to include the metadata that is extracted from the input in each of the Documents

ParseEvtx

Parses the contents of a Windows Event Log file (evtx) and writes the resulting XML to the FlowFile

Tags: logs, windows, event, evtx, message, file

Properties

Granularity

Output flow file for each Record, Chunk, or File encountered in the event log

Relationships

  • success: Any FlowFile that was successfully converted from evtx to XML

  • failure: Any FlowFile that encountered an exception during conversion will be transferred to this relationship with as much parsing as possible done

  • bad chunk: Any bad chunks of records will be transferred to this relationship in their original binary form

  • original: The unmodified input FlowFile will be transferred to this relationship

Reads Attributes

  • filename: The filename of the evtx file

Writes Attributes

  • filename: The output filename

  • mime.type: The output filetype (application/xml for success and failure relationships, original value for bad chunk and original relationships)

Input Requirement

This component requires an incoming relationship.

Additional Details

Description:

This processor is used to parse Windows event logs in the binary evtx format. The input flow files’ content should be evtx files. The processor has 4 outputs:

  • The original unmodified FlowFile

  • The XML resulting from parsing at the configured granularity

  • Failed parsing with partial output

  • Malformed chunk in binary form

Output XML Example:
<?xml version="1.0"?>
<Events>
    <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
        <System>
            <Provider Name="Service Control Manager" Guid="{555908d1-a6d7-4695-8e1e-26931d2012f4}" Ev
                      entSourceName="Service Control Manager"/>
            <EventID Qualifiers="16384">7036</EventID>
            <Version>0</Version>
            <Level>4</Level>
            <Task>0</Task>
            <Opcode>0</Opcode>
            <Keywords>0x8080000000000000</Keywords>
            <TimeCreated SystemTime="2016-01-08 16:49:47.518"/>
            <EventRecordID>780</EventRecordID>
            <Correlation ActivityID="" RelatedActivityID=""/>
            <Execution ProcessID="480" ThreadID="596"/>
            <Channel>System</Channel>
            <Computer>win7-pro-vm</Computer>
            <Security UserID=""/>
        </System>
        <EventData>
            <Data Name="param1">Workstation</Data>
            <Data Name="param2">running</Data>
            <Binary>TABhAG4AbQBhAG4AVwBvAHIAawBzAHQAYQB0AGkAbwBuAC8ANAAAAA==</Binary>
        </EventData>
    </Event>
    <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
        <System>
            <Provider Name="Service Control Manager" Guid="{555908d1-a6d7-4695-8e1e-26931d2012f4}"
                      EventSourceName="Service Control Manager"/>
            <EventID Qualifiers="16384">7036</EventID>
            <Version>0</Version>
            <Level>4</Level>
            <Task>0</Task>
            <Opcode>0</Opcode>
            <Keywords>0x8080000000000000</Keywords>
            <TimeCreated SystemTime="2016-01-08 16:49:47.535"/>
            <EventRecordID>781</EventRecordID>
            <Correlation ActivityID="" RelatedActivityID=""/>
            <Execution ProcessID="480" ThreadID="576"/>
            <Channel>System</Channel>
            <Computer>win7-pro-vm</Computer>
            <Security UserID=""/>
        </System>
        <EventData>
            <Data Name="param1">Cryptographic Services</Data>
            <Data Name="param2">running</Data>
            <Binary>QwByAHkAcAB0AFMAdgBjAC8ANAAAAA==</Binary>
        </EventData>
    </Event>
</Events>
xml

ParseNetflowv5

Parses netflowv5 byte ingest and add to NiFi flowfile as attributes or JSON content.

Tags: network, netflow, attributes, datagram, v5, packet, byte

Properties

Parsed fields destination

Indicates whether the results of the parser are written to the FlowFile content or a FlowFile attribute; if using flowfile-attribute, fields will be populated as attributes. If set to flowfile-content, the netflowv5 field will be converted into a flat JSON object.

Relationships

  • success: Any FlowFile that is successfully parsed as a netflowv5 data will be transferred to this Relationship.

  • failure: Any FlowFile that could not be parsed as a netflowv5 message will be transferred to this Relationship without any attributes being added

  • original: The original raw content

Reads Attributes

  • udp.port: Optionally read if packets are received from UDP datagrams.

Writes Attributes

  • netflowv5.header.*: The key and value generated by the parsing of the header fields.

  • netflowv5.record.*: The key and value generated by the parsing of the record fields.

Input Requirement

This component requires an incoming relationship.

Additional Details

Netflowv5Parser processor parses the ingress netflowv5 datagram format and transfers it either as flowfile attributes or JSON object. Netflowv5 format has predefined schema named “template” for parsing the netflowv5 record. More information: https://www.cisco.com/c/en/us/td/docs/net_mgmt/netflow_collection_engine/3-6/user/guide/format.html[RFC-netflowv5]

Netflowv5 JSON Output Schema
{
  "port": int,
  "format": string,
  "header": {
    "version": int,
    "count": int,
    "sys_uptime": long,
    "unix_secs": long,
    "unix_nsecs": long,
    "flow_sequence": long,
    "engine_type": short,
    "engine_id": short,
    "sampling_interval": int
  },
  "record": {
    "srcaddr": string,
    "dstaddr": string,
    "nexthop": string,
    "input": int,
    "output": int,
    "dPkts": long,
    "dOctets": long,
    "first": long,
    "last": long,
    "srcport": int,
    "dstport": int,
    "pad1": short,
    "tcp_flags": short,
    "prot": short,
    "tos": short,
    "src_as": int,
    "dst_as": int,
    "src_mask": short,
    "dst_mask": short,
    "pad2": int
  }
}
json

ParseSyslog

Attempts to parses the contents of a Syslog message in accordance to RFC5424 and RFC3164 formats and adds attributes to the FlowFile for each of the parts of the Syslog message.Note: Be mindfull that RFC3164 is informational and a wide range of different implementations are present in the wild. If messages fail parsing, considering using RFC5424 or using a generic parsing processors such as ExtractGrok.

Tags: logs, syslog, attributes, system, event, message

Properties

Character Set

Specifies which character set of the Syslog messages

Relationships

  • success: Any FlowFile that is successfully parsed as a Syslog message will be to this Relationship.

  • failure: Any FlowFile that could not be parsed as a Syslog message will be transferred to this Relationship without any attributes being added

Writes Attributes

  • syslog.priority: The priority of the Syslog message.

  • syslog.severity: The severity of the Syslog message derived from the priority.

  • syslog.facility: The facility of the Syslog message derived from the priority.

  • syslog.version: The optional version from the Syslog message.

  • syslog.timestamp: The timestamp of the Syslog message.

  • syslog.hostname: The hostname or IP address of the Syslog message.

  • syslog.sender: The hostname of the Syslog server that sent the message.

  • syslog.body: The body of the Syslog message, everything after the hostname.

Input Requirement

This component requires an incoming relationship.

ParseSyslog5424

Attempts to parse the contents of a well formed Syslog message in accordance to RFC5424 format and adds attributes to the FlowFile for each of the parts of the Syslog message, including Structured Data.Structured Data will be written to attributes as one attribute per item id + parameter see https://tools.ietf.org/html/rfc5424.Note: ParseSyslog5424 follows the specification more closely than ParseSyslog. If your Syslog producer does not follow the spec closely, with regards to using '-' for missing header entries for example, those logs will fail with this parser, where they would not fail with ParseSyslog.

Tags: logs, syslog, syslog5424, attributes, system, event, message

Properties

Character Set

Specifies which character set of the Syslog messages

NIL Policy

Defines how NIL values are handled for header fields.

Include Message Body in Attributes

If true, then the Syslog Message body will be included in the attributes.

Relationships

  • success: Any FlowFile that is successfully parsed as a Syslog message will be to this Relationship.

  • failure: Any FlowFile that could not be parsed as a Syslog message will be transferred to this Relationship without any attributes being added

Writes Attributes

  • syslog.priority: The priority of the Syslog message.

  • syslog.severity: The severity of the Syslog message derived from the priority.

  • syslog.facility: The facility of the Syslog message derived from the priority.

  • syslog.version: The optional version from the Syslog message.

  • syslog.timestamp: The timestamp of the Syslog message.

  • syslog.hostname: The hostname or IP address of the Syslog message.

  • syslog.appname: The appname of the Syslog message.

  • syslog.procid: The procid of the Syslog message.

  • syslog.messageid: The messageid the Syslog message.

  • syslog.structuredData: Multiple entries per structuredData of the Syslog message.

  • syslog.sender: The hostname of the Syslog server that sent the message.

  • syslog.body: The body of the Syslog message, everything after the hostname.

Input Requirement

This component requires an incoming relationship.

PartitionRecord

Splits, or partitions, record-oriented data based on the configured fields in the data. One or more properties must be added. The name of the property is the name of an attribute to add. The value of the property is a RecordPath to evaluate against each Record. Two records will go to the same outbound FlowFile only if they have the same value for each of the given RecordPaths. Because we know that all records in a given output FlowFile have the same value for the fields that are specified by the RecordPath, an attribute is added for each field. See Additional Details on the Usage page for more information and examples.

Use Cases

Separate records into separate FlowFiles so that all of the records in a FlowFile have the same value for a given field or set of fields.

Keywords: separate, split, partition, break apart, colocate, segregate, record, field, recordpath

Input Requirement: This component allows an incoming relationship.

  1. Choose a RecordReader that is appropriate based on the format of the incoming data.

  2. Choose a RecordWriter that writes the data in the desired output format. .

  3. Add a single additional property. The name of the property should describe the type of data that is being used to partition the data. The property’s value should be a RecordPath that specifies which output FlowFile the Record belongs to. .

  4. For example, if we want to separate records based on their transactionType field, we could add a new property named transactionType. The value of the property might be /transaction/type. An input FlowFile will then be separated into as few FlowFiles as possible such that each output FlowFile has the same value for the transactionType field. .

Separate records based on whether or not they adhere to a specific criteria

Keywords: separate, split, partition, break apart, segregate, record, field, recordpath, criteria

Input Requirement: This component allows an incoming relationship.

  1. Choose a RecordReader that is appropriate based on the format of the incoming data.

  2. Choose a RecordWriter that writes the data in the desired output format. .

  3. Add a single additional property. The name of the property should describe the criteria. The property’s value should be a RecordPath that returns true if the Record meets the criteria or false otherwise. .

  4. For example, if we want to separate records based on whether or not they have a transaction total of more than $1,000 we could add a new property named largeTransaction with a value of /transaction/total > 1000. This will create two FlowFiles. In the first, all records will have a total over 1000. In the second, all records will have a transaction less than or equal to 1000. Each FlowFile will have an attribute named largeTransaction with a value of true or false. .

Tags: record, partition, recordpath, rpath, segment, split, group, bin, organize

Properties

Record Reader

Specifies the Controller Service to use for reading incoming data

Record Writer

Specifies the Controller Service to use for writing out the records

Dynamic Properties

The name given to the dynamic property is the name of the attribute that will be used to denote the value of the associated RecordPath.

Each dynamic property represents a RecordPath that will be evaluated against each record in an incoming FlowFile. When the value of the RecordPath is determined for a Record, an attribute is added to the outgoing FlowFile. The name of the attribute is the same as the name of this property. The value of the attribute is the same as the value of the field in the Record that the RecordPath points to. Note that no attribute will be added if the value returned for the RecordPath is null or is not a scalar value (i.e., the value is an Array, Map, or Record).

Relationships

  • success: FlowFiles that are successfully partitioned will be routed to this relationship

  • failure: If a FlowFile cannot be partitioned from the configured input format to the configured output format, the unchanged FlowFile will be routed to this relationship

  • original: Once all records in an incoming FlowFile have been partitioned, the original FlowFile is routed to this relationship.

Writes Attributes

  • record.count: The number of records in an outgoing FlowFile

  • mime.type: The MIME Type that the configured Record Writer indicates is appropriate

  • fragment.identifier: All partitioned FlowFiles produced from the same parent FlowFile will have the same randomly generated UUID added for this attribute

  • fragment.index: A one-up number that indicates the ordering of the partitioned FlowFiles that were created from a single parent FlowFile

  • fragment.count: The number of partitioned FlowFiles generated from the parent FlowFile

  • *segment.original.filename *: The filename of the parent FlowFile

  • <dynamic property name>: For each dynamic property that is added, an attribute may be added to the FlowFile. See the description for Dynamic Properties for more information.

Input Requirement

This component requires an incoming relationship.

Additional Details

PartitionRecord allows the user to separate out records in a FlowFile such that each outgoing FlowFile consists only of records that are “alike.” To define what it means for two records to be alike, the Processor makes use of NiFi’s RecordPath DSL.

In order to make the Processor valid, at least one user-defined property must be added to the Processor. The value of the property must be a valid RecordPath. Expression Language is supported and will be evaluated before attempting to compile the RecordPath. However, if Expression Language is used, the Processor is not able to validate the RecordPath beforehand and may result in having FlowFiles fail processing if the RecordPath is not valid when being used.

Once one or more RecordPath’s have been added, those RecordPath’s are evaluated against each Record in an incoming FlowFile. In order for Record A and Record B to be considered “like records,” both of them must have the same value for all RecordPath’s that are configured. Only the values that are returned by the RecordPath are held in Java’s heap. The records themselves are written immediately to the FlowFile content. This means that for most cases, heap usage is not a concern. However, if the RecordPath points to a large Record field that is different for each record in a FlowFile, then heap usage may be an important consideration. In such cases, SplitRecord may be useful to split a large FlowFile into smaller FlowFiles before partitioning.

Once a FlowFile has been written, we know that all the Records within that FlowFile have the same value for the fields that are described by the configured RecordPath’s. As a result, this means that we can promote those values to FlowFile Attributes. We do so by looking at the name of the property to which each RecordPath belongs. For example, if we have a property named country with a value of /geo/country/name, then each outbound FlowFile will have an attribute named country with the value of the /geo/country/name field. The addition of these attributes makes it very easy to perform tasks such as routing, or referencing the value in another Processor that can be used for configuring where to send the data, etc. However, for any RecordPath whose value is not a scalar value (i.e., the value is of type Array, Map, or Record), no attribute will be added.

Examples

To better understand how this Processor works, we will lay out a few examples. For the sake of these examples, let’s assume that our input data is JSON formatted and looks like this:

[
  {
    "name": "John Doe",
    "dob": "11/30/1976",
    "favorites": [
      "spaghetti",
      "basketball",
      "blue"
    ],
    "locations": {
      "home": {
        "number": 123,
        "street": "My Street",
        "city": "New York",
        "state": "NY",
        "country": "US"
      },
      "work": {
        "number": 321,
        "street": "Your Street",
        "city": "New York",
        "state": "NY",
        "country": "US"
      }
    }
  },
  {
    "name": "Jane Doe",
    "dob": "10/04/1979",
    "favorites": [
      "spaghetti",
      "football",
      "red"
    ],
    "locations": {
      "home": {
        "number": 123,
        "street": "My Street",
        "city": "New York",
        "state": "NY",
        "country": "US"
      },
      "work": {
        "number": 456,
        "street": "Our Street",
        "city": "New York",
        "state": "NY",
        "country": "US"
      }
    }
  },
  {
    "name": "Jacob Doe",
    "dob": "04/02/2012",
    "favorites": [
      "chocolate",
      "running",
      "yellow"
    ],
    "locations": {
      "home": {
        "number": 123,
        "street": "My Street",
        "city": "New York",
        "state": "NY",
        "country": "US"
      },
      "work": null
    }
  },
  {
    "name": "Janet Doe",
    "dob": "02/14/2007",
    "favorites": [
      "spaghetti",
      "reading",
      "white"
    ],
    "locations": {
      "home": {
        "number": 1111,
        "street": "Far Away",
        "city": "San Francisco",
        "state": "CA",
        "country": "US"
      },
      "work": null
    }
  }
]
json
Example 1 - Partition By Simple Field

For a simple case, let’s partition all the records based on the state that they live in. We can add a property named state with a value of /locations/home/state. The result will be that we will have two outbound FlowFiles. The first will contain an attribute with the name state and a value of NY. This FlowFile will consist of 3 records: John Doe, Jane Doe, and Jacob Doe. The second FlowFile will consist of a single record for Janet Doe and will contain an attribute named state that has a value of CA.

Example 2 - Partition By Nullable Value

In the above example, there are three different values for the work location. If we use a RecordPath of /locations/work/state with a property name of state, then we will end up with two different FlowFiles. The first will contain records for John Doe and Jane Doe because they have the same value for the given RecordPath. This FlowFile will have an attribute named state with a value of NY.

The second FlowFile will contain the two records for Jacob Doe and Janet Doe, because the RecordPath will evaluate to null for both of them. This FlowFile will have no state attribute (unless such an attribute existed on the incoming FlowFile, in which case its value will be unaltered).

Example 3 - Partition By Multiple Values

Now let’s say that we want to partition records based on multiple different fields. We now add two properties to the PartitionRecord processor. The first property is named home and has a value of /locations/home. The second property is named favorite.food and has a value of /favorites[0] to reference the first element in the “favorites” array.

This will result in three different FlowFiles being created. The first FlowFile will contain records for John Doe and Jane Doe. It will contain an attribute named “favorite.food” with a value of “spaghetti.” However, because the second RecordPath pointed to a Record field, no “home” attribute will be added. In this case, both of these records have the same value for both the first element of the “favorites” array and the same value for the home address. Janet Doe has the same value for the first element in the “favorites” array but has a different home address. Similarly, Jacob Doe has the same home address but a different value for the favorite food.

The second FlowFile will consist of a single record: Jacob Doe. This FlowFile will have an attribute named ” favorite.food” with a value of “chocolate.” The third FlowFile will consist of a single record: Janet Doe. This FlowFile will have an attribute named “favorite.food” with a value of “spaghetti.”

PromptLLM

Submits a prompt to an LLM, writing the results either to a FlowFile attribute or to the contents of the FlowFile

Tags: text, chatgpt, gpt, geminimachine learning, ML, artificial intelligence, ai, document, langchain

Properties

LLM Family

The name of the LLM family to be used (e.g. GPT, Gemini)

Google LLM name

The name of the Google (Gemini) model to use

OpenAI LLM name

The name of the OpenAI model to use

System Prompt

A system prompt is an initial instruction provided to a language model that defines its behavior, tone, and purpose during an interaction. It helps guide the model’s responses to align with specific goals or contexts. For example, a system prompt like "You are a helpful and friendly AI assistant. Your job is to provide clear, polite answers and assist with any questions" would direct the model to respond in a helpful and approachable manner. This may use FlowFile attributes via Expression Language and may also reference the FlowFile content by using the literal {flowfile_content} (including braces) in the prompt. If the FlowFile’s content is JSON formatted, a reference may also include JSONPath Expressions to reference specific fields in the FlowFile content, such as {$.page_content}. To use the literal exressions '{$variable} or {flowfile_content}', you must escape them like this: '{/$variable}' or '{/flowfile_content}'

User Prompt

A user prompt is a specific instruction or query provided to the LLM, usually by the user. For example, a user might ask an AI assistant "What is the best way to travel from London to Paris?" This may use FlowFile attributes via Expression Language and may also reference the FlowFile content by using the literal {flowfile_content} (including braces) in the prompt. If the FlowFile’s content is JSON formatted, a reference may also include JSONPath Expressions to reference specific fields in the FlowFile content, such as {$.page_content} To use the literal exressions '{$variable} or {flowfile_content}', you must escape them like this: '{/$variable}' or '{/flowfile_content}'

Temperature

The Temperature parameter to submit to the LLM. A lower value will result in more consistent answers while a higher value will result in a more creative answer. "The value must be between 0 and 2, inclusive.

Result Attribute

If specified, the result will be added to the attribute whose name is given. If not specified, the result will be written to the FlowFile’s content

API Key

The API Key to use

Request Timeout

The amount of time to wait before timing out the request

Max Tokens to Generate

The maximum number of tokens that the LLM should generate

PublishAMQP

Creates an AMQP Message from the contents of a FlowFile and sends the message to an AMQP Exchange. In a typical AMQP exchange model, the message that is sent to the AMQP Exchange will be routed based on the 'Routing Key' to its final destination in the queue (the binding). If due to some misconfiguration the binding between the Exchange, Routing Key and Queue is not set up, the message will have no final destination and will return (i.e., the data will not make it to the queue). If that happens you will see a log in both app-log and bulletin stating to that effect, and the FlowFile will be routed to the 'failure' relationship.

Tags: amqp, rabbit, put, message, send, publish

Properties

Exchange Name

The name of the AMQP Exchange the messages will be sent to. Usually provided by the AMQP administrator (e.g., 'amq.direct'). It is an optional property. If kept empty the messages will be sent to a default AMQP exchange.

Routing Key

The name of the Routing Key that will be used by AMQP to route messages from the exchange to a destination queue(s). Usually provided by the administrator (e.g., 'myKey')In the event when messages are sent to a default exchange this property corresponds to a destination queue name, otherwise a binding from the Exchange to a Queue via Routing Key must be set (usually by the AMQP administrator)

Headers Source

The source of the headers which will be applied to the published message.

Headers Pattern

Regular expression that will be evaluated against the FlowFile attributes to select the matching attributes and put as AMQP headers. Attribute name will be used as header key.

Header Separator

The character that is used to split key-value for headers. The value must only one character. Otherwise you will get an error message

Brokers

A comma-separated list of known AMQP Brokers in the format <host>:<port> (e.g., localhost:5672). If this is set, Host Name and Port are ignored. Only include hosts from the same AMQP cluster.

Host Name

Network address of AMQP broker (e.g., localhost). If Brokers is set, then this property is ignored.

Port

Numeric value identifying Port of AMQP broker (e.g., 5671). If Brokers is set, then this property is ignored.

Virtual Host

Virtual Host name which segregates AMQP system for enhanced security.

User Name

User Name used for authentication and authorization.

Password

Password used for authentication and authorization.

AMQP Version

AMQP Version. Currently only supports AMQP v0.9.1.

SSL Context Service

The SSL Context Service used to provide client certificate information for TLS/SSL connections.

Use Client Certificate Authentication

Authenticate using the SSL certificate rather than user name/password.

Relationships

  • success: All FlowFiles that are sent to the AMQP destination are routed to this relationship

  • failure: All FlowFiles that cannot be routed to the AMQP destination are routed to this relationship

Reads Attributes

  • amqp$appId: The App ID field to set on the AMQP Message

  • amqp$contentEncoding: The Content Encoding to set on the AMQP Message

  • amqp$contentType: The Content Type to set on the AMQP Message

  • amqp$headers: The headers to set on the AMQP Message, if 'Header Source' is set to use it. See additional details of the processor.

  • amqp$deliveryMode: The numeric indicator for the Message’s Delivery Mode

  • amqp$priority: The Message priority

  • amqp$correlationId: The Message’s Correlation ID

  • amqp$replyTo: The value of the Message’s Reply-To field

  • amqp$expiration: The Message Expiration

  • amqp$messageId: The unique ID of the Message

  • amqp$timestamp: The timestamp of the Message, as the number of milliseconds since epoch

  • amqp$type: The type of message

  • amqp$userId: The ID of the user

  • amqp$clusterId: The ID of the AMQP Cluster

Input Requirement

This component requires an incoming relationship.

System Resource Considerations

  • MEMORY: An instance of this component can cause high usage of this system resource. Multiple instances or high concurrency settings may result a degradation of performance.

Additional Details

Summary

This processor publishes the contents of the incoming FlowFile to an AMQP-based messaging system. At the time of writing this document the supported AMQP protocol version is v0.9.1.

The component is based on RabbitMQ Client API The following guide and tutorial may also help you to brush up on some of the AMQP basics.

This processor does two things. It constructs AMQP Message by extracting FlowFile contents (both body and attributes). Once message is constructed it is sent to an AMQP Exchange. AMQP Properties will be extracted from the FlowFile and converted to com.rabbitmq.client.AMQP.BasicProperties to be sent along with the message. Upon success the incoming FlowFile is transferred to success Relationship and upon failure FlowFile is penalized and transferred to the failure Relationship.

Where did my message go?

In a typical AMQP exchange model, the message that is sent to an AMQP Exchange will be routed based on the Routing Key to its final destination in the Queue. It’s called Binding. If due to some misconfiguration the binding between the Exchange, Routing Key and the Queue is not set up, the message will have no final destination and will return ( i.e., the data will not make it to the queue). If that happens you will see a log in both app-log and bulletin stating to that effect. Fixing the binding (normally done by AMQP administrator) will resolve the issue.

AMQP Properties

Attributes extracted from the FlowFile are considered candidates for AMQP properties if their names are prefixed with amqp$ (e.g., amqp$contentType=text/xml). To enrich message with additional AMQP properties you may use UpdateAttribute** processor between the source processor and PublishAMQP processor. The following is the list of available standard AMQP properties: (“amqp$contentType”, “amqp$contentEncoding”, ” amqp$headers” (if ‘Headers Source’ is set to ‘Attribute “amqp$headers” Value’) , “amqp$deliveryMode”, “amqp$priority”, “amqp$correlationId”, “amqp$replyTo”, “amqp$expiration”, “amqp$messageId”, “amqp$timestamp”, “amqp$type”, “amqp$userId”, “amqp$appId”, “amqp$clusterId”)

AMQP Message Headers Source

The headers attached to AMQP message by the processor depends on the “Headers Source” property value.

  1. Attribute “amqp$headers” Value - The processor will read single attribute “amqp$headers” and split it based on ” Header Separator” and then read headers in key=value format.

  2. Attributes Matching Regex - The processor will pick flow file attributes by matching the regex provided in ” Attributes To Headers Regular Expression”. The name of the attribute is used as key of header

Configuration Details

At the time of writing this document it only defines the essential configuration properties which are suitable for most cases. Other properties will be defined later as this component progresses. Configuring PublishAMQP:

  1. Exchange Name - [OPTIONAL] the name of AMQP exchange the messages will be sent to. Usually provided by the administrator (e.g., ‘amq.direct’) It is an optional property. If kept empty the messages will be sent to a default AMQP exchange. More on AMQP Exchanges could be found here.

  2. Routing Key - [REQUIRED] the name of the routing key that will be used by AMQP to route messages from the exchange to destination queue(s). Usually provided by administrator (e.g., ‘myKey’) In the event when messages are sent to a default exchange this property corresponds to a destination queue name, otherwise a binding from the Exchange to a Queue via Routing Key must be set (usually by the AMQP administrator). More on AMQP Exchanges and Bindings could be found here.

  3. Host Name - [REQUIRED] the name of the host where AMQP broker is running. Usually provided by administrator ( e.g., ‘myhost.com’). Defaults to ‘localhost’.

  4. Port - [REQUIRED] the port number where AMQP broker is running. Usually provided by the administrator (e.g., ’ 2453’). Defaults to ‘5672’.

  5. User Name - [REQUIRED] user name to connect to AMQP broker. Usually provided by the administrator (e.g., ‘me’). Defaults to ‘guest’.

  6. Password - [REQUIRED] password to use with user name to connect to AMQP broker. Usually provided by the administrator. Defaults to ‘guest’.

  7. Use Certificate Authentication - [OPTIONAL] Use the SSL certificate common name for authentication rather than user name/password. This can only be used in conjunction with SSL. Defaults to ‘false’.

  8. Virtual Host - [OPTIONAL] Virtual Host name which segregates AMQP system for enhanced security. Please refer to this blog for more details on Virtual Host.

PublishGCPubSub

Publishes the content of the incoming flowfile to the configured Google Cloud PubSub topic. The processor supports dynamic properties. If any dynamic properties are present, they will be sent along with the message in the form of 'attributes'.

Tags: google, google-cloud, gcp, message, pubsub, publish

Properties

GCP Credentials Provider Service

The Controller Service used to obtain Google Cloud Platform credentials.

Project ID

Google Cloud Project ID

Topic Name

Name of the Google Cloud PubSub Topic

Message Derivation Strategy

The strategy used to publish the incoming FlowFile to the Google Cloud PubSub endpoint.

Record Reader

The Record Reader to use for incoming FlowFiles

Record Writer

The Record Writer to use in order to serialize the data before sending to GCPubSub endpoint

Input Batch Size

Maximum number of FlowFiles processed for each Processor invocation

Maximum Message Size

The maximum size of a Google PubSub message in bytes. Defaults to 1 MB (1048576 bytes)

Batch Size Threshold

Indicates the number of messages the cloud service should bundle together in a batch. If not set and left empty, only one message will be used in a batch

Batch Bytes Threshold

Publish request gets triggered based on this Batch Bytes Threshold property and the Batch Size Threshold property, whichever condition is met first.

Batch Delay Threshold

Indicates the delay threshold to use for batching. After this amount of time has elapsed (counting from the first element added), the elements will be wrapped up in a batch and sent. This value should not be set too high, usually on the order of milliseconds. Otherwise, calls might appear to never complete.

API Endpoint

Override the gRPC endpoint in the form of [host:port]

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP + AuthN

Dynamic Properties

Attribute name

Attributes to be set for the outgoing Google Cloud PubSub message

Relationships

  • success: FlowFiles are routed to this relationship after a successful Google Cloud Pub/Sub operation.

  • failure: FlowFiles are routed to this relationship if the Google Cloud Pub/Sub operation fails.

  • retry: FlowFiles are routed to this relationship if the Google Cloud Pub/Sub operation fails but attempting the operation again may succeed.

Writes Attributes

  • gcp.pubsub.messageId: ID of the pubsub message published to the configured Google Cloud PubSub topic

  • gcp.pubsub.count.records: Count of pubsub messages published to the configured Google Cloud PubSub topic

  • gcp.pubsub.topic: Name of the Google Cloud PubSub topic the message was published to

Input Requirement

This component requires an incoming relationship.

System Resource Considerations

  • MEMORY: The entirety of the FlowFile’s content will be read into memory to be sent as a PubSub message.

See Also

PublishJMS

Creates a JMS Message from the contents of a FlowFile and sends it to a JMS Destination (queue or topic) as JMS BytesMessage or TextMessage. FlowFile attributes will be added as JMS headers and/or properties to the outgoing JMS message.

Tags: jms, put, message, send, publish

Properties

Connection Factory Service

The Controller Service that is used to obtain Connection Factory. Alternatively, the 'JNDI *' or the 'JMS *' properties can also be be used to configure the Connection Factory.

Destination Name

The name of the JMS Destination. Usually provided by the administrator (e.g., 'topic://myTopic' or 'myTopic').

Destination Type

The type of the JMS Destination. Could be one of 'QUEUE' or 'TOPIC'. Usually provided by the administrator. Defaults to 'QUEUE'

User Name

User Name used for authentication and authorization.

Password

Password used for authentication and authorization.

Connection Client ID

The client id to be set on the connection, if set. For durable non shared consumer this is mandatory, for all others it is optional, typically with shared consumers it is undesirable to be set. Please see JMS spec for further details

Message Body Type

The type of JMS message body to construct.

Character Set

The name of the character set to use to construct or interpret TextMessages

Allow Illegal Characters in Header Names

Specifies whether illegal characters in header names should be sent to the JMS broker. Usually hyphens and full-stops.

Attributes to Send as JMS Headers (Regex)

Specifies the Regular Expression that determines the names of FlowFile attributes that should be sent as JMS Headers

Maximum Batch Size

The maximum number of messages to publish or consume in each invocation of the processor.

Record Reader

The Record Reader to use for parsing the incoming FlowFile into Records.

Record Writer

The Record Writer to use for serializing Records before publishing them as an JMS Message.

JNDI Initial Context Factory Class

The fully qualified class name of the JNDI Initial Context Factory Class (java.naming.factory.initial).

JNDI Provider URL

The URL of the JNDI Provider to use as the value for java.naming.provider.url. See additional details documentation for allowed URL schemes.

JNDI Name of the Connection Factory

The name of the JNDI Object to lookup for the Connection Factory.

JNDI / JMS Client Libraries

Specifies jar files and/or directories to add to the ClassPath in order to load the JNDI / JMS client libraries. This should be a comma-separated list of files, directories, and/or URLs. If a directory is given, any files in that directory will be included, but subdirectories will not be included (i.e., it is not recursive).

JNDI Principal

The Principal to use when authenticating with JNDI (java.naming.security.principal).

JNDI Credentials

The Credentials to use when authenticating with JNDI (java.naming.security.credentials).

JMS Connection Factory Implementation Class

The fully qualified name of the JMS ConnectionFactory implementation class (eg. org.apache.activemq.ActiveMQConnectionFactory).

JMS Client Libraries

Path to the directory with additional resources (eg. JARs, configuration files etc.) to be added to the classpath (defined as a comma separated list of values). Such resources typically represent target JMS client libraries for the ConnectionFactory implementation.

JMS Broker URI

URI pointing to the network location of the JMS Message broker. Example for ActiveMQ: 'tcp://myhost:61616'. Examples for IBM MQ: 'myhost(1414)' and 'myhost01(1414),myhost02(1414)'.

JMS SSL Context Service

The SSL Context Service used to provide client certificate information for TLS/SSL connections.

Dynamic Properties

The name of a Connection Factory configuration property.

Additional configuration property for the Connection Factory. It can be used when the Connection Factory is being configured via the 'JNDI *' or the 'JMS *'properties of the processor. For more information, see the Additional Details page.

Relationships

  • success: All FlowFiles that are sent to the JMS destination are routed to this relationship

  • failure: All FlowFiles that cannot be sent to JMS destination are routed to this relationship

Reads Attributes

  • jms_deliveryMode: This attribute becomes the JMSDeliveryMode message header. Must be an integer.

  • jms_expiration: This attribute becomes the JMSExpiration message header. Must be a long.

  • jms_priority: This attribute becomes the JMSPriority message header. Must be an integer.

  • jms_redelivered: This attribute becomes the JMSRedelivered message header.

  • jms_timestamp: This attribute becomes the JMSTimestamp message header. Must be a long.

  • jms_correlationId: This attribute becomes the JMSCorrelationID message header.

  • jms_type: This attribute becomes the JMSType message header. Must be an integer.

  • jms_replyTo: This attribute becomes the JMSReplyTo message header. Must be an integer.

  • jms_destination: This attribute becomes the JMSDestination message header. Must be an integer.

  • other attributes: All other attributes that do not start with jms_ are added as message properties.

  • other attributes .type: When an attribute will be added as a message property, a second attribute of the same name but with an extra .type at the end will cause the message property to be sent using that strong type. For example, attribute delay with value 12000 and another attribute delay.type with value integer will cause a JMS message property delay to be sent as an Integer rather than a String. Supported types are boolean, byte, short, integer, long, float, double, and string (which is the default).

Restricted

Making changes to this component requires admin rights. Non-admins can still schedule this component.

Input Requirement

This component requires an incoming relationship.

System Resource Considerations

  • MEMORY: An instance of this component can cause high usage of this system resource. Multiple instances or high concurrency settings may result a degradation of performance.

Additional Details

Summary

This processor publishes the contents of the incoming FlowFile to a JMS compliant messaging system.

This processor does two things. It constructs JMS Message by extracting FlowFile contents (both body and attributes). Once message is constructed it is sent to a pre-configured JMS Destination. Standard JMS Headers will be extracted from the FlowFile and set on javax.jms.Message as JMS headers while other FlowFile attributes will be set as properties of javax.jms.Message. Upon success the incoming FlowFile is transferred to the success Relationship and upon failure FlowFile is penalized and transferred to the failure Relationship.

Configuration Details

At the time of writing this document it only defines the essential configuration properties which are suitable for most cases. Other properties will be defined later as this component progresses. Configuring PublishJMS:

  1. User Name - [OPTIONAL] User Name used for authentication and authorization when this processor obtains javax.jms.Connection from the pre-configured javax.jms.ConnectionFactory (see below).

  2. Password - [OPTIONAL] Password used in conjunction with User Name.

  3. Destination Name - [REQUIRED] the name of the javax.jms.Destination. Usually provided by administrator (e.g., ‘topic://myTopic’).

  4. Destination Type - [REQUIRED] the type of the javax.jms.Destination. Could be one of ‘QUEUE’ or ‘TOPIC’ Usually provided by the administrator. Defaults to ‘QUEUE’.

Connection Factory Configuration

There are multiple ways to configure the Connection Factory for the processor:

  • Connection Factory Service property - link to a pre-configured controller service (JndiJmsConnectionFactoryProvider or JMSConnectionFactoryProvider)

  • JNDI * properties - processor level configuration, the properties are the same as the properties of JndiJmsConnectionFactoryProvider controller service, the dynamic properties can also be used in this case

  • JMS * properties - processor level configuration, the properties are the same as the properties of JMSConnectionFactoryProvider controller service, the dynamic properties can also be used in this case

The preferred way is to use the Connection Factory Service property and a pre-configured controller service. It is also the most convenient method, because it is enough to configure the controller service once and then it can be used in multiple processors.

However, some JMS client libraries may not work with the controller services due to incompatible Java ClassLoader handling between the 3rd party JMS client library and NiFi. Should you encounter java.lang.ClassCastException errors when using the controller services, please try to configure the Connection Factory via the ‘JNDI *’ or the ‘JMS *’ and the dynamic properties of the processor. For more details on these properties, see the documentation of the corresponding controller service (JndiJmsConnectionFactoryProvider for ‘JNDI *’ and JMSConnectionFactoryProvider for ‘JMS *’).

PublishKafka

Sends the contents of a FlowFile as either a message or as individual records to Apache Kafka using the Kafka Producer API. The messages to send may be individual FlowFiles, may be delimited using a user-specified delimiter (such as a new-line), or may be record-oriented data that can be read by the configured Record Reader. The complementary NiFi processor for fetching messages is ConsumeKafka. To produce a kafka tombstone message while using PublishStrategy.USE_WRAPPER, simply set the value of a record to 'null'.

Tags: Apache, Kafka, Record, csv, json, avro, logs, Put, Send, Message, PubSub

Properties

Kafka Connection Service

Provides connections to Kafka Broker for publishing Kafka Records

Topic Name

Name of the Kafka Topic to which the Processor publishes Kafka Records

Failure Strategy

Specifies how the processor handles a FlowFile if it is unable to publish the data to Kafka

Delivery Guarantee

Specifies the requirement for guaranteeing that a message is sent to Kafka. Corresponds to Kafka Client acks property.

Compression Type

Specifies the compression strategy for records sent to Kafka. Corresponds to Kafka Client compression.type property.

Max Request Size

The maximum size of a request in bytes. Corresponds to Kafka Client max.request.size property.

Transactions Enabled

Specifies whether to provide transactional guarantees when communicating with Kafka. If there is a problem sending data to Kafka, and this property is set to false, then the messages that have already been sent to Kafka will continue on and be delivered to consumers. If this is set to true, then the Kafka transaction will be rolled back so that those messages are not available to consumers. Setting this to true requires that the [Delivery Guarantee] property be set to [Guarantee Replicated Delivery.]

Transactional ID Prefix

Specifies the KafkaProducer config transactional.id will be a generated UUID and will be prefixed with the configured string.

Partitioner Class

Specifies which class to use to compute a partition id for a message. Corresponds to Kafka Client partitioner.class property.

Partition

Specifies the Kafka Partition destination for Records.

Message Demarcator

Specifies the string (interpreted as UTF-8) to use for demarcating multiple messages within a single FlowFile. If not specified, the entire content of the FlowFile will be used as a single message. If specified, the contents of the FlowFile will be split on this delimiter and each section sent as a separate Kafka message. To enter special character such as 'new line' use CTRL+Enter or Shift+Enter, depending on your OS.

Record Reader

The Record Reader to use for incoming FlowFiles

Record Writer

The Record Writer to use in order to serialize the data before sending to Kafka

Publish Strategy

The format used to publish the incoming FlowFile record to Kafka.

Message Key Field

The name of a field in the Input Records that should be used as the Key for the Kafka message.

FlowFile Attribute Header Pattern

A Regular Expression that is matched against all FlowFile attribute names. Any attribute whose name matches the pattern will be added to the Kafka messages as a Header. If not specified, no FlowFile attributes will be added as headers.

Header Encoding

For any attribute that is added as a Kafka Record Header, this property indicates the Character Encoding to use for serializing the headers.

Kafka Key

The Key to use for the Message. If not specified, the FlowFile attribute 'kafka.key' is used as the message key, if it is present.Beware that setting Kafka key and demarcating at the same time may potentially lead to many Kafka messages with the same key.Normally this is not a problem as Kafka does not enforce or assume message and key uniqueness. Still, setting the demarcator and Kafka key at the same time poses a risk of data loss on Kafka. During a topic compaction on Kafka, messages will be deduplicated based on this key.

Kafka Key Attribute Encoding

FlowFiles that are emitted have an attribute named 'kafka.key'. This property dictates how the value of the attribute should be encoded.

Record Key Writer

The Record Key Writer to use for outgoing FlowFiles

Record Metadata Strategy

Specifies whether the Record’s metadata (topic and partition) should come from the Record’s metadata field or if it should come from the configured Topic Name and Partition / Partitioner class properties

Relationships

  • success: FlowFiles for which all content was sent to Kafka.

  • failure: Any FlowFile that cannot be sent to Kafka will be routed to this Relationship

Reads Attributes

  • kafka.tombstone: If this attribute is set to 'true', if the processor is not configured with a demarcator and if the FlowFile’s content is null, then a tombstone message with zero bytes will be sent to Kafka.

Writes Attributes

  • msg.count: The number of messages that were sent to Kafka for this FlowFile. This attribute is added only to FlowFiles that are routed to success.

Input Requirement

This component requires an incoming relationship.

See Also

PublishMQTT

Publishes a message to an MQTT topic

Tags: publish, MQTT, IOT

Properties

Broker URI

The URI(s) to use to connect to the MQTT broker (e.g., tcp://localhost:1883). The 'tcp', 'ssl', 'ws' and 'wss' schemes are supported. In order to use 'ssl', the SSL Context Service property must be set. When a comma-separated URI list is set (e.g., tcp://localhost:1883,tcp://localhost:1884), the processor will use a round-robin algorithm to connect to the brokers on connection failure.

MQTT Specification Version

The MQTT specification version when connecting with the broker. See the allowable value descriptions for more details.

Username

Username to use when connecting to the broker

Password

Password to use when connecting to the broker

SSL Context Service

The SSL Context Service used to provide client certificate information for TLS/SSL connections.

Session state

Whether to start a fresh or resume previous flows. See the allowable value descriptions for more details.

Session Expiry Interval

After this interval the broker will expire the client and clear the session state.

Client ID

MQTT client ID to use. If not set, a UUID will be generated.

Topic

The topic to publish the message to.

Retain Message

Whether or not the retain flag should be set on the MQTT message.

Quality of Service (QoS)

The Quality of Service (QoS) to send the message with. Accepts three values '0', '1' and '2'; '0' for 'at most once', '1' for 'at least once', '2' for 'exactly once'. Expression language is allowed in order to support publishing messages with different QoS but the end value of the property must be either '0', '1' or '2'.

Record Reader

The Record Reader to use for parsing the incoming FlowFile into Records.

Record Writer

The Record Writer to use for serializing Records before publishing them as an MQTT Message.

Message Demarcator

With this property, you have an option to publish multiple messages from a single FlowFile. This property allows you to provide a string (interpreted as UTF-8) to use for demarcating apart the FlowFile content. This is an optional property ; if not provided, and if not defining a Record Reader/Writer, each FlowFile will be published as a single message. To enter special character such as 'new line' use CTRL+Enter or Shift+Enter depending on the OS.

Connection Timeout (seconds)

Maximum time interval the client will wait for the network connection to the MQTT server to be established. The default timeout is 30 seconds. A value of 0 disables timeout processing meaning the client will wait until the network connection is made successfully or fails.

Keep Alive Interval (seconds)

Defines the maximum time interval between messages sent or received. It enables the client to detect if the server is no longer available, without having to wait for the TCP/IP timeout. The client will ensure that at least one message travels across the network within each keep alive period. In the absence of a data-related message during the time period, the client sends a very small "ping" message, which the server will acknowledge. A value of 0 disables keepalive processing in the client.

Last Will Message

The message to send as the client’s Last Will.

Last Will Topic

The topic to send the client’s Last Will to.

Last Will Retain

Whether to retain the client’s Last Will.

Last Will QoS Level

QoS level to be used when publishing the Last Will Message.

Relationships

  • success: FlowFiles that are sent successfully to the destination are transferred to this relationship.

  • failure: FlowFiles that failed to send to the destination are transferred to this relationship.

Input Requirement

This component requires an incoming relationship.

System Resource Considerations

  • MEMORY: An instance of this component can cause high usage of this system resource. Multiple instances or high concurrency settings may result a degradation of performance.

See Also

PublishSlack

Posts a message to the specified Slack channel. The content of the message can be either a user-defined message that makes use of Expression Language or the contents of the FlowFile can be sent as the message. If sending a user-defined message, the contents of the FlowFile may also be optionally uploaded as a file attachment.

Use Cases

Send specific text as a message to Slack, optionally including the FlowFile’s contents as an attached file.

Input Requirement: This component allows an incoming relationship.

  1. Set "Access Token" to the value of your Slack OAuth Access Token.

  2. Set "Channel" to the ID of the channel or the name of the channel prefixed with the # symbol. For example, "C0123456789" or "#general".

  3. Set "Publish Strategy" to "Use 'Message Text' Property".

  4. Set "Message Text" to the text that you would like to send as the Slack message.

  5. Set "Include FlowFile Content as Attachment" to "true" if the FlowFile’s contents should be attached as a file, or "false" to send just the message text without an attachment. .

Send the contents of the FlowFile as a message to Slack.

Input Requirement: This component allows an incoming relationship.

  1. Set "Access Token" to the value of your Slack OAuth Access Token.

  2. Set "Channel" to the ID of the channel or the name of the channel prefixed with the # symbol. For example, "C0123456789" or "#general".

  3. Set "Publish Strategy" to "Send FlowFile Content as Message". .

Multi-Processor Use Cases

Respond to a Slack message in a thread.

Keywords: slack, respond, reply, thread

Processor:

  1. Set "Destination" to "flowfile-attribute" .

  2. Add a new property named "thread.ts" with a value of $.threadTs

  3. Add a new property named "message.ts" with a value of $.ts

  4. Add a new property named "channel.id" with a value of $.channel

  5. Add a new property named "user.id" with a value of $.user .

  6. Connect the "matched" Relationship to PublishSlack. .

PublishSlack:

  1. Set "Access Token" to the value of your Slack OAuth Access Token.

  2. Set "Channel" to ${'channel.id'}

  3. Set "Publish Strategy" to "Use 'Message Text' Property".

  4. Set "Message Text" to the text that you would like to send as the response. If desired, you can reference the user of the original message by including the text <@${'user.id'}>.

  5. For example: Hey, <@${'user.id'}>, thanks for asking…​

  6. Set "Include FlowFile Content as Attachment" to "false".

  7. Set "Thread Timestamp" to ${'thread.ts':replaceEmpty( ${'message.ts'} )} .

Tags: slack, conversation, chat.postMessage, social media, team, text, unstructured, write, upload, send, publish

Properties

Access Token

OAuth Access Token used for authenticating/authorizing the Slack request sent by NiFi. This may be either a User Token or a Bot Token. The token must be granted the chat:write scope. Additionally, in order to upload FlowFile contents as an attachment, it must be granted files:write.

Channel

The name or identifier of the channel to send the message to. If using a channel name, it must be prefixed with the # character. For example, #general. This is valid only for public channels. Otherwise, the unique identifier of the channel to publish to must be provided.

Publish Strategy

Specifies how the Processor will send the message or file to Slack.

Message Text

The text of the message to send to Slack.

Character Set

Specifies the name of the Character Set used to encode the FlowFile contents.

Include FlowFile Content as Attachment

Specifies whether or not the contents of the FlowFile should be uploaded as an attachment to the Slack message.

Max FlowFile Size

The maximum size of a FlowFile that can be sent to Slack. If any FlowFile exceeds this size, it will be routed to failure. This plays an important role because the entire contents of the file must be loaded into NiFi’s heap in order to send the data to Slack.

Thread Timestamp

The Timestamp identifier for the thread that this message is to be a part of. If not specified, the message will be a top-level message instead of being in a thread.

Methods Endpoint Url Prefix

Customization of the Slack Client. Set the methodsEndpointUrlPrefix. If you need to set a different URL prefix for Slack API Methods calls, you can set the one. Default value: https://slack.com/api/

Relationships

  • success: FlowFiles are routed to success after being successfully sent to Slack

  • failure: FlowFiles are routed to 'failure' if unable to be sent to Slack for any other reason

  • rate limited: FlowFiles are routed to 'rate limited' if the Rate Limit has been exceeded

Writes Attributes

  • slack.channel.id: The ID of the Slack Channel from which the messages were retrieved

  • slack.ts: The timestamp of the slack messages that was sent; this is used by Slack as a unique identifier

Input Requirement

This component requires an incoming relationship.

Additional Details

Description:

PublishSlack allows for the ability to send messages to Slack using Slack’s chat.postMessage API. This Processor should be preferred over the deprecated PutSlack and PostSlack Processors, as it aims to incorporate the capabilities of both of those Processors, improve the maintainability, and ease the configuration for the user.

Slack Setup

In order use this Processor, it requires that a Slack App be created and installed in your Slack workspace. An OAuth User or Bot Token must be created for the App. The token must have the chat:write Token Scope. Please see Slack’s documentation for the latest information on how to create an Application and install it into your workspace.

Depending on the Processor’s configuration, you may also require additional Scopes. For example, if configured to upload the contents of the FlowFile as a message attachment, the files:write User Token Scope or Bot Token Scope must be granted. Additionally, the Channels to consume from may be listed either as a Channel ID or (for public Channels) a Channel Name. However, if a name, such as #general is used, the token must be provided the channels:read scope in order to determine the Channel ID for you.

Rather than requiring the channels:read Scope, you may alternatively supply only Channel IDs for the “Channel” property. To determine the ID of a Channel, navigate to the desired Channel in Slack. Click the name of the Channel at the top of the screen. This provides a popup that provides information about the Channel. Scroll to the bottom of the popup, and you will be shown the Channel ID with the ability to click a button to Copy the ID to your clipboard.

At the time of this writing, the following steps may be used to create a Slack App with the necessary scope of chat:write scope. However, these instructions are subject to change at any time, so it is best to read through Slack’s Quickstart Guide.

  • Create a Slack App. Click here to get started. From here, click the “Create New App” button and choose “From scratch.” Give your App a name and choose the workspace that you want to use for developing the app.

  • Creating your app will take you to the configuration page for your application. For example, https://api.slack.com/apps/<APP_IDENTIFIER>. From here, click on “OAuth & Permissions” in the left-hand menu. Scroll down to the “Scopes” section and click the “Add an OAuth Scope” button under ‘Bot Token Scopes’. Choose the chat:write scope.

  • Scroll back to the top, and under the “OAuth Tokens for Your Workspace” section, click the “Install to Workspace” button. This will prompt you to allow the application to be added to your workspace, if you have the appropriate permissions. Otherwise, it will generate a notification for a Workspace Owner to approve the installation. Additionally, it will generate a “Bot User OAuth Token”.

  • Copy the value of the “Bot User OAuth Token.” This will be used as the value for the ConsumeSlack Processor’s Access Token property.

  • The Bot must then be enabled for each Channel that you would like to consume messages from. In order to do that, in the Slack application, go to the Channel that you would like to consume from and press /. Choose the Add apps to this channel option, and add the Application that you created as a Bot to the channel.

PutAzureBlobStorage_v12

Puts content into a blob on Azure Blob Storage. The processor uses Azure Blob Storage client library v12.

Tags: azure, microsoft, cloud, storage, blob

Properties

Storage Credentials

Controller Service used to obtain Azure Blob Storage Credentials.

Container Name

Name of the Azure storage container. In case of PutAzureBlobStorage processor, container can be created if it does not exist.

Create Container

Specifies whether to check if the container exists and to automatically create it if it does not. Permission to list containers is required. If false, this check is not made, but the Put operation will fail if the container does not exist.

Conflict Resolution Strategy

Specifies whether an existing blob will have its contents replaced upon conflict.

Blob Name

The full name of the blob

Resource Transfer Source

The source of the content to be transferred

File Resource Service

File Resource Service providing access to the local resource to be transferred

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP, SOCKS In case of SOCKS, it is not guaranteed that the selected SOCKS Version will be used by the processor.

Client-Side Encryption Key Type

Specifies the key type to use for client-side encryption.

Client-Side Encryption Key ID

Specifies the ID of the key to use for client-side encryption.

Client-Side Encryption Local Key

When using local client-side encryption, this is the raw key, encoded in hexadecimal

Relationships

  • success: All successfully processed FlowFiles are routed to this relationship

  • failure: Unsuccessful operations will be transferred to the failure relationship.

Writes Attributes

  • azure.container: The name of the Azure Blob Storage container

  • azure.blobname: The name of the blob on Azure Blob Storage

  • azure.primaryUri: Primary location of the blob

  • azure.etag: ETag of the blob

  • azure.blobtype: Type of the blob (either BlockBlob, PageBlob or AppendBlob)

  • mime.type: MIME Type of the content

  • lang: Language code for the content

  • azure.timestamp: Timestamp of the blob

  • azure.length: Length of the blob

  • azure.error.code: Error code reported during blob operation

  • azure.ignored: When Conflict Resolution Strategy is 'ignore', this property will be true/false depending on whether the blob was ignored.

Input Requirement

This component requires an incoming relationship.

PutAzureCosmosDBRecord

This processor is a record-aware processor for inserting data into Cosmos DB with Core SQL API. It uses a configured record reader and schema to read an incoming record set from the body of a Flowfile and then inserts those records into a configured Cosmos DB Container.

Tags: azure, cosmos, insert, record, put

Properties

Cosmos DB Connection Service

If configured, the controller service used to obtain the connection string and access key

Cosmos DB URI

Cosmos DB URI, typically in the form of https://{databaseaccount}.documents.azure.com:443/ Note this host URL is for Cosmos DB with Core SQL API from Azure Portal (Overview→URI)

Cosmos DB Access Key

Cosmos DB Access Key from Azure Portal (Settings→Keys). Choose a read-write key to enable database or container creation at run time

Cosmos DB Consistency Level

Choose from five consistency levels on the consistency spectrum. Refer to Cosmos DB documentation for their differences

Cosmos DB Name

The database name or id. This is used as the namespace for document collections or containers

Cosmos DB Container ID

The unique identifier for the container

Cosmos DB Partition Key

The partition key used to distribute data among servers

Record Reader

Specifies the Controller Service to use for parsing incoming data and determining the data’s schema

Insert Batch Size

The number of records to group together for one single insert operation against Cosmos DB

Cosmos DB Conflict Handling Strategy

Choose whether to ignore or upsert when conflict error occurs during insertion

Relationships

  • success: All FlowFiles that are written to Cosmos DB are routed to this relationship

  • failure: All FlowFiles that cannot be written to Cosmos DB are routed to this relationship

Input Requirement

This component requires an incoming relationship.

System Resource Considerations

  • MEMORY: An instance of this component can cause high usage of this system resource. Multiple instances or high concurrency settings may result a degradation of performance.

PutAzureDataExplorer

Acts as an Azure Data Explorer sink which sends FlowFiles to the provided endpoint. Data can be sent through queued ingestion or streaming ingestion to the Azure Data Explorer cluster.

Tags: Azure, Kusto, ADX, Explorer, Data

Properties

Kusto Ingest Service

Azure Data Explorer Kusto Ingest Service

Database Name

Azure Data Explorer Database Name for ingesting data

Table Name

Azure Data Explorer Table Name for ingesting data

Ingest Mapping Name

The name of the mapping responsible for storing the data in the appropriate columns.

Data Format

The format of the data that is sent to Azure Data Explorer. Supported formats include: avro, csv, json

Partially Succeeded Routing Strategy

Defines where to route FlowFiles that resulted in a partially succeeded status.

Streaming Enabled

Whether to stream data to Azure Data Explorer.

Ingestion Ignore First Record

Defines whether ignore first record while ingestion.

Poll for Ingest Status

Determines whether to poll on ingestion status after an ingestion to Azure Data Explorer is completed

Ingest Status Polling Timeout

Defines the total amount time to poll for ingestion status

Ingest Status Polling Interval

Defines the value of interval of time to poll for ingestion status

Relationships

  • success: Ingest processing succeeded

  • failure: Ingest processing failed

Input Requirement

This component requires an incoming relationship.

PutAzureDataLakeStorage

Writes the contents of a FlowFile as a file on Azure Data Lake Storage Gen 2

Tags: azure, microsoft, cloud, storage, adlsgen2, datalake

Properties

ADLS Credentials

Controller Service used to obtain Azure Credentials.

Filesystem Name

Name of the Azure Storage File System (also called Container). It is assumed to be already existing.

Directory Name

Name of the Azure Storage Directory. The Directory Name cannot contain a leading '/'. The root directory can be designated by the empty string value. In case of the PutAzureDataLakeStorage processor, the directory will be created if not already existing.

File Name

The filename

Writing Strategy

Defines the approach for writing the Azure file.

Base Temporary Path

The Path where the temporary directory will be created. The Path name cannot contain a leading '/'. The root directory can be designated by the empty string value. Non-existing directories will be created.The Temporary File Directory name is _nifitempdirectory

Conflict Resolution Strategy

Indicates what should happen when a file with the same name already exists in the output directory

Resource Transfer Source

The source of the content to be transferred

File Resource Service

File Resource Service providing access to the local resource to be transferred

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP, SOCKS In case of SOCKS, it is not guaranteed that the selected SOCKS Version will be used by the processor.

Relationships

  • success: Files that have been successfully written to Azure storage are transferred to this relationship

  • failure: Files that could not be written to Azure storage for some reason are transferred to this relationship

Writes Attributes

  • azure.filesystem: The name of the Azure File System

  • azure.directory: The name of the Azure Directory

  • azure.filename: The name of the Azure File

  • azure.primaryUri: Primary location for file content

  • azure.length: The length of the Azure File

Input Requirement

This component requires an incoming relationship.

Additional Details

This processor is responsible for uploading files to Azure Data Lake Storage Gen2.

File uploading and cleanup process in case of “Write and Rename” strategy

======= New file upload

  1. A temporary file is created with random prefix under the given path in ‘_nifitempdirectory’.

  2. Content is appended to temp file.

  3. Temp file is moved to the final destination directory and renamed to its original name.

  4. In case of appending or renaming failure, the temp file is deleted.

  5. In case of temporary file deletion failure, the temp file remains on the server.

======= Existing file upload

  • Processors with “fail” conflict resolution strategy will direct the FlowFile to “Failure” relationship.

  • Processors with “ignore” conflict resolution strategy will direct the FlowFile to “Success” relationship.

  • Processors with “replace” conflict resolution strategy:

  1. A temporary file is created with random prefix under the given path in ‘_nifitempdirectory’.

  2. Content is appended to temp file.

  3. Temp file is moved to the final destination directory and renamed to its original name, the original file is overwritten.

  4. In case of appending or renaming failure, the temp file is deleted and the original file remains intact.

  5. In case of temporary file deletion failure, both temp file and original file remain on the server.

File uploading and cleanup process in case of “Simple Write” strategy

======= New file upload

  1. An empty file is created at its final destination.

  2. Content is appended to the file.

  3. In case of appending failure, the file is deleted.

  4. In case of file deletion failure, the file remains on the server.

======= Existing file upload

  • Processors with “fail” conflict resolution strategy will direct the FlowFile to “Failure” relationship.

  • Processors with “ignore” conflict resolution strategy will direct the FlowFile to “Success” relationship.

  • Processors with “replace” conflict resolution strategy:

  1. An empty file is created at its final destination, the original file is overwritten.

  2. Content is appended to the file.

  3. In case of appending failure, the file is deleted and the original file is not restored.

  4. In case of file deletion failure, the file remains on the server.

PutAzureEventHub

Send FlowFile contents to Azure Event Hubs

Tags: microsoft, azure, cloud, eventhub, events, streams, streaming

Properties

Event Hub Namespace

Namespace of Azure Event Hubs prefixed to Service Bus Endpoint domain

Event Hub Name

Name of Azure Event Hubs destination

Service Bus Endpoint

To support namespaces not in the default windows.net domain.

Transport Type

Advanced Message Queuing Protocol Transport Type for communication with Azure Event Hubs

Shared Access Policy Name

The name of the shared access policy. This policy must have Send claims.

Shared Access Policy Key

The key of the shared access policy. Either the primary or the secondary key can be used.

Use Azure Managed Identity

Choose whether or not to use the managed identity of Azure VM/VMSS

Partitioning Key Attribute Name

If specified, the value from argument named by this field will be used as a partitioning key to be used by event hub.

Maximum Batch Size

Maximum number of FlowFiles processed for each Processor invocation

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP + AuthN

Relationships

  • success: Any FlowFile that is successfully sent to the event hubs will be transferred to this Relationship.

  • failure: Any FlowFile that could not be sent to the event hub will be transferred to this Relationship.

Input Requirement

This component requires an incoming relationship.

System Resource Considerations

  • MEMORY: The Processor buffers FlowFile contents in memory before sending

PutAzureQueueStorage_v12

Writes the content of the incoming FlowFiles to the configured Azure Queue Storage.

Tags: azure, microsoft, cloud, storage, queue, enqueue

Properties

Queue Name

Name of the Azure Storage Queue

Endpoint Suffix

Storage accounts in public Azure always use a common FQDN suffix. Override this endpoint suffix with a different suffix in certain circumstances (like Azure Stack or non-public Azure regions).

Credentials Service

Controller Service used to obtain Azure Storage Credentials.

Message Time To Live

Maximum time to allow the message to be in the queue

Visibility Timeout

The length of time during which the message will be invisible after it is read. If the processing unit fails to delete the message after it is read, then the message will reappear in the queue.

Request Timeout

The timeout for read or write requests to Azure Queue Storage. Defaults to 1 second.

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP, SOCKS In case of SOCKS, it is not guaranteed that the selected SOCKS Version will be used by the processor.

Relationships

  • success: All successfully processed FlowFiles are routed to this relationship

  • failure: Unsuccessful operations will be transferred to the failure relationship.

Input Requirement

This component requires an incoming relationship.

PutBigQuery

Writes the contents of a FlowFile to a Google BigQuery table. The processor is record based so the schema that is used is driven by the RecordReader. Attributes that are not matched to the target schema are skipped. Exactly once delivery semantics are achieved via stream offsets.

Tags: google, google cloud, bq, bigquery

Properties

GCP Credentials Provider Service

The Controller Service used to obtain Google Cloud Platform credentials.

Project ID

Google Cloud Project ID

BigQuery API Endpoint

Can be used to override the default BigQuery endpoint. Default is bigquerystorage.googleapis.com:443. Format must be hostname:port.

Dataset

BigQuery dataset name (Note - The dataset must exist in GCP)

Table Name

BigQuery table name

Record Reader

Specifies the Controller Service to use for parsing incoming data.

Transfer Type

Defines the preferred transfer type streaming or batching

Append Record Count

The number of records to be appended to the write stream at once. Applicable for both batch and stream types

Number of retries

How many retry attempts should be made before routing to the failure relationship.

Skip Invalid Rows

Sets whether to insert all valid rows of a request, even if invalid rows exist. If not set the entire insert request will fail if it contains an invalid row.

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP + AuthN

Relationships

  • success: FlowFiles are routed to this relationship after a successful Google BigQuery operation.

  • failure: FlowFiles are routed to this relationship if the Google BigQuery operation fails.

Writes Attributes

  • bq.records.count: Number of records successfully inserted

Input Requirement

This component requires an incoming relationship.

Additional Details

Streaming Versus Batching Data

PutBigQuery is record based and is relying on the gRPC based Write API using protocol buffers. The underlying stream supports both streaming and batching approaches.

Streaming

With streaming the appended data to the stream is instantly available in BigQuery for reading. It is configurable how many records (rows) should be appended at once. Only one stream is established per flow file so at the conclusion of the FlowFile processing the used stream is closed and a new one is opened for the next FlowFile. Supports exactly once delivery semantics via stream offsets.

Batching

Similarly to the streaming approach one stream is opened for each FlowFile and records are appended to the stream. However data is not available in BigQuery until it is committed by the processor at the end of the FlowFile processing.

Improvement opportunities

  • The table has to exist on BigQuery side it is not created automatically

  • The Write API supports multiple streams for parallel execution and transactionality across streams. This is not utilized at the moment as this would be covered on NiFI framework level.

The official Google Write API documentation provides additional details.

PutBoxFile

Puts content to a Box folder.

Tags: box, storage, put

Properties

Box Client Service

Controller Service used to obtain a Box API connection.

Folder ID

The ID of the folder where the file is uploaded. Please see Additional Details to obtain Folder ID.

Subfolder Name

The name (path) of the subfolder where files are uploaded. The subfolder name is relative to the folder specified by 'Folder ID'. Example: subFolder, subFolder1/subfolder2

Create Subfolder

Specifies whether to check if the subfolder exists and to automatically create it if it does not. Permission to list folders is required.

Filename

The name of the file to upload to the specified Box folder.

Conflict Resolution Strategy

Indicates what should happen when a file with the same name already exists in the specified Box folder.

Chunked Upload Threshold

The maximum size of the content which is uploaded at once. FlowFiles larger than this threshold are uploaded in chunks. Chunked upload is allowed for files larger than 20 MB. It is recommended to use chunked upload for files exceeding 50 MB.

Relationships

  • success: Files that have been successfully written to Box are transferred to this relationship.

  • failure: Files that could not be written to Box for some reason are transferred to this relationship.

Reads Attributes

  • filename: Uses the FlowFile’s filename as the filename for the Box object.

Writes Attributes

  • box.id: The id of the file

  • filename: The name of the file

  • path: The folder path where the file is located

  • box.size: The size of the file

  • box.timestamp: The last modified time of the file

  • error.code: The error code returned by Box

  • error.message: The error message returned by Box

Input Requirement

This component requires an incoming relationship.

Additional Details

Upload files to Box from NiFi
  1. Find Folder ID

    • Navigate to the folder in Box and enter it. The URL in your browser will include the ID at the end of the URL. For example, if the URL were https://app.box.com/folder/191632099757, the Folder ID would be 191632099757

  2. Set Folder ID in ‘Folder ID’ property

PutBPCAuditLog

Writes audit log data into a Virtimo Business Process Center (BPC). If successful, the FlowFile is routed unmodified to the "success" relationship. If the BPC returns an error, the FlowFile is routed to the "failure" relationship and the error message is written in the "bpc.error.response" attribute.

Tags: virtimo, bpc

Properties

BPC Controller

Controller used to define the connection to the BPC. API-Key used in controller requires 'AUDIT_LOG_CREATE_ENTRY'-Rights.

BPC Audit Level

The log level for the audit log. Must either be 'INFO', 'DEBUG', 'ERROR', 'WARNING'.

BPC Audit Originator

The user or system that writes audit data to the logs.

BPC Action

The current action in the flow, the log data refers to.

BPC Audit Description

Description of what is logged.

BPC Audit Old Values

If you want to log value changes, you can put the original data in here.

BPC Audit New Values

If you want to log value changes, you can put the new data in here.

Relationships

  • success: If the request was successfully sent to the BPC, the received FlowFile is routed here unmodified.

  • failure: Failed to write to the audit log.

Writes Attributes

  • bpc.status.code: The response code from BPC.

  • bpc.error.response: The error description from BPC.

Input Requirement

This component requires an incoming relationship.

PutBPCNotification

Sends a notification to a Virtimo Business Process Center (BPC). If successful, the FlowFile is routed unmodified to the "success" relationship. If the BPC returns an error, the FlowFile is routed to the "failure" relationship and the error message is written in the "bpc.error.response" attribute.

Tags: virtimo, bpc

Properties

BPC Controller

Controller used to define the connection to the BPC. API-Key used in controller requires 'NOTIFICATION_ADD'-Rights.

ID

ID of the notification. A random UUID gets set when not given.

Priority

Delivery priority of the notification.

Read Priority

Use expression language to read the priority from an attribute. The priority must be 'Silent'/'0', 'Toast'/'5' or 'Popup'/'10'. Defaults to toast if value is empty.

Subject

The subject of the notification.

Message

The message of the notification.

Recipients Type

The user information used to determine which users will receive the notification.

Read Recipients Type

Use expression language to read from an attribute. The value must be 'User', 'Role' or 'Organisation.

Recipients

The recipient(s) of the notification, separated by commas: e.g. "bpcadmin, bpcuser". Depends on the used Recipients Type.

Originator

The originator of the notification, e.g. 'Iguasu'. If empty, the ID of the BPC Controller’s API Key is used.

Notification Icon

The Font Awesome icon which should be used when showing the notification. Specified using the icon’s HTML class name, e.g. 'fa-water'. If left empty, an icon will be chosen based on the BPC Notification Type.

Notification Type

The type of the notification.

Read Type

Use expression language to read from an attribute. The value must be 'Info', 'Warning', 'Error' or 'Link'. 'Info' is used if value is empty. Link Target Module and Link Route must also be specified in case the value is 'Link', but the properties will be ignored if the value isn’t 'Link'.

Link Target Module

The BPC Module that the link should redirect to.

Link Route

The route-elements of the target component within the Target Module that the link should redirect to. Route-elements should be separated with slashes, e.g. "_core/apiKeys/api/API-0"

Send raw JSON request

Manually create the JSON request to send notifications using BPC’s Notification API.

JSON

Please refer to BPC’s API documentation for details on how to construct the request. https://docs.virtimo.net/bpc-docs/latest/core/dev/api/index.html#_notification_api

Relationships

  • success: If the request was successfully sent to the BPC, the received FlowFile is routed here unmodified.

  • failure: Failed to send notification.

Writes Attributes

  • bpc.status.code: The response code from BPC.

  • bpc.error.response: The error description from BPC.

Input Requirement

This component requires an incoming relationship.

PutBPCProcessLog

Writes process log data into a Virtimo Business Process Center (BPC). The log data must be in the expected JSON format (see Additional Details). If successful, the FlowFile is routed unmodified to the "success" relationship. If the BPC returns an error, the FlowFile is routed to the "failure" relationship and the error message is written in the "bpc.error.response" attribute.

Tags: virtimo, bpc

Properties

BPC Controller

Controller used to define the connection to the BPC. The API-Key used by the controller requires 'LOG_SERVICE_WRITE_DATA' and 'LOG_SERVICE_CONFIG_GET_INSTANCES'-Rights.

BPC Logger

Select the logger from available ones.

BPC Logger ID

The ID of the logger (i.e. the Component ID of the Log Service in BPC).

Input Type

Should the logged JSON data be taken from the FlowFile content or from the "BPC Entries JSON" property?

BPC Entries JSON

The full, raw JSON to be logged into the BPC. Filling the value from the BPC (using the bucket button) will read the data of BPC Logger and create the JSON based on it. For this to work, the API-Key used by the BPC Controller requires 'LOG_SERVICE_READ_DATA'-Rights.

Debug Mode

Write the created value of the "BPC Entries JSON" property into the "bpc.request" attribute.

Relationships

  • success: If the request was successfully sent to the BPC, the received FlowFile is routed here unmodified.

  • failure: Failed to write to the process log.

Writes Attributes

  • bpc.status.code: The response code from BPC.

  • bpc.error.response: The error description from BPC.

Input Requirement

This component requires an incoming relationship.

PutCloudWatchMetric

Publishes metrics to Amazon CloudWatch. Metric can be either a single value, or a StatisticSet comprised of minimum, maximum, sum and sample count.

Tags: amazon, aws, cloudwatch, metrics, put, publish

Properties

Namespace

The namespace for the metric data for CloudWatch

Metric Name

The name of the metric

Region

AWS Credentials Provider Service

The Controller Service that is used to obtain AWS credentials provider

Value

The value for the metric. Must be a double

Maximum

The maximum value of the sample set. Must be a double

Minimum

The minimum value of the sample set. Must be a double

Sample Count

The number of samples used for the statistic set. Must be a double

Sum

The sum of values for the sample set. Must be a double

Timestamp

A point in time expressed as the number of milliseconds since Jan 1, 1970 00:00:00 UTC. If not specified, the default value is set to the time the metric data was received

Unit

The unit of the metric. (e.g Seconds, Bytes, Megabytes, Percent, Count, Kilobytes/Second, Terabits/Second, Count/Second) For details see http://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_MetricDatum.html

Communications Timeout

SSL Context Service

Specifies an optional SSL Context Service that, if provided, will be used to create connections

Endpoint Override URL

Endpoint URL to use instead of the AWS default including scheme, host, port, and path. The AWS libraries select an endpoint URL based on the AWS region, but this property overrides the selected endpoint URL, allowing use with other S3-compatible endpoints.

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP + AuthN

Dynamic Properties

Dimension Name

Allows dimension name/value pairs to be added to the metric. AWS supports a maximum of 10 dimensions.

Relationships

  • success: FlowFiles are routed to success relationship

  • failure: FlowFiles are routed to failure relationship

Input Requirement

This component requires an incoming relationship.

PutDatabaseRecord

The PutDatabaseRecord processor uses a specified RecordReader to input (possibly multiple) records from an incoming flow file. These records are translated to SQL statements and executed as a single transaction. If any errors occur, the flow file is routed to failure or retry, and if the records are transmitted successfully, the incoming flow file is routed to success. The type of statement executed by the processor is specified via the Statement Type property, which accepts some hard-coded values such as INSERT, UPDATE, and DELETE, as well as 'Use statement.type Attribute', which causes the processor to get the statement type from a flow file attribute. IMPORTANT: If the Statement Type is UPDATE, then the incoming records must not alter the value(s) of the primary keys (or user-specified Update Keys). If such records are encountered, the UPDATE statement issued to the database may do nothing (if no existing records with the new primary key values are found), or could inadvertently corrupt the existing data (by changing records for which the new values of the primary keys exist).

Use Cases

Insert records into a database

Input Requirement: This component allows an incoming relationship.

Tags: sql, record, jdbc, put, database, update, insert, delete

Properties

Record Reader

Specifies the Controller Service to use for parsing incoming data and determining the data’s schema.

Database Type

Database Type for generating statements specific to a particular service or vendor. The Generic Type supports most cases but selecting a specific type enables optimal processing or additional features.

Database Dialect Service

Database Dialect Service for generating statements specific to a particular service or vendor.

Statement Type

Specifies the type of SQL Statement to generate. Please refer to the database documentation for a description of the behavior of each operation. Please note that some Database Types may not support certain Statement Types. If 'Use statement.type Attribute' is chosen, then the value is taken from the statement.type attribute in the FlowFile. The 'Use statement.type Attribute' option is the only one that allows the 'SQL' statement type. If 'SQL' is specified, the value of the field specified by the 'Field Containing SQL' property is expected to be a valid SQL statement on the target database, and will be executed as-is.

Statement Type Record Path

Specifies a RecordPath to evaluate against each Record in order to determine the Statement Type. The RecordPath should equate to either INSERT, UPDATE, UPSERT, or DELETE. (Debezium style operation types are also supported: "r" and "c" for INSERT, "u" for UPDATE, and "d" for DELETE)

Data Record Path

If specified, this property denotes a RecordPath that will be evaluated against each incoming Record and the Record that results from evaluating the RecordPath will be sent to the database instead of sending the entire incoming Record. If not specified, the entire incoming Record will be published to the database.

Database Connection Pooling Service

The Controller Service that is used to obtain a connection to the database for sending records.

Catalog Name

The name of the catalog that the statement should update. This may not apply for the database that you are updating. In this case, leave the field empty. Note that if the property is set and the database is case-sensitive, the catalog name must match the database’s catalog name exactly.

Schema Name

The name of the schema that the table belongs to. This may not apply for the database that you are updating. In this case, leave the field empty. Note that if the property is set and the database is case-sensitive, the schema name must match the database’s schema name exactly.

Table Name

The name of the table that the statement should affect. Note that if the database is case-sensitive, the table name must match the database’s table name exactly.

Binary String Format

The format to be applied when decoding string values to binary.

Translate Field Names

If true, the Processor will attempt to translate field names into the appropriate column names for the table specified. If false, the field names must match the column names exactly, or the column will not be updated

Column Name Translation Strategy

The strategy used to normalize table column name. Column Name will be uppercased to do case-insensitive matching irrespective of strategy

Column Name Translation Pattern

Column name will be normalized with this regular expression

Unmatched Field Behavior

If an incoming record has a field that does not map to any of the database table’s columns, this property specifies how to handle the situation

Unmatched Column Behavior

If an incoming record does not have a field mapping for all of the database table’s columns, this property specifies how to handle the situation

Update Keys

A comma-separated list of column names that uniquely identifies a row in the database for UPDATE statements. If the Statement Type is UPDATE and this property is not set, the table’s Primary Keys are used. In this case, if no Primary Key exists, the conversion to SQL will fail if Unmatched Column Behaviour is set to FAIL. This property is ignored if the Statement Type is INSERT

Delete Keys

A comma-separated list of column names that uniquely identifies a row in the database for DELETE statements. If the Statement Type is DELETE and this property is not set, the table’s columns are used. This property is ignored if the Statement Type is not DELETE

Field Containing SQL

If the Statement Type is 'SQL' (as set in the statement.type attribute), this field indicates which field in the record(s) contains the SQL statement to execute. The value of the field must be a single SQL statement. If the Statement Type is not 'SQL', this field is ignored.

Allow Multiple SQL Statements

If the Statement Type is 'SQL' (as set in the statement.type attribute), this field indicates whether to split the field value by a semicolon and execute each statement separately. If any statement causes an error, the entire set of statements will be rolled back. If the Statement Type is not 'SQL', this field is ignored.

Quote Column Identifiers

Enabling this option will cause all column names to be quoted, allowing you to use reserved words as column names in your tables.

Quote Table Identifiers

Enabling this option will cause the table name to be quoted to support the use of special characters in the table name.

Max Wait Time

The maximum amount of time allowed for a running SQL statement , zero means there is no limit. Max time less than 1 second will be equal to zero.

Rollback On Failure

Specify how to handle error. By default (false), if an error occurs while processing a FlowFile, the FlowFile will be routed to 'failure' or 'retry' relationship based on error type, and processor can continue with next FlowFile. Instead, you may want to rollback currently processed FlowFiles and stop further processing immediately. In that case, you can do so by enabling this 'Rollback On Failure' property. If enabled, failed FlowFiles will stay in the input relationship without penalizing it and being processed repeatedly until it gets processed successfully or removed by other means. It is important to set adequate 'Yield Duration' to avoid retrying too frequently.

Table Schema Cache Size

Specifies how many Table Schemas should be cached

Maximum Batch Size

Specifies maximum number of sql statements to be included in each batch sent to the database. Zero means the batch size is not limited, and all statements are put into a single batch which can cause high memory usage issues for a very large number of statements.

Database Session AutoCommit

The autocommit mode to set on the database connection being used. If set to false, the operation(s) will be explicitly committed or rolled back (based on success or failure respectively). If set to true, the driver/database automatically handles the commit/rollback.

Relationships

  • success: Successfully created FlowFile from SQL query result set.

  • failure: A FlowFile is routed to this relationship if the database cannot be updated and retrying the operation will also fail, such as an invalid query or an integrity constraint violation

  • retry: A FlowFile is routed to this relationship if the database cannot be updated but attempting the operation again may succeed

Reads Attributes

  • statement.type: If 'Use statement.type Attribute' is selected for the Statement Type property, the value of this attribute will be used to determine the type of statement (INSERT, UPDATE, DELETE, SQL, etc.) to generate and execute.

Writes Attributes

  • putdatabaserecord.error: If an error occurs during processing, the flow file will be routed to failure or retry, and this attribute will be populated with the cause of the error.

Input Requirement

This component requires an incoming relationship.

PutDistributedMapCache

Gets the content of a FlowFile and puts it to a distributed map cache, using a cache key computed from FlowFile attributes. If the cache already contains the entry and the cache update strategy is 'keep original' the entry is not replaced.'

Tags: map, cache, put, distributed

Properties

Cache Entry Identifier

A FlowFile attribute, or the results of an Attribute Expression Language statement, which will be evaluated against a FlowFile in order to determine the cache key

Distributed Cache Service

The Controller Service that is used to cache flow files

Cache update strategy

Determines how the cache is updated if the cache already contains the entry

Max cache entry size

The maximum amount of data to put into cache

Relationships

  • success: Any FlowFile that is successfully inserted into cache will be routed to this relationship

  • failure: Any FlowFile that cannot be inserted into the cache will be routed to this relationship

Writes Attributes

  • cached: All FlowFiles will have an attribute 'cached'. The value of this attribute is true, is the FlowFile is cached, otherwise false.

Input Requirement

This component requires an incoming relationship.

PutDropbox

Puts content to a Dropbox folder.

Tags: dropbox, storage, put

Properties

Dropbox Credential Service

Controller Service used to obtain Dropbox credentials (App Key, App Secret, Access Token, Refresh Token). See controller service’s Additional Details for more information.

Folder

The path of the Dropbox folder to upload files to. The folder will be created if it does not exist yet.

Filename

The full name of the file to upload.

Conflict Resolution Strategy

Indicates what should happen when a file with the same name already exists in the specified Dropbox folder.

Chunked Upload Threshold

The maximum size of the content which is uploaded at once. FlowFiles larger than this threshold are uploaded in chunks. Maximum allowed value is 150 MB.

Chunked Upload Size

Defines the size of a chunk. Used when a FlowFile’s size exceeds 'Chunked Upload Threshold' and content is uploaded in smaller chunks. It is recommended to specify chunked upload size smaller than 'Chunked Upload Threshold' and as multiples of 4 MB. Maximum allowed value is 150 MB.

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP + AuthN

Relationships

  • success: Files that have been successfully written to Dropbox are transferred to this relationship.

  • failure: Files that could not be written to Dropbox for some reason are transferred to this relationship.

Reads Attributes

  • filename: Uses the FlowFile’s filename as the filename for the Dropbox object.

Writes Attributes

  • error.message: The error message returned by Dropbox

  • dropbox.id: The Dropbox identifier of the file

  • path: The folder path where the file is located

  • filename: The name of the file

  • dropbox.size: The size of the file

  • dropbox.timestamp: The server modified time of the file

  • dropbox.revision: Revision of the file

Input Requirement

This component requires an incoming relationship.

PutDynamoDB

Puts a document from DynamoDB based on hash and range key. The table can have either hash and range or hash key alone. Currently the keys supported are string and number and value can be json document. In case of hash and range keys both key are required for the operation. The FlowFile content must be JSON. FlowFile content is mapped to the specified Json Document attribute in the DynamoDB item.

Tags: Amazon, DynamoDB, AWS, Put, Insert

Properties

Table Name

The DynamoDB table name

Region

AWS Credentials Provider Service

The Controller Service that is used to obtain AWS credentials provider

Json Document attribute

The Json document to be retrieved from the dynamodb item ('s' type in the schema)

Hash Key Name

The hash key name of the item

Range Key Name

The range key name of the item

Hash Key Value

The hash key value of the item

Range Key Value

Hash Key Value Type

The hash key value type of the item

Range Key Value Type

The range key value type of the item

Character set of document

Character set of data in the document

Batch items for each request (between 1 and 50)

The items to be retrieved in one batch

Communications Timeout

Endpoint Override URL

Endpoint URL to use instead of the AWS default including scheme, host, port, and path. The AWS libraries select an endpoint URL based on the AWS region, but this property overrides the selected endpoint URL, allowing use with other S3-compatible endpoints.

SSL Context Service

Specifies an optional SSL Context Service that, if provided, will be used to create connections

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP + AuthN

Relationships

  • success: FlowFiles are routed to success relationship

  • failure: FlowFiles are routed to failure relationship

  • unprocessed: FlowFiles are routed to unprocessed relationship when DynamoDB is not able to process all the items in the request. Typical reasons are insufficient table throughput capacity and exceeding the maximum bytes per request. Unprocessed FlowFiles can be retried with a new request.

Reads Attributes

  • * dynamodb.item.hash.key.value*: Items hash key value

  • * dynamodb.item.range.key.value*: Items range key value

Writes Attributes

  • dynamodb.key.error.unprocessed: DynamoDB unprocessed keys

  • dynmodb.range.key.value.error: DynamoDB range key error

  • dynamodb.key.error.not.found: DynamoDB key not found

  • dynamodb.error.exception.message: DynamoDB exception message

  • dynamodb.error.code: DynamoDB error code

  • dynamodb.error.message: DynamoDB error message

  • dynamodb.error.service: DynamoDB error service

  • dynamodb.error.retryable: DynamoDB error is retryable

  • dynamodb.error.request.id: DynamoDB error request id

  • dynamodb.error.status.code: DynamoDB error status code

  • dynamodb.item.io.error: IO exception message on creating item

Input Requirement

This component requires an incoming relationship.

System Resource Considerations

  • MEMORY: An instance of this component can cause high usage of this system resource. Multiple instances or high concurrency settings may result a degradation of performance.

PutDynamoDBRecord

Inserts items into DynamoDB based on record-oriented data. The record fields are mapped into DynamoDB item fields, including partition and sort keys if set. Depending on the number of records the processor might execute the insert in multiple chunks in order to overcome DynamoDB’s limitation on batch writing. This might result partially processed FlowFiles in which case the FlowFile will be transferred to the "unprocessed" relationship with the necessary attribute to retry later without duplicating the already executed inserts.

Tags: Amazon, DynamoDB, AWS, Put, Insert, Record

Properties

Table Name

The DynamoDB table name

Region

AWS Credentials Provider Service

The Controller Service that is used to obtain AWS credentials provider

Record Reader

Specifies the Controller Service to use for parsing incoming data and determining the data’s schema.

Partition Key Strategy

Defines the strategy the processor uses to assign partition key value to the inserted Items.

Partition Key Field

Defines the name of the partition key field in the DynamoDB table. Partition key is also known as hash key. Depending on the "Partition Key Strategy" the field value might come from the incoming Record or a generated one.

Partition Key Attribute

Specifies the FlowFile attribute that will be used as the value of the partition key when using "Partition by attribute" partition key strategy.

Sort Key Strategy

Defines the strategy the processor uses to assign sort key to the inserted Items.

Sort Key Field

Defines the name of the sort key field in the DynamoDB table. Sort key is also known as range key.

Communications Timeout

Endpoint Override URL

Endpoint URL to use instead of the AWS default including scheme, host, port, and path. The AWS libraries select an endpoint URL based on the AWS region, but this property overrides the selected endpoint URL, allowing use with other S3-compatible endpoints.

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests.

SSL Context Service

Specifies an optional SSL Context Service that, if provided, will be used to create connections

Relationships

  • success: FlowFiles are routed to success relationship

  • failure: FlowFiles are routed to failure relationship

  • unprocessed: FlowFiles are routed to unprocessed relationship when DynamoDB is not able to process all the items in the request. Typical reasons are insufficient table throughput capacity and exceeding the maximum bytes per request. Unprocessed FlowFiles can be retried with a new request.

Reads Attributes

  • dynamodb.chunks.processed: Number of chunks successfully inserted into DynamoDB. If not set, it is considered as 0

Writes Attributes

  • dynamodb.chunks.processed: Number of chunks successfully inserted into DynamoDB. If not set, it is considered as 0

  • dynamodb.key.error.unprocessed: DynamoDB unprocessed keys

  • dynmodb.range.key.value.error: DynamoDB range key error

  • dynamodb.key.error.not.found: DynamoDB key not found

  • dynamodb.error.exception.message: DynamoDB exception message

  • dynamodb.error.code: DynamoDB error code

  • dynamodb.error.message: DynamoDB error message

  • dynamodb.error.service: DynamoDB error service

  • dynamodb.error.retryable: DynamoDB error is retryable

  • dynamodb.error.request.id: DynamoDB error request id

  • dynamodb.error.status.code: DynamoDB error status code

  • dynamodb.item.io.error: IO exception message on creating item

Input Requirement

This component requires an incoming relationship.

System Resource Considerations

  • MEMORY: An instance of this component can cause high usage of this system resource. Multiple instances or high concurrency settings may result a degradation of performance.

  • NETWORK: An instance of this component can cause high usage of this system resource. Multiple instances or high concurrency settings may result a degradation of performance.

Additional Details

Description

PutDynamoDBRecord intends to provide the capability to insert multiple Items into a DynamoDB table from a record-oriented FlowFile. Compared to the PutDynamoDB, this processor is capable to process data based other than JSON format too and prepared to add multiple fields for a given Item. Also, PutDynamoDBRecord is designed to insert bigger batches of data into the database.

Data types

The list data types supported by DynamoDB does not fully overlap with the capabilities of the Record data structure. Some conversions and simplifications are necessary during inserting the data. These are:

  • Numeric values are stored using a floating-point data structure within Items. In some cases this representation might cause issues with the accuracy.

  • Char is not a supported type within DynamoDB, these fields are converted into String values.

  • Enum types are stored as String fields, using the name of the given enum.

  • DynamoDB stores time and date related information as Strings.

  • Internal record structures are converted into maps.

  • Choice is not a supported data type, regardless of the actual wrapped data type, values enveloped in Choice are handled as Strings.

  • Unknown data types are handled as stings.

Limitations

Working with DynamoDB when batch inserting comes with two inherit limitations. First, the number of inserted Items is limited to 25 in any case. In order to overcome this, during one execution, depending on the number of records in the incoming FlowFile, PutDynamoDBRecord might attempt multiple insert calls towards the database server. Using this approach, the flow does not have to work with this limitation in most cases.

Having multiple external actions comes with the risk of having an unforeseen result at one of the steps. For example when the incoming FlowFile is consists of 70 records, it will be split into 3 chunks, with a single insert operation for every chunk. The first two chunks contains 25 Items to insert per chunk, and the third contains the remaining 20. In some cases it might occur that the first two insert operation succeeds but the third one fails. In these cases we consider the FlowFile “partially processed” and we will transfer it to the “failure” or “unprocessed” Relationship according to the nature of the issue. In order to keep the information about the successfully processed chunks the processor assigns the "`dynamodb.chunks.processed`" attribute to the FlowFile, which has the number of successfully processed chunks as value.

The most common reason for this behaviour comes from the other limitation the inserts have with DynamoDB: the database has a build in supervision over the amount of inserted data. When a client reaches the “throughput limit”, the server refuses to process the insert request until a certain amount of time. More information here. From the perspective of the PutDynamoDBRecord we consider these cases as temporary issues and the FlowFile will be transferred to the “unprocessed” Relationship after which the processor will yield in order to avoid further throughput issues. (Other kinds of failures will result transfer to the “failure” Relationship)

Retry

It is suggested to loop back the “unprocessed” Relationship to the PutDynamoDBRecord in some way. FlowFiles transferred to that relationship considered as healthy ones might be successfully processed in a later point. It is possible that the FlowFile contains such a high number of records, what needs more than two attempts to fully insert. The attribute “dynamodb.chunks.processed” is “rolled” through the attempts, which means, after each trigger it will contain the sum number of inserted chunks making it possible for the later attempts to continue from the right point without duplicated inserts.

Partition and sort keys

The processor supports multiple strategies for assigning partition key and sort key to the inserted Items. These are:

Partition Key Strategies

======= Partition By Field

The processors assign one of the record fields as partition key. The name of the record field is specified by the ” Partition Key Field” property and the value will be the value of the record field with the same name.

======= Partition By Attribute

The processor assigns the value of a FlowFile attribute as partition key. With this strategy all the Items within a FlowFile will share the same partition key value, and it is suggested to use for tables also having a sort key in order to meet the primary key requirements of the DynamoDB. The property “Partition Key Field” defines the name of the Item field and the property “Partition Key Attribute” will specify which attribute’s value will be assigned to the partition key. With this strategy the “Partition Key Field” must be different from the fields consisted by the incoming records.

======= Generated UUID

By using this strategy the processor will generate a UUID identifier for every single Item. This identifier will be used as value for the partition key. The name of the field used as partition key is defined by the property “Partition Key Field”. With this strategy the “Partition Key Field” must be different from the fields consisted by the incoming records. When using this strategy, the partition key in the DynamoDB table must have String data type.

Sort Key Strategies

======= None

No sort key will be assigned to the Item. In case of the table definition expects it, using this strategy will result unsuccessful inserts.

======= Sort By Field

The processors assign one of the record fields as sort key. The name of the record field is specified by the “Sort Key Field” property and the value will be the value of the record field with the same name. With this strategy the “Sort Key Field” must be different from the fields consisted by the incoming records.

======= Generate Sequence

The processor assigns a generated value to every Item based on the original record’s position in the incoming FlowFile ( regardless of the chunks). The first Item will have the sort key 1, the second will have sort key 2 and so on. The generated keys are unique within a given FlowFile. The name of the record field is specified by the “Sort Key Field” attribute. With this strategy the “Sort Key Field” must be different from the fields consisted by the incoming records. When using this strategy, the sort key in the DynamoDB table must have Number data type.

Examples
Using fields as partition and sort key

======= Setup

  • Partition Key Strategy: Partition By Field

  • Partition Key Field: class

  • Sort Key Strategy: Sort By Field

  • Sort Key Field: size

Note: both fields have to exist in the incoming records!

======= Result

Using this pair of strategies will result Items identical to the incoming record (not counting the representational changes from the conversion). The field specified by the properties are added to the Items normally with the only difference of flagged as (primary) key items.

======= Input

[
  {
    "type": "A",
    "subtype": 4,
    "class": "t",
    "size": 1
  }
]
json

======= Output (stylized)

  • type: String field with value “A”

  • subtype: Number field with value 4

  • class: String field with value “t” and serving as partition key

  • size: Number field with value 1 and serving as sort key

Using FlowFile filename as partition key with generated sort key

======= Setup

  • Partition Key Strategy: Partition By Attribute

  • Partition Key Field: source

  • Partition Key Attribute: filename

  • Sort Key Strategy: Generate Sequence

  • Sort Key Field: sort

======= Result

The FlowFile’s filename attribute will be used as partition key. In this case all the records within the same FlowFile will share the same partition key. In order to avoid collusion, if FlowFiles contain multiple records, using sort key is suggested. In this case a generated sequence is used which is guaranteed to be unique within a given FlowFile.

======= Input

[
  {
    "type": "A",
    "subtype": 4,
    "class": "t",
    "size": 1
  },
  {
    "type": "B",
    "subtype": 5,
    "class": "m",
    "size": 2
  }
]
json

======= Output (stylized)

======== First Item

  • source: String field with value “data46362.json” and serving as partition key

  • type: String field with value “A”

  • subtype: Number field with value 4

  • class: String field with value “t”

  • size: Number field with value 1

  • sort: Number field with value 1 and serving as sort key

======== Second Item

  • source: String field with value “data46362.json” and serving as partition key

  • type: String field with value “B”

  • subtype: Number field with value 5

  • class: String field with value “m”

  • size: Number field with value 2

  • sort: Number field with value 2 and serving as sort key

Using generated partition key

======= Setup

  • Partition Key Strategy: Generated UUID

  • Partition Key Field: identifier

  • Sort Key Strategy: None

======= Result

A generated UUID will be used as partition key. A different UUID will be generated for every Item.

======= Input

[
  {
    "type": "A",
    "subtype": 4,
    "class": "t",
    "size": 1
  }
]
json

======= Output (stylized)

  • identifier: String field with value “872ab776-ed73-4d37-a04a-807f0297e06e” and serving as partition key

  • type: String field with value “A”

  • subtype: Number field with value 4

  • class: String field with value “t”

  • size: Number field with value 1

PutElasticsearchJson

An Elasticsearch put processor that uses the official Elastic REST client libraries. Each FlowFile is treated as a document to be sent to the Elasticsearch _bulk API. Multiple FlowFiles can be batched together into each Request sent to Elasticsearch.

Tags: json, elasticsearch, elasticsearch5, elasticsearch6, elasticsearch7, elasticsearch8, put, index

Properties

Identifier Attribute

The name of the FlowFile attribute containing the identifier for the document. If the Index Operation is "index", this property may be left empty or evaluate to an empty value, in which case the document’s identifier will be auto-generated by Elasticsearch. For all other Index Operations, the attribute must evaluate to a non-empty value.

Index Operation

The type of the operation used to index (create, delete, index, update, upsert)

Index

The name of the index to use.

Type

The type of this document (used by Elasticsearch for indexing and searching).

Script

The script for the document update/upsert. Only applies to Update/Upsert operations. Must be parsable as JSON Object. If left blank, the FlowFile content will be used for document update/upsert

Scripted Upsert

Whether to add the scripted_upsert flag to the Upsert Operation. Forces Elasticsearch to execute the Script whether or not the document exists, defaults to false. If the Upsert Document provided (from FlowFile content) will be empty, but sure to set the Client Service controller service’s Suppress Null/Empty Values to Never Suppress or no "upsert" doc will be, included in the request to Elasticsearch and the operation will not create a new document for the script to execute against, resulting in a "not_found" error

Dynamic Templates

The dynamic_templates for the document. Must be parsable as a JSON Object. Requires Elasticsearch 7+

Batch Size

The preferred number of FlowFiles to send over in a single batch

Character Set

Specifies the character set of the document data.

Max JSON Field String Length

The maximum allowed length of a string value when parsing a JSON document or attribute.

Client Service

An Elasticsearch client service to use for running queries.

Log Error Responses

If this is enabled, errors will be logged to the NiFi logs at the error log level. Otherwise, they will only be logged if debug logging is enabled on NiFi as a whole. The purpose of this option is to give the user the ability to debug failed operations without having to turn on debug logging.

Output Error Responses

If this is enabled, response messages from Elasticsearch marked as "error" will be output to the "error_responses" relationship.This does not impact the output of flowfiles to the "successful" or "errors" relationships

Treat "Not Found" as Success

If true, "not_found" Elasticsearch Document associated Records will be routed to the "successful" relationship, otherwise to the "errors" relationship. If Output Error Responses is "true" then "not_found" responses from Elasticsearch will be sent to the error_responses relationship.

Dynamic Properties

The name of the Bulk request header

Prefix: BULK: - adds the specified property name/value as a Bulk request header in the Elasticsearch Bulk API body used for processing. If the value is null or blank, the Bulk header will be omitted for the document operation. These parameters will override any matching parameters in the _bulk request body.

The name of a URL query parameter to add

Adds the specified property name/value as a query parameter in the Elasticsearch URL used for processing. These parameters will override any matching parameters in the _bulk request body. If FlowFiles are batched, only the first FlowFile in the batch is used to evaluate property values.

Relationships

  • failure: All flowfiles that fail for reasons unrelated to server availability go to this relationship.

  • errors: Record(s)/Flowfile(s) corresponding to Elasticsearch document(s) that resulted in an "error" (within Elasticsearch) will be routed here.

  • original: All flowfiles that are sent to Elasticsearch without request failures go to this relationship.

  • retry: All flowfiles that fail due to server/cluster availability go to this relationship.

  • successful: Record(s)/Flowfile(s) corresponding to Elasticsearch document(s) that did not result in an "error" (within Elasticsearch) will be routed here.

Writes Attributes

  • elasticsearch.put.error: The error message if there is an issue parsing the FlowFile, sending the parsed document to Elasticsearch or parsing the Elasticsearch response

  • elasticsearch.bulk.error: The _bulk response if there was an error during processing the document within Elasticsearch.

Input Requirement

This component requires an incoming relationship.

System Resource Considerations

  • MEMORY: The Batch of FlowFiles will be stored in memory until the bulk operation is performed.

Additional Details

This processor is for accessing the Elasticsearch Bulk API. It provides the ability to configure bulk operations on a per-FlowFile basis, which is what separates it from PutElasticsearchRecord.

As part of the Elasticsearch REST API bundle, it uses a controller service to manage connection information and that controller service is built on top of the official Elasticsearch client APIs. That provides features such as automatic master detection against the cluster which is missing in the other bundles.

This processor builds one Elasticsearch Bulk API body per (batch of) FlowFiles. Care should be taken to batch FlowFiles into appropriately-sized chunks so that NiFi does not run out of memory and the requests sent to Elasticsearch are not too large for it to handle. When failures do occur, this processor is capable of attempting to route the FlowFiles that failed to an errors queue so that only failed FlowFiles can be processed downstream or replayed.

The index, operation and (optional) type fields are configured with default values. The ID (optional unless the operation is “index”) can be set as an attribute on the FlowFile(s).

Dynamic Templates

Index and Create operations can use Dynamic Templates. The Dynamic Templates property must be parsable as a JSON object.

======= Example - Index with Dynamic Templates

{
  "message": "Hello, world"
}
json

The Dynamic Templates property below would be parsable:

{
  "message": "keyword_lower"
}
json

Would create Elasticsearch action:

{
  "index": {
    "_id": "1",
    "_index": "test",
    "dynamic_templates": {
      "message": "keyword_lower"
    }
  }
}
json
{
  "doc": {
    "message": "Hello, world"
  }
}
json
Update/Upsert Scripts

Update and Upsert operations can use a script. Scripts must contain all the elements required by Elasticsearch, e.g. source and lang. The Script property must be parsable as a JSON object.

If a script is defined for an upset, the Flowfile content will be used as the upsert fields in the Elasticsearch action. If no script is defined, the FlowFile content will be used as the update doc (or doc_as_upsert for upsert operations).

======= Example - Update without Script

{
  "message": "Hello, world",
  "from": "john.smith"
}
json

Would create Elasticsearch action:

{
  "update": {
    "_id": "1",
    "_index": "test"
  }
}
json
{
  "doc": {
    "message": "Hello, world",
    "from": "john.smith"
  }
}
json

======= Example - Upsert with Script

{
  "counter": 1
}
json

The script property below would be parsable:

{
  "source": "ctx._source.counter += params.param1",
  "lang": "painless",
  "params": {
    "param1": 1
  }
}
json

Would create Elasticsearch action:

{
  "update": {
    "_id": "1",
    "_index": "test"
  }
}
json
{
  "script": {
    "source": "ctx._source.counter += params.param1",
    "lang": "painless",
    "params": {
      "param1": 1
    }
  },
  "upsert": {
    "counter": 1
  }
}
json
Bulk Action Header Fields

Dynamic Properties can be defined on the processor with BULK: prefixes. Users must ensure that only known Bulk action fields are sent to Elasticsearch for the relevant index operation defined for the FlowFile, Elasticsearch will reject invalid combinations of index operation and Bulk action fields.

======= Example - Update with Retry on Conflict

{
  "message": "Hello, world",
  "from": "john.smith"
}
json

With the Dynamic Property below:

  • BULK:retry_on_conflict = 3

Would create Elasticsearch action:

{
  "update": {
    "_id": "1",
    "_index": "test",
    "retry_on_conflict": "3"
  }
}
json
{
  "doc": {
    "message": "Hello, world",
    "from": "john.smith"
  }
}
json
Index Operations

Valid values for “operation” are:

  • create

  • delete

  • index

  • update

  • upsert

PutElasticsearchRecord

A record-aware Elasticsearch put processor that uses the official Elastic REST client libraries. Each Record within the FlowFile is converted into a document to be sent to the Elasticsearch _bulk APi. Multiple documents can be batched into each Request sent to Elasticsearch. Each document’s Bulk operation can be configured using Record Path expressions.

Tags: json, elasticsearch, elasticsearch5, elasticsearch6, elasticsearch7, elasticsearch8, put, index, record

Properties

Index Operation

The type of the operation used to index (create, delete, index, update, upsert)

Index

The name of the index to use.

Type

The type of this document (used by Elasticsearch for indexing and searching).

@timestamp Value

The value to use as the @timestamp field (required for Elasticsearch Data Streams)

Max JSON Field String Length

The maximum allowed length of a string value when parsing a JSON document or attribute.

Client Service

An Elasticsearch client service to use for running queries.

Record Reader

The record reader to use for reading incoming records from flowfiles.

Batch Size

The number of records to send over in a single batch.

ID Record Path

A record path expression to retrieve the ID field for use with Elasticsearch. If left blank the ID will be automatically generated by Elasticsearch.

Retain ID (Record Path)

Whether to retain the existing field used as the ID Record Path.

Index Operation Record Path

A record path expression to retrieve the Index Operation field for use with Elasticsearch. If left blank the Index Operation will be determined using the main Index Operation property.

Index Record Path

A record path expression to retrieve the index field for use with Elasticsearch. If left blank the index will be determined using the main index property.

Type Record Path

A record path expression to retrieve the type field for use with Elasticsearch. If left blank the type will be determined using the main type property.

@timestamp Record Path

A RecordPath pointing to a field in the record(s) that contains the @timestamp for the document. If left blank the @timestamp will be determined using the main @timestamp property

Retain @timestamp (Record Path)

Whether to retain the existing field used as the @timestamp Record Path.

Script Record Path

A RecordPath pointing to a field in the record(s) that contains the script for the document update/upsert. Only applies to Update/Upsert operations. Field must be Map-type compatible (e.g. a Map or a Record) or a String parsable into a JSON Object

Scripted Upsert Record Path

A RecordPath pointing to a field in the record(s) that contains the scripted_upsert boolean flag. Whether to add the scripted_upsert flag to the Upsert Operation. Forces Elasticsearch to execute the Script whether or not the document exists, defaults to false. If the Upsert Document provided (from FlowFile content) will be empty, but sure to set the Client Service controller service’s Suppress Null/Empty Values to Never Suppress or no "upsert" doc will be, included in the request to Elasticsearch and the operation will not create a new document for the script to execute against, resulting in a "not_found" error

Dynamic Templates Record Path

A RecordPath pointing to a field in the record(s) that contains the dynamic_templates for the document. Field must be Map-type compatible (e.g. a Map or Record) or a String parsable into a JSON Object. Requires Elasticsearch 7+

Date Format

Specifies the format to use when writing Date fields. If not specified, the default format 'yyyy-MM-dd' is used. If specified, the value must match the Java Simple Date Format (for example, MM/dd/yyyy for a two-digit month, followed by a two-digit day, followed by a four-digit year, all separated by '/' characters, as in 01/25/2017).

Time Format

Specifies the format to use when writing Time fields. If not specified, the default format 'HH:mm:ss' is used. If specified, the value must match the Java Simple Date Format (for example, HH:mm:ss for a two-digit hour in 24-hour format, followed by a two-digit minute, followed by a two-digit second, all separated by ':' characters, as in 18:04:15).

Timestamp Format

Specifies the format to use when writing Timestamp fields. If not specified, the default format 'yyyy-MM-dd HH:mm:ss' is used. If specified, the value must match the Java Simple Date Format (for example, MM/dd/yyyy HH:mm:ss for a two-digit month, followed by a two-digit day, followed by a four-digit year, all separated by '/' characters; and then followed by a two-digit hour in 24-hour format, followed by a two-digit minute, followed by a two-digit second, all separated by ':' characters, as in 01/25/2017 18:04:15).

Log Error Responses

If this is enabled, errors will be logged to the NiFi logs at the error log level. Otherwise, they will only be logged if debug logging is enabled on NiFi as a whole. The purpose of this option is to give the user the ability to debug failed operations without having to turn on debug logging.

Output Error Responses

If this is enabled, response messages from Elasticsearch marked as "error" will be output to the "error_responses" relationship.This does not impact the output of flowfiles to the "successful" or "errors" relationships

Result Record Writer

The response from Elasticsearch will be examined for failed records and the failed records will be written to a record set with this record writer service and sent to the "errors" relationship. Successful records will be written to a record set with this record writer service and sent to the "successful" relationship.

Treat "Not Found" as Success

If true, "not_found" Elasticsearch Document associated Records will be routed to the "successful" relationship, otherwise to the "errors" relationship. If Output Error Responses is "true" then "not_found" responses from Elasticsearch will be sent to the error_responses relationship.

Group Results by Bulk Error Type

The errored records written to the "errors" relationship will be grouped by error type and the error related to the first record within the FlowFile added to the FlowFile as "elasticsearch.bulk.error". If "Treat "Not Found" as Success" is "false" then records associated with "not_found" Elasticsearch document responses will also be send to the "errors" relationship.

Dynamic Properties

The name of the Bulk request header

Prefix: BULK: - adds the specified property name/value as a Bulk request header in the Elasticsearch Bulk API body used for processing. If the Record Path expression results in a null or blank value, the Bulk header will be omitted for the document operation. These parameters will override any matching parameters in the _bulk request body.

The name of a URL query parameter to add

Adds the specified property name/value as a query parameter in the Elasticsearch URL used for processing. These parameters will override any matching parameters in the _bulk request body

Relationships

  • failure: All flowfiles that fail for reasons unrelated to server availability go to this relationship.

  • errors: Record(s)/Flowfile(s) corresponding to Elasticsearch document(s) that resulted in an "error" (within Elasticsearch) will be routed here.

  • original: All flowfiles that are sent to Elasticsearch without request failures go to this relationship.

  • retry: All flowfiles that fail due to server/cluster availability go to this relationship.

  • successful: Record(s)/Flowfile(s) corresponding to Elasticsearch document(s) that did not result in an "error" (within Elasticsearch) will be routed here.

Writes Attributes

  • elasticsearch.put.error: The error message if there is an issue parsing the FlowFile records, sending the parsed documents to Elasticsearch or parsing the Elasticsearch response.

  • elasticsearch.put.error.count: The number of records that generated errors in the Elasticsearch _bulk API.

  • elasticsearch.put.success.count: The number of records that were successfully processed by the Elasticsearch _bulk API.

  • elasticsearch.bulk.error: The _bulk response if there was an error during processing the record within Elasticsearch.

Input Requirement

This component requires an incoming relationship.

System Resource Considerations

  • MEMORY: The Batch of Records will be stored in memory until the bulk operation is performed.

Additional Details

This processor is for accessing the Elasticsearch Bulk API. It provides the ability to configure bulk operations on a per-record basis which is what separates it from PutElasticsearchJson. For example, it is possible to define multiple commands to index documents, followed by deletes, creates and update operations against the same index or other indices as desired.

As part of the Elasticsearch REST API bundle, it uses a controller service to manage connection information and that controller service is built on top of the official Elasticsearch client APIs. That provides features such as automatic master detection against the cluster which is missing in the other bundles.

This processor builds one Elasticsearch Bulk API body per record set. Care should be taken to split up record sets into appropriately-sized chunks so that NiFi does not run out of memory and the requests sent to Elasticsearch are not too large for it to handle. When failures do occur, this processor is capable of attempting to write the records that failed to an output record writer so that only failed records can be processed downstream or replayed.

Per-Record Actions

The index, operation and (optional) type fields are configured with default values that can be overridden using record path operations that find an index or type value in the record set. The ID and operation type (create, index, update, upsert or delete) can also be extracted in a similar fashion from the record set. A “@timestamp” field can be added to the data either using a default or by extracting it from the record set. This is useful if the documents are being indexed into an Elasticsearch Data Stream.

======= Example - per-record actions

The following is an example of a document exercising all of these features:

{
  "metadata": {
    "id": "12345",
    "index": "test",
    "type": "message",
    "operation": "index"
  },
  "message": "Hello, world",
  "from": "john.smith",
  "ts": "2021-12-03'T'14:00:00.000Z"
}
json
{
  "metadata": {
    "id": "12345",
    "index": "test",
    "type": "message",
    "operation": "delete"
  }
}
json

The record path operations below would extract the relevant data:

  • /metadata/id

  • /metadata/index

  • metadata/type

  • metadata/operation

  • /ts

Dynamic Templates

Index and Create operations can use Dynamic Templates from the Record, record path operations can be configured to find the Dynamic Templates from the record set. Dynamic Templates fields in Records must either be a Map, child Record or a string that can be parsable as a JSON object.

======= Example - Index with Dynamic Templates

{
  "message": "Hello, world",
  "dynamic_templates": "{\"message\": \"keyword_lower\"}"
}
json

The record path operation below would extract the relevant Dynamic Templates:

  • /dynamic_templates

Would create Elasticsearch action:

{
  "index": {
    "_id": "1",
    "_index": "test",
    "dynamic_templates": {
      "message": "keyword_lower"
    }
  }
}
json
{
  "doc": {
    "message": "Hello, world"
  }
}
json
Update/Upsert Scripts

Update and Upsert operations can use a script from the Record, record path operations can be configured to find the script from the record set. Scripts must contain all the elements required by Elasticsearch, e.g. source and lang. Script fields in Records must either be a Map, child Record or a string that can be parsable as a JSON object.

If a script is defined for an upset, any fields remaining in the Record will be used as the upsert fields in the Elasticsearch action. If no script is defined, all Record fields will be used as the update doc (or doc_as_upsert for upsert operations).

======= Example - Update without Script

{
  "message": "Hello, world",
  "from": "john.smith"
}
json

Would create Elasticsearch action:

{
  "update": {
    "_id": "1",
    "_index": "test"
  }
}
json
{
  "doc": {
    "message": "Hello, world",
    "from": "john.smith"
  }
}
json

======= Example - Upsert with Script

{
  "counter": 1,
  "script": {
    "source": "ctx._source.counter += params.param1",
    "lang": "painless",
    "params": {
      "param1": 1
    }
  }
}
json

The record path operation below would extract the relevant script:

  • /script

Would create Elasticsearch action:

{
  "update": {
    "_id": "1",
    "_index": "test"
  }
}
json
{
  "script": {
    "source": "ctx._source.counter += params.param1",
    "lang": "painless",
    "params": {
      "param1": 1
    }
  },
  "upsert": {
    "counter": 1
  }
}
json
Bulk Action Header Fields

Dynamic Properties can be defined on the processor with BULK: prefixes. The value of the Dynamic Property is a record path operation to find the field value from the record set. Users must ensure that only known Bulk action fields are sent to Elasticsearch for the relevant index operation defined for the Record, Elasticsearch will reject invalid combinations of index operation and Bulk action fields.

======= Example - Update with Retry on Conflict

{
  "message": "Hello, world",
  "from": "john.smith",
  "retry": 3
}
json

The Dynamic Property and record path operation below would extract the relevant field:

  • BULK:retry_on_conflict = /retry

Would create Elasticsearch action:

{
  "update": {
    "_id": "1",
    "_index": "test",
    "retry_on_conflict": 3
  }
}
json
{
  "doc": {
    "message": "Hello, world",
    "from": "john.smith"
  }
}
json
Index Operations

Valid values for “operation” are:

  • create

  • delete

  • index

  • update

  • upsert

PutEmail

Sends an e-mail to configured recipients for each incoming FlowFile

Tags: email, put, notify, smtp

Properties

SMTP Hostname

The hostname of the SMTP host

SMTP Port

The Port used for SMTP communications

Authorization Mode

How to authorize sending email on the user’s behalf.

OAuth2 Access Token Provider

OAuth2 service that can provide access tokens.

SMTP Username

Username for the SMTP account

SMTP Password

Password for the SMTP account

SMTP Auth

Flag indicating whether authentication should be used

SMTP STARTTLS

Flag indicating whether Opportunistic TLS should be enabled using STARTTLS command

SMTP Socket Factory

Socket Factory to use for SMTP Connection

SMTP X-Mailer Header

X-Mailer used in the header of the outgoing email

Attributes to Send as Headers (Regex)

A Regular Expression that is matched against all FlowFile attribute names. Any attribute whose name matches the regex will be added to the Email messages as a Header. If not specified, no FlowFile attributes will be added as headers.

Content Type

Mime Type used to interpret the contents of the email, such as text/plain or text/html

From

Specifies the Email address to use as the sender. Comma separated sequence of addresses following RFC822 syntax.

To

The recipients to include in the To-Line of the email. Comma separated sequence of addresses following RFC822 syntax.

CC

The recipients to include in the CC-Line of the email. Comma separated sequence of addresses following RFC822 syntax.

BCC

The recipients to include in the BCC-Line of the email. Comma separated sequence of addresses following RFC822 syntax.

Reply-To

The recipients that will receive the reply instead of the from (see RFC2822 §3.6.2).This feature is useful, for example, when the email is sent by a no-reply account. This field is optional.Comma separated sequence of addresses following RFC822 syntax.

Subject

The email subject

Message

The body of the email message

Flow file content as message

Specifies whether or not the FlowFile content should be the message of the email. If true, the 'Message' property is ignored.

Input Character Set

Specifies the character set of the FlowFile contents for reading input FlowFile contents to generate the message body or as an attachment to the message. If not set, UTF-8 will be the default value.

Attach File

Specifies whether or not the FlowFile content should be attached to the email

Include All Attributes In Message

Specifies whether or not all FlowFile attributes should be recorded in the body of the email message

Dynamic Properties

mail.propertyName

Dynamic property names that will be passed to the Mail session. Possible properties can be found in: https://javaee.github.io/javamail/docs/api/com/sun/mail/smtp/package-summary.html.

Relationships

  • success: FlowFiles that are successfully sent will be routed to this relationship

  • failure: FlowFiles that fail to send will be routed to this relationship

Input Requirement

This component requires an incoming relationship.

System Resource Considerations

  • MEMORY: The entirety of the FlowFile’s content (as a String object) will be read into memory in case the property to use the flow file content as the email body is set to true.

Additional Details

OAuth Authorization Mode

PutEmail can use OAuth2. The exact way may depend on the email provider.

OAuth with Gmail
Configure Gmail OAuth Client

The Gmail OAuth client can be used to send email on behalf of multiple different gmail accounts so this needs to be done once.

  1. In the Google Development Console Create a project (if you don’t have one yet)

  2. Configure OAuth consent

  3. Create OAuth client. Select Desktop app as Application type. When the client has been created, take note of the Client ID and Client secret values as they will be needed later.

Retrieve Token for NiFi

Tokens are provided once the owner of the Gmail account consented to the previously created client to send emails on their behalf. Consequently, this needs to be done for every gmail account.

  1. Go to the following web page:

    Replace CLIENT_ID at the end to your Client ID.

  2. You may need to select the Google Account for which you want to consent. Click Continue twice.

  3. A page will appear with an Authorisation code that will have a message at the bottom like this:

    Authorisation code

    Please copy this code, switch to your application and paste it there:

    AUTHORISATION_CODE

  4. Execute the following command from terminal to fetch the access and refresh tokens.
    In case the curl command returns an error, please try again from step 1.

    curl https://accounts.google.com/o/oauth2/token -d grant_type=authorization_code -d redirect_uri=“urn:ietf:wg:oauth:2.0:oob” -d client_id=CLIENT_ID -d client_secret=CLIENT_SECRET -d code=AUTHORISATION_CODE

    Replace CLIENT_ID, CLIENT_SECRET and AUTHORISATION_CODE to your values.

  5. The curl command results a json file which contains the access token and refresh token:

{
  "access_token": "ACCESS_TOKEN",
  "expires_in": 3599,
  "refresh_token": "REFRESH_TOKEN",
  "scope": "https://mail.google.com/",
  "token_type": "Bearer"
}
json
Configure Token in NiFi
  1. On the PutEmail processor in the Authorization Mode property select Use OAuth2.

  2. In the OAuth2 Access Token Provider property select/create a StandardOauth2AccessTokenProvider controller service.

  3. On the StandardOauth2AccessTokenProvider controller service in the Grant Type property select Refresh Token.

  4. In the Refresh Token property enter the REFRESH_TOKEN returned by the curl command.

  5. In the Authorization Server URL enter

  6. Also fill in the Client ID and Client secret properties.

PutFile

Writes the contents of a FlowFile to the local file system

Tags: put, local, copy, archive, files, filesystem

Properties

Directory

The directory to which files should be written. You may use expression language such as /aa/bb/${path}

Conflict Resolution Strategy

Indicates what should happen when a file with the same name already exists in the output directory

Create Missing Directories

If true, then missing destination directories will be created. If false, flowfiles are penalized and sent to failure.

Maximum File Count

Specifies the maximum number of files that can exist in the output directory

Last Modified Time

Sets the lastModifiedTime on the output file to the value of this attribute. Format must be yyyy-MM-dd’T’HH:mm:ssZ. You may also use expression language such as ${file.lastModifiedTime}.

Permissions

Sets the permissions on the output file to the value of this attribute. Format must be either UNIX rwxrwxrwx with a - in place of denied permissions (e.g. rw-r—​r--) or an octal number (e.g. 644). You may also use expression language such as ${file.permissions}.

Owner

Sets the owner on the output file to the value of this attribute. You may also use expression language such as ${file.owner}. Note on many operating systems Nifi must be running as a super-user to have the permissions to set the file owner.

Group

Sets the group on the output file to the value of this attribute. You may also use expression language such as ${file.group}.

Relationships

  • success: Files that have been successfully written to the output directory are transferred to this relationship

  • failure: Files that could not be written to the output directory for some reason are transferred to this relationship

Reads Attributes

  • filename: The filename to use when writing the FlowFile to disk.

Restricted

Making changes to this component requires admin rights. Non-admins can still schedule this component.

Input Requirement

This component requires an incoming relationship.

See Also

PutFTP

Sends FlowFiles to an FTP Server

Tags: remote, copy, egress, put, ftp, archive, files

Properties

Hostname

The fully qualified hostname or IP address of the remote system

Port

The port that the remote system is listening on for file transfers

Username

Username

Password

Password for the user account

Remote Path

The path on the remote system from which to pull or push files

Create Directory

Specifies whether or not the remote directory should be created if it does not exist.

Batch Size

The maximum number of FlowFiles to send in a single connection

Connection Timeout

Amount of time to wait before timing out while creating a connection

Data Timeout

When transferring a file between the local and remote system, this value specifies how long is allowed to elapse without any data being transferred between systems

Conflict Resolution

Determines how to handle the problem of filename collisions

Dot Rename

If true, then the filename of the sent file is prepended with a "." and then renamed back to the original once the file is completely sent. Otherwise, there is no rename. This property is ignored if the Temporary Filename property is set.

Temporary Filename

If set, the filename of the sent file will be equal to the value specified during the transfer and after successful completion will be renamed to the original filename. If this value is set, the Dot Rename property is ignored.

Transfer Mode

The FTP Transfer Mode

Connection Mode

The FTP Connection Mode

Reject Zero-Byte Files

Determines whether or not Zero-byte files should be rejected without attempting to transfer

Last Modified Time

The lastModifiedTime to assign to the file after transferring it. If not set, the lastModifiedTime will not be changed. Format must be yyyy-MM-dd’T’HH:mm:ssZ. You may also use expression language such as ${file.lastModifiedTime}. If the value is invalid, the processor will not be invalid but will fail to change lastModifiedTime of the file.

Permissions

The permissions to assign to the file after transferring it. Format must be either UNIX rwxrwxrwx with a - in place of denied permissions (e.g. rw-r—​r--) or an octal number (e.g. 644). If not set, the permissions will not be changed. You may also use expression language such as ${file.permissions}. If the value is invalid, the processor will not be invalid but will fail to change permissions of the file.

Use Compression

Indicates whether or not ZLIB compression should be used when transferring files

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP + AuthN, SOCKS + AuthN

Internal Buffer Size

Set the internal buffer size for buffered data streams

Use UTF-8 Encoding

Tells the client to use UTF-8 encoding when processing files and filenames. If set to true, the server must also support UTF-8 encoding.

Dynamic Properties

pre.cmd._

The command specified in the key will be executed before doing a put. You may add these optional properties to send any commands to the FTP server before the file is actually transferred (before the put command). This option is only available for the PutFTP processor, as only FTP has this functionality. This is essentially the same as sending quote commands to an FTP server from the command line. While this is the same as sending a quote command, it is very important that you leave off the .

post.cmd._

The command specified in the key will be executed after doing a put. You may add these optional properties to send any commands to the FTP server before the file is actually transferred (before the put command). This option is only available for the PutFTP processor, as only FTP has this functionality. This is essentially the same as sending quote commands to an FTP server from the command line. While this is the same as sending a quote command, it is very important that you leave off the .

Relationships

  • success: FlowFiles that are successfully sent will be routed to success

  • failure: FlowFiles that failed to send to the remote system; failure is usually looped back to this processor

  • reject: FlowFiles that were rejected by the destination system

Input Requirement

This component requires an incoming relationship.

See Also

PutGCSObject

Writes the contents of a FlowFile as an object in a Google Cloud Storage.

Tags: google, google cloud, gcs, archive, put

Properties

GCP Credentials Provider Service

The Controller Service used to obtain Google Cloud Platform credentials.

Project ID

Google Cloud Project ID

Bucket

Bucket of the object.

Key

Name of the object.

Resource Transfer Source

The source of the content to be transferred

File Resource Service

File Resource Service providing access to the local resource to be transferred

Content Type

Content Type for the file, i.e. text/plain

CRC32C Checksum

CRC32C Checksum (encoded in Base64, big-Endian order) of the file for server-side validation.

Object ACL

Access Control to be attached to the object uploaded. Not providing this will revert to bucket defaults.

Server Side Encryption Key

An AES256 Encryption Key (encoded in base64) for server-side encryption of the object.

Overwrite Object

If false, the upload to GCS will succeed only if the object does not exist.

Content Disposition Type

Type of RFC-6266 Content Disposition to be attached to the object

GZIP Compression Enabled

Signals to the GCS Blob Writer whether GZIP compression during transfer is desired. False means do not gzip and can boost performance in many cases.

Storage API URL

Overrides the default storage URL. Configuring an alternative Storage API URL also overrides the HTTP Host header on requests as described in the Google documentation for Private Service Connections.

Number of retries

How many retry attempts should be made before routing to the failure relationship.

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP + AuthN

Dynamic Properties

The name of a User-Defined Metadata field to add to the GCS Object

Allows user-defined metadata to be added to the GCS object as key/value pairs

Relationships

  • success: FlowFiles are routed to this relationship after a successful Google Cloud Storage operation.

  • failure: FlowFiles are routed to this relationship if the Google Cloud Storage operation fails.

Reads Attributes

  • filename: Uses the FlowFile’s filename as the filename for the GCS object

  • mime.type: Uses the FlowFile’s MIME type as the content-type for the GCS object

Writes Attributes

  • gcs.bucket: Bucket of the object.

  • gcs.key: Name of the object.

  • gcs.size: Size of the object.

  • gcs.cache.control: Data cache control of the object.

  • gcs.component.count: The number of components which make up the object.

  • gcs.content.disposition: The data content disposition of the object.

  • gcs.content.encoding: The content encoding of the object.

  • gcs.content.language: The content language of the object.

  • mime.type: The MIME/Content-Type of the object

  • gcs.crc32c: The CRC32C checksum of object’s data, encoded in base64 in big-endian order.

  • gcs.create.time: The creation time of the object (milliseconds)

  • gcs.update.time: The last modification time of the object (milliseconds)

  • gcs.encryption.algorithm: The algorithm used to encrypt the object.

  • gcs.encryption.sha256: The SHA256 hash of the key used to encrypt the object

  • gcs.etag: The HTTP 1.1 Entity tag for the object.

  • gcs.generated.id: The service-generated for the object

  • gcs.generation: The data generation of the object.

  • gcs.md5: The MD5 hash of the object’s data encoded in base64.

  • gcs.media.link: The media download link to the object.

  • gcs.metageneration: The metageneration of the object.

  • gcs.owner: The owner (uploader) of the object.

  • gcs.owner.type: The ACL entity type of the uploader of the object.

  • gcs.uri: The URI of the object as a string.

Input Requirement

This component requires an incoming relationship.

PutGoogleDrive

Writes the contents of a FlowFile as a file in Google Drive.

Tags: google, drive, storage, put

Properties

GCP Credentials Provider Service

The Controller Service used to obtain Google Cloud Platform credentials.

Folder ID

The ID of the shared folder. Please see Additional Details to set up access to Google Drive and obtain Folder ID.

Filename

The name of the file to upload to the specified Google Drive folder.

Conflict Resolution Strategy

Indicates what should happen when a file with the same name already exists in the specified Google Drive folder.

Chunked Upload Threshold

The maximum size of the content which is uploaded at once. FlowFiles larger than this threshold are uploaded in chunks.

Chunked Upload Size

Defines the size of a chunk. Used when a FlowFile’s size exceeds 'Chunked Upload Threshold' and content is uploaded in smaller chunks. Minimum allowed chunk size is 256 KB, maximum allowed chunk size is 1 GB.

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP + AuthN

Relationships

  • success: Files that have been successfully written to Google Drive are transferred to this relationship.

  • failure: Files that could not be written to Google Drive for some reason are transferred to this relationship.

Reads Attributes

  • filename: Uses the FlowFile’s filename as the filename for the Google Drive object.

Writes Attributes

  • drive.id: The id of the file

  • filename: The name of the file

  • mime.type: The MIME type of the file

  • drive.size: The size of the file

  • drive.timestamp: The last modified time or created time (whichever is greater) of the file. The reason for this is that the original modified date of a file is preserved when uploaded to Google Drive. 'Created time' takes the time when the upload occurs. However uploaded files can still be modified later.

  • error.code: The error code returned by Google Drive

  • error.message: The error message returned by Google Drive

Input Requirement

This component requires an incoming relationship.

Additional Details

Accessing Google Drive from NiFi

This processor uses Google Cloud credentials for authentication to access Google Drive. The following steps are required to prepare the Google Cloud and Google Drive accounts for the processors:

  1. Enable Google Drive API in Google Cloud

  2. Grant access to Google Drive folder

    • In Google Cloud Console navigate to IAM & Admin → Service Accounts.

    • Take a note of the email of the service account you are going to use.

    • Navigate to the folder in Google Drive which will be used as the base folder.

    • Right-click on the Folder → Share.

    • Enter the service account email.

  3. Find Folder ID

    • Navigate to the folder to be listed in Google Drive and enter it. The URL in your browser will include the ID at the end of the URL. For example, if the URL were https://drive.google.com/drive/folders/1trTraPVCnX5_TNwO8d9P_bz278xWOmGm, the Folder ID would be 1trTraPVCnX5_TNwO8d9P_bz278xWOmGm

  4. Set Folder ID in ‘Folder ID’ property

PutGridFS

Writes a file to a GridFS bucket.

Tags: mongo, gridfs, put, file, store

Properties

Client Service

The MongoDB client service to use for database connections.

Mongo Database Name

The name of the database to use

Bucket Name

The GridFS bucket where the files will be stored. If left blank, it will use the default value 'fs' that the MongoDB client driver uses.

File Name

The name of the file in the bucket that is the target of this processor. GridFS file names do not include path information because GridFS does not sort files into folders within a bucket.

File Properties Prefix

Attributes that have this prefix will be added to the file stored in GridFS as metadata.

Enforce Uniqueness

When enabled, this option will ensure that uniqueness is enforced on the bucket. It will do so by creating a MongoDB index that matches your selection. It should ideally be configured once when the bucket is created for the first time because it could take a long time to build on an existing bucket wit a lot of data.

Hash Attribute

If uniquness enforcement is enabled and the file hash is part of the constraint, this must be set to an attribute that exists on all incoming flowfiles.

Chunk Size

Controls the maximum size of each chunk of a file uploaded into GridFS.

Relationships

  • success: When the operation succeeds, the flowfile is sent to this relationship.

  • failure: When there is a failure processing the flowfile, it goes to this relationship.

  • duplicate: Flowfiles that fail the duplicate check are sent to this relationship.

Input Requirement

This component requires an incoming relationship.

Additional Details

Description:

This processor puts a file with one or more user-defined metadata values into GridFS in the configured bucket. It allows the user to define how big each file chunk will be during ingestion and provides some ability to intelligently attempt to enforce file uniqueness using filename or hash values instead of just relying on a database index.

GridFS File Attributes

PutGridFS allows for flowfile attributes that start with a configured prefix to be added to the GridFS document. These can be very useful later when working with GridFS for providing metadata about a file.

Chunk Size

GridFS splits up file into chunks within Mongo documents as the file is ingested into the database. The chunk size configuration parameter configures the maximum size of each chunk. This field should be left at its default value unless there is a specific business case to increase or decrease it.

Uniqueness Enforcement

There are four operating modes:

  • No enforcement at the application level.

  • Enforce by unique file name.

  • Enforce by unique hash value.

  • Use both hash and file name.

The hash value by default is taken from the attribute hash.value which can be generated by configuring a HashContent processor upstream of PutGridFS. Both this and the name option use a query on the existing data to see if a file matching that criteria exists before attempting to write the flowfile contents.

PutHDFS

Write FlowFile data to Hadoop Distributed File System (HDFS)

Tags: hadoop, HCFS, HDFS, put, copy, filesystem

Properties

Hadoop Configuration Resources

A file or comma separated list of files which contains the Hadoop file system configuration. Without this, Hadoop will search the classpath for a 'core-site.xml' and 'hdfs-site.xml' file or will revert to a default configuration. To use swebhdfs, see 'Additional Details' section of PutHDFS’s documentation.

Kerberos User Service

Specifies the Kerberos User Controller Service that should be used for authenticating with Kerberos

Additional Classpath Resources

A comma-separated list of paths to files and/or directories that will be added to the classpath and used for loading native libraries. When specifying a directory, all files with in the directory will be added to the classpath, but further sub-directories will not be included.

Directory

The parent HDFS directory to which files should be written. The directory will be created if it doesn’t exist.

Conflict Resolution Strategy

Indicates what should happen when a file with the same name already exists in the output directory

Append Mode

Defines the append strategy to use when the Conflict Resolution Strategy is set to 'append'.

Writing Strategy

Defines the approach for writing the FlowFile data.

Block Size

Size of each block as written to HDFS. This overrides the Hadoop Configuration

IO Buffer Size

Amount of memory to use to buffer file contents during IO. This overrides the Hadoop Configuration

Replication

Number of times that HDFS will replicate each file. This overrides the Hadoop Configuration

Permissions umask

A umask represented as an octal number which determines the permissions of files written to HDFS. This overrides the Hadoop property "fs.permissions.umask-mode". If this property and "fs.permissions.umask-mode" are undefined, the Hadoop default "022" will be used. If the PutHDFS target folder has a default ACL defined, the umask property is ignored by HDFS.

Remote Owner

Changes the owner of the HDFS file to this value after it is written. This only works if NiFi is running as a user that has HDFS super user privilege to change owner

Remote Group

Changes the group of the HDFS file to this value after it is written. This only works if NiFi is running as a user that has HDFS super user privilege to change group

Compression codec

Ignore Locality

Directs the HDFS system to ignore locality rules so that data is distributed randomly throughout the cluster

Resource Transfer Source

The source of the content to be transferred

File Resource Service

File Resource Service providing access to the local resource to be transferred

Relationships

  • success: Files that have been successfully written to HDFS are transferred to this relationship

  • failure: Files that could not be written to HDFS for some reason are transferred to this relationship

Reads Attributes

  • filename: The name of the file written to HDFS comes from the value of this attribute.

Writes Attributes

  • filename: The name of the file written to HDFS is stored in this attribute.

  • absolute.hdfs.path: The absolute path to the file on HDFS is stored in this attribute.

  • hadoop.file.url: The hadoop url for the file is stored in this attribute.

  • target.dir.created: The result(true/false) indicates if the folder is created by the processor.

Restricted

Making changes to this component requires admin rights. Non-admins can still schedule this component.

Input Requirement

This component requires an incoming relationship.

See Also

Additional Details

SSL Configuration:

Hadoop provides the ability to configure keystore and/or truststore properties. If you want to use SSL-secured file system like swebhdfs, you can use the Hadoop configurations instead of using SSL Context Service.

  1. create ‘ssl-client.xml’ to configure the truststores.

ssl-client.xml Properties:

Property Default Value Explanation

ssl.client.truststore.type

jks

Truststore file type

ssl.client.truststore.location

NONE

Truststore file location

ssl.client.truststore.password

NONE

Truststore file password

ssl.client.truststore.reload.interval

10000

Truststore reload interval, in milliseconds

ssl-client.xml Example:

<configuration>
    <property>
        <name>ssl.client.truststore.type</name>
        <value>jks</value>
    </property>
    <property>
        <name>ssl.client.truststore.location</name>
        <value>/path/to/truststore.jks</value>
    </property>
    <property>
        <name>ssl.client.truststore.password</name>
        <value>clientfoo</value>
    </property>
    <property>
        <name>ssl.client.truststore.reload.interval</name>
        <value>10000</value>
    </property>
</configuration>
xml
  1. put ‘ssl-client.xml’ to the location looked up in the classpath, like under NiFi conriguration directory.

  2. set the name of ‘ssl-client.xml’ to hadoop.ssl.client.conf in the ‘core-site.xml’ which HDFS processors use.

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>swebhdfs://{namenode.hostname:port}</value>
    </property>
    <property>
        <name>hadoop.ssl.client.conf</name>
        <value>ssl-client.xml</value>
    </property>
</configuration>
xml

PutKinesisFirehose

Sends the contents to a specified Amazon Kinesis Firehose. In order to send data to firehose, the firehose delivery stream name has to be specified.

Tags: amazon, aws, firehose, kinesis, put, stream

Properties

Amazon Kinesis Firehose Delivery Stream Name

The name of kinesis firehose delivery stream

Batch Size

Batch size for messages (1-500).

Region

AWS Credentials Provider Service

The Controller Service that is used to obtain AWS credentials provider

Max message buffer size

Max message buffer

Communications Timeout

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP + AuthN

Endpoint Override URL

Endpoint URL to use instead of the AWS default including scheme, host, port, and path. The AWS libraries select an endpoint URL based on the AWS region, but this property overrides the selected endpoint URL, allowing use with other S3-compatible endpoints.

Relationships

  • success: FlowFiles are routed to success relationship

  • failure: FlowFiles are routed to failure relationship

Writes Attributes

  • aws.kinesis.firehose.error.message: Error message on posting message to AWS Kinesis Firehose

  • aws.kinesis.firehose.error.code: Error code for the message when posting to AWS Kinesis Firehose

  • aws.kinesis.firehose.record.id: Record id of the message posted to Kinesis Firehose

Input Requirement

This component requires an incoming relationship.

PutKinesisStream

Sends the contents to a specified Amazon Kinesis. In order to send data to Kinesis, the stream name has to be specified.

Tags: amazon, aws, kinesis, put, stream

Properties

Amazon Kinesis Stream Name

The name of Kinesis Stream

Region

AWS Credentials Provider Service

The Controller Service that is used to obtain AWS credentials provider

Amazon Kinesis Stream Partition Key

The partition key attribute. If it is not set, a random value is used

Message Batch Size

Batch size for messages (1-500).

Max message buffer size (MB)

Max message buffer size in Mega-bytes

Communications Timeout

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP + AuthN

Endpoint Override URL

Endpoint URL to use instead of the AWS default including scheme, host, port, and path. The AWS libraries select an endpoint URL based on the AWS region, but this property overrides the selected endpoint URL, allowing use with other S3-compatible endpoints.

Relationships

  • success: FlowFiles are routed to success relationship

  • failure: FlowFiles are routed to failure relationship

Writes Attributes

  • aws.kinesis.error.message: Error message on posting message to AWS Kinesis

  • aws.kinesis.error.code: Error code for the message when posting to AWS Kinesis

  • aws.kinesis.sequence.number: Sequence number for the message when posting to AWS Kinesis

  • aws.kinesis.shard.id: Shard id of the message posted to AWS Kinesis

Input Requirement

This component requires an incoming relationship.

PutLambda

Sends the contents to a specified Amazon Lambda Function. The AWS credentials used for authentication must have permissions execute the Lambda function (lambda:InvokeFunction).The FlowFile content must be JSON.

Tags: amazon, aws, lambda, put

Properties

Amazon Lambda Name

The Lambda Function Name

Amazon Lambda Qualifier (version)

The Lambda Function Version

Region

AWS Credentials Provider Service

The Controller Service that is used to obtain AWS credentials provider

Communications Timeout

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP + AuthN

Endpoint Override URL

Endpoint URL to use instead of the AWS default including scheme, host, port, and path. The AWS libraries select an endpoint URL based on the AWS region, but this property overrides the selected endpoint URL, allowing use with other S3-compatible endpoints.

Relationships

  • success: FlowFiles are routed to success relationship

  • failure: FlowFiles are routed to failure relationship

Writes Attributes

  • aws.lambda.result.function.error: Function error message in result on posting message to AWS Lambda

  • aws.lambda.result.status.code: Status code in the result for the message when posting to AWS Lambda

  • aws.lambda.result.payload: Payload in the result from AWS Lambda

  • aws.lambda.result.log: Log in the result of the message posted to Lambda

  • aws.lambda.exception.message: Exception message on invoking from AWS Lambda

  • aws.lambda.exception.cause: Exception cause on invoking from AWS Lambda

  • aws.lambda.exception.error.code: Exception error code on invoking from AWS Lambda

  • aws.lambda.exception.request.id: Exception request id on invoking from AWS Lambda

  • aws.lambda.exception.status.code: Exception status code on invoking from AWS Lambda

Input Requirement

This component requires an incoming relationship.

PutMetro

"Puts" FlowFiles on the "Metro Line", i.e. keeps FlowFiles queued in this processor’s incoming connection(s) until processed by another processor using with the same Metro Line Controller as this processor. Such processors can process a copy of a queued FlowFile, at which point the queued FlowFile will be routed to the 'success' relationship. Use dynamic properties to add attributes to queued FlowFiles - e.g. to map a common error reason that can then be evaluated after a ExitMetro processor.

Tags: virtimo, metro

Properties

Metro Line Controller

The processor uses this controller’s Metro Line to connect with GetMetro/ExitMetro processors.

Dynamic Properties

Attribute name

Attribute to be created. For example, to map a common error reason that can then be evaluated after the ExitMetro processor.

Relationships

  • success: The FlowFile was successfully transferred via metro.

Writes Attributes

  • metro.processed: Epoch milliseconds when FlowFile is cached.

  • <Dynamic Property’s name>: Use dynamic properties to add attributes with the dynamic property’s name and value.

Input Requirement

This component requires an incoming relationship.

PutMongo

Writes the contents of a FlowFile to MongoDB

Tags: mongodb, insert, update, write, put

Properties

Client Service

If configured, this property will use the assigned client service for connection pooling.

Mongo Database Name

The name of the database to use

Mongo Collection Name

The name of the collection to use

Mode

Indicates whether the processor should insert or update content

Upsert

When true, inserts a document if no document matches the update query criteria; this property is valid only when using update mode, otherwise it is ignored

Update Query Key

One or more comma-separated document key names used to build the update query criteria, such as _id

Update Query

Specify a full MongoDB query to be used for the lookup query to do an update/upsert. NOTE: this field is ignored if the 'Update Query Key' value is not empty.

Update Mode

Choose an update mode. You can either supply a JSON document to use as a direct replacement or specify a document that contains update operators like $set, $unset, and $inc. When Operators mode is enabled, the flowfile content is expected to be the operator part for example: {$set:{"key": "value"},$inc:{"count":1234}} and the update query will come from the configured Update Query property.

Update Method

MongoDB method for running collection update operations, such as updateOne or updateMany

Character Set

The Character Set in which the data is encoded

Relationships

  • success: All FlowFiles that are written to MongoDB are routed to this relationship

  • failure: All FlowFiles that cannot be written to MongoDB are routed to this relationship

Writes Attributes

  • mongo.put.update.match.count: The match count from result if update/upsert is performed, otherwise not set.

  • mongo.put.update.modify.count: The modify count from result if update/upsert is performed, otherwise not set.

  • mongo.put.upsert.id: The '_id' hex value if upsert is performed, otherwise not set.

Input Requirement

This component requires an incoming relationship.

System Resource Considerations

  • MEMORY: An instance of this component can cause high usage of this system resource. Multiple instances or high concurrency settings may result a degradation of performance.

Additional Details

Description:

This processor is a general purpose processor for inserting, upserting and updating MongoDB collections.

Inserting Documents

Each flowfile is assumed to contain only a single MongoDB document to be inserted. The contents must be valid JSON. The input the Mongo shell accepts should not be confused with valid JSON. It does not support batch writes at this time.

Updating and Upserting
Update Modes

There are two methods for choosing what gets written to a document when updating:

  • Whole document - the entire document is replaced with the contents of the flowfile.

  • With Operators Enabled - the document in the flowfile content will be assumed to have update operators such as $set and will be used to update particular fields. The whole document will not be replaced.

There are two ways to update:

  • Update Key - use one or more keys from the document.

  • Update Query - use a totally separate query that is not derived from the document.

Update Key

The update key method takes keys from the document and builds a query from them. It will attempt to parse the _id field as an ObjectID type if that is one of the keys that is specified in the configuration field. Multiple keys can be specified by separating them with commas. This configuration field supports Expression Language, so it can be derived in part or entirely from flowfile attributes.

Update Query

The update query method takes a valid JSON document as its value and uses it to find one or more documents to update. This field supports Expression Language, so it can be derived in part or entirely from flowfile attributes. It is possible, for instance, to put an attribute named update_query on a flowfile and specify ${update_query} in the configuration field, so it will load the value from the flowfile.

Upserts

If the upsert mode is enabled, PutMongo will insert a new document that matches the search criteria (be it a user-supplied query or one built from update keys) and give it the properties that are specified in the JSON document provided in the flowfile content. This feature should be used carefully, as it can result in incomplete data being added to MongoDB.

PutMongoBulkOperations

Writes the contents of a FlowFile to MongoDB as bulk-update

Tags: mongodb, insert, update, write, put, bulk

Properties

Client Service

If configured, this property will use the assigned client service for connection pooling.

Mongo Database Name

The name of the database to use

Mongo Collection Name

The name of the collection to use

Ordered

Ordered execution of bulk-writes and break on error - otherwise arbitrary order and continue on error

Character Set

The Character Set in which the data is encoded

Relationships

  • success: All FlowFiles that are written to MongoDB are routed to this relationship

  • failure: All FlowFiles that cannot be written to MongoDB are routed to this relationship

Input Requirement

This component requires an incoming relationship.

System Resource Considerations

  • MEMORY: An instance of this component can cause high usage of this system resource. Multiple instances or high concurrency settings may result a degradation of performance.

PutMongoRecord

This processor is a record-aware processor for inserting/upserting data into MongoDB. It uses a configured record reader and schema to read an incoming record set from the body of a flowfile and then inserts/upserts batches of those records into a configured MongoDB collection. This processor does not support deletes. The number of documents to insert/upsert at a time is controlled by the "Batch Size" configuration property. This value should be set to a reasonable size to ensure that MongoDB is not overloaded with too many operations at once.

Tags: mongodb, insert, update, upsert, record, put

Properties

Client Service

If configured, this property will use the assigned client service for connection pooling.

Mongo Database Name

The name of the database to use

Mongo Collection Name

The name of the collection to use

Record Reader

Specifies the Controller Service to use for parsing incoming data and determining the data’s schema

Batch Size

The number of records to group together for one single insert/upsert operation against MongoDB.

Ordered

Perform ordered or unordered operations

Bypass Validation

Enable or disable bypassing document schema validation during insert or update operations. Bypassing document validation is a Privilege Action in MongoDB. Enabling this property can result in authorization errors for users with limited privileges.

Update Key Fields

Comma separated list of fields based on which to identify documents that need to be updated. If this property is set NiFi will attempt an upsert operation on all documents. If this property is not set all documents will be inserted.

Update Mode

Choose between updating a single document or multiple documents per incoming record.

Relationships

  • success: All FlowFiles that are written to MongoDB are routed to this relationship

  • failure: All FlowFiles that cannot be written to MongoDB are routed to this relationship

Reads Attributes

  • mongodb.update.mode: Configurable parameter for controlling update mode on a per-flowfile basis. Acceptable values are 'one' and 'many' and controls whether a single incoming record should update a single or multiple Mongo documents.

Input Requirement

This component requires an incoming relationship.

PutParquet

Reads records from an incoming FlowFile using the provided Record Reader, and writes those records to a Parquet file. The schema for the Parquet file must be provided in the processor properties. This processor will first write a temporary dot file and upon successfully writing every record to the dot file, it will rename the dot file to it’s final name. If the dot file cannot be renamed, the rename operation will be attempted up to 10 times, and if still not successful, the dot file will be deleted and the flow file will be routed to failure. If any error occurs while reading records from the input, or writing records to the output, the entire dot file will be removed and the flow file will be routed to failure or retry, depending on the error.

Tags: put, parquet, hadoop, HDFS, filesystem, record

Properties

Hadoop Configuration Resources

A file or comma separated list of files which contains the Hadoop file system configuration. Without this, Hadoop will search the classpath for a 'core-site.xml' and 'hdfs-site.xml' file or will revert to a default configuration. To use swebhdfs, see 'Additional Details' section of PutHDFS’s documentation.

Kerberos User Service

Specifies the Kerberos User Controller Service that should be used for authenticating with Kerberos

Additional Classpath Resources

A comma-separated list of paths to files and/or directories that will be added to the classpath and used for loading native libraries. When specifying a directory, all files with in the directory will be added to the classpath, but further sub-directories will not be included.

Record Reader

The service for reading records from incoming flow files.

Directory

The parent directory to which files should be written. Will be created if it doesn’t exist.

Compression Type

The type of compression for the file being written.

Overwrite Files

Whether or not to overwrite existing files in the same directory with the same name. When set to false, flow files will be routed to failure when a file exists in the same directory with the same name.

Permissions umask

A umask represented as an octal number which determines the permissions of files written to HDFS. This overrides the Hadoop Configuration dfs.umaskmode

Remote Group

Changes the group of the HDFS file to this value after it is written. This only works if NiFi is running as a user that has HDFS super user privilege to change group

Remote Owner

Changes the owner of the HDFS file to this value after it is written. This only works if NiFi is running as a user that has HDFS super user privilege to change owner

Row Group Size

The row group size used by the Parquet writer. The value is specified in the format of <Data Size> <Data Unit> where Data Unit is one of B, KB, MB, GB, TB.

Page Size

The page size used by the Parquet writer. The value is specified in the format of <Data Size> <Data Unit> where Data Unit is one of B, KB, MB, GB, TB.

Dictionary Page Size

The dictionary page size used by the Parquet writer. The value is specified in the format of <Data Size> <Data Unit> where Data Unit is one of B, KB, MB, GB, TB.

Max Padding Size

The maximum amount of padding that will be used to align row groups with blocks in the underlying filesystem. If the underlying filesystem is not a block filesystem like HDFS, this has no effect. The value is specified in the format of <Data Size> <Data Unit> where Data Unit is one of B, KB, MB, GB, TB.

Enable Dictionary Encoding

Specifies whether dictionary encoding should be enabled for the Parquet writer

Enable Validation

Specifies whether validation should be enabled for the Parquet writer

Writer Version

Specifies the version used by Parquet writer

Avro Write Old List Structure

Specifies the value for 'parquet.avro.write-old-list-structure' in the underlying Parquet library

Avro Add List Element Records

Specifies the value for 'parquet.avro.add-list-element-records' in the underlying Parquet library

Remove CRC Files

Specifies whether the corresponding CRC file should be deleted upon successfully writing a Parquet file

Relationships

  • success: Flow Files that have been successfully processed are transferred to this relationship

  • failure: Flow Files that could not be processed due to issue that cannot be retried are transferred to this relationship

  • retry: Flow Files that could not be processed due to issues that can be retried are transferred to this relationship

Reads Attributes

  • filename: The name of the file to write comes from the value of this attribute.

Writes Attributes

  • filename: The name of the file is stored in this attribute.

  • absolute.hdfs.path: The absolute path to the file is stored in this attribute.

  • hadoop.file.url: The hadoop url for the file is stored in this attribute.

  • record.count: The number of records written to the Parquet file

Restricted

Making changes to this component requires admin rights. Non-admins can still schedule this component.

Input Requirement

This component requires an incoming relationship.

PutQdrant

Publishes JSON data to Qdrant. The Incoming data must be in single JSON per Line format, each with two keys: 'text' and 'metadata'. The text must be a string, while metadata must be a map with strings for values. Any additional fields will be ignored.

Use Cases

Create embeddings that semantically represent text content and upload to Qdrant - https://qdrant.tech/

Notes: This processor assumes that the data has already been formatted in JSONL format with the text to store in Qdrant provided in the 'text' field.

Keywords: qdrant, embedding, vector, text, vectorstore, insert

  1. Configure 'Collection Name' to the name of the Qdrant collection to use.

  2. Configure 'Qdrant URL' to the fully qualified URL of the Qdrant instance.

  3. Configure 'Qdrant API Key' to the API Key to use in order to authenticate with Qdrant.

  4. Configure 'Prefer gRPC' to True if you want to use gRPC for interfacing with Qdrant.

  5. Configure 'Use HTTPS' to True if you want to use TLS(HTTPS) while interfacing with Qdrant.

  6. Configure 'Embedding Model' to indicate whether OpenAI embeddings should be used or a Google embedding model should be used: 'Google Model' or 'OpenAI Model'

  7. Configure 'Google API Key' or 'OpenAI API Key', depending on the chosen Embedding Model.

  8. Configure 'Google Model' or 'OpenAI Model' to the name of the model to use.

  9. Configure 'Force Recreate Collection' to True if you want to recreate the collection if it already exists.

  10. Configure 'Similarity Metric' to the similarity metric to use when querying Qdrant. .

  11. If the documents to send to Qdrant contain a unique identifier(UUID), set the 'Document ID Field Name' property to the name of the field that contains the document ID.

  12. This property can be left blank, in which case a UUID will be generated based on the FlowFile’s filename.

Tags: qdrant, vector, vectordb, vectorstore, embeddings, ai, artificial intelligence, ml, machine learning, text, LLM

Properties

Document ID Field Name

Specifies the name of the field in the 'metadata' element of each document where the document’s ID can be found. If not specified, a UUID will be generated based on the FlowFile’s filename and an incremental number.

Force Recreate Collection

Specifies whether to recreate the collection if it already exists. Essentially clearing the existing data.

Similarity Metric

Specifies the similarity metric when creating the collection.

PutRecord

The PutRecord processor uses a specified RecordReader to input (possibly multiple) records from an incoming flow file, and sends them to a destination specified by a Record Destination Service (i.e. record sink).

Tags: record, put, sink

Properties

Record Reader

Specifies the Controller Service to use for reading incoming data

Record Destination Service

Specifies the Controller Service to use for writing out the query result records to some destination.

Include Zero Record Results

If no records are read from the incoming FlowFile, this property specifies whether or not an empty record set will be transmitted. The original FlowFile will still be routed to success, but if no transmission occurs, no provenance SEND event will be generated.

Relationships

  • success: The original FlowFile will be routed to this relationship if the records were transmitted successfully

  • failure: A FlowFile is routed to this relationship if the records could not be transmitted and retrying the operation will also fail

  • retry: The original FlowFile is routed to this relationship if the records could not be transmitted but attempting the operation again may succeed

Input Requirement

This component requires an incoming relationship.

PutRedisHashRecord

Puts record field data into Redis using a specified hash value, which is determined by a RecordPath to a field in each record containing the hash value. The record fields and values are stored as key/value pairs associated by the hash value. NOTE: Neither the evaluated hash value nor any of the field values can be null. If the hash value is null, the FlowFile will be routed to failure. For each of the field values, if the value is null that field will be not set in Redis.

Tags: put, redis, hash, record

Properties

Record Reader

Specifies the Controller Service to use for parsing incoming data and determining the data’s schema

Redis Connection Pool

Hash Value Record Path

Specifies a RecordPath to evaluate against each Record in order to determine the hash value associated with all the record fields/values (see 'hset' in Redis documentation for more details). The RecordPath must point at exactly one field or an error will occur.

Data Record Path

This property denotes a RecordPath that will be evaluated against each incoming Record and the Record that results from evaluating the RecordPath will be sent to Redis instead of sending the entire incoming Record. The property defaults to the root '/' which corresponds to a 'flat' record (all fields/values at the top level of the Record.

Character Set

Specifies the character set to use when storing record field values as strings. All fields will be converted to strings using this character set before being stored in Redis.

Relationships

  • success: FlowFiles having all Records stored in Redis will be routed to this relationship

  • failure: FlowFiles containing Records with processing errors will be routed to this relationship

Writes Attributes

  • redis.success.record.count: Number of records written to Redis

Input Requirement

This component requires an incoming relationship.

PutS3Object

Writes the contents of a FlowFile as an S3 Object to an Amazon S3 Bucket.

Tags: Amazon, S3, AWS, Archive, Put

Properties

Bucket

The S3 Bucket to interact with

Object Key

The S3 Object Key to use. This is analogous to a filename for traditional file systems.

Region

The AWS Region to connect to.

AWS Credentials Provider Service

The Controller Service that is used to obtain AWS credentials provider

Resource Transfer Source

The source of the content to be transferred

File Resource Service

File Resource Service providing access to the local resource to be transferred

Storage Class

Encryption Service

Specifies the Encryption Service Controller used to configure requests. PutS3Object: For backward compatibility, this value is ignored when 'Server Side Encryption' is set. FetchS3Object: Only needs to be configured in case of Server-side Customer Key, Client-side KMS and Client-side Customer Key encryptions.

Server Side Encryption

Specifies the algorithm used for server side encryption.

Content Type

Sets the Content-Type HTTP header indicating the type of content stored in the associated object. The value of this header is a standard MIME type. AWS S3 Java client will attempt to determine the correct content type if one hasn’t been set yet. Users are responsible for ensuring a suitable content type is set when uploading streams. If no content type is provided and cannot be determined by the filename, the default content type "application/octet-stream" will be used.

Content Disposition

Sets the Content-Disposition HTTP header indicating if the content is intended to be displayed inline or should be downloaded. Possible values are 'inline' or 'attachment'. If this property is not specified, object’s content-disposition will be set to filename. When 'attachment' is selected, '; filename=' plus object key are automatically appended to form final value 'attachment; filename="filename.jpg"'.

Cache Control

Sets the Cache-Control HTTP header indicating the caching directives of the associated object. Multiple directives are comma-separated.

Object Tags Prefix

Specifies the prefix which would be scanned against the incoming FlowFile’s attributes and the matching attribute’s name and value would be considered as the outgoing S3 object’s Tag name and Tag value respectively. For Ex: If the incoming FlowFile carries the attributes tagS3country, tagS3PII, the tag prefix to be specified would be 'tagS3'

Remove Tag Prefix

If set to 'True', the value provided for 'Object Tags Prefix' will be removed from the attribute(s) and then considered as the Tag name. For ex: If the incoming FlowFile carries the attributes tagS3country, tagS3PII and the prefix is set to 'tagS3' then the corresponding tag values would be 'country' and 'PII'

Communications Timeout

The amount of time to wait in order to establish a connection to AWS or receive data from AWS before timing out.

Expiration Time Rule

FullControl User List

A comma-separated list of Amazon User ID’s or E-mail addresses that specifies who should have Full Control for an object

Read Permission User List

A comma-separated list of Amazon User ID’s or E-mail addresses that specifies who should have Read Access for an object

Write Permission User List

A comma-separated list of Amazon User ID’s or E-mail addresses that specifies who should have Write Access for an object

Read ACL User List

A comma-separated list of Amazon User ID’s or E-mail addresses that specifies who should have permissions to read the Access Control List for an object

Write ACL User List

A comma-separated list of Amazon User ID’s or E-mail addresses that specifies who should have permissions to change the Access Control List for an object

Owner

The Amazon ID to use for the object’s owner

Canned ACL

Amazon Canned ACL for an object, one of: BucketOwnerFullControl, BucketOwnerRead, LogDeliveryWrite, AuthenticatedRead, PublicReadWrite, PublicRead, Private; will be ignored if any other ACL/permission/owner property is specified

SSL Context Service

Specifies an optional SSL Context Service that, if provided, will be used to create connections

Endpoint Override URL

Endpoint URL to use instead of the AWS default including scheme, host, port, and path. The AWS libraries select an endpoint URL based on the AWS region, but this property overrides the selected endpoint URL, allowing use with other S3-compatible endpoints.

Signer Override

The AWS S3 library uses Signature Version 4 by default but this property allows you to specify the Version 2 signer to support older S3-compatible services or even to plug in your own custom signer implementation.

Custom Signer Class Name

Fully qualified class name of the custom signer class. The signer must implement com.amazonaws.auth.Signer interface.

Custom Signer Module Location

Comma-separated list of paths to files and/or directories which contain the custom signer’s JAR file and its dependencies (if any).

Multipart Threshold

Specifies the file size threshold for switch from the PutS3Object API to the PutS3MultipartUpload API. Flow files bigger than this limit will be sent using the stateful multipart process. The valid range is 50MB to 5GB.

Multipart Part Size

Specifies the part size for use when the PutS3Multipart Upload API is used. Flow files will be broken into chunks of this size for the upload process, but the last part sent can be smaller since it is not padded. The valid range is 50MB to 5GB.

Multipart Upload AgeOff Interval

Specifies the interval at which existing multipart uploads in AWS S3 will be evaluated for ageoff. When processor is triggered it will initiate the ageoff evaluation if this interval has been exceeded.

Multipart Upload Max Age Threshold

Specifies the maximum age for existing multipart uploads in AWS S3. When the ageoff process occurs, any upload older than this threshold will be aborted.

Temporary Directory Multipart State

Directory in which, for multipart uploads, the processor will locally save the state tracking the upload ID and parts uploaded which must both be provided to complete the upload.

Use Chunked Encoding

Enables / disables chunked encoding for upload requests. Set it to false only if your endpoint does not support chunked uploading.

Use Path Style Access

Path-style access can be enforced by setting this property to true. Set it to true if your endpoint does not support virtual-hosted-style requests, only path-style requests.

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP + AuthN

Dynamic Properties

The name of a User-Defined Metadata field to add to the S3 Object

Allows user-defined metadata to be added to the S3 object as key/value pairs

Relationships

  • success: FlowFiles are routed to this Relationship after they have been successfully processed.

  • failure: If the Processor is unable to process a given FlowFile, it will be routed to this Relationship.

Reads Attributes

  • filename: Uses the FlowFile’s filename as the filename for the S3 object

Writes Attributes

  • s3.url: The URL that can be used to access the S3 object

  • s3.bucket: The S3 bucket where the Object was put in S3

  • s3.key: The S3 key within where the Object was put in S3

  • s3.contenttype: The S3 content type of the S3 Object that put in S3

  • s3.version: The version of the S3 Object that was put to S3

  • s3.exception: The class name of the exception thrown during processor execution

  • s3.additionalDetails: The S3 supplied detail from the failed operation

  • s3.statusCode: The HTTP error code (if available) from the failed operation

  • s3.errorCode: The S3 moniker of the failed operation

  • s3.errorMessage: The S3 exception message from the failed operation

  • s3.etag: The ETag of the S3 Object

  • s3.contentdisposition: The content disposition of the S3 Object that put in S3

  • s3.cachecontrol: The cache-control header of the S3 Object

  • s3.uploadId: The uploadId used to upload the Object to S3

  • s3.expiration: A human-readable form of the expiration date of the S3 object, if one is set

  • s3.sseAlgorithm: The server side encryption algorithm of the object

  • s3.usermetadata: A human-readable form of the User Metadata of the S3 object, if any was set

  • s3.encryptionStrategy: The name of the encryption strategy, if any was set

Input Requirement

This component requires an incoming relationship.

Additional Details

Multi-part Upload Details

The upload uses either the PutS3Object method or the PutS3MultipartUpload method. The PutS3Object method sends the file in a single synchronous call, but it has a 5GB size limit. Larger files are sent using the PutS3MultipartUpload method. This multipart process saves state after each step so that a large upload can be resumed with minimal loss if the processor or cluster is stopped and restarted. A multipart upload consists of three steps:

  1. Initiate upload

  2. Upload the parts

  3. Complete the upload

For multipart uploads, the processor saves state locally tracking the upload ID and parts uploaded, which must both be provided to complete the upload. The AWS libraries select an endpoint URL based on the AWS region, but this can be overridden with the ‘Endpoint Override URL’ property for use with other S3-compatible endpoints. The S3 API specifies that the maximum file size for a PutS3Object upload is 5GB. It also requires that parts in a multipart upload must be at least 5MB in size, except for the last part. These limits establish the bounds for the Multipart Upload Threshold and Part Size properties.

Configuration Details
Object Key

The Object Key property value should not start with “/”.

Credentials File

The Credentials File property allows the user to specify the path to a file containing the AWS access key and secret key. The contents of the file should be in the following format:

    [default]
    accessKey=<access key>
    secretKey=<security key>

Make sure the credentials file is readable by the NiFi service user.

When using the Credential File property, ensure that there are no values for the Access Key and Secret Key properties. The Value column should read “No value set” for both. Note: Do not check “Set empty string” for either as the empty string is considered a set value.

PutSalesforceObject

Creates new records for the specified Salesforce sObject. The type of the Salesforce object must be set in the input flowfile’s 'objectType' attribute. This processor cannot update existing records.

Tags: salesforce, sobject, put

Properties

Salesforce Instance URL

The URL of the Salesforce instance including the domain without additional path information, such as https://MyDomainName.my.salesforce.com

API Version

The version number of the Salesforce REST API appended to the URL after the services/data path. See Salesforce documentation for supported versions

Read Timeout

Maximum time allowed for reading a response from the Salesforce REST API

OAuth2 Access Token Provider

Service providing OAuth2 Access Tokens for authenticating using the HTTP Authorization Header

Record Reader

Specifies the Controller Service to use for parsing incoming data and determining the data’s schema

Relationships

  • success: For FlowFiles created as a result of a successful execution.

  • failure: For FlowFiles created as a result of an execution error.

Reads Attributes

  • objectType: The Salesforce object type to upload records to. E.g. Account, Contact, Campaign.

Writes Attributes

  • error.message: The error message returned by Salesforce.

Input Requirement

This component requires an incoming relationship.

Additional Details

Description

Objects in Salesforce are database tables, their rows are known as records, and their columns are called fields. The PutSalesforceObject creates a new a Salesforce record in a Salesforce object. The Salesforce object must be set as the “objectType” attribute of an incoming flowfile. Check Salesforce documentation for object types and metadata. The processor utilizes NiFi record-based processing to allow arbitrary input format.

======= Example

If the “objectType” is set to “Account”, the following JSON input will create two records in the Account object with the names “SampleAccount1” and “SampleAccount2”.

[
  {
    "name": "SampleAccount1",
    "phone": "1111111111",
    "website": "www.salesforce1.com",
    "numberOfEmployees": "100",
    "industry": "Banking"
  },
  {
    "name": "SampleAccount2",
    "phone": "22222222",
    "website": "www.salesforce2.com",
    "numberOfEmployees": "200",
    "industry": "Banking"
  }
]
json

PutSFTP

Sends FlowFiles to an SFTP Server

Tags: remote, copy, egress, put, sftp, archive, files

Properties

Hostname

The fully qualified hostname or IP address of the remote system

Port

The port that the remote system is listening on for file transfers

Username

Username

Password

Password for the user account

Private Key Path

The fully qualified path to the Private Key file

Private Key Passphrase

Password for the private key

Remote Path

The path on the remote system from which to pull or push files

Create Directory

Specifies whether or not the remote directory should be created if it does not exist.

Disable Directory Listing

If set to 'true', directory listing is not performed prior to create missing directories. By default, this processor executes a directory listing command to see target directory existence before creating missing directories. However, there are situations that you might need to disable the directory listing such as the following. Directory listing might fail with some permission setups (e.g. chmod 100) on a directory. Also, if any other SFTP client created the directory after this processor performed a listing and before a directory creation request by this processor is finished, then an error is returned because the directory already exists.

Batch Size

The maximum number of FlowFiles to send in a single connection

Connection Timeout

Amount of time to wait before timing out while creating a connection

Data Timeout

When transferring a file between the local and remote system, this value specifies how long is allowed to elapse without any data being transferred between systems

Conflict Resolution

Determines how to handle the problem of filename collisions

Reject Zero-Byte Files

Determines whether or not Zero-byte files should be rejected without attempting to transfer

Dot Rename

If true, then the filename of the sent file is prepended with a "." and then renamed back to the original once the file is completely sent. Otherwise, there is no rename. This property is ignored if the Temporary Filename property is set.

Temporary Filename

If set, the filename of the sent file will be equal to the value specified during the transfer and after successful completion will be renamed to the original filename. If this value is set, the Dot Rename property is ignored.

Host Key File

If supplied, the given file will be used as the Host Key; otherwise, if 'Strict Host Key Checking' property is applied (set to true) then uses the 'known_hosts' and 'known_hosts2' files from ~/.ssh directory else no host key file will be used

Last Modified Time

The lastModifiedTime to assign to the file after transferring it. If not set, the lastModifiedTime will not be changed. Format must be yyyy-MM-dd’T’HH:mm:ssZ. You may also use expression language such as ${file.lastModifiedTime}. If the value is invalid, the processor will not be invalid but will fail to change lastModifiedTime of the file.

Permissions

The permissions to assign to the file after transferring it. Format must be either UNIX rwxrwxrwx with a - in place of denied permissions (e.g. rw-r—​r--) or an octal number (e.g. 644). If not set, the permissions will not be changed. You may also use expression language such as ${file.permissions}. If the value is invalid, the processor will not be invalid but will fail to change permissions of the file.

Remote Owner

Integer value representing the User ID to set on the file after transferring it. If not set, the owner will not be set. You may also use expression language such as ${file.owner}. If the value is invalid, the processor will not be invalid but will fail to change the owner of the file.

Remote Group

Integer value representing the Group ID to set on the file after transferring it. If not set, the group will not be set. You may also use expression language such as ${file.group}. If the value is invalid, the processor will not be invalid but will fail to change the group of the file.

Strict Host Key Checking

Indicates whether or not strict enforcement of hosts keys should be applied

Send Keep Alive On Timeout

Send a Keep Alive message every 5 seconds up to 5 times for an overall timeout of 25 seconds.

Use Compression

Indicates whether or not ZLIB compression should be used when transferring files

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP + AuthN, SOCKS + AuthN

Ciphers Allowed

A comma-separated list of Ciphers allowed for SFTP connections. Leave unset to allow all. Available options are: 3des-cbc, 3des-ctr, aes128-cbc, aes128-ctr, aes128-gcm@openssh.com, aes192-cbc, aes192-ctr, aes256-cbc, aes256-ctr, aes256-gcm@openssh.com, arcfour, arcfour128, arcfour256, blowfish-cbc, blowfish-ctr, cast128-cbc, cast128-ctr, chacha20-poly1305@openssh.com, idea-cbc, idea-ctr, serpent128-cbc, serpent128-ctr, serpent192-cbc, serpent192-ctr, serpent256-cbc, serpent256-ctr, twofish-cbc, twofish128-cbc, twofish128-ctr, twofish192-cbc, twofish192-ctr, twofish256-cbc, twofish256-ctr

Key Algorithms Allowed

A comma-separated list of Key Algorithms allowed for SFTP connections. Leave unset to allow all. Available options are: ecdsa-sha2-nistp256, ecdsa-sha2-nistp256-cert-v01@openssh.com, ecdsa-sha2-nistp384, ecdsa-sha2-nistp384-cert-v01@openssh.com, ecdsa-sha2-nistp521, ecdsa-sha2-nistp521-cert-v01@openssh.com, rsa-sha2-256, rsa-sha2-512, ssh-dss, ssh-dss-cert-v01@openssh.com, ssh-ed25519, ssh-ed25519-cert-v01@openssh.com, ssh-rsa, ssh-rsa-cert-v01@openssh.com

Key Exchange Algorithms Allowed

A comma-separated list of Key Exchange Algorithms allowed for SFTP connections. Leave unset to allow all. Available options are: curve25519-sha256, curve25519-sha256@libssh.org, diffie-hellman-group-exchange-sha1, diffie-hellman-group-exchange-sha256, diffie-hellman-group1-sha1, diffie-hellman-group14-sha1, diffie-hellman-group14-sha256, diffie-hellman-group14-sha256@ssh.com, diffie-hellman-group15-sha256, diffie-hellman-group15-sha256@ssh.com, diffie-hellman-group15-sha384@ssh.com, diffie-hellman-group15-sha512, diffie-hellman-group16-sha256, diffie-hellman-group16-sha384@ssh.com, diffie-hellman-group16-sha512, diffie-hellman-group16-sha512@ssh.com, diffie-hellman-group17-sha512, diffie-hellman-group18-sha512, diffie-hellman-group18-sha512@ssh.com, ecdh-sha2-nistp256, ecdh-sha2-nistp384, ecdh-sha2-nistp521, ext-info-c

Message Authentication Codes Allowed

A comma-separated list of Message Authentication Codes allowed for SFTP connections. Leave unset to allow all. Available options are: hmac-md5, hmac-md5-96, hmac-md5-96-etm@openssh.com, hmac-md5-etm@openssh.com, hmac-ripemd160, hmac-ripemd160-96, hmac-ripemd160-etm@openssh.com, hmac-ripemd160@openssh.com, hmac-sha1, hmac-sha1-96, hmac-sha1-96@openssh.com, hmac-sha1-etm@openssh.com, hmac-sha2-256, hmac-sha2-256-etm@openssh.com, hmac-sha2-512, hmac-sha2-512-etm@openssh.com

Relationships

  • success: FlowFiles that are successfully sent will be routed to success

  • failure: FlowFiles that failed to send to the remote system; failure is usually looped back to this processor

  • reject: FlowFiles that were rejected by the destination system

Input Requirement

This component requires an incoming relationship.

See Also

PutSmbFile

Writes the contents of a FlowFile to a samba network location. Use this processor instead of a cifs mounts if share access control is important.Configure the Hostname, Share and Directory accordingly: \\[Hostname]\[Share]\[path\to\Directory]

Tags: samba, smb, cifs, files, put

Properties

Hostname

The network host to which files should be written.

Share

The network share to which files should be written. This is the "first folder"after the hostname: \\hostname\[share]\dir1\dir2

Directory

The network folder to which files should be written. This is the remaining relative path after the share: \\hostname\share\[dir1\dir2]. You may use expression language.

Domain

The domain used for authentication. Optional, in most cases username and password is sufficient.

Username

The username used for authentication. If no username is set then anonymous authentication is attempted.

Password

The password used for authentication. Required if Username is set.

Create Missing Directories

If true, then missing destination directories will be created. If false, flowfiles are penalized and sent to failure.

Share Access Strategy

Indicates which shared access are granted on the file during the write. None is the most restrictive, but the safest setting to prevent corruption.

Conflict Resolution Strategy

Indicates what should happen when a file with the same name already exists in the output directory

Batch Size

The maximum number of files to put in each iteration

Temporary Suffix

A temporary suffix which will be apended to the filename while it’s transfering. After the transfer is complete, the suffix will be removed.

SMB Dialect

The SMB dialect is negotiated between the client and the server by default to the highest common version supported by both end. In some rare cases, the client-server communication may fail with the automatically negotiated dialect. This property can be used to set the dialect explicitly (e.g. to downgrade to a lower version), when those situations would occur.

Use Encryption

Turns on/off encrypted communication between the client and the server. The property’s behavior is SMB dialect dependent: SMB 2.x does not support encryption and the property has no effect. In case of SMB 3.x, it is a hint/request to the server to turn encryption on if the server also supports it.

Enable DFS

Enables accessing Distributed File System (DFS) and following DFS links during SMB operations.

Timeout

Timeout for read and write operations.

Relationships

  • success: Files that have been successfully written to the output network path are transferred to this relationship

  • failure: Files that could not be written to the output network path for some reason are transferred to this relationship

Reads Attributes

  • filename: The filename to use when writing the FlowFile to the network folder.

Input Requirement

This component requires an incoming relationship.

PutSnowflakeInternalStage

Puts files into a Snowflake internal stage. The internal stage must be created in the Snowflake account beforehand. This processor can be connected to a StartSnowflakeIngest processor to ingest the file in the internal stage

Tags: snowflake, jdbc, database, connection, snowpipe

Properties

Snowflake Connection Provider

Specifies the Controller Service to use for creating SQL connections to Snowflake.

Internal Stage Type

The type of internal stage to use

Database

The database to use by default. The same as passing 'db=DATABASE_NAME' to the connection string.

Schema

The schema to use by default. The same as passing 'schema=SCHEMA' to the connection string.

Table

The name of the table in the Snowflake account.

Stage

The name of the internal stage in the Snowflake account to put files into.

Relationships

  • success: For FlowFiles of successful PUT operation

  • failure: For FlowFiles of failed PUT operation

Writes Attributes

  • snowflake.staged.file.path: Staged file path

Input Requirement

This component requires an incoming relationship.

Additional Details

Description

The PutSnowflakeInternalStage processor can upload a file to a Snowflake internal stage. Please note, that named stages needs to be created in your Snowflake account manually. See the documentation on how to set up an internal stage here. The processor requires an upstream connection and the incoming FlowFiles’ content will be uploaded to the stage. A unique file name is generated for the file’s staged file name. While the processor may be used separately, it’s recommended to connect it to a StartSnowflakeIngest processor so that the uploaded file can be piped into your Snowflake table.

PutSNS

Sends the content of a FlowFile as a notification to the Amazon Simple Notification Service

Tags: amazon, aws, sns, topic, put, publish, pubsub

Properties

Amazon Resource Name (ARN)

The name of the resource to which notifications should be published

ARN Type

The type of Amazon Resource Name that is being used.

E-mail Subject

The optional subject to use for any subscribers that are subscribed via E-mail

Region

AWS Credentials Provider Service

The Controller Service that is used to obtain AWS credentials provider

SSL Context Service

Specifies an optional SSL Context Service that, if provided, will be used to create connections

Communications Timeout

Endpoint Override URL

Endpoint URL to use instead of the AWS default including scheme, host, port, and path. The AWS libraries select an endpoint URL based on the AWS region, but this property overrides the selected endpoint URL, allowing use with other S3-compatible endpoints.

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP + AuthN

Use JSON Structure

If true, the contents of the FlowFile must be JSON with a top-level element named 'default'. Additional elements can be used to send different messages to different protocols. See the Amazon SNS Documentation for more information.

Character Set

The character set in which the FlowFile’s content is encoded

Message Group ID

If using FIFO, the message group to which the flowFile belongs

Deduplication Message ID

The token used for deduplication of sent messages

Dynamic Properties

A name of an attribute to be added to the notification

User specified dynamic Properties are added as attributes to the notification

Relationships

  • success: FlowFiles are routed to success relationship

  • failure: FlowFiles are routed to failure relationship

Input Requirement

This component requires an incoming relationship.

See Also

PutSplunk

Sends logs to Splunk Enterprise over TCP, TCP + TLS/SSL, or UDP. If a Message Delimiter is provided, then this processor will read messages from the incoming FlowFile based on the delimiter, and send each message to Splunk. If a Message Delimiter is not provided then the content of the FlowFile will be sent directly to Splunk as if it were a single message.

Tags: splunk, logs, tcp, udp

Properties

Hostname

Destination hostname or IP address

Port

Destination port number

Max Size of Socket Send Buffer

The maximum size of the socket send buffer that should be used. This is a suggestion to the Operating System to indicate how big the socket buffer should be. If this value is set too low, the buffer may fill up before the data can be read, and incoming data will be dropped.

Idle Connection Expiration

The amount of time a connection should be held open without being used before closing the connection. A value of 0 seconds will disable this feature.

Timeout

The timeout for connecting to and communicating with the destination. Does not apply to UDP

Character Set

Specifies the character set of the data being sent.

Protocol

The protocol for communication.

Message Delimiter

Specifies the delimiter to use for splitting apart multiple messages within a single FlowFile. If not specified, the entire content of the FlowFile will be used as a single message. If specified, the contents of the FlowFile will be split on this delimiter and each section sent as a separate message. Note that if messages are delimited and some messages for a given FlowFile are transferred successfully while others are not, the messages will be split into individual FlowFiles, such that those messages that were successfully sent are routed to the 'success' relationship while other messages are sent to the 'failure' relationship.

SSL Context Service

Specifies the SSL Context Service to enable TLS socket communication

Relationships

  • success: FlowFiles that are sent successfully to the destination are sent out this relationship.

  • failure: FlowFiles that failed to send to the destination are sent out this relationship.

Input Requirement

This component requires an incoming relationship.

PutSplunkHTTP

Sends flow file content to the specified Splunk server over HTTP or HTTPS. Supports HEC Index Acknowledgement.

Tags: splunk, logs, http

Properties

Scheme

The scheme for connecting to Splunk.

Hostname

The ip address or hostname of the Splunk server.

HTTP Event Collector Port

The HTTP Event Collector HTTP Port Number.

Security Protocol

The security protocol to use for communicating with Splunk.

Owner

The owner to pass to Splunk.

HTTP Event Collector Token

HTTP Event Collector token starting with the string Splunk. For example 'Splunk 1234578-abcd-1234-abcd-1234abcd'

Username

The username to authenticate to Splunk.

Password

The password to authenticate to Splunk.

Splunk Request Channel

Identifier of the used request channel.

Source

User-defined event source. Sets a default for all events when unspecified.

Source Type

User-defined event sourcetype. Sets a default for all events when unspecified.

Host

Specify with the host query string parameter. Sets a default for all events when unspecified.

Index

Index name. Specify with the index query string parameter. Sets a default for all events when unspecified.

Content Type

The media type of the event sent to Splunk. If not set, "mime.type" flow file attribute will be used. In case of neither of them is specified, this information will not be sent to the server.

Character Set

The name of the character set.

Relationships

  • success: FlowFiles that are sent successfully to the destination are sent to this relationship.

  • failure: FlowFiles that failed to send to the destination are sent to this relationship.

Reads Attributes

  • mime.type: Uses as value for HTTP Content-Type header if set.

Writes Attributes

  • splunk.acknowledgement.id: The indexing acknowledgement id provided by Splunk.

  • splunk.responded.at: The time of the response of put request for Splunk.

Input Requirement

This component requires an incoming relationship.

System Resource Considerations

  • MEMORY: An instance of this component can cause high usage of this system resource. Multiple instances or high concurrency settings may result a degradation of performance.

Additional Details

PutSplunkHTTP

This processor serves as a counterpart for PutSplunk processor. While the latter solves communication using TCP and UDP protocols, PutSplunkHTTP aims to send events into Splunk via HTTP or HTTPS. In this fashion, this processor shows similarities with GetSplunk processor and the properties relevant to the connection with Splunk server are identical. There are however some aspects unique for this processor:

Content details

PutSplunkHTTP allows the user to specify some metadata about the event being sent to the Splunk. These include: the ” Character Set” and the “Content Type” of the flow file content, using the matching properties. If the incoming flow file has “mime.type” attribute, the processor will use it, unless the “Content Type” property is set, in which case the property will override the flow file attribute.

Event parameters

The “Source”, “Source Type”, “Host” and “Index” properties are optional and will be set by Splunk if unspecified. If set, the default values will be overwritten by user specified ones. For more details about the Splunk API, please visit this documentation.

Acknowledgements

HTTP Event Collector (HEC) in Splunk provides the possibility of index acknowledgement, which can be used to monitor the indexing status of the individual events. PutSplunkHTTP supports this feature by enriching the outgoing flow file with the necessary information, making it possible for a later processor to poll the status based on. The necessary information for this is stored within flow file attributes “splunk.acknowledgement.id” and “splunk.responded.at”.

For further steps of acknowledgement handling in NiFi side, please refer to QuerySplunkIndexingStatus processor. For more details about the index acknowledgement, please visit this documentation.

Error information

For more refined processing, flow files are enriched with additional information if possible. The information is stored in the flow file attribute “splunk.status.code” or “splunk.response.code”, depending on the success of the processing. The attribute “splunk.status.code” is always filled when the Splunk API call is executed and contains the HTTP status code of the response. In case the flow file transferred into “failure” relationship, the “splunk.response.code” might be also filled, based on the Splunk response code.

PutSQL

Executes a SQL UPDATE or INSERT command. The content of an incoming FlowFile is expected to be the SQL command to execute. The SQL command may use the ? to escape parameters. In this case, the parameters to use must exist as FlowFile attributes with the naming convention sql.args.N.type and sql.args.N.value, where N is a positive integer. The sql.args.N.type is expected to be a number indicating the JDBC Type. The content of the FlowFile is expected to be in UTF-8 format.

Tags: sql, put, rdbms, database, update, insert, relational

Properties

JDBC Connection Pool

Specifies the JDBC Connection Pool to use in order to convert the JSON message to a SQL statement. The Connection Pool is necessary in order to determine the appropriate database column types.

SQL Statement

The SQL statement to execute. The statement can be empty, a constant value, or built from attributes using Expression Language. If this property is specified, it will be used regardless of the content of incoming FlowFiles. If this property is empty, the content of the incoming FlowFile is expected to contain a valid SQL statement, to be issued by the processor to the database.

Support Fragmented Transactions

If true, when a FlowFile is consumed by this Processor, the Processor will first check the fragment.identifier and fragment.count attributes of that FlowFile. If the fragment.count value is greater than 1, the Processor will not process any FlowFile with that fragment.identifier until all are available; at that point, it will process all FlowFiles with that fragment.identifier as a single transaction, in the order specified by the FlowFiles' fragment.index attributes. This Provides atomicity of those SQL statements. Once any statement of this transaction throws exception when executing, this transaction will be rolled back. When transaction rollback happened, none of these FlowFiles would be routed to 'success'. If the <Rollback On Failure> is set true, these FlowFiles will stay in the input relationship. When the <Rollback On Failure> is set false,, if any of these FlowFiles will be routed to 'retry', all of these FlowFiles will be routed to 'retry'.Otherwise, they will be routed to 'failure'. If this value is false, these attributes will be ignored and the updates will occur independent of one another.

Database Session AutoCommit

The autocommit mode to set on the database connection being used. If set to false, the operation(s) will be explicitly committed or rolled back (based on success or failure respectively), if set to true the driver/database handles the commit/rollback.

Transaction Timeout

If the <Support Fragmented Transactions> property is set to true, specifies how long to wait for all FlowFiles for a particular fragment.identifier attribute to arrive before just transferring all of the FlowFiles with that identifier to the 'failure' relationship

Batch Size

The preferred number of FlowFiles to put to the database in a single transaction

Obtain Generated Keys

If true, any key that is automatically generated by the database will be added to the FlowFile that generated it using the sql.generate.key attribute. This may result in slightly slower performance and is not supported by all databases.

Rollback On Failure

Specify how to handle error. By default (false), if an error occurs while processing a FlowFile, the FlowFile will be routed to 'failure' or 'retry' relationship based on error type, and processor can continue with next FlowFile. Instead, you may want to rollback currently processed FlowFiles and stop further processing immediately. In that case, you can do so by enabling this 'Rollback On Failure' property. If enabled, failed FlowFiles will stay in the input relationship without penalizing it and being processed repeatedly until it gets processed successfully or removed by other means. It is important to set adequate 'Yield Duration' to avoid retrying too frequently.

Relationships

  • success: A FlowFile is routed to this relationship after the database is successfully updated

  • failure: A FlowFile is routed to this relationship if the database cannot be updated and retrying the operation will also fail, such as an invalid query or an integrity constraint violation

  • retry: A FlowFile is routed to this relationship if the database cannot be updated but attempting the operation again may succeed

Reads Attributes

  • fragment.identifier: If the <Support Fragment Transactions> property is true, this attribute is used to determine whether or not two FlowFiles belong to the same transaction.

  • fragment.count: If the <Support Fragment Transactions> property is true, this attribute is used to determine how many FlowFiles are needed to complete the transaction.

  • fragment.index: If the <Support Fragment Transactions> property is true, this attribute is used to determine the order that the FlowFiles in a transaction should be evaluated.

  • sql.args.N.type: Incoming FlowFiles are expected to be parametrized SQL statements. The type of each Parameter is specified as an integer that represents the JDBC Type of the parameter.

  • sql.args.N.value: Incoming FlowFiles are expected to be parametrized SQL statements. The value of the Parameters are specified as sql.args.1.value, sql.args.2.value, sql.args.3.value, and so on. The type of the sql.args.1.value Parameter is specified by the sql.args.1.type attribute.

  • sql.args.N.format: This attribute is always optional, but default options may not always work for your data. Incoming FlowFiles are expected to be parametrized SQL statements. In some cases a format option needs to be specified, currently this is only applicable for binary data types, dates, times and timestamps. Binary Data Types (defaults to 'ascii') - ascii: each string character in your attribute value represents a single byte. This is the format provided by Avro Processors. base64: the string is a Base64 encoded string that can be decoded to bytes. hex: the string is hex encoded with all letters in upper case and no '0x' at the beginning. Dates/Times/Timestamps - Date, Time and Timestamp formats all support both custom formats or named format ('yyyy-MM-dd','ISO_OFFSET_DATE_TIME') as specified according to java.time.format.DateTimeFormatter. If not specified, a long value input is expected to be an unix epoch (milli seconds from 1970/1/1), or a string value in 'yyyy-MM-dd' format for Date, 'HH:mm:ss.SSS' for Time (some database engines e.g. Derby or MySQL do not support milliseconds and will truncate milliseconds), 'yyyy-MM-dd HH:mm:ss.SSS' for Timestamp is used.

Writes Attributes

  • sql.generated.key: If the database generated a key for an INSERT statement and the Obtain Generated Keys property is set to true, this attribute will be added to indicate the generated key, if possible. This feature is not supported by all database vendors.

Input Requirement

This component requires an incoming relationship.

PutSQS

Publishes a message to an Amazon Simple Queuing Service Queue

Tags: Amazon, AWS, SQS, Queue, Put, Publish

Properties

Queue URL

The URL of the queue to act upon

Region

AWS Credentials Provider Service

The Controller Service that is used to obtain AWS credentials provider

SSL Context Service

Specifies an optional SSL Context Service that, if provided, will be used to create connections

Delay

The amount of time to delay the message before it becomes available to consumers

Communications Timeout

Endpoint Override URL

Endpoint URL to use instead of the AWS default including scheme, host, port, and path. The AWS libraries select an endpoint URL based on the AWS region, but this property overrides the selected endpoint URL, allowing use with other S3-compatible endpoints.

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP + AuthN

Message Group ID

If using FIFO, the message group to which the FlowFile belongs

Deduplication Message ID

The token used for deduplication of sent messages

Dynamic Properties

The name of a Message Attribute to add to the message

Allows the user to add key/value pairs as Message Attributes by adding a property whose name will become the name of the Message Attribute and value will become the value of the Message Attribute

Relationships

  • success: FlowFiles are routed to success relationship

  • failure: FlowFiles are routed to failure relationship

Input Requirement

This component requires an incoming relationship.

See Also

PutSyslog

Sends Syslog messages to a given host and port over TCP or UDP. Messages are constructed from the "Message _" properties of the processor which can use expression language to generate messages from incoming FlowFiles. The properties are used to construct messages of the form: (<PRIORITY>)(VERSION )(TIMESTAMP) (HOSTNAME) (BODY) where version is optional. The constructed messages are checked against regular expressions for RFC5424 and RFC3164 formatted messages. The timestamp can be an RFC5424 timestamp with a format of "yyyy-MM-dd’T’HH:mm:ss.SZ" or "yyyy-MM-dd’T’HH:mm:ss.S+hh:mm", or it can be an RFC3164 timestamp with a format of "MMM d HH:mm:ss". If a message is constructed that does not form a valid Syslog message according to the above description, then it is routed to the invalid relationship. Valid messages are sent to the Syslog server and successes are routed to the success relationship, failures routed to the failure relationship.

Tags: syslog, put, udp, tcp, logs

Properties

Hostname

The IP address or hostname of the Syslog server.

Protocol

The protocol for Syslog communication.

Port

The port for Syslog communication. Note that Expression language is not evaluated per FlowFile.

Max Size of Socket Send Buffer

The maximum size of the socket send buffer that should be used. This is a suggestion to the Operating System to indicate how big the socket buffer should be. If this value is set too low, the buffer may fill up before the data can be read, and incoming data will be dropped.

SSL Context Service

The Controller Service to use in order to obtain an SSL Context. If this property is set, syslog messages will be sent over a secure connection.

Idle Connection Expiration

The amount of time a connection should be held open without being used before closing the connection.

Timeout

The timeout for connecting to and communicating with the syslog server. Does not apply to UDP. Note that Expression language is not evaluated per FlowFile.

Batch Size

The number of incoming FlowFiles to process in a single execution of this processor.

Character Set

Specifies the character set of the Syslog messages. Note that Expression language is not evaluated per FlowFile.

Message Priority

The priority for the Syslog messages, excluding < >.

Message Version

The version for the Syslog messages.

Message Timestamp

The timestamp for the Syslog messages. The timestamp can be an RFC5424 timestamp with a format of "yyyy-MM-dd’T’HH:mm:ss.SZ" or "yyyy-MM-dd’T’HH:mm:ss.S+hh:mm", " or it can be an RFC3164 timestamp with a format of "MMM d HH:mm:ss".

Message Hostname

The hostname for the Syslog messages.

Message Body

The body for the Syslog messages.

Relationships

  • success: FlowFiles that are sent successfully to Syslog are sent out this relationship.

  • failure: FlowFiles that failed to send to Syslog are sent out this relationship.

  • invalid: FlowFiles that do not form a valid Syslog message are sent out this relationship.

Input Requirement

This component requires an incoming relationship.

PutTCP

Sends serialized FlowFiles or Records over TCP to a configurable destination with optional support for TLS

Tags: remote, egress, put, tcp

Properties

Hostname

Destination hostname or IP address

Port

Destination port number

Max Size of Socket Send Buffer

The maximum size of the socket send buffer that should be used. This is a suggestion to the Operating System to indicate how big the socket buffer should be. If this value is set too low, the buffer may fill up before the data can be read, and incoming data will be dropped.

Idle Connection Expiration

The amount of time a connection should be held open without being used before closing the connection. A value of 0 seconds will disable this feature.

Timeout

The timeout for connecting to and communicating with the destination. Does not apply to UDP

Connection Per FlowFile

Specifies whether to send each FlowFile’s content on an individual connection.

SSL Context Service

Specifies the SSL Context Service to enable TLS socket communication

Transmission Strategy

Specifies the strategy used for reading input FlowFiles and transmitting messages to the destination socket address

Outgoing Message Delimiter

Specifies the delimiter to use when sending messages out over the same TCP stream. The delimiter is appended to each FlowFile message that is transmitted over the stream so that the receiver can determine when one message ends and the next message begins. Users should ensure that the FlowFile content does not contain the delimiter character to avoid errors. In order to use a new line character you can enter '\n'. For a tab character use '\t'. Finally for a carriage return use '\r'.

Character Set

Specifies the character set of the data being sent.

Record Reader

Specifies the Controller Service to use for reading Records from input FlowFiles

Record Writer

Specifies the Controller Service to use for writing Records to the configured socket address

Relationships

  • success: FlowFiles that are sent successfully to the destination are sent out this relationship.

  • failure: FlowFiles that failed to send to the destination are sent out this relationship.

Writes Attributes

  • record.count.transmitted: Count of records transmitted to configured destination address

Input Requirement

This component requires an incoming relationship.

See Also

PutUDP

The PutUDP processor receives a FlowFile and packages the FlowFile content into a single UDP datagram packet which is then transmitted to the configured UDP server. The user must ensure that the FlowFile content being fed to this processor is not larger than the maximum size for the underlying UDP transport. The maximum transport size will vary based on the platform setup but is generally just under 64KB. FlowFiles will be marked as failed if their content is larger than the maximum transport size.

Tags: remote, egress, put, udp

Properties

Hostname

Destination hostname or IP address

Port

Destination port number

Max Size of Socket Send Buffer

The maximum size of the socket send buffer that should be used. This is a suggestion to the Operating System to indicate how big the socket buffer should be. If this value is set too low, the buffer may fill up before the data can be read, and incoming data will be dropped.

Idle Connection Expiration

The amount of time a connection should be held open without being used before closing the connection. A value of 0 seconds will disable this feature.

Timeout

The timeout for connecting to and communicating with the destination. Does not apply to UDP

Relationships

  • success: FlowFiles that are sent successfully to the destination are sent out this relationship.

  • failure: FlowFiles that failed to send to the destination are sent out this relationship.

Input Requirement

This component requires an incoming relationship.

See Also

PutWebSocket

Sends messages to a WebSocket remote endpoint using a WebSocket session that is established by either ListenWebSocket or ConnectWebSocket.

Tags: WebSocket, publish, send

Properties

WebSocket Session Id

A NiFi Expression to retrieve the session id. If not specified, a message will be sent to all connected WebSocket peers for the WebSocket controller service endpoint.

WebSocket ControllerService Id

A NiFi Expression to retrieve the id of a WebSocket ControllerService.

WebSocket Endpoint Id

A NiFi Expression to retrieve the endpoint id of a WebSocket ControllerService.

WebSocket Message Type

The type of message content: TEXT or BINARY

Relationships

  • success: FlowFiles that are sent successfully to the destination are transferred to this relationship.

  • failure: FlowFiles that failed to send to the destination are transferred to this relationship.

Writes Attributes

  • websocket.controller.service.id: WebSocket Controller Service id.

  • websocket.session.id: Established WebSocket session id.

  • websocket.endpoint.id: WebSocket endpoint id.

  • websocket.message.type: TEXT or BINARY.

  • websocket.local.address: WebSocket server address.

  • websocket.remote.address: WebSocket client address.

  • websocket.failure.detail: Detail of the failure.

Input Requirement

This component requires an incoming relationship.

System Resource Considerations

  • MEMORY: An instance of this component can cause high usage of this system resource. Multiple instances or high concurrency settings may result a degradation of performance.

PutZendeskTicket

Create Zendesk tickets using the Zendesk API.

Tags: zendesk, ticket

Properties

Web Client Service Provider

Controller service for HTTP client operations.

Subdomain Name

Name of the Zendesk subdomain.

User Name

Login user to Zendesk subdomain.

Authentication Type

Type of authentication to Zendesk API.

Authentication Credential

Password or authentication token for Zendesk login user.

Record Reader

Specifies the Controller Service to use for parsing incoming data and determining the data’s schema.

Comment Body

The content or the path to the comment body in the incoming record.

Subject

The content or the path to the subject in the incoming record.

Priority

The content or the path to the priority in the incoming record.

Type

The content or the path to the type in the incoming record.

Dynamic Properties

The path in the request object to add. The value needs be a valid JsonPointer.

Additional property to be added to the Zendesk request object.

Relationships

  • success: For FlowFiles created as a result of a successful HTTP request.

  • failure: A FlowFile is routed to this relationship if the operation failed and retrying the operation will also fail, such as an invalid data or schema.

Writes Attributes

  • record.count: The number of records processed.

  • error.code: The error code of from the response.

  • error.message: The error message of from the response.

Input Requirement

This component allows an incoming relationship.

Additional Details

Description

The processor uses the Zendesk API to ingest tickets into Zendesk. The processor is capable to send requests directly from the FlowFile content or construct the request objects from the incoming records using a RecordReader.

Authentication

Zendesk API uses basic authentication. Either a password or an authentication token has to be provided. In Zendesk API Settings, it’s possible to generate authentication tokens, eliminating the need for users to expose their passwords. This approach also offers the advantage of fast token revocation when required.

Property values

There are multiple ways of providing property values to the request object:

Record Path:

The property value is going to be evaluated as a record path if the value is provided inside brackets starting with a ‘%’.

Example:

The incoming record look like this.

{
  "record": {
    "description": "This is a sample description.",
    "issue\_type": "Immediate",
    "issue": {
      "name": "General error",
      "type": "Immediate"
    },
    "project": {
      "name": "Maintenance"
    }
  }
}
json

We are going to provide Record Path values for the Comment Body, Subject, Priority and Type processor attributes:

Comment Body : %{/record/description}
Subject : %{/record/issue/name}
Priority : %{/record/issue/type}
Type : %{/record/project/name}

The constructed request object that is going to be sent to the Zendesk API will look like this:

{
  "comment": {
    "body": "This is a sample description."
  },
  "subject": "General error",
  "priority": "Immediate",
  "type": "Maintenance"
}
json

Constant:

The property value is going to be treated as a constant if the provided value doesn’t match with the Record Path format.

Example:

We are going to provide constant values for the Comment Body, Subject, Priority and Type processor attributes:

Comment Body : Sample description
Subject : Sample subject
Priority : High
Type : Sample type

The constructed request object that is going to be sent to the Zendesk API will look like this:

{
  "comment": {
    "body": "Sample description"
  },
  "subject": "Sample subject",
  "priority": "High",
  "type": "Sample type"
}
json
Additional properties

The processor offers a set of frequently used Zendesk ticket attributes within its property list. However, users have the flexibility to include any desired number of additional properties using dynamic properties. These dynamic properties utilize their keys as Json Pointer, which denote the paths within the request object. Correspondingly, the values of these dynamic properties align with the predefined property attributes. The possible Zendesk request attributes can be found in the Zendesk API documentation

Property Key values:

The dynamic property key must be a valid Json Pointer value which has the following syntax rules:

  • The path starts with /.

  • Each segment is separated by /.

  • Each segment can be interpreted as either an array index or an object key.

Example:

We are going to add a new dynamic property to the processor:

/request/new_object : This is a new property
/request/new_array/0 : This is a new array element

The constructed request object will look like this:

{
  "request": {
    "new_object": "This is a new property",
    "new_array": [
      "This is a new array element"
    ]
  }
}
json

QueryAirtableTable

Query records from an Airtable table. Records are incrementally retrieved based on the last modified time of the records. Records can also be further filtered by setting the 'Custom Filter' property which supports the formulas provided by the Airtable API. This processor is intended to be run on the Primary Node only.

Tags: airtable, query, database

Properties

API URL

The URL for the Airtable REST API including the domain and the path to the API (e.g. https://api.airtable.com/v0).

Personal Access Token

The Personal Access Token (PAT) to use in queries. Should be generated on Airtable’s account page.

Base ID

The ID of the Airtable base to be queried.

Table ID

The name or the ID of the Airtable table to be queried.

Fields

Comma-separated list of fields to query from the table. Both the field’s name and ID can be used.

Custom Filter

Filter records by Airtable’s formulas.

Query Time Window Lag

The amount of lag to be applied to the query time window’s end point. Set this property to avoid missing records when the clock of your local machines and Airtable servers' clock are not in sync. Must be greater than or equal to 1 second.

Web Client Service Provider

Web Client Service Provider to use for Airtable REST API requests

Query Page Size

Number of records to be fetched in a page. Should be between 1 and 100 inclusively.

Max Records Per FlowFile

The maximum number of result records that will be included in a single FlowFile. This will allow you to break up very large result sets into multiple FlowFiles. If no value specified, then all records are returned in a single FlowFile.

Relationships

  • success: For FlowFiles created as a result of a successful query.

Writes Attributes

  • record.count: Sets the number of records in the FlowFile.

  • fragment.identifier: If 'Max Records Per FlowFile' is set then all FlowFiles from the same query result set will have the same value for the fragment.identifier attribute. This can then be used to correlate the results.

  • fragment.count: If 'Max Records Per FlowFile' is set then this is the total number of FlowFiles produced by a single ResultSet. This can be used in conjunction with the fragment.identifier attribute in order to know how many FlowFiles belonged to the same incoming ResultSet.

  • fragment.index: If 'Max Records Per FlowFile' is set then the position of this FlowFile in the list of outgoing FlowFiles that were all derived from the same result set FlowFile. This can be used in conjunction with the fragment.identifier attribute to know which FlowFiles originated from the same query result set and in what order FlowFiles were produced

Stateful

Scope: Cluster

The last successful query’s time is stored in order to enable incremental loading. The initial query returns all the records in the table and each subsequent query filters the records by their last modified time. In other words, if a record is updated after the last successful query only the updated records will be returned in the next query. State is stored across the cluster, so this Processor can run only on the Primary Node and if a new Primary Node is selected, the new node can pick up where the previous one left off without duplicating the data.

Input Requirement

This component does not allow an incoming relationship.

Additional Details

Description

Airtable is a spreadsheet-database hybrid. In Airtable an application is called base and each base can have multiple tables. A table consists of records (rows) and each record can have multiple fields (columns). The QueryAirtableTable processor can query records from a single base and table via Airtable’s REST API. The processor utilizes streams to be able to handle a large number of records. It can also split large record sets to multiple FlowFiles just like a database processor.

Personal Access Token

Please note that API Keys were deprecated, Airtable now provides Personal Access Tokens (PATs) instead. Airtable REST API calls requires a PAT (Personal Access Token) that needs to be passed in a request. An Airtable account is required to generate the PAT.

API rate limit

The Airtable REST API limits the number of requests that can be sent on a per-base basis to avoid bottlenecks. Currently, this limit is 5 requests per second per base. If this limit is exceeded you can’t make another request for 30 seconds. It’s your responsibility to handle this rate limit via configuring Yield Duration and Run Schedule properly. It is recommended to start off with the default settings and to increase both parameters when rate limit issues occur.

Metadata API

Currently, the Metadata API of Airtable is unstable, and we don’t provide a way to use it. Until it becomes stable you can set up a ConvertRecord or MergeRecord processor with a JsonTreeReader to read the content and convert it into a Record with schema.

QueryAzureDataExplorer

Query Azure Data Explorer and stream JSON results to output FlowFiles

Tags: Azure, Data, Explorer, ADX, Kusto

Properties

Kusto Query Service

Azure Data Explorer Kusto Query Service

Database Name

Azure Data Explorer Database Name for querying

Query

Query to be run against Azure Data Explorer

Relationships

  • success: FlowFiles containing results of a successful Query

  • failure: FlowFiles containing original input associated with a failed Query

Writes Attributes

  • query.error.message: Azure Data Explorer query error message on failures

  • query.executed: Azure Data Explorer query executed

  • mime.type: Content Type set to application/json

Input Requirement

This component requires an incoming relationship.

QueryDatabaseTable

Generates a SQL select query, or uses a provided statement, and executes it to fetch all rows whose values in the specified Maximum Value column(s) are larger than the previously-seen maxima. Query result will be converted to Avro format. Expression Language is supported for several properties, but no incoming connections are permitted. The Environment/System properties may be used to provide values for any property containing Expression Language. If it is desired to leverage flow file attributes to perform these queries, the GenerateTableFetch and/or ExecuteSQL processors can be used for this purpose. Streaming is used so arbitrarily large result sets are supported. This processor can be scheduled to run on a timer or cron expression, using the standard scheduling methods. This processor is intended to be run on the Primary Node only. FlowFile attribute 'querydbtable.row.count' indicates how many rows were selected.

Tags: sql, select, jdbc, query, database

Properties

Database Connection Pooling Service

The Controller Service that is used to obtain a connection to the database.

Database Type

Database Type for generating statements specific to a particular service or vendor. The Generic Type supports most cases but selecting a specific type enables optimal processing or additional features.

Database Dialect Service

Database Dialect Service for generating statements specific to a particular service or vendor.

Table Name

The name of the database table to be queried. When a custom query is used, this property is used to alias the query and appears as an attribute on the FlowFile.

Columns to Return

A comma-separated list of column names to be used in the query. If your database requires special treatment of the names (quoting, e.g.), each name should include such treatment. If no column names are supplied, all columns in the specified table will be returned. NOTE: It is important to use consistent column names for a given table for incremental fetch to work properly.

Additional WHERE clause

A custom clause to be added in the WHERE condition when building SQL queries.

Custom Query

A custom SQL query used to retrieve data. Instead of building a SQL query from other properties, this query will be wrapped as a sub-query. Query must have no ORDER BY statement.

Maximum-value Columns

A comma-separated list of column names. The processor will keep track of the maximum value for each column that has been returned since the processor started running. Using multiple columns implies an order to the column list, and each column’s values are expected to increase more slowly than the previous columns' values. Thus, using multiple columns implies a hierarchical structure of columns, which is usually used for partitioning tables. This processor can be used to retrieve only those rows that have been added/updated since the last retrieval. Note that some JDBC types such as bit/boolean are not conducive to maintaining maximum value, so columns of these types should not be listed in this property, and will result in error(s) during processing. If no columns are provided, all rows from the table will be considered, which could have a performance impact. NOTE: It is important to use consistent max-value column names for a given table for incremental fetch to work properly.

Initial Load Strategy

How to handle existing rows in the database table when the processor is started for the first time (or its state has been cleared). The property will be ignored, if any 'initial.maxvalue.*' dynamic property has also been configured.

Max Wait Time

The maximum amount of time allowed for a running SQL select query , zero means there is no limit. Max time less than 1 second will be equal to zero.

Fetch Size

The number of result rows to be fetched from the result set at a time. This is a hint to the database driver and may not be honored and/or exact. If the value specified is zero, then the hint is ignored. If using PostgreSQL, then 'Set Auto Commit' must be equal to 'false' to cause 'Fetch Size' to take effect.

Set Auto Commit

Allows enabling or disabling the auto commit functionality of the DB connection. Default value is 'No value set'. 'No value set' will leave the db connection’s auto commit mode unchanged. For some JDBC drivers such as PostgreSQL driver, it is required to disable the auto commit functionality to get the 'Fetch Size' setting to take effect. When auto commit is enabled, PostgreSQL driver ignores 'Fetch Size' setting and loads all rows of the result set to memory at once. This could lead for a large amount of memory usage when executing queries which fetch large data sets. More Details of this behaviour in PostgreSQL driver can be found in https://jdbc.postgresql.org//documentation/head/query.html.

Max Rows Per Flow File

The maximum number of result rows that will be included in a single FlowFile. This will allow you to break up very large result sets into multiple FlowFiles. If the value specified is zero, then all rows are returned in a single FlowFile.

Output Batch Size

The number of output FlowFiles to queue before committing the process session. When set to zero, the session will be committed when all result set rows have been processed and the output FlowFiles are ready for transfer to the downstream relationship. For large result sets, this can cause a large burst of FlowFiles to be transferred at the end of processor execution. If this property is set, then when the specified number of FlowFiles are ready for transfer, then the session will be committed, thus releasing the FlowFiles to the downstream relationship. NOTE: The maxvalue.* and fragment.count attributes will not be set on FlowFiles when this property is set.

Maximum Number of Fragments

The maximum number of fragments. If the value specified is zero, then all fragments are returned. This prevents OutOfMemoryError when this processor ingests huge table. NOTE: Setting this property can result in data loss, as the incoming results are not ordered, and fragments may end at arbitrary boundaries where rows are not included in the result set.

Normalize Table/Column Names

Whether to change non-Avro-compatible characters in column names to Avro-compatible characters. For example, colons and periods will be changed to underscores in order to build a valid Avro record.

Transaction Isolation Level

This setting will set the transaction isolation level for the database connection for drivers that support this setting

Use Avro Logical Types

Whether to use Avro Logical Types for DECIMAL/NUMBER, DATE, TIME and TIMESTAMP columns. If disabled, written as string. If enabled, Logical types are used and written as its underlying type, specifically, DECIMAL/NUMBER as logical 'decimal': written as bytes with additional precision and scale meta data, DATE as logical 'date-millis': written as int denoting days since Unix epoch (1970-01-01), TIME as logical 'time-millis': written as int denoting milliseconds since Unix epoch, and TIMESTAMP as logical 'timestamp-millis': written as long denoting milliseconds since Unix epoch. If a reader of written Avro records also knows these logical types, then these values can be deserialized with more context depending on reader implementation.

Default Decimal Precision

When a DECIMAL/NUMBER value is written as a 'decimal' Avro logical type, a specific 'precision' denoting number of available digits is required. Generally, precision is defined by column data type definition or database engines default. However undefined precision (0) can be returned from some database engines. 'Default Decimal Precision' is used when writing those undefined precision numbers.

Default Decimal Scale

When a DECIMAL/NUMBER value is written as a 'decimal' Avro logical type, a specific 'scale' denoting number of available decimal digits is required. Generally, scale is defined by column data type definition or database engines default. However when undefined precision (0) is returned, scale can also be uncertain with some database engines. 'Default Decimal Scale' is used when writing those undefined numbers. If a value has more decimals than specified scale, then the value will be rounded-up, e.g. 1.53 becomes 2 with scale 0, and 1.5 with scale 1.

Dynamic Properties

initial.maxvalue.<max_value_column>

Specifies an initial max value for max value column(s). Properties should be added in the format initial.maxvalue.<max_value_column>. This value is only used the first time the table is accessed (when a Maximum Value Column is specified).

Relationships

  • success: Successfully created FlowFile from SQL query result set.

Writes Attributes

  • tablename: Name of the table being queried

  • querydbtable.row.count: The number of rows selected by the query

  • fragment.identifier: If 'Max Rows Per Flow File' is set then all FlowFiles from the same query result set will have the same value for the fragment.identifier attribute. This can then be used to correlate the results.

  • fragment.count: If 'Max Rows Per Flow File' is set then this is the total number of FlowFiles produced by a single ResultSet. This can be used in conjunction with the fragment.identifier attribute in order to know how many FlowFiles belonged to the same incoming ResultSet. If Output Batch Size is set, then this attribute will not be populated.

  • fragment.index: If 'Max Rows Per Flow File' is set then the position of this FlowFile in the list of outgoing FlowFiles that were all derived from the same result set FlowFile. This can be used in conjunction with the fragment.identifier attribute to know which FlowFiles originated from the same query result set and in what order FlowFiles were produced

  • maxvalue.*: Each attribute contains the observed maximum value of a specified 'Maximum-value Column'. The suffix of the attribute is the name of the column. If Output Batch Size is set, then this attribute will not be populated.

Stateful

Scope: Cluster

After performing a query on the specified table, the maximum values for the specified column(s) will be retained for use in future executions of the query. This allows the Processor to fetch only those records that have max values greater than the retained values. This can be used for incremental fetching, fetching of newly added rows, etc. To clear the maximum values, clear the state of the processor per the State Management documentation

Input Requirement

This component does not allow an incoming relationship.

QueryDatabaseTableRecord

Generates a SQL select query, or uses a provided statement, and executes it to fetch all rows whose values in the specified Maximum Value column(s) are larger than the previously-seen maxima. Query result will be converted to the format specified by the record writer. Expression Language is supported for several properties, but no incoming connections are permitted. The Environment/System properties may be used to provide values for any property containing Expression Language. If it is desired to leverage flow file attributes to perform these queries, the GenerateTableFetch and/or ExecuteSQL processors can be used for this purpose. Streaming is used so arbitrarily large result sets are supported. This processor can be scheduled to run on a timer or cron expression, using the standard scheduling methods. This processor is intended to be run on the Primary Node only. FlowFile attribute 'querydbtable.row.count' indicates how many rows were selected.

Use Cases

Retrieve all rows from a database table.

Keywords: jdbc, rdbms, cdc, database, table, stream

Input Requirement: This component allows an incoming relationship.

  1. Configure the "Database Connection Pooling Service" to specify a Connection Pooling Service so that the Processor knows how to connect to the database.

  2. Set the "Database Type" property to the type of database to query, or "Generic" if the database vendor is not listed.

  3. Set the "Table Name" property to the name of the table to retrieve records from.

  4. Configure the "Record Writer" to specify a Record Writer that is appropriate for the desired output format.

  5. Set the "Maximum-value Columns" property to a comma-separated list of columns whose values can be used to determine which values are new. For example, this might be set to

  6. an id column that is a one-up number, or a last_modified column that is a timestamp of when the row was last modified.

  7. Set the "Initial Load Strategy" property to "Start at Beginning".

  8. Set the "Fetch Size" to a number that avoids loading too much data into memory on the NiFi side. For example, a value of 1000 will load up to 1,000 rows of data.

  9. Set the "Max Rows Per Flow File" to a value that allows efficient processing, such as 1000 or 10000.

  10. Set the "Output Batch Size" property to a value greater than 0. A smaller value, such as 1 or even 20 will result in lower latency but also slightly lower throughput.

  11. A larger value such as 1000 will result in higher throughput but also higher latency. It is not recommended to set the value larger than 1000 as it can cause significant

  12. memory utilization. .

Perform an incremental load of a single database table, fetching only new rows as they are added to the table.

Keywords: incremental load, rdbms, jdbc, cdc, database, table, stream

Input Requirement: This component allows an incoming relationship.

  1. Configure the "Database Connection Pooling Service" to specify a Connection Pooling Service so that the Processor knows how to connect to the database.

  2. Set the "Database Type" property to the type of database to query, or "Generic" if the database vendor is not listed.

  3. Set the "Table Name" property to the name of the table to retrieve records from.

  4. Configure the "Record Writer" to specify a Record Writer that is appropriate for the desired output format.

  5. Set the "Maximum-value Columns" property to a comma-separated list of columns whose values can be used to determine which values are new. For example, this might be set to

  6. an id column that is a one-up number, or a last_modified column that is a timestamp of when the row was last modified.

  7. Set the "Initial Load Strategy" property to "Start at Current Maximum Values".

  8. Set the "Fetch Size" to a number that avoids loading too much data into memory on the NiFi side. For example, a value of 1000 will load up to 1,000 rows of data.

  9. Set the "Max Rows Per Flow File" to a value that allows efficient processing, such as 1000 or 10000.

  10. Set the "Output Batch Size" property to a value greater than 0. A smaller value, such as 1 or even 20 will result in lower latency but also slightly lower throughput.

  11. A larger value such as 1000 will result in higher throughput but also higher latency. It is not recommended to set the value larger than 1000 as it can cause significant

  12. memory utilization. .

Multi-Processor Use Cases

Perform an incremental load of multiple database tables, fetching only new rows as they are added to the tables.

Keywords: incremental load, rdbms, jdbc, cdc, database, table, stream

QueryDatabaseTableRecord:

  1. Configure the "Database Connection Pooling Service" to the same Connection Pool that was used in ListDatabaseTables.

  2. Set the "Database Type" property to the type of database to query, or "Generic" if the database vendor is not listed.

  3. Set the "Table Name" property to "${db.table.fullname}"

  4. Configure the "Record Writer" to specify a Record Writer that is appropriate for the desired output format.

  5. Set the "Maximum-value Columns" property to a comma-separated list of columns whose values can be used to determine which values are new. For example, this might be set to

  6. an id column that is a one-up number, or a last_modified column that is a timestamp of when the row was last modified.

  7. Set the "Initial Load Strategy" property to "Start at Current Maximum Values".

  8. Set the "Fetch Size" to a number that avoids loading too much data into memory on the NiFi side. For example, a value of 1000 will load up to 1,000 rows of data.

  9. Set the "Max Rows Per Flow File" to a value that allows efficient processing, such as 1000 or 10000.

  10. Set the "Output Batch Size" property to a value greater than 0. A smaller value, such as 1 or even 20 will result in lower latency but also slightly lower throughput.

  11. A larger value such as 1000 will result in higher throughput but also higher latency. It is not recommended to set the value larger than 1000 as it can cause significant

  12. memory utilization. .

ListDatabaseTables:

  1. Configure the "Database Connection Pooling Service" property to specify a Connection Pool that is applicable for interacting with your database. .

  2. Set the "Catalog" property to the name of the database Catalog;

  3. set the "Schema Pattern" property to a Java Regular Expression that matches all database Schemas that should be included; and

  4. set the "Table Name Pattern" property to a Java Regular Expression that matches the names of all tables that should be included.

  5. In order to perform an incremental load of all tables, leave the Catalog, Schema Pattern, and Table Name Pattern unset. .

  6. Leave the RecordWriter property unset. .

  7. Connect the 'success' relationship to QueryDatabaseTableRecord. .

Tags: sql, select, jdbc, query, database, record

Properties

Database Connection Pooling Service

The Controller Service that is used to obtain a connection to the database.

Database Type

Database Type for generating statements specific to a particular service or vendor. The Generic Type supports most cases but selecting a specific type enables optimal processing or additional features.

Database Dialect Service

Database Dialect Service for generating statements specific to a particular service or vendor.

Table Name

The name of the database table to be queried. When a custom query is used, this property is used to alias the query and appears as an attribute on the FlowFile.

Columns to Return

A comma-separated list of column names to be used in the query. If your database requires special treatment of the names (quoting, e.g.), each name should include such treatment. If no column names are supplied, all columns in the specified table will be returned. NOTE: It is important to use consistent column names for a given table for incremental fetch to work properly.

Additional WHERE clause

A custom clause to be added in the WHERE condition when building SQL queries.

Custom Query

A custom SQL query used to retrieve data. Instead of building a SQL query from other properties, this query will be wrapped as a sub-query. Query must have no ORDER BY statement.

Record Writer

Specifies the Controller Service to use for writing results to a FlowFile. The Record Writer may use Inherit Schema to emulate the inferred schema behavior, i.e. an explicit schema need not be defined in the writer, and will be supplied by the same logic used to infer the schema from the column types.

Maximum-value Columns

A comma-separated list of column names. The processor will keep track of the maximum value for each column that has been returned since the processor started running. Using multiple columns implies an order to the column list, and each column’s values are expected to increase more slowly than the previous columns' values. Thus, using multiple columns implies a hierarchical structure of columns, which is usually used for partitioning tables. This processor can be used to retrieve only those rows that have been added/updated since the last retrieval. Note that some JDBC types such as bit/boolean are not conducive to maintaining maximum value, so columns of these types should not be listed in this property, and will result in error(s) during processing. If no columns are provided, all rows from the table will be considered, which could have a performance impact. NOTE: It is important to use consistent max-value column names for a given table for incremental fetch to work properly.

Initial Load Strategy

How to handle existing rows in the database table when the processor is started for the first time (or its state has been cleared). The property will be ignored, if any 'initial.maxvalue.*' dynamic property has also been configured.

Max Wait Time

The maximum amount of time allowed for a running SQL select query , zero means there is no limit. Max time less than 1 second will be equal to zero.

Fetch Size

The number of result rows to be fetched from the result set at a time. This is a hint to the database driver and may not be honored and/or exact. If the value specified is zero, then the hint is ignored. If using PostgreSQL, then 'Set Auto Commit' must be equal to 'false' to cause 'Fetch Size' to take effect.

Set Auto Commit

Allows enabling or disabling the auto commit functionality of the DB connection. Default value is 'No value set'. 'No value set' will leave the db connection’s auto commit mode unchanged. For some JDBC drivers such as PostgreSQL driver, it is required to disable the auto commit functionality to get the 'Fetch Size' setting to take effect. When auto commit is enabled, PostgreSQL driver ignores 'Fetch Size' setting and loads all rows of the result set to memory at once. This could lead for a large amount of memory usage when executing queries which fetch large data sets. More Details of this behaviour in PostgreSQL driver can be found in https://jdbc.postgresql.org//documentation/head/query.html.

Max Rows Per Flow File

The maximum number of result rows that will be included in a single FlowFile. This will allow you to break up very large result sets into multiple FlowFiles. If the value specified is zero, then all rows are returned in a single FlowFile.

Output Batch Size

The number of output FlowFiles to queue before committing the process session. When set to zero, the session will be committed when all result set rows have been processed and the output FlowFiles are ready for transfer to the downstream relationship. For large result sets, this can cause a large burst of FlowFiles to be transferred at the end of processor execution. If this property is set, then when the specified number of FlowFiles are ready for transfer, then the session will be committed, thus releasing the FlowFiles to the downstream relationship. NOTE: The maxvalue.* and fragment.count attributes will not be set on FlowFiles when this property is set.

Maximum Number of Fragments

The maximum number of fragments. If the value specified is zero, then all fragments are returned. This prevents OutOfMemoryError when this processor ingests huge table. NOTE: Setting this property can result in data loss, as the incoming results are not ordered, and fragments may end at arbitrary boundaries where rows are not included in the result set.

Normalize Table/Column Names

Whether to change characters in column names when creating the output schema. For example, colons and periods will be changed to underscores.

Use Avro Logical Types

Whether to use Avro Logical Types for DECIMAL/NUMBER, DATE, TIME and TIMESTAMP columns. If disabled, written as string. If enabled, Logical types are used and written as its underlying type, specifically, DECIMAL/NUMBER as logical 'decimal': written as bytes with additional precision and scale meta data, DATE as logical 'date-millis': written as int denoting days since Unix epoch (1970-01-01), TIME as logical 'time-millis': written as int denoting milliseconds since Unix epoch, and TIMESTAMP as logical 'timestamp-millis': written as long denoting milliseconds since Unix epoch. If a reader of written Avro records also knows these logical types, then these values can be deserialized with more context depending on reader implementation.

Default Decimal Precision

When a DECIMAL/NUMBER value is written as a 'decimal' Avro logical type, a specific 'precision' denoting number of available digits is required. Generally, precision is defined by column data type definition or database engines default. However undefined precision (0) can be returned from some database engines. 'Default Decimal Precision' is used when writing those undefined precision numbers.

Default Decimal Scale

When a DECIMAL/NUMBER value is written as a 'decimal' Avro logical type, a specific 'scale' denoting number of available decimal digits is required. Generally, scale is defined by column data type definition or database engines default. However when undefined precision (0) is returned, scale can also be uncertain with some database engines. 'Default Decimal Scale' is used when writing those undefined numbers. If a value has more decimals than specified scale, then the value will be rounded-up, e.g. 1.53 becomes 2 with scale 0, and 1.5 with scale 1.

Dynamic Properties

initial.maxvalue.<max_value_column>

Specifies an initial max value for max value column(s). Properties should be added in the format initial.maxvalue.<max_value_column>. This value is only used the first time the table is accessed (when a Maximum Value Column is specified).

Relationships

  • success: Successfully created FlowFile from SQL query result set.

Writes Attributes

  • tablename: Name of the table being queried

  • querydbtable.row.count: The number of rows selected by the query

  • fragment.identifier: If 'Max Rows Per Flow File' is set then all FlowFiles from the same query result set will have the same value for the fragment.identifier attribute. This can then be used to correlate the results.

  • fragment.count: If 'Max Rows Per Flow File' is set then this is the total number of FlowFiles produced by a single ResultSet. This can be used in conjunction with the fragment.identifier attribute in order to know how many FlowFiles belonged to the same incoming ResultSet. If Output Batch Size is set, then this attribute will not be populated.

  • fragment.index: If 'Max Rows Per Flow File' is set then the position of this FlowFile in the list of outgoing FlowFiles that were all derived from the same result set FlowFile. This can be used in conjunction with the fragment.identifier attribute to know which FlowFiles originated from the same query result set and in what order FlowFiles were produced

  • maxvalue.*: Each attribute contains the observed maximum value of a specified 'Maximum-value Column'. The suffix of the attribute is the name of the column. If Output Batch Size is set, then this attribute will not be populated.

  • mime.type: Sets the mime.type attribute to the MIME Type specified by the Record Writer.

  • record.count: The number of records output by the Record Writer.

Stateful

Scope: Cluster

After performing a query on the specified table, the maximum values for the specified column(s) will be retained for use in future executions of the query. This allows the Processor to fetch only those records that have max values greater than the retained values. This can be used for incremental fetching, fetching of newly added rows, etc. To clear the maximum values, clear the state of the processor per the State Management documentation

Input Requirement

This component does not allow an incoming relationship.

QueryQdrant

Queries Qdrant in order to gather a specified number of documents that are most closely related to the given query.

Use Cases

Semantically search for documents stored in Qdrant - https://qdrant.tech/

Keywords: qdrant, embedding, vector, text, vectorstore, search

  1. Configure 'Collection Name' to the name of the Qdrant collection to use.

  2. Configure 'Qdrant URL' to the fully qualified URL of the Qdrant instance.

  3. Configure 'Qdrant API Key' to the API Key to use in order to authenticate with Qdrant.

  4. Configure 'Prefer gRPC' to True if you want to use gRPC for interfacing with Qdrant.

  5. Configure 'Use HTTPS' to True if you want to use TLS(HTTPS) while interfacing with Qdrant.

  6. Configure 'Embedding Model' to indicate whether OpenAI embeddings, Google embeddings or HuggingFace embeddings should be used: 'Hugging Face Model', 'Google Model' 'OpenAI Model'

  7. Configure the 'OpenAI API Key', 'Google API Key' or 'HuggingFace API Key', depending on the chosen Embedding Model.

  8. Configure 'HuggingFace Model', 'Google Model' or 'OpenAI Model' to the name of the model to use.

  9. Configure 'Query' to the text of the query to send to Qdrant.

  10. Configure 'Number of Results' to the number of results to return from Qdrant.

  11. Configure 'Metadata Filter' to apply an optional metadata filter with the query. For example: { "author": "john.doe" }

  12. Configure 'Output Strategy' to indicate how the output should be formatted: 'Row-Oriented', 'Text', or 'Column-Oriented'.

  13. Configure 'Results Field' to the name of the field to insert the results, if the input FlowFile is JSON Formatted,.

  14. Configure 'Include Metadatas' to True if metadata should be included in the output.

  15. Configure 'Include Distances' to True if distances should be included in the output.

Tags: qdrant, vector, vectordb, vectorstore, embeddings, ai, artificial intelligence, ml, machine learning, text, LLM

Properties

Query

The text of the query to send to Qdrant.

Number of Results

The number of results to return from Qdrant.

Metadata Filter

Optional metadata filter to apply with the query. For example: { "author": "john.doe" }

QueryRecord

Evaluates one or more SQL queries against the contents of a FlowFile. The result of the SQL query then becomes the content of the output FlowFile. This can be used, for example, for field-specific filtering, transformation, and row-level filtering. Columns can be renamed, simple calculations and aggregations performed, etc. The Processor is configured with a Record Reader Controller Service and a Record Writer service so as to allow flexibility in incoming and outgoing data formats. The Processor must be configured with at least one user-defined property. The name of the Property is the Relationship to route data to, and the value of the Property is a SQL SELECT statement that is used to specify how input data should be transformed/filtered. The SQL statement must be valid ANSI SQL and is powered by Apache Calcite. If the transformation fails, the original FlowFile is routed to the 'failure' relationship. Otherwise, the data selected will be routed to the associated relationship. If the Record Writer chooses to inherit the schema from the Record, it is important to note that the schema that is inherited will be from the ResultSet, rather than the input Record. This allows a single instance of the QueryRecord processor to have multiple queries, each of which returns a different set of columns and aggregations. As a result, though, the schema that is derived will have no schema name, so it is important that the configured Record Writer not attempt to write the Schema Name as an attribute if inheriting the Schema from the Record. See the Processor Usage documentation for more information.

Use Cases

Filter out records based on the values of the records' fields

Keywords: filter out, remove, drop, strip out, record field, sql

Input Requirement: This component allows an incoming relationship.

  1. "Record Reader" should be set to a Record Reader that is appropriate for your data.

  2. "Record Writer" should be set to a Record Writer that writes out data in the desired format. .

  3. One additional property should be added.

  4. The name of the property should be a short description of the data to keep.

  5. Its value is a SQL statement that selects all columns from a table named FLOW_FILE for relevant rows.

  6. The WHERE clause selects the data to keep. I.e., it is the exact opposite of what we want to remove.

  7. It is recommended to always quote column names using double-quotes in order to avoid conflicts with SQL keywords.

  8. For example, to remove records where either the name is George OR the age is less than 18, we would add a property named "adults not george" with a value that selects records where the name is not George AND the age is greater than or equal to 18. So the value would be SELECT * FROM FLOWFILE WHERE "name" <> 'George' AND "age" >= 18 .

  9. Adding this property now gives us a new Relationship whose name is the same as the property name. So, the "adults not george" Relationship should be connected to the next Processor in our flow. .

Keep only specific records

Keywords: keep, filter, retain, select, include, record, sql

Input Requirement: This component allows an incoming relationship.

  1. "Record Reader" should be set to a Record Reader that is appropriate for your data.

  2. "Record Writer" should be set to a Record Writer that writes out data in the desired format. .

  3. One additional property should be added.

  4. The name of the property should be a short description of the data to keep.

  5. Its value is a SQL statement that selects all columns from a table named FLOW_FILE for relevant rows.

  6. The WHERE clause selects the data to keep.

  7. It is recommended to always quote column names using double-quotes in order to avoid conflicts with SQL keywords.

  8. For example, to keep only records where the person is an adult (aged 18 or older), add a property named "adults" with a value that is a SQL statement that selects records where the age is at least 18. So the value would be SELECT * FROM FLOWFILE WHERE "age" >= 18 .

  9. Adding this property now gives us a new Relationship whose name is the same as the property name. So, the "adults" Relationship should be connected to the next Processor in our flow. .

Keep only specific fields in a a Record, where the names of the fields to keep are known

Keywords: keep, filter, retain, select, include, record, fields, sql

Input Requirement: This component allows an incoming relationship.

  1. "Record Reader" should be set to a Record Reader that is appropriate for your data.

  2. "Record Writer" should be set to a Record Writer that writes out data in the desired format. .

  3. One additional property should be added.

  4. The name of the property should be a short description of the data to keep, such as relevant fields.

  5. Its value is a SQL statement that selects the desired columns from a table named FLOW_FILE for relevant rows.

  6. There is no WHERE clause.

  7. It is recommended to always quote column names using double-quotes in order to avoid conflicts with SQL keywords.

  8. For example, to keep only the name, age, and address fields, add a property named relevant fields with a value of SELECT "name", "age", "address" FROM FLOWFILE .

  9. Adding this property now gives us a new Relationship whose name is the same as the property name. So, the relevant fields Relationship should be connected to the next Processor in our flow. .

Route record-oriented data for processing based on its contents

Keywords: record, route, conditional processing, field

Input Requirement: This component allows an incoming relationship.

  1. "Record Reader" should be set to a Record Reader that is appropriate for your data.

  2. "Record Writer" should be set to a Record Writer that writes out data in the desired format. .

  3. For each route that you want to create, add a new property.

  4. The name of the property should be a short description of the data that should be selected for the route.

  5. Its value is a SQL statement that selects all columns from a table named FLOW_FILE. The WHERE clause selects the data that should be included in the route.

  6. It is recommended to always quote column names using double-quotes in order to avoid conflicts with SQL keywords. .

  7. A new outbound relationship is created for each property that is added. The name of the relationship is the same as the property name. .

  8. For example, to route data based on whether or not it is a large transaction, we would add two properties:

  9. small transaction would have a value such as SELECT * FROM FLOWFILE WHERE transactionTotal < 100

  10. large transaction would have a value of SELECT * FROM FLOWFILE WHERE transactionTotal >= 100 .

Tags: sql, query, calcite, route, record, transform, select, update, modify, etl, filter, record, csv, json, logs, text, avro, aggregate

Properties

Record Reader

Specifies the Controller Service to use for parsing incoming data and determining the data’s schema

Record Writer

Specifies the Controller Service to use for writing results to a FlowFile

Include Zero Record FlowFiles

When running the SQL statement against an incoming FlowFile, if the result has no data, this property specifies whether or not a FlowFile will be sent to the corresponding relationship

Default Decimal Precision

When a DECIMAL/NUMBER value is written as a 'decimal' Avro logical type, a specific 'precision' denoting number of available digits is required. Generally, precision is defined by column data type definition or database engines default. However undefined precision (0) can be returned from some database engines. 'Default Decimal Precision' is used when writing those undefined precision numbers.

Default Decimal Scale

When a DECIMAL/NUMBER value is written as a 'decimal' Avro logical type, a specific 'scale' denoting number of available decimal digits is required. Generally, scale is defined by column data type definition or database engines default. However when undefined precision (0) is returned, scale can also be uncertain with some database engines. 'Default Decimal Scale' is used when writing those undefined numbers. If a value has more decimals than specified scale, then the value will be rounded-up, e.g. 1.53 becomes 2 with scale 0, and 1.5 with scale 1.

Dynamic Properties

The name of the relationship to route data to

Each user-defined property specifies a SQL SELECT statement to run over the data, with the data that is selected being routed to the relationship whose name is the property name

Relationships

  • failure: If a FlowFile fails processing for any reason (for example, the SQL statement contains columns not present in input data), the original FlowFile it will be routed to this relationship

  • original: The original FlowFile is routed to this relationship

Dynamic Relationships

  • <Property Name>: Each user-defined property defines a new Relationship for this Processor.

Writes Attributes

  • mime.type: Sets the mime.type attribute to the MIME Type specified by the Record Writer

  • record.count: The number of records selected by the query

  • QueryRecord.Route: The relation to which the FlowFile was routed

Input Requirement

This component requires an incoming relationship.

Additional Details

SQL Over Streams

QueryRecord provides users a tremendous amount of power by leveraging an extremely well-known syntax (SQL) to route, filter, transform, and query data as it traverses the system. In order to provide the Processor with the maximum amount of flexibility, it is configured with a Controller Service that is responsible for reading and parsing the incoming FlowFiles and a Controller Service that is responsible for writing the results out. By using this paradigm, users are not forced to convert their data from one format to another just to query it, and then transform the data back into the form that they want. Rather, the appropriate Controller Service can easily be configured and put to use for the appropriate data format.

Rather than providing a single “SQL SELECT Statement” type of Property, this Processor makes use of user-defined properties. Each user-defined property that is added to the Processor has a name that becomes a new Relationship for the Processor and a corresponding SQL query that will be evaluated against each FlowFile. This allows multiple SQL queries to be run against each FlowFile.

The SQL syntax that is supported by this Processor is ANSI SQL and is powered by Apache Calcite. Please note that identifiers are quoted using double-quotes, and column names/labels are case-insensitive.

As an example, let’s consider that we have a FlowFile with the following CSV data:

name, age, title
John Doe, 34, Software Engineer
Jane Doe, 30, Program Manager
Jacob Doe, 45, Vice President
Janice Doe, 46, Vice President

Now consider that we add the following properties to the Processor:

Property Name Property Value

Engineers

SELECT * FROM FLOWFILE WHERE title LIKE '%Engineer%'

VP

SELECT name FROM FLOWFILE WHERE title = 'Vice President'

Younger Than Average

SELECT * FROM FLOWFILE WHERE age < (SELECT AVG(age) FROM FLOWFILE)

This Processor will now have five relationships: original, failure, Engineers, VP, and Younger Than Average. If there is a failure processing the FlowFile, then the original FlowFile will be routed to failure. Otherwise, the original FlowFile will be routed to original and one FlowFile will be routed to each of the other relationships, with the following values:

Relationship Name FlowFile Value

Engineers

name, age, title`John Doe, 34, Software Engineer`

VP

nameJacob DoeJanice Doe

Younger Than Average

name, age, titleJohn Doe, 34, Software EngineerJane Doe, 30, Program Manager

Note that this example is intended to illustrate the data that is input and output from the Processor. The actual format of the data may vary, depending on the configuration of the Record Reader and Record Writer that is used. For example, here we assume that we are using a CSV Reader and a CSV Writer and that both are configured to have a header line. Should we have used a JSON Writer instead, the output would have contained the same information but been presented in JSON Output. The user is able to choose which input and output format make the most since for his or her use case. The input and output formats need not be the same.

It is also worth noting that the outbound FlowFiles have two different schemas. The Engineers and Younger Than Average FlowFiles contain 3 fields: name, age, and title while the VP FlowFile contains only the name field. In most cases, the Record Writer is configured to use whatever Schema is provided to it by the Record ( this generally means that it is configured with a Schema Access Strategy of Inherit Record Schema). In such a case, this works well. However, if a Schema is supplied to the Record Writer explicitly, it is important to ensure that the Schema accounts for all fields. If not, then the fields that are missing from the Record Writer’s schema will simply not be present in the output.

SQL Over Hierarchical Data

One important detail that we must take into account when evaluating SQL over streams of arbitrary data is how we can handle hierarchical data, such as JSON, XML, and Avro. Because SQL was developed originally for relational databases, which represent “flat” data, it is easy to understand how this would map to other “flat” data like a CSV file. Or even a “flat” JSON representation where all fields are primitive types. However, in many cases, users encounter cases where they would like to evaluate SQL over JSON or Avro data that is made up of many nested values. For example, consider the following JSON as input:

            {
  "name": "John Doe",
  "title": "Software Engineer",
  "age": 40,
  "addresses": [
    {
      "streetNumber": 4820,
      "street": "My Street",
      "apartment": null,
      "city": "New York",
      "state": "NY",
      "country": "USA",
      "label": "work"
    },
    {
      "streetNumber": 327,
      "street": "Small Street",
      "apartment": 309,
      "city": "Los Angeles",
      "state": "CA",
      "country": "USA",
      "label": "home"
    }
  ],
  "project": {
    "name": "Apache NiFi",
    "maintainer": {
      "id": 28302873,
      "name": "Apache Software Foundation"
    },
    "debutYear": 2014
  }
}
json

Consider a query that will select the title and name of any person who has a home address in a different state than their work address. Here, we can only select the fields name, title, age, and addresses. In this scenario, addresses represents an Array of complex objects - records. In order to accommodate for this, QueryRecord provides User-Defined Functions to enable Record Path to be used. Record Path is a simple NiFi Domain Specific Language (DSL) that allows users to reference a nested structure.

The primary User-Defined Function that will be used is named RPATH (short for Record Path). This function expects exactly two arguments: the Record to evaluate the RecordPath against, and the RecordPath to evaluate (in that order). So, to select the title and name of any person who has a home address in a different state than their work address, we can use the following SQL statement:

SELECT title, name
FROM FLOWFILE
WHERE RPATH(addresses, '/state[/label = ''home'']') <> RPATH(addresses, '/state[/label = ''work'']')
sql

To explain this query in English, we can say that it selects the “title” and “name” fields from any Record in the FlowFile for which there is an address whose “label” is “home” and another address whose “label” is “work” and for which the two addreses have different states.

Similarly, we could select the entire Record (all fields) of any person who has a “project” whose maintainer is the Apache Software Foundation using the query:

SELECT *
FROM FLOWFILE
WHERE RPATH(project, '/maintainer/name') = 'Apache Software Foundation'
sql

There does exist a caveat, though, when using RecordPath. That is that the RPATH function returns an Object, which in JDBC is represented as an OTHER type. This is fine and does not affect anything when it is used like above. However, what if we wanted to use another SQL function on the result? For example, what if we wanted to use the SQL query SELECT * FROM FLOWFILE WHERE RPATH(project, '/maintainer/name') LIKE 'Apache%'? This would fail with a very long error such as:

3860 [pool-2-thread-1] ERROR org.apache.nifi.processors.standard.QueryRecord - QueryRecord[id=135e9bc8-0372-4c1e-9c82-9d9a5bfe1261] Unable to query FlowFile[0,174730597574853.mockFlowFile,0B] due to java.lang.RuntimeException: Error while compiling generated Java code: org.apache.calcite.DataContext root;  public org.apache.calcite.linq4j.Enumerable bind(final org.apache.calcite.DataContext root0) {   root = root0;   final org.apache.calcite.linq4j.Enumerable _inputEnumerable = ((org.apache.nifi.queryrecord.FlowFileTable) root.getRootSchema().getTable("FLOWFILE")).project(new int[] {     0,     1,     2,     3});   return new org.apache.calcite.linq4j.AbstractEnumerable(){       public org.apache.calcite.linq4j.Enumerator enumerator() {         return new org.apache.calcite.linq4j.Enumerator(){             public final org.apache.calcite.linq4j.Enumerator inputEnumerator = _inputEnumerable.enumerator();             public void reset() {               inputEnumerator.reset();             }              public boolean moveNext() {               while (inputEnumerator.moveNext()) {                 final Object[] inp3_ = (Object[]) ((Object[]) inputEnumerator.current())[3];                 if (new org.apache.nifi.processors.standard.QueryRecord.ObjectRecordPath().eval(inp3_, "/state[. = 'NY']") != null && org.apache.calcite.runtime.SqlFunctions.like(new org.apache.nifi.processors.standard.QueryRecord.ObjectRecordPath().eval(inp3_, "/state[. = 'NY']"), "N%")) {                   return true;                 }               }               return false;             }              public void close() {               inputEnumerator.close();             }              public Object current() {               final Object[] current = (Object[]) inputEnumerator.current();               return new Object[] {                   current[2],                   current[0]};             }            };       }      }; }   public Class getElementType() {   return java.lang.Object[].class; }   : java.lang.RuntimeException: Error while compiling generated Java code: org.apache.calcite.DataContext root;  public org.apache.calcite.linq4j.Enumerable bind(final org.apache.calcite.DataContext root0) {   root = root0;   final org.apache.calcite.linq4j.Enumerable _inputEnumerable = ((org.apache.nifi.queryrecord.FlowFileTable) root.getRootSchema().getTable("FLOWFILE")).project(new int[] {     0,     1,     2,     3});   return new org.apache.calcite.linq4j.AbstractEnumerable(){       public org.apache.calcite.linq4j.Enumerator enumerator() {         return new org.apache.calcite.linq4j.Enumerator(){             public final org.apache.calcite.linq4j.Enumerator inputEnumerator = _inputEnumerable.enumerator();             public void reset() {               inputEnumerator.reset();             }              public boolean moveNext() {               while (inputEnumerator.moveNext()) {                 final Object[] inp3_ = (Object[]) ((Object[]) inputEnumerator.current())[3];                 if (new org.apache.nifi.processors.standard.QueryRecord.ObjectRecordPath().eval(inp3_, "/state[. = 'NY']") != null && org.apache.calcite.runtime.SqlFunctions.like(new org.apache.nifi.processors.standard.QueryRecord.ObjectRecordPath().eval(inp3_, "/state[. = 'NY']"), "N%")) {                   return true;                 }               }               return false;             }              public void close() {               inputEnumerator.close();             }              public Object current() {               final Object[] current = (Object[]) inputEnumerator.current();               return new Object[] {                   current[2],                   current[0]};             }            };       }      }; }   public Class getElementType() {   return java.lang.Object[].class; }    3864 [pool-2-thread-1] ERROR org.apache.nifi.processors.standard.QueryRecord - java.lang.RuntimeException: Error while compiling generated Java code: org.apache.calcite.DataContext root;  public org.apache.calcite.linq4j.Enumerable bind(final org.apache.calcite.DataContext root0) {   root = root0;   final org.apache.calcite.linq4j.Enumerable _inputEnumerable = ((org.apache.nifi.queryrecord.FlowFileTable) root.getRootSchema().getTable("FLOWFILE")).project(new int[] {     0,     1,     2,     3});   return new org.apache.calcite.linq4j.AbstractEnumerable(){       public org.apache.calcite.linq4j.Enumerator enumerator() {         return new org.apache.calcite.linq4j.Enumerator(){             public final org.apache.calcite.linq4j.Enumerator inputEnumerator = _inputEnumerable.enumerator();             public void reset() {               inputEnumerator.reset();             }              public boolean moveNext() {               while (inputEnumerator.moveNext()) {                 final Object[] inp3_ = (Object[]) ((Object[]) inputEnumerator.current())[3];                 if (new org.apache.nifi.processors.standard.QueryRecord.ObjectRecordPath().eval(inp3_, "/state[. = 'NY']") != null && org.apache.calcite.runtime.SqlFunctions.like(new org.apache.nifi.processors.standard.QueryRecord.ObjectRecordPath().eval(inp3_, "/state[. = 'NY']"), "N%")) {                   return true;                 }               }               return false;             }              public void close() {               inputEnumerator.close();             }              public Object current() {               final Object[] current = (Object[]) inputEnumerator.current();               return new Object[] {                   current[2],                   current[0]};             }            };       }      }; }   public Class getElementType() {   return java.lang.Object[].class; }       at org.apache.calcite.avatica.Helper.wrap(Helper.java:37)   at org.apache.calcite.adapter.enumerable.EnumerableInterpretable.toBindable(EnumerableInterpretable.java:108)   at org.apache.calcite.prepare.CalcitePrepareImpl$CalcitePreparingStmt.implement(CalcitePrepareImpl.java:1237)   at org.apache.calcite.prepare.Prepare.prepareSql(Prepare.java:331)  at org.apache.calcite.prepare.Prepare.prepareSql(Prepare.java:230)  at org.apache.calcite.prepare.CalcitePrepareImpl.prepare2_(CalcitePrepareImpl.java:772)     at org.apache.calcite.prepare.CalcitePrepareImpl.prepare_(CalcitePrepareImpl.java:636)  at org.apache.calcite.prepare.CalcitePrepareImpl.prepareSql(CalcitePrepareImpl.java:606)    at org.apache.calcite.jdbc.CalciteConnectionImpl.parseQuery(CalciteConnectionImpl.java:229)     at org.apache.calcite.jdbc.CalciteConnectionImpl.prepareStatement_(CalciteConnectionImpl.java:211)  at org.apache.calcite.jdbc.CalciteConnectionImpl.prepareStatement(CalciteConnectionImpl.java:200)   at org.apache.calcite.jdbc.CalciteConnectionImpl.prepareStatement(CalciteConnectionImpl.java:90)    at org.apache.calcite.avatica.AvaticaConnection.prepareStatement(AvaticaConnection.java:175)    at org.apache.nifi.processors.standard.QueryRecord.buildCachedStatement(QueryRecord.java:428)   at org.apache.nifi.processors.standard.QueryRecord.getStatement(QueryRecord.java:415)   at org.apache.nifi.processors.standard.QueryRecord.queryWithCache(QueryRecord.java:475)     at org.apache.nifi.processors.standard.QueryRecord.onTrigger(QueryRecord.java:311)  at org.apache.nifi.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27)     at org.apache.nifi.util.StandardProcessorTestRunner$RunProcessor.call(StandardProcessorTestRunner.java:255)     at org.apache.nifi.util.StandardProcessorTestRunner$RunProcessor.call(StandardProcessorTestRunner.java:249)     at java.util.concurrent.FutureTask.run(FutureTask.java:266)     at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)  at java.lang.Thread.run(Thread.java:745) Caused by: org.codehaus.commons.compiler.CompileException: Line 21, Column 180: No applicable constructor/method found for actual parameters "java.lang.Object, java.lang.String"; candidates are: "public static boolean org.apache.calcite.runtime.SqlFunctions.like(java.lang.String, java.lang.String)", "public static boolean org.apache.calcite.runtime.SqlFunctions.like(java.lang.String, java.lang.String, java.lang.String)"    at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:10092)   at org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:7506)  at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:7376)     at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:7280)     at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:3850)     at org.codehaus.janino.UnitCompiler.access$6900(UnitCompiler.java:183)  at org.codehaus.janino.UnitCompiler$10.visitMethodInvocation(UnitCompiler.java:3251)    at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:3974)     at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3278)  at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4345)     at org.codehaus.janino.UnitCompiler.compileBoolean2(UnitCompiler.java:2842)     at org.codehaus.janino.UnitCompiler.access$4800(UnitCompiler.java:183)  at org.codehaus.janino.UnitCompiler$8.visitMethodInvocation(UnitCompiler.java:2803)     at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:3974)     at org.codehaus.janino.UnitCompiler.compileBoolean(UnitCompiler.java:2830)  at org.codehaus.janino.UnitCompiler.compileBoolean2(UnitCompiler.java:2924)     at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:183)  at org.codehaus.janino.UnitCompiler$8.visitBinaryOperation(UnitCompiler.java:2797)  at org.codehaus.janino.Java$BinaryOperation.accept(Java.java:3768)  at org.codehaus.janino.UnitCompiler.compileBoolean(UnitCompiler.java:2830)  at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:1742)    at org.codehaus.janino.UnitCompiler.access$1200(UnitCompiler.java:183)  at org.codehaus.janino.UnitCompiler$4.visitIfStatement(UnitCompiler.java:935)   at org.codehaus.janino.Java$IfStatement.accept(Java.java:2157)  at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:956)  at org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:997)    at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:983)     at org.codehaus.janino.UnitCompiler.access$1000(UnitCompiler.java:183)  at org.codehaus.janino.UnitCompiler$4.visitBlock(UnitCompiler.java:933)     at org.codehaus.janino.Java$Block.accept(Java.java:2012)    at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:956)  at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:1263)    at org.codehaus.janino.UnitCompiler.access$1500(UnitCompiler.java:183)  at org.codehaus.janino.UnitCompiler$4.visitWhileStatement(UnitCompiler.java:938)    at org.codehaus.janino.Java$WhileStatement.accept(Java.java:2244)   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:956)  at org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:997)    at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2283)     at org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:820)   at org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:792)   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:505)     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:656)     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:620)     at org.codehaus.janino.UnitCompiler.access$200(UnitCompiler.java:183)   at org.codehaus.janino.UnitCompiler$2.visitAnonymousClassDeclaration(UnitCompiler.java:343)     at org.codehaus.janino.Java$AnonymousClassDeclaration.accept(Java.java:894)     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:352)  at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4194)     at org.codehaus.janino.UnitCompiler.access$7300(UnitCompiler.java:183)  at org.codehaus.janino.UnitCompiler$10.visitNewAnonymousClassInstance(UnitCompiler.java:3260)   at org.codehaus.janino.Java$NewAnonymousClassInstance.accept(Java.java:4131)    at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3278)  at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4345)     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:1901)    at org.codehaus.janino.UnitCompiler.access$2100(UnitCompiler.java:183)  at org.codehaus.janino.UnitCompiler$4.visitReturnStatement(UnitCompiler.java:944)   at org.codehaus.janino.Java$ReturnStatement.accept(Java.java:2544)  at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:956)  at org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:997)    at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2283)     at org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:820)   at org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:792)   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:505)     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:656)     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:620)     at org.codehaus.janino.UnitCompiler.access$200(UnitCompiler.java:183)   at org.codehaus.janino.UnitCompiler$2.visitAnonymousClassDeclaration(UnitCompiler.java:343)     at org.codehaus.janino.Java$AnonymousClassDeclaration.accept(Java.java:894)     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:352)  at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4194)     at org.codehaus.janino.UnitCompiler.access$7300(UnitCompiler.java:183)  at org.codehaus.janino.UnitCompiler$10.visitNewAnonymousClassInstance(UnitCompiler.java:3260)   at org.codehaus.janino.Java$NewAnonymousClassInstance.accept(Java.java:4131)    at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3278)  at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4345)     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:1901)    at org.codehaus.janino.UnitCompiler.access$2100(UnitCompiler.java:183)  at org.codehaus.janino.UnitCompiler$4.visitReturnStatement(UnitCompiler.java:944)   at org.codehaus.janino.Java$ReturnStatement.accept(Java.java:2544)  at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:956)  at org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:997)    at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2283)     at org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:820)   at org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:792)   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:505)     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:391)     at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:183)   at org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:345)     at org.codehaus.janino.Java$PackageMemberClassDeclaration.accept(Java.java:1139)    at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:352)  at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:320)  at org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:383)     at org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:315)   at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:233)     at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:192)     at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:47)    at org.codehaus.janino.ClassBodyEvaluator.createInstance(ClassBodyEvaluator.java:340)   at org.apache.calcite.adapter.enumerable.EnumerableInterpretable.getBindable(EnumerableInterpretable.java:140)  at org.apache.calcite.adapter.enumerable.EnumerableInterpretable.toBindable(EnumerableInterpretable.java:105)   ... 24 common frames omitted

This happens because the LIKE function expects that you use it to compare String objects. I.e., it expects a format of String LIKE String and we have instead passed to it Other LIKE String. To account for this, there exact a few other RecordPath functions: RPATH_STRING, RPATH_INT, RPATH_LONG, RPATH_FLOAT, and RPATH_DOUBLE that can be used when you want to cause the return type to be of type String, Integer, Long (64-bit Integer), Float, or Double, respectively. So the above query would need to instead be written as SELECT * FROM FLOWFILE WHERE RPATH_STRING(project, '/maintainer/name') LIKE 'Apache%', which will produce the desired output.

Aggregate Functions

In order to evaluate SQL against a stream of data, the Processor treats each individual FlowFile as its own Table. Therefore, aggregate functions such as SUM and AVG will be evaluated against all Records in each FlowFile but will not span FlowFile boundaries. As an example, consider an input FlowFile in CSV format with the following data:

name, age, gender
John Doe, 40, Male
Jane Doe, 39, Female
Jimmy Doe, 4, Male
June Doe, 1, Female

Given this data, we may wish to perform a query that performs an aggregate function, such as MAX:

SELECT name
FROM FLOWFILE
WHERE age = (SELECT MAX(age))
sql

The above query will select the name of the oldest person, namely John Doe. If a second FlowFile were to then arrive, its contents would be evaluated as an entirely new Table.

QuerySalesforceObject

Retrieves records from a Salesforce sObject. Users can add arbitrary filter conditions by setting the 'Custom WHERE Condition' property. The processor can also run a custom query, although record processing is not supported in that case. Supports incremental retrieval: users can define a field in the 'Age Field' property that will be used to determine when the record was created. When this property is set the processor will retrieve new records. Incremental loading and record-based processing are only supported in property-based queries. It’s also possible to define an initial cutoff value for the age, filtering out all older records even for the first run. In case of 'Property Based Query' this processor should run on the Primary Node only. FlowFile attribute 'record.count' indicates how many records were retrieved and written to the output. The processor can accept an optional input FlowFile and reference the FlowFile attributes in the query. When 'Include Deleted Records' is true, the processor will include deleted records (soft-deletes) in the results by using the 'queryAll' API. The 'IsDeleted' field will be automatically included in the results when querying deleted records.

Tags: salesforce, sobject, soql, query

Properties

Salesforce Instance URL

The URL of the Salesforce instance including the domain without additional path information, such as https://MyDomainName.my.salesforce.com

API Version

The version number of the Salesforce REST API appended to the URL after the services/data path. See Salesforce documentation for supported versions

Query Type

Choose to provide the query by parameters or a full custom query.

Custom SOQL Query

Specify the SOQL query to run.

sObject Name

The Salesforce sObject to be queried

Field Names

Comma-separated list of field names requested from the sObject to be queried. When this field is left empty, all fields are queried.

Record Writer

Service used for writing records returned from the Salesforce REST API

Age Field

The name of a TIMESTAMP field that will be used to filter records using a bounded time window.The processor will return only those records with a timestamp value newer than the timestamp recorded after the last processor run.

Initial Age Start Time

This property specifies the start time that the processor applies when running the first query.

Age Delay

The ending timestamp of the time window will be adjusted earlier by the amount configured in this property. For example, with a property value of 10 seconds, an ending timestamp of 12:30:45 would be changed to 12:30:35.

Custom WHERE Condition

A custom expression to be added in the WHERE clause of the query

Include Deleted Records

If true, the processor will include deleted records (IsDeleted = true) in the query results. When enabled, the processor will use the 'queryAll' API.

Read Timeout

Maximum time allowed for reading a response from the Salesforce REST API

Create Zero Record FlowFiles

Specifies whether or not to create a FlowFile when the Salesforce REST API does not return any records

OAuth2 Access Token Provider

Service providing OAuth2 Access Tokens for authenticating using the HTTP Authorization Header

Relationships

  • success: For FlowFiles created as a result of a successful query.

  • failure: The input flowfile gets sent to this relationship when the query fails.

  • original: The input flowfile gets sent to this relationship when the query succeeds.

Writes Attributes

  • mime.type: Sets the mime.type attribute to the MIME Type specified by the Record Writer.

  • record.count: Sets the number of records in the FlowFile.

  • total.record.count: Sets the total number of records in the FlowFile.

Stateful

Scope: Cluster

When 'Age Field' is set, after performing a query the time of execution is stored. Subsequent queries will be augmented with an additional condition so that only records that are newer than the stored execution time (adjusted with the optional value of 'Age Delay') will be retrieved. State is stored across the cluster so that this Processor can be run on Primary Node only and if a new Primary Node is selected, the new node can pick up where the previous node left off, without duplicating the data.

Input Requirement

This component allows an incoming relationship.

Additional Details

Description

Objects in Salesforce are database tables, their rows are known as records, and their columns are called fields. The QuerySalesforceObject processor queries Salesforce objects and retrieves their records. The processor constructs the query from processor properties or executes a custom SOQL (Salesforce Object Query Language) query and retrieves the result record dataset using the Salesforce REST API. The ‘Query Type’ processor property allows the query to be built in two ways. The ‘Property Based Query’ option allows to define a ‘SELECT from ’ type query, with the fields defined in the '`Field Names’ property and the Salesforce object defined in the ‘sObject Name’ property, whereas the ‘Custom Query’ option allows you to supply an arbitrary SOQL query. By using ‘Custom Query’, the processor can accept an optional input flowfile and reference the flowfile attributes in the query. However, incremental loading and record-based processing are only supported in ‘Property Based Queries’.

OAuth2 Access Token Provider Service

The OAuth2 Access Token Provider Service handles Salesforce REST API authorization. In order to use OAuth2 authorization, create a new StandardOauth2AccessTokenProvider service and configure it as follows.

  • Authorization Server URL: It is the concatenation of the Salesforce URL and the token request service URL ( /services/oauth2/token).

  • Grant Type: User Password.

  • Username: The email address registered in the Salesforce account.

  • Password: For the Password a Security token must be requested. Go to Profile → Settings and under the Reset My Security Token option, request one, which will be sent to the registered email address. The password is made up of the Salesforce account password and the Security token concatenated together without a space.

  • Client ID: Create a new Connected App within Salesforce. Go to Setup → On the left search panel find App Manager → Create New Connected App. Once it’s done, the Consumer Key goes to the Client ID property.

  • Client Secret: Available on the Connected App page under Consumer Secret.

Age properties

The age properties are important to avoid processing duplicate records. Age filtering provides a sliding window that starts with the processor’s prior run time and ends with the current run time minus the age delay. Only records that are within the sliding window are queried and processed. On the processor, the Age Field property must be a datetime field of the queried object, this will be subject to the condition that it is greater than the processor’s previous but less than the current run time (e.g. LastModifiedDate). The first run, for example, will query records whose LastModifiedDate field is earlier than the current run time. The second will look for records with LastModifiedDate fields that are later than the previous run time but earlier than the current run time.

The processor uses the Initial Age Filter as a specific timestamp that sets the beginning of the sliding window from which processing builds the initial query. The format must adhere to the Salesforce SOQL standards (see Salesforce documentation). The Age Delay moves the time of the records to be processed earlier than the current run time if necessary.

QuerySplunkIndexingStatus

Queries Splunk server in order to acquire the status of indexing acknowledgement.

Tags: splunk, logs, http, acknowledgement

Properties

Scheme

The scheme for connecting to Splunk.

Hostname

The ip address or hostname of the Splunk server.

HTTP Event Collector Port

The HTTP Event Collector HTTP Port Number.

Security Protocol

The security protocol to use for communicating with Splunk.

Owner

The owner to pass to Splunk.

HTTP Event Collector Token

HTTP Event Collector token starting with the string Splunk. For example 'Splunk 1234578-abcd-1234-abcd-1234abcd'

Username

The username to authenticate to Splunk.

Password

The password to authenticate to Splunk.

Splunk Request Channel

Identifier of the used request channel.

Maximum Waiting Time

The maximum time the processor tries to acquire acknowledgement confirmation for an index, from the point of registration. After the given amount of time, the processor considers the index as not acknowledged and transfers the FlowFile to the "unacknowledged" relationship.

Maximum Query Size

The maximum number of acknowledgement identifiers the outgoing query contains in one batch. It is recommended not to set it too low in order to reduce network communication.

Relationships

  • success: A FlowFile is transferred to this relationship when the acknowledgement was successful.

  • failure: A FlowFile is transferred to this relationship when the acknowledgement was not successful due to errors during the communication. FlowFiles are timing out or unknown by the Splunk server will transferred to "undetermined" relationship.

  • unacknowledged: A FlowFile is transferred to this relationship when the acknowledgement was not successful. This can happen when the acknowledgement did not happened within the time period set for Maximum Waiting Time. FlowFiles with acknowledgement id unknown for the Splunk server will be transferred to this relationship after the Maximum Waiting Time is reached.

  • undetermined: A FlowFile is transferred to this relationship when the acknowledgement state is not determined. FlowFiles transferred to this relationship might be penalized. This happens when Splunk returns with HTTP 200 but with false response for the acknowledgement id in the flow file attribute.

Reads Attributes

  • splunk.acknowledgement.id: The indexing acknowledgement id provided by Splunk.

  • splunk.responded.at: The time of the response of put request for Splunk.

Input Requirement

This component requires an incoming relationship.

See Also

Additional Details

QuerySplunkIndexingStatus

This processor is responsible for polling Splunk server and determine if a Splunk event is acknowledged at the time of execution. For more details about the HEC Index Acknowledgement please see this documentation.

Prerequisites

In order to work properly, the incoming flow files need to have the attributes “splunk.acknowledgement.id” and ” splunk.responded.at” filled properly. The flow file attribute “splunk.acknowledgement.id” should continue the “ackId” contained by the response of the Splunk from the original put call. The flow file attribute “splunk.responded.at” should contain the Unix Epoch the put call was answered by Splunk. It is suggested to use PutSplunkHTTP processor to execute the put call and set these attributes.

Unacknowledged and undetermined cases

Splunk serves information only about successful acknowledgement. In every other case it will return a value of false. This includes unsuccessful or ongoing indexing and unknown acknowledgement identifiers. In order to avoid infinite tries, QuerySplunkIndexingStatus gives user the possibility to set a “Maximum waiting time”. Results with value of false from Splunk within the specified waiting time will be handled as “undetermined” and are transferred to the ” undetermined” relationship. Flow files outside of this time range will be queried as well and be transferred to either ” acknowledged” or “unacknowledged” relationship determined by the Splunk response. In order to determine if the indexing of a given event is within the waiting time, the Unix Epoch of the original Splunk response is stored in the attribute ” splunk.responded.at”. Setting “Maximum waiting time” too low might result some false negative result as in case under higher load, Splunk server might index slower than it is expected.

Undetermined cases are normal in healthy environment as it is possible that NiFi asks for indexing status before Splunk finishes and acknowledges it. These cases are safe to retry, and it is suggested to loop “undetermined” relationship back to the processor for later try. Flow files transferred into the “Undetermined” relationship are penalized.

Performance

Please keep Splunk channel limitations in mind: there are multiple configuration parameters in Splunk which might have direct effect on the performance and behaviour of the QuerySplunkIndexingStatus processor. For example ” max_number_of_acked_requests_pending_query” and “max_number_of_acked_requests_pending_query_per_ack_channel” might limit the amount of ackIDs, the Splunk stores.

Also, it is suggested to execute the query in batches. The “Maximum Query Size” property might be used for fine tune the maximum number of events the processor will query about in one API request. This serves as an upper limit for the batch but the processor might execute the query with fewer events.

RemoveRecordField

Modifies the contents of a FlowFile that contains Record-oriented data (i.e. data that can be read via a RecordReader and written by a RecordWriter) by removing selected fields. This Processor requires that at least one user-defined Property be added. The name of the property is ignored by the processor, but could be a meaningful identifier for the user. The value of the property should indicate a RecordPath that determines the field to be removed. The processor executes the removal in the order in which these properties are added to the processor. Set the "Record Writer" to "Inherit Record Schema" in order to use the updated Record Schema modified when removing Fields.

Use Cases

Remove one or more fields from a Record, where the names of the fields to remove are known.

Keywords: record, field, drop, remove, delete, expunge, recordpath

Input Requirement: This component allows an incoming relationship.

  1. Configure the Record Reader according to the incoming data format.

  2. Configure the Record Writer according to the desired output format. .

  3. For each field that you want to remove, add a single new property to the Processor.

  4. The name of the property can be anything but it’s recommended to use a brief description of the field.

  5. The value of the property is a RecordPath that matches the field to remove. .

  6. For example, to remove the name and email fields, add two Properties:

  7. name = /name

  8. email = /email .

Tags: update, record, generic, schema, json, csv, avro, freeform, text, remove, delete

Properties

Record Reader

Specifies the Controller Service to use for reading incoming data

Record Writer

Specifies the Controller Service to use for writing out the records

Dynamic Properties

A description of the field to remove

Any field that matches the RecordPath set as the value will be removed.

Relationships

  • success: FlowFiles that are successfully transformed will be routed to this relationship

  • failure: If a FlowFile cannot be transformed from the configured input format to the configured output format, the unchanged FlowFile will be routed to this relationship

Writes Attributes

  • record.error.message: This attribute provides on failure the error message encountered by the Reader or Writer.

Input Requirement

This component requires an incoming relationship.

See Also

Additional Details

RemoveRecordField processor usage with examples

The RemoveRecordField processor is capable of removing fields from a NiFi record. The fields that should be removed from the record are identified by a RecordPath expression. To learn about RecordPath, please read the RecordPath Guide.

RemoveRecordField will update all Records within the FlowFile based upon the RecordPath(s) configured for removal. The Schema associated with the Record Reader configured to read the FlowFile content will be updated based upon the same RecordPath(s) and considering the values remaining within the Record’s Fields after removal. This updated schema can be used for output if the Record Writer has a Schema Access Strategy of Inherit Record Schema, otherwise the schema updates will be lost and the Records output using the Schema configured upon the Writer.

Below are some examples that are intended to explain how to use the processor. In these examples the input data, the record schema, the output data and the output schema are all in JSON format for easy understanding. We assume that the processor is configured to use a JsonTreeReader and JsonRecordSetWriter controller service, but of course the processor works with other Reader and Writer controller services as well. In the examples it is also assumed that the record schema is provided explicitly as a FlowFile attribute (avro.schema attribute), and the Reader uses this schema to work with the FlowFile. The Writer’s Schema Access Strategy is “Inherit Record Schema” so that all modifications made to the schema by the processor are considered by the Writer. Schema Write Strategy of the Writer is set to “Set ‘avro.schema’ Attribute” so that the output FlowFile contains the schema as an attribute value.

Example 1:

Removing a field from a simple record

Input data:

{
  "id": 1,
  "name": "John Doe",
  "dateOfBirth": "1980-01-01"
}
json

Input schema:

{
  "type": "record",
  "name": "PersonRecord",
  "namespace": "org.apache.nifi",
  "fields": [
    {
      "name": "id",
      "type": "int"
    },
    {
      "name": "name",
      "type": "string"
    },
    {
      "name": "dateOfBirth",
      "type": "string"
    }
  ]
}
json

Field to remove:

/dateOfBirth

In this case the dateOfBirth field is removed from the record as well as the schema.

Output data:

{
  "id": 1,
  "name": "John Doe"
}
json

Output schema:

{
  "type": "record",
  "name": "PersonRecord",
  "namespace": "org.apache.nifi",
  "fields": [
    {
      "name": "id",
      "type": "int"
    },
    {
      "name": "name",
      "type": "string"
    }
  ]
}
json

Note, that removing a field from a record differs from setting a field’s value to null. With RemoveRecordField a field is completely removed from the record and its schema regardless of the field being nullable or not.

Example 2:

Removing fields from a complex record

Let’s suppose we have an input record that contains a homeAddress and a mailingAddress field both of which contain a zip field and we want to remove the zip field from both of them.

Input data:

{
  "id": 1,
  "name": "John Doe",
  "homeAddress": {
    "zip": 1111,
    "street": "Main",
    "number": 24
  },
  "mailingAddress": {
    "zip": 1121,
    "street": "Airport",
    "number": 12
  }
}
json

Input schema:

{
  "name": "PersonRecord",
  "type": "record",
  "namespace": "org.apache.nifi",
  "fields": [
    {
      "name": "id",
      "type": "int"
    },
    {
      "name": "name",
      "type": "string"
    },
    {
      "name": "homeAddress",
      "type": {
        "name": "address",
        "type": "record",
        "fields": [
          {
            "name": "zip",
            "type": "int"
          },
          {
            "name": "street",
            "type": "string"
          },
          {
            "name": "number",
            "type": "int"
          }
        ]
      }
    },
    {
      "name": "mailingAddress",
      "type": "address"
    }
  ]
}
json

The zip field from both addresses can be removed by specifying “Field to remove 1” property on the processor as ” /homeAddress/zip” and adding a dynamic property with the value “/mailingAddress/zip”. Or we can use a wildcard expression in the RecordPath in “Field To Remove 1” (no need to specify a dynamic property).

Field to remove:

/*/zip

The zip field is removed from both addresses.

Output data:

{
  "id": 1,
  "name": "John Doe",
  "homeAddress": {
    "street": "Main",
    "number": 24
  },
  "mailingAddress": {
    "street": "Airport",
    "number": 12
  }
}
json

The zip field is removed from the schema of both the homeAddress field and the mailingAddress field. However, if only “/homeAddress/zip” was specified to be removed, the schema of mailingAddress would be intact regardless of the fact that originally these two addresses shared the same schema.

Example 3:

Arrays

Let’s suppose we have an input record that contains an array of addresses.

Input data:

{
  "id": 1,
  "name": "John Doe",
  "addresses": [
    {
      "zip": 1111,
      "street": "Main",
      "number": 24
    },
    {
      "zip": 1121,
      "street": "Airport",
      "number": 12
    }
  ]
}
json

Input schema:

{
  "name": "PersonRecord",
  "type": "record",
  "namespace": "org.apache.nifi",
  "fields": [
    {
      "name": "id",
      "type": "int"
    },
    {
      "name": "name",
      "type": "string"
    },
    {
      "name": "addresses",
      "type": {
        "type": "array",
        "items": {
          "name": "address",
          "type": "record",
          "fields": [
            {
              "name": "zip",
              "type": "int"
            },
            {
              "name": "street",
              "type": "string"
            },
            {
              "name": "number",
              "type": "int"
            }
          ]
        }
      }
    }
  ]
}
json
  • Case 1: removing one element from the array

Field to remove:

/addresses[0]

Output data:

{
  "id": 1,
  "name": "John Doe",
  "addresses": [
    {
      "zip": 1121,
      "street": "Airport",
      "number": 12
    }
  ]
}
json

Output schema:

{
  "type": "record",
  "name": "nifiRecord",
  "namespace": "org.apache.nifi",
  "fields": [
    {
      "name": "id",
      "type": "int"
    },
    {
      "name": "name",
      "type": "string"
    },
    {
      "name": "addresses",
      "type": {
        "type": "array",
        "items": {
          "type": "record",
          "name": "addressesType",
          "fields": [
            {
              "name": "zip",
              "type": "int"
            },
            {
              "name": "street",
              "type": "string"
            },
            {
              "name": "number",
              "type": "int"
            }
          ]
        }
      }
    }
  ]
}
json

The first element of the array is removed. The schema of the output data is structurally the same as the input schema. Note that the name “PersonRecord” of the input schema changed to “nifiRecord” and the name “address” changed to ” addressesType”. This is normal, NiFi generates these names for the output schema. These name changes occur regardless of the schema actually being modified or not.

  • Case 2: removing all elements from the array

Field to remove:

/addresses[*]

Output data:

{
  "id": 1,
  "name": "John Doe",
  "addresses": []
}
json

All elements of the array are removed, the result is an empty array. The output schema is the same as in Case 1, no structural changes made to the schema.

  • Case 3: removing a field from certain elements of the array

Field to remove:

/addresses[0]/zip

Output data:

{
  "id": 1,
  "name": "John Doe",
  "addresses": [
    {
      "zip": null,
      "street": "Main",
      "number": 24
    },
    {
      "zip": 1121,
      "street": "Airport",
      "number": 12
    }
  ]
}
json

The output schema is the same as in Case 1, no structural changes. The zip field of the array’s first element is set to null since the value had to be deleted but the schema could not be modified since deletion is not applied to all elements in the array. In a case like this, the value of the field is set to null regardless of the field being nullable or not.

  • Case 4: removing a field from all elements of an array

Field to remove:

/addresses[*]/zip

Output data:

{
  "id": 1,
  "name": "John Doe",
  "addresses": [
    {
      "street": "Main",
      "number": 24
    },
    {
      "street": "Airport",
      "number": 12
    }
  ]
}
json

Output schema:

{
  "type": "record",
  "name": "nifiRecord",
  "namespace": "org.apache.nifi",
  "fields": [
    {
      "name": "id",
      "type": "int"
    },
    {
      "name": "name",
      "type": "string"
    },
    {
      "name": "addresses",
      "type": {
        "type": "array",
        "items": {
          "type": "record",
          "name": "addressesType",
          "fields": [
            {
              "name": "street",
              "type": "string"
            },
            {
              "name": "number",
              "type": "int"
            }
          ]
        }
      }
    }
  ]
}
json

The zip field is removed from all elements of the array, and the schema is modified, the zip field is removed from the array’s element type.

The examples shown in Case 1, Case 2, Case 3 and Case 4 apply to both kinds of collections: arrays and maps. The schema of an array or a map is only modified if the field removal applies to all elements of the collection. Selecting all elements of an array can be performed with the [*] as well as the [0..-1] operator.

Important note: if there are e.g. 3 elements in the addresses array, and “/addresses[*]/zip” is removed, then the zip field is removed from the schema because it applies explicitly for all elements regardless of the actual number of elements in the array. However, if the path says “/addresses[0,1,2]/zip” then the schema is NOT modified (even though [0,1,2] means all the elements in this particular array), because it selects the first, second and third elements individually and does not express the intention to apply removal to all elements of the array regardless of the number of elements.

  • Case 5: removing multiple elements from an array

Fields to remove:

/addresses[0] /addresses[0]

In this example we want to remove the first two elements of the array. To do that we need to specify two separate path expressions, each pointing to one array element. Each removal is executed on the result of the previous removal, and removals are executed in the order in which the properties containing record paths are specified on the Processor. First, “/addresses[0]” is removed, that is the address with zip code 1111 in the example. After this removal, the addresses array has a new first element, which is the second element of the original array (the address with zip code 1121). To remove this element, we need to issue “/addresses[0]” again. Trying to remove “/addresses[0,1]”, or filtering array elements with predicates when the target of the removal is multiple different array elements may produce unexpected results.

  • Case 6: array within an array

Let’s suppose we have a complex input record that has an array within an array.

Input data:

{
  "id": 1,
  "people": [
    {
      "id": 11,
      "addresses": [
        {
          "zip": 1111,
          "street": "Main",
          "number": 24
        },
        {
          "zip": 1121,
          "street": "Airport",
          "number": 12
        }
      ]
    },
    {
      "id": 22,
      "addresses": [
        {
          "zip": 2222,
          "street": "Ocean",
          "number": 24
        },
        {
          "zip": 2232,
          "street": "Sunset",
          "number": 12
        }
      ]
    },
    {
      "id": 33,
      "addresses": [
        {
          "zip": 3333,
          "street": "Dawn",
          "number": 24
        },
        {
          "zip": 3323,
          "street": "Spring",
          "number": 12
        }
      ]
    }
  ]
}
json

The following table summarizes what happens to the record and the schema for different RecordPaths

Field To Remove Is the schema modified? What happens to the record?

/people[0]/addresses[1]/zip

No

The zip field of the first person’s second address is set to null.

/people[*]/addresses[1]/zip

No

The zip field of all people’s second address is set to null.

/people[0]/addresses[*]/zip

No

The zip field of the first person’s every address is set to null.

/people[]/addresses[]/zip

Yes

The zip field every person’s every address is removed (from the schema AND the data).

The rules and examples shown for arrays apply for maps as well.

Example 4:

Choice datatype

Let’s suppose we have an input schema that contains a field of type CHOICE.

Input data:

{
  "id": 12,
  "name": "John Doe"
}
json

Input schema:

{
  "type": "record",
  "name": "nameRecord",
  "namespace": "org.apache.nifi",
  "fields": [
    {
      "name": "id",
      "type": "int"
    },
    {
      "name": "name",
      "type": [
        "string",
        {
          "type": "record",
          "name": "nameType",
          "fields": [
            {
              "name": "firstName",
              "type": "string"
            },
            {
              "name": "lastName",
              "type": "string"
            }
          ]
        }
      ]
    }
  ]
}
json

In this example, the schema specifies the name field as CHOICE, but in the data it is a simple string. If we remove ” /name/firstName” then there is no modifications to the data, but the schema is modified, the firstName field gets removed from the schema only.

RenameRecordField

Renames one or more fields in each Record of a FlowFile. This Processor requires that at least one user-defined Property be added. The name of the Property should indicate a RecordPath that determines the field that should be updated. The value of the Property is the new name to assign to the Record Field that matches the RecordPath. The property value may use Expression Language to reference FlowFile attributes as well as the variables field.name, field.value, field.type, and record.index

Use Cases

Rename a field in each Record to a specific, known name.

Keywords: rename, field, static, specific, name

Input Requirement: This component allows an incoming relationship.

  1. Configure the 'Record Reader' according to the input format.

  2. Configure the 'Record Writer' according to the desired output format. .

  3. Add a property to the Processor such that the name of the property is a RecordPath to identifies the field to rename. The value of the property is the new name of the property. .

  4. For example, to rename the name field to full_name, add a property with a name of /name and a value of full_name. .

  5. Many properties can be added following this pattern in order to rename multiple fields. .

Rename a field in each Record to a name that is derived from a FlowFile attribute.

Keywords: rename, field, expression language, EL, flowfile, attribute

Input Requirement: This component allows an incoming relationship.

  1. Configure the 'Record Reader' according to the input format.

  2. Configure the 'Record Writer' according to the desired output format. .

  3. Add a property to the Processor such that the name of the property is a RecordPath to identifies the field to rename. The value of the property is an Expression Language expression that can be used to determine the new name of the field. .

  4. For example, to rename the addr field to whatever value is stored in the preferred_address_name attribute, add a property with a name of /name and a value of ${preferred_address_name}. .

  5. Many properties can be added following this pattern in order to rename multiple fields. .

Rename a field in each Record to a new name that is derived from the current field name.

Notes: This might be used, for example, to add a prefix or a suffix to some fields, or to transform the name of the field by making it uppercase.

Keywords: rename, field, expression language, EL, field.name

Input Requirement: This component allows an incoming relationship.

  1. Configure the 'Record Reader' according to the input format.

  2. Configure the 'Record Writer' according to the desired output format. .

  3. Add a property to the Processor such that the name of the property is a RecordPath to identifies the field to rename. The value of the property is an Expression Language expression that references the field.name property. .

  4. For example, to rename all fields with a prefix of pre_, we add a property named / and a value of pre_${field.name}. If we would like this to happen recursively, to nested fields as well, we use a property name of // with the value of pre_${field.name}. .

  5. To make all field names uppercase, we can add a property named //* with a value of ${field.name:toUpper()}. .

  6. Many properties can be added following this pattern in order to rename multiple fields. .

Tags: update, record, rename, field, generic, schema, json, csv, avro, log, logs

Properties

Record Reader

Specifies the Controller Service to use for reading incoming data

Record Writer

Specifies the Controller Service to use for writing out the records

Dynamic Properties

A RecordPath that identifies which field(s) to update

Allows users to specify a new name for each field that matches the RecordPath.

Relationships

  • success: FlowFiles that are successfully transformed will be routed to this relationship

  • failure: If a FlowFile cannot be transformed from the configured input format to the configured output format, the unchanged FlowFile will be routed to this relationship

Writes Attributes

  • record.index: This attribute provides the current row index and is only available inside the literal value expression.

Input Requirement

This component requires an incoming relationship.

ReplaceText

Updates the content of a FlowFile by searching for some textual value in the FlowFile content (via Regular Expression/regex, or literal value) and replacing the section of the content that matches with some alternate value. It can also be used to append or prepend text to the contents of a FlowFile.

Use Cases

Append text to the end of every line in a FlowFile

Keywords: raw text, append, line

Input Requirement: This component allows an incoming relationship.

  1. "Evaluation Mode" = "Line-by-Line"

  2. "Replacement Strategy" = "Append" .

  3. "Replacement Value" is set to whatever text should be appended to the line.

  4. For example, to insert the text <fin> at the end of every line, we would set "Replacement Value" to <fin>.

  5. We can also use Expression Language. So to insert the filename at the end of every line, we set "Replacement Value" to ${filename} .

Prepend text to the beginning of every line in a FlowFile

Keywords: raw text, prepend, line

Input Requirement: This component allows an incoming relationship.

  1. "Evaluation Mode" = "Line-by-Line"

  2. "Replacement Strategy" = "Prepend" .

  3. "Replacement Value" is set to whatever text should be prepended to the line.

  4. For example, to insert the text <start> at the beginning of every line, we would set "Replacement Value" to <start>.

  5. We can also use Expression Language. So to insert the filename at the beginning of every line, we set "Replacement Value" to ${filename} .

Replace every occurrence of a literal string in the FlowFile with a different value

Keywords: replace, string, text, literal

Input Requirement: This component allows an incoming relationship.

  1. "Evaluation Mode" = "Line-by-Line"

  2. "Replacement Strategy" = "Literal Replace"

  3. "Search Value" is set to whatever text is in the FlowFile that needs to be replaced.

  4. "Replacement Value" is set to the text that should replace the current text. .

  5. For example, to replace the word "spider" with "arachnid" we set "Search Value" to spider and set "Replacement Value" to arachnid. .

Transform every occurrence of a literal string in a FlowFile

Keywords: replace, transform, raw text

Input Requirement: This component allows an incoming relationship.

  1. "Evaluation Mode" = "Line-by-Line"

  2. "Replacement Strategy" = "Regex Replace"

  3. "Search Value" is set to a regular expression that matches the text that should be transformed in a capturing group.

  4. "Replacement Value" is set to a NiFi Expression Language expression that references $1 (in quotes to escape the reference name). .

  5. For example, if we wanted to lowercase any occurrence of WOLF, TIGER, or LION, we would use a "Search Value" of (WOLF|TIGER|LION) and a "Replacement Value" of ${'$1':toLower()}.

  6. If we want to replace any identifier with a hash of that identifier, we might use a "Search Value" of identifier: (.*) and a "Replacement Value" of identifier: ${'$1':hash('sha256')} .

Completely replace the contents of a FlowFile to a specific text

Keywords: replace, raw text

Input Requirement: This component allows an incoming relationship.

  1. "Evaluation Mode" = "Entire text"

  2. "Replacement Strategy" = "Always Replace" .

  3. "Replacement Value" is set to the new text that should be written to the FlowFile. This text might include NiFi Expression Language to reference one or more attributes. .

Tags: Text, Regular Expression, Update, Change, Replace, Modify, Regex

Properties

Replacement Strategy

The strategy for how and what to replace within the FlowFile’s text content.

Search Value

The Search Value to search for in the FlowFile content. Only used for 'Literal Replace' and 'Regex Replace' matching strategies

Replacement Value

The value to insert using the 'Replacement Strategy'. Using "Regex Replace" back-references to Regular Expression capturing groups are supported, but back-references that reference capturing groups that do not exist in the regular expression will be treated as literal value. Back References may also be referenced using the Expression Language, as '$1', '$2', etc. The single-tick marks MUST be included, as these variables are not "Standard" attribute names (attribute names must be quoted unless they contain only numbers, letters, and _).

Text to Prepend

The text to prepend to the start of the FlowFile, or each line, depending on the configured value of the Evaluation Mode property

Text to Append

The text to append to the end of the FlowFile, or each line, depending on the configured value of the Evaluation Mode property

Character Set

The Character Set in which the file is encoded

Maximum Buffer Size

Specifies the maximum amount of data to buffer (per file or per line, depending on the Evaluation Mode) in order to apply the replacement. If 'Entire Text' (in Evaluation Mode) is selected and the FlowFile is larger than this value, the FlowFile will be routed to 'failure'. In 'Line-by-Line' Mode, if a single line is larger than this value, the FlowFile will be routed to 'failure'. A default value of 1 MB is provided, primarily for 'Entire Text' mode. In 'Line-by-Line' Mode, a value such as 8 KB or 16 KB is suggested. This value is ignored if the <Replacement Strategy> property is set to one of: Append, Prepend, Always Replace

Evaluation Mode

Run the 'Replacement Strategy' against each line separately (Line-by-Line) or buffer the entire file into memory (Entire Text) and run against that.

Line-by-Line Evaluation Mode

Run the 'Replacement Strategy' against each line separately (Line-by-Line) for all lines in the FlowFile, First Line (Header) alone, Last Line (Footer) alone, Except the First Line (Header) or Except the Last Line (Footer).

Relationships

  • success: FlowFiles that have been successfully processed are routed to this relationship. This includes both FlowFiles that had text replaced and those that did not.

  • failure: FlowFiles that could not be updated are routed to this relationship

Input Requirement

This component requires an incoming relationship.

System Resource Considerations

  • MEMORY: An instance of this component can cause high usage of this system resource. Multiple instances or high concurrency settings may result a degradation of performance.

ReplaceTextWithMapping

Updates the content of a FlowFile by evaluating a Regular Expression against it and replacing the section of the content that matches the Regular Expression with some alternate value provided in a mapping file.

Tags: Text, Regular Expression, Update, Change, Replace, Modify, Regex, Mapping

Properties

Regular Expression

The Regular Expression to search for in the FlowFile content

Matching Group

The number of the matching group of the provided regex to replace with the corresponding value from the mapping file (if it exists).

Mapping File

The name of the file (including the full path) containing the Mappings.

Mapping File Refresh Interval

The polling interval to check for updates to the mapping file. The default is 60s.

Character Set

The Character Set in which the file is encoded

Maximum Buffer Size

Specifies the maximum amount of data to buffer (per file) in order to apply the regular expressions. If a FlowFile is larger than this value, the FlowFile will be routed to 'failure'

Relationships

  • success: FlowFiles that have been successfully updated are routed to this relationship, as well as FlowFiles whose content does not match the given Regular Expression

  • failure: FlowFiles that could not be updated are routed to this relationship

Input Requirement

This component requires an incoming relationship.

ResizeImage

Resizes an image to user-specified dimensions. This Processor uses the image codecs registered with the environment that NiFi is running in. By default, this includes JPEG, PNG, BMP, WBMP, and GIF images.

Tags: resize, image, jpg, jpeg, png, bmp, wbmp, gif

Properties

Image Width (in pixels)

The desired number of pixels for the image’s width

Image Height (in pixels)

The desired number of pixels for the image’s height

Scaling Algorithm

Specifies which algorithm should be used to resize the image

Maintain aspect ratio

Specifies if the ratio of the input image should be maintained

Relationships

  • success: A FlowFile is routed to this relationship if it is successfully resized

  • failure: A FlowFile is routed to this relationship if it is not in the specified format

Input Requirement

This component requires an incoming relationship.

RespondCloud

Used in an IGUASU Gateway (running on an on-prem system) for responding to requests received from IGUASU (running on the Virtimo Cloud). Responses contain incoming FlowFiles' content and attributes. Requests from IGUASU are sent using an InvokeOnPrem-processor, and received by IGUASU Gateway using a ListenCloud-processor. Data is securely sent in either direction using a standing WSS connection initiated by IGUASU Gateway, meaning the on-prem system mustn’t open external ports.

Tags: hybrid, cloud, onprem, virtimo, websocket

Properties

Client Controller

Controller for connecting to a Virtimo system and enable WSS communication.

Relationships

  • success: FlowFiles will be routed to this Relationship after the response has been successfully sent to the requester.

  • failure: FlowFiles will be routed to this Relationship if the Processor is unable to respond to the requester. This may happen, for instance, if the connection times out or if NiFi is restarted before responding to the request.

Reads Attributes

  • hybrid.invoker: The ID of the invoker that sent the request and will receive the response. If this attribute is missing, no response is sent and the FlowFile will be routed to the 'failure' relationship.

  • hybrid.request: Used to match responses with requests. If this attribute is missing, no response is sent and the FlowFile will be routed to the 'failure' relationship.

Input Requirement

This component requires an incoming relationship.

RespondInubit

Used in IGUASU for responding to requests received from an (on-prem) INUBIT system. Responses contain incoming FlowFiles' content and attributes. Requests from INUBIT are sent using an IGUASU-Connector, and received by IGUASU using a ListenInubit-processor. Data is securely sent in either direction using a standing WSS connection initiated by INUBIT, meaning the INUBIT system mustn’t open external ports.

Tags: hybrid, cloud, onprem, virtimo, websocket, inubit

Properties

Server Controller

Controller for letting other Virtimo systems to connect to and enable WSS communication.

Relationships

  • success: FlowFiles will be routed to this Relationship after the response has been successfully sent to the requester.

  • failure: FlowFiles will be routed to this Relationship if the Processor is unable to respond to the requester. This may happen, for instance, if the connection times out or if NiFi is restarted before responding to the request.

Reads Attributes

  • hybrid.invoker: The ID of the invoker that sent the request and will receive the response. If this attribute is missing, no response is sent and the FlowFile will be routed to the 'failure' relationship.

  • hybrid.request: Used to match responses with requests. If this attribute is missing, no response is sent and the FlowFile will be routed to the 'failure' relationship.

Input Requirement

This component requires an incoming relationship.

See Also

RespondOnPrem

Used in IGUASU (running on the Virtimo Cloud) for responding to requests received from IGUASU Gateway (running on an on-prem system). Responses contain incoming FlowFiles' content and attributes. Requests from the IGUASU Gateway are sent using an InvokeCloud-processor, and received by IGUASU using a ListenOnPrem-processor. Data is securely sent in either direction using a standing WSS connection initiated by IGUASU Gateway, meaning the on-prem system mustn’t open external ports.

Tags: hybrid, cloud, onprem, virtimo, websocket

Properties

Server Controller

Controller for letting other Virtimo systems to connect to and enable WSS communication.

Relationships

  • success: FlowFiles will be routed to this Relationship after the response has been successfully sent to the requester.

  • failure: FlowFiles will be routed to this Relationship if the Processor is unable to respond to the requester. This may happen, for instance, if the connection times out or if NiFi is restarted before responding to the request.

Reads Attributes

  • hybrid.invoker: The ID of the invoker that sent the request and will receive the response. If this attribute is missing, no response is sent and the FlowFile will be routed to the 'failure' relationship.

  • hybrid.request: Used to match responses with requests. If this attribute is missing, no response is sent and the FlowFile will be routed to the 'failure' relationship.

Input Requirement

This component requires an incoming relationship.

RetryFlowFile

FlowFiles passed to this Processor have a 'Retry Attribute' value checked against a configured 'Maximum Retries' value. If the current attribute value is below the configured maximum, the FlowFile is passed to a retry relationship. The FlowFile may or may not be penalized in that condition. If the FlowFile’s attribute value exceeds the configured maximum, the FlowFile will be passed to a 'retries_exceeded' relationship. WARNING: If the incoming FlowFile has a non-numeric value in the configured 'Retry Attribute' attribute, it will be reset to '1'. You may choose to fail the FlowFile instead of performing the reset. Additional dynamic properties can be defined for any attributes you wish to add to the FlowFiles transferred to 'retries_exceeded'. These attributes support attribute expression language.

Tags: Retry, FlowFile

Properties

Retry Attribute

The name of the attribute that contains the current retry count for the FlowFile. WARNING: If the name matches an attribute already on the FlowFile that does not contain a numerical value, the processor will either overwrite that attribute with '1' or fail based on configuration.

Maximum Retries

The maximum number of times a FlowFile can be retried before being passed to the 'retries_exceeded' relationship

Penalize Retries

If set to 'true', this Processor will penalize input FlowFiles before passing them to the 'retry' relationship. This does not apply to the 'retries_exceeded' relationship.

Fail on Non-numerical Overwrite

If the FlowFile already has the attribute defined in 'Retry Attribute' that is not a number, fail the FlowFile instead of resetting that value to '1'

Reuse Mode

Defines how the Processor behaves if the retry FlowFile has a different retry UUID than the instance that received the FlowFile. This generally means that the attribute was not reset after being successfully retried by a previous instance of this processor.

Dynamic Properties

Exceeded FlowFile Attribute Key

One or more dynamic properties can be used to add attributes to FlowFiles passed to the 'retries_exceeded' relationship

Relationships

  • failure: The processor is configured such that a non-numerical value on 'Retry Attribute' results in a failure instead of resetting that value to '1'. This will immediately terminate the limited feedback loop. Might also include when 'Maximum Retries' contains attribute expression language that does not resolve to an Integer.

  • retries_exceeded: Input FlowFile has exceeded the configured maximum retry count, do not pass this relationship back to the input Processor to terminate the limited feedback loop.

  • retry: Input FlowFile has not exceeded the configured maximum retry count, pass this relationship back to the input Processor to create a limited feedback loop.

Reads Attributes

  • Retry Attribute: Will read the attribute or attribute expression language result as defined in 'Retry Attribute'

Writes Attributes

  • Retry Attribute: User defined retry attribute is updated with the current retry count

  • Retry Attribute .uuid: User defined retry attribute with .uuid that determines what processor retried the FlowFile last

Input Requirement

This component requires an incoming relationship.

RouteHL7

Routes incoming HL7 data according to user-defined queries. To add a query, add a new property to the processor. The name of the property will become a new relationship for the processor, and the value is an HL7 Query Language query. If a FlowFile matches the query, a copy of the FlowFile will be routed to the associated relationship.

Tags: HL7, healthcare, route, Health Level 7

Properties

Character Encoding

The Character Encoding that is used to encode the HL7 data

Dynamic Properties

Name of a Relationship

If a FlowFile matches the query, it will be routed to a relationship with the name of the property

Relationships

  • failure: Any FlowFile that cannot be parsed as HL7 will be routed to this relationship

  • original: The original FlowFile that comes into this processor will be routed to this relationship, unless it is routed to 'failure'

Writes Attributes

  • RouteHL7.Route: The name of the relationship to which the FlowFile was routed

Input Requirement

This component requires an incoming relationship.

RouteOnAttribute

Routes FlowFiles based on their Attributes using the Attribute Expression Language

Use Cases

Route data to one or more relationships based on its attributes using the NiFi Expression Language.

Keywords: attributes, routing, expression language

Input Requirement: This component allows an incoming relationship.

  1. Set the "Routing Strategy" property to "Route to Property name".

  2. For each route that a FlowFile might be routed to, add a new property. The name of the property should describe the route.

  3. The value of the property is an Attribute Expression Language expression that returns a boolean value indicating whether or not a given FlowFile will be routed to the associated relationship. .

  4. For example, we might route data based on its file extension using the following properties:

  5. - "Routing Strategy" = "Route to Property Name"

  6. - "jpg" = "${filename:endsWith('.jpg')}"

  7. - "png" = "${filename:endsWith('.png')}"

  8. - "pdf" = "${filename:endsWith('.pdf')}" .

  9. The Processor will now have 3 relationships: jpg, png, and pdf. Each of these should be connected to the appropriate downstream processor. .

Keep data only if its attributes meet some criteria, such as its filename ends with .txt.

Keywords: keep, filter, remove, delete, expression language

Input Requirement: This component allows an incoming relationship.

  1. Add a new property for each condition that must be satisfied in order to keep the data.

  2. If the data should be kept in the case that any of the provided conditions is met, set the "Routing Strategy" property to "Route to 'matched' if any matches".

  3. If all conditions must be met in order to keep the data, set the "Routing Strategy" property to "Route to 'matched' if all match". .

  4. For example, to keep files whose filename ends with .txt and have a file size of at least 1000 bytes, we will use the following properties:

  5. - "ends_with_txt" = "${filename:endsWith('.txt')}"

  6. - "large_enough" = "${fileSize:ge(1000)}

  7. - "Routing Strategy" = "Route to 'matched' if all match" .

  8. Auto-terminate the 'unmatched' relationship.

  9. Connect the 'matched' relationship to the next processor in the flow. .

Discard or drop a file based on attributes, such as filename.

Keywords: discard, drop, filter, remove, delete, expression language

Input Requirement: This component allows an incoming relationship.

  1. Add a new property for each condition that must be satisfied in order to drop the data.

  2. If the data should be dropped in the case that any of the provided conditions is met, set the "Routing Strategy" property to "Route to 'matched' if any matches".

  3. If all conditions must be met in order to drop the data, set the "Routing Strategy" property to "Route to 'matched' if all match". .

  4. Here are a couple of examples for configuring the properties:

  5. Example 1 Use Case: Data should be dropped if its "uuid" attribute has an 'a' in it or ends with '0'.

  6. Here, we will use the following properties:

  7. - "has_a" = "${uuid:contains('a')}"

  8. - "ends_with_0" = "${uuid:endsWith('0')}

  9. - "Routing Strategy" = "Route to 'matched' if any matches"

  10. Example 2 Use Case: Data should be dropped if its 'uuid' attribute has an 'a' AND it ends with a '1'.

  11. Here, we will use the following properties:

  12. - "has_a" = "${uuid:contains('a')}"

  13. - "ends_with_1" = "${uuid:endsWith('1')}

  14. - "Routing Strategy" = "Route to 'matched' if all match" .

  15. Auto-terminate the 'matched' relationship.

  16. Connect the 'unmatched' relationship to the next processor in the flow. .

Multi-Processor Use Cases

Route record-oriented data based on whether or not the record’s values meet some criteria

Keywords: record, route, content, data

PartitionRecord:

  1. Choose a RecordReader that is appropriate based on the format of the incoming data.

  2. Choose a RecordWriter that writes the data in the desired output format. .

  3. Add a single additional property. The name of the property should describe the criteria to route on. The property’s value should be a RecordPath that returns true if the Record meets the criteria or false otherwise. This adds a new attribute to the FlowFile whose name is equal to the property name. .

  4. Connect the 'success' Relationship to RouteOnAttribute. .

RouteOnAttribute:

  1. Set "Routing Strategy" to "Route to Property name" .

  2. Add two additional properties. For the first one, the name of the property should describe data that matches the criteria. The value is an Expression Language expression that checks if the attribute added by the PartitionRecord processor has a value of true. For example, ${criteria:equals('true')}.

  3. The second property should have a name that describes data that does not match the criteria. The value is an Expression Language that evaluates to the opposite of the first property value. For example, ${criteria:equals('true'):not()}. .

  4. Connect each of the newly created Relationships to the appropriate downstream processors. .

Tags: attributes, routing, Attribute Expression Language, regexp, regex, Regular Expression, Expression Language, find, text, string, search, filter, detect

Properties

Routing Strategy

Specifies how to determine which relationship to use when evaluating the Expression Language

Dynamic Properties

Relationship Name

Routes FlowFiles whose attributes match the Expression Language specified in the Dynamic Property Value to the Relationship specified in the Dynamic Property Key

Relationships

  • unmatched: FlowFiles that do not match any user-define expression will be routed here

Dynamic Relationships

  • Name from Dynamic Property: FlowFiles that match the Dynamic Property’s Attribute Expression Language

Writes Attributes

  • RouteOnAttribute.Route: The relation to which the FlowFile was routed

Input Requirement

This component requires an incoming relationship.

Additional Details

Usage Example

This processor routes FlowFiles based on their attributes using the NiFi Expression Language. Users add properties with valid NiFi Expression Language Expressions as the values. Each Expression must return a value of type Boolean (true or false).

Example: The goal is to route all files with filenames that start with ABC down a certain path. Add a property with the following name and value:

  • property name: ABC

  • property value: $\{filename:startsWith(‘ABC’)}

In this example, all files with filenames that start with ABC will follow the ABC relationship.

RouteOnContent

Applies Regular Expressions to the content of a FlowFile and routes a copy of the FlowFile to each destination whose Regular Expression matches. Regular Expressions are added as User-Defined Properties where the name of the property is the name of the relationship and the value is a Regular Expression to match against the FlowFile content. User-Defined properties do support the Attribute Expression Language, but the results are interpreted as literal values, not Regular Expressions

Tags: route, content, regex, regular expression, regexp, find, text, string, search, filter, detect

Properties

Match Requirement

Specifies whether the entire content of the file must match the regular expression exactly, or if any part of the file (up to Content Buffer Size) can contain the regular expression in order to be considered a match

Character Set

The Character Set in which the file is encoded

Content Buffer Size

Specifies the maximum amount of data to buffer in order to apply the regular expressions. If the size of the FlowFile exceeds this value, any amount of this value will be ignored

Dynamic Properties

Relationship Name

Routes FlowFiles whose content matches the regular expression defined by Dynamic Property’s value to the Relationship defined by the Dynamic Property’s key

Relationships

  • unmatched: FlowFiles that do not match any of the user-supplied regular expressions will be routed to this relationship

Dynamic Relationships

  • Name from Dynamic Property: FlowFiles that match the Dynamic Property’s Regular Expression

Input Requirement

This component requires an incoming relationship.

RouteText

Routes textual data based on a set of user-defined rules. Each line in an incoming FlowFile is compared against the values specified by user-defined Properties. The mechanism by which the text is compared to these user-defined properties is defined by the 'Matching Strategy'. The data is then routed according to these rules, routing each line of the text individually.

Use Cases

Drop blank or empty lines from the FlowFile’s content.

Keywords: filter, drop, empty, blank, remove, delete, strip out, lines, text

Input Requirement: This component allows an incoming relationship.

  1. "Routing Strategy" = "Route to each matching Property Name"

  2. "Matching Strategy" = "Matches Regular Expression"

  3. "Empty Line" = "^$" .

  4. Auto-terminate the "Empty Line" relationship.

  5. Connect the "unmatched" relationship to the next processor in your flow. .

Remove specific lines of text from a file, such as those containing a specific word or having a line length over some threshold.

Keywords: filter, drop, empty, blank, remove, delete, strip out, lines, text, expression language

Input Requirement: This component allows an incoming relationship.

  1. "Routing Strategy" = "Route to each matching Property Name"

  2. "Matching Strategy" = "Satisfies Expression" .

  3. An additional property should be added named "Filter Out." The value should be a NiFi Expression Language Expression that can refer to two variables (in addition to FlowFile attributes): line, which is the line of text being evaluated; and lineNo, which is the line number in the file (starting with 1). The Expression should return true for any line that should be dropped. .

  4. For example, to remove any line that starts with a # symbol, we can set "Filter Out" to ${line:startsWith("#")}.

  5. We could also remove the first 2 lines of text by setting "Filter Out" to ${lineNo:le(2)}. Note that we use the le function because we want lines numbers less than or equal to 2, since the line index is 1-based. .

  6. Auto-terminate the "Filter Out" relationship.

  7. Connect the "unmatched" relationship to the next processor in your flow. .

Tags: attributes, routing, text, regexp, regex, Regular Expression, Expression Language, csv, filter, logs, delimited, find, string, search, filter, detect

Properties

Routing Strategy

Specifies how to determine which Relationship(s) to use when evaluating the lines of incoming text against the 'Matching Strategy' and user-defined properties.

Matching Strategy

Specifies how to evaluate each line of incoming text against the user-defined properties.

Character Set

The Character Set in which the incoming text is encoded

Ignore Leading/Trailing Whitespace

Indicates whether or not the whitespace at the beginning and end of the lines should be ignored when evaluating the line.

Ignore Case

If true, capitalization will not be taken into account when comparing values. E.g., matching against 'HELLO' or 'hello' will have the same result. This property is ignored if the 'Matching Strategy' is set to 'Satisfies Expression'.

Grouping Regular Expression

Specifies a Regular Expression to evaluate against each line to determine which Group the line should be placed in. The Regular Expression must have at least one Capturing Group that defines the line’s Group. If multiple Capturing Groups exist in the Regular Expression, the values from all Capturing Groups will be concatenated together. Two lines will not be placed into the same FlowFile unless they both have the same value for the Group (or neither line matches the Regular Expression). For example, to group together all lines in a CSV File by the first column, we can set this value to "(.?),.". Two lines that have the same Group but different Relationships will never be placed into the same FlowFile.

Dynamic Properties

Relationship Name

Routes data that matches the value specified in the Dynamic Property Value to the Relationship specified in the Dynamic Property Key.

Relationships

  • original: The original input file will be routed to this destination when the lines have been successfully routed to 1 or more relationships

  • unmatched: Data that does not satisfy the required user-defined rules will be routed to this Relationship

Dynamic Relationships

  • Name from Dynamic Property: FlowFiles that match the Dynamic Property’s value

Writes Attributes

  • RouteText.Route: The name of the relationship to which the FlowFile was routed.

  • RouteText.Group: The value captured by all capturing groups in the 'Grouping Regular Expression' property. If this property is not set or contains no capturing groups, this attribute will not be added.

Input Requirement

This component requires an incoming relationship.

RunMongoAggregation

A processor that runs an aggregation query whenever a flowfile is received.

Tags: mongo, aggregation, aggregate

Properties

Client Service

If configured, this property will use the assigned client service for connection pooling.

Mongo Database Name

The name of the database to use

Mongo Collection Name

The name of the collection to use

Character Set

Specifies the character set of the document data.

Query

The aggregation query to be executed.

Allow Disk Use

Set this to true to enable writing data to temporary files to prevent exceeding the maximum memory use limit during aggregation pipeline staged when handling large datasets.

JSON Type

By default, MongoDB’s Java driver returns "extended JSON". Some of the features of this variant of JSON may cause problems for other JSON parsers that expect only standard JSON types and conventions. This configuration setting controls whether to use extended JSON or provide a clean view that conforms to standard JSON.

Query Output Attribute

If set, the query will be written to a specified attribute on the output flowfiles.

Batch Size

The number of elements returned from the server in one batch.

Results Per FlowFile

How many results to put into a flowfile at once. The whole body will be treated as a JSON array of results.

Date Format

The date format string to use for formatting Date fields that are returned from Mongo. It is only applied when the JSON output format is set to Standard JSON.

Relationships

  • failure: The input flowfile gets sent to this relationship when the query fails.

  • original: The input flowfile gets sent to this relationship when the query succeeds.

  • results: The result set of the aggregation will be sent to this relationship.

Input Requirement

This component allows an incoming relationship.

Additional Details

Description:

This processor runs a MongoDB aggregation query based on user-defined settings. The following is an example of such a query (and what the expected input looks like):

[
  {
    "$project": {
      "domain": 1
    },
    "$group": {
      "_id": {
        "domain": "$domain"
      },
      "total": {
        "$sum": 1
      }
    }
  }
]
json

SampleRecord

Samples the records of a FlowFile based on a specified sampling strategy (such as Reservoir Sampling). The resulting FlowFile may be of a fixed number of records (in the case of reservoir-based algorithms) or some subset of the total number of records (in the case of probabilistic sampling), or a deterministic number of records (in the case of interval sampling).

Tags: record, sample, reservoir, range, interval

Properties

Record Reader

Specifies the Controller Service to use for parsing incoming data and determining the data’s schema

Record Writer

Specifies the Controller Service to use for writing results to a FlowFile

Sampling Strategy

Specifies which method to use for sampling records from the incoming FlowFile

Sampling Interval

Specifies the number of records to skip before writing a record to the outgoing FlowFile. This property is only used if Sampling Strategy is set to Interval Sampling. A value of zero (0) will cause no records to be included in theoutgoing FlowFile, a value of one (1) will cause all records to be included, and a value of two (2) will cause half the records to be included, and so on.

Sampling Range

Specifies the range of records to include in the sample, from 1 to the total number of records. An example is '3,6-8,20-' which includes the third record, the sixth, seventh and eighth records, and all records from the twentieth record on. Commas separate intervals that don’t overlap, and an interval can be between two numbers (i.e. 6-8) or up to a given number (i.e. -5), or from a number to the number of the last record (i.e. 20-). If this property is unset, all records will be included.

Sampling Probability

Specifies the probability (as a percent from 0-100) of a record being included in the outgoing FlowFile. This property is only used if Sampling Strategy is set to Probabilistic Sampling. A value of zero (0) will cause no records to be included in theoutgoing FlowFile, and a value of 100 will cause all records to be included in the outgoing FlowFile..

Reservoir Size

Specifies the number of records to write to the outgoing FlowFile. This property is only used if Sampling Strategy is set to reservoir-based strategies such as Reservoir Sampling.

Random Seed

Specifies a particular number to use as the seed for the random number generator (used by probabilistic strategies). Setting this property will ensure the same records are selected even when using probabilistic strategies.

Relationships

  • success: The FlowFile is routed to this relationship if the sampling completed successfully

  • failure: If a FlowFile fails processing for any reason (for example, any record is not valid), the original FlowFile will be routed to this relationship

  • original: The original FlowFile is routed to this relationship if sampling is successful

Writes Attributes

  • mime.type: The MIME type indicated by the record writer

  • record.count: The number of records in the resulting flow file

Input Requirement

This component requires an incoming relationship.

System Resource Considerations

  • MEMORY: An instance of this component can cause high usage of this system resource. Multiple instances or high concurrency settings may result a degradation of performance.

Additional Details

This processor takes in a record set and samples records from the set according to the specified sampling strategy. The available sampling strategies are:

  • Interval Sampling

    Select every _N_th record based on the value of the Sampling Interval property. For example, if there are 100 records in the set and the Sampling Interval is set to 4, there will be 25 records in the output, namely every 4th record. This performs uniform sampling of the record set so is best suited for record sets that are uniformly distributed. For example a record set representing user information that is uniformly distributed will result in the output records also being uniformly distributed. The outgoing record count is deterministic and is exactly the total number of records divided by the Sampling Interval value.

  • Probabilistic Sampling

    Select each record with probability P, an integer percentage specified by the Sampling Probability value. For example, an incoming record set of 100 records with a Sampling Probability value of 20 should have roughly 20 records in the output. Use this when you want to output record sets of roughly the same size (but not exactly) and when you want each record to have the same “chance” to be selected for the output set. As another example, if you send the same flow file into the processor twice, a sampling strategy of Interval Sampling will always produce the same output, where Probabilistic Sampling may output different records (and a different total number of records).

  • Reservoir Sampling

    Select K records from a record set having N total values, where K is the value of the Reservoir Size property and each record has an equal probability of being selected (exactly K / N). For example, an incoming record set of 100 records with a Reservoir Size value of 20 should have exactly 20 records in the output, randomly chosen from the input record set. Use this when you want to control the exact number of output records and have each input record have the same probability of being selected. As another example, if you send the same flow file into the processor twice, a sampling strategy of Interval Sampling will always produce the same output (same records and number of records), where Probabilistic Sampling may output different records (and a different total number of records), and Reservoir Sampling may output different records but the same total number of records. Note that the reservoir is kept in-memory, so if the size of the reservoir is very large, it may cause memory issues.

The “Random Seed” property applies to strategies/algorithms that use a pseudorandom random number generator, such as Probabilistic Sampling and Reservoir Sampling. The property is optional but if set will guarantee the same records in a flow file will be selected by the algorithm each time. This is useful for testing flows using non-deterministic algorithms such as Probabilistic Sampling and Reservoir Sampling.

ScanAttribute

Scans the specified attributes of FlowFiles, checking to see if any of their values are present within the specified dictionary of terms

Tags: scan, attributes, search, lookup, find, text

Properties

Dictionary File

A new-line-delimited text file that includes the terms that should trigger a match. Empty lines are ignored. The contents of the text file are loaded into memory when the processor is scheduled and reloaded when the contents are modified.

Attribute Pattern

Regular Expression that specifies the names of attributes whose values will be matched against the terms in the dictionary

Match Criteria

If set to All Must Match, then FlowFiles will be routed to 'matched' only if all specified attributes' values are found in the dictionary. If set to At Least 1 Must Match, FlowFiles will be routed to 'matched' if any attribute specified is found in the dictionary

Dictionary Filter Pattern

A Regular Expression that will be applied to each line in the dictionary file. If the regular expression does not match the line, the line will not be included in the list of terms to search for. If a Matching Group is specified, only the portion of the term that matches that Matching Group will be used instead of the entire term. If not specified, all terms in the dictionary will be used and each term will consist of the text of the entire line in the file

Relationships

  • matched: FlowFiles whose attributes are found in the dictionary will be routed to this relationship

  • unmatched: FlowFiles whose attributes are not found in the dictionary will be routed to this relationship

Input Requirement

This component requires an incoming relationship.

ScanContent

Scans the content of FlowFiles for terms that are found in a user-supplied dictionary. If a term is matched, the UTF-8 encoded version of the term will be added to the FlowFile using the 'matching.term' attribute

Tags: aho-corasick, scan, content, byte sequence, search, find, dictionary

Properties

Dictionary File

The filename of the terms dictionary

Dictionary Encoding

Indicates how the dictionary is encoded. If 'text', dictionary terms are new-line delimited and UTF-8 encoded; if 'binary', dictionary terms are denoted by a 4-byte integer indicating the term length followed by the term itself

Relationships

  • matched: FlowFiles that match at least one term in the dictionary are routed to this relationship

  • unmatched: FlowFiles that do not match any term in the dictionary are routed to this relationship

Writes Attributes

  • matching.term: The term that caused the Processor to route the FlowFile to the 'matched' relationship; if FlowFile is routed to the 'unmatched' relationship, this attribute is not added

Input Requirement

This component requires an incoming relationship.

ScriptedFilterRecord

This processor provides the ability to filter records out from FlowFiles using the user-provided script. Every record will be evaluated by the script which must return with a boolean value. Records with "true" result will be routed to the "matching" relationship in a batch. Other records will be filtered out.

Tags: record, filter, script, groovy

Properties

Record Reader

The Record Reader to use parsing the incoming FlowFile into Records

Record Writer

The Record Writer to use for serializing Records after they have been transformed

Script Language

The Language to use for the script

Script Body

Body of script to execute. Only one of Script File or Script Body may be used

Script File

Path to script file to execute. Only one of Script File or Script Body may be used

Module Directory

Comma-separated list of paths to files and/or directories which contain modules required by the script.

Relationships

  • success: Matching records of the original FlowFile will be routed to this relationship. If there are no matching records, no FlowFile will be routed here.

  • failure: In case of any issue during processing the incoming FlowFile, the incoming FlowFile will be routed to this relationship.

  • original: After successful procession, the incoming FlowFile will be transferred to this relationship. This happens regardless the number of filtered or remaining records.

Writes Attributes

  • mime.type: Sets the mime.type attribute to the MIME Type specified by the Record Writer

  • record.count: The number of records within the flow file.

  • record.error.message: This attribute provides on failure the error message encountered by the Reader or Writer.

Restricted

Making changes to this component requires admin rights. Non-admins can still schedule this component.

Input Requirement

This component requires an incoming relationship.

Additional Details

Description

The ScriptedFilterRecord Processor provides the ability to use a scripting language, such as Groovy in order to remove Records from an incoming FlowFile. NiFi provides several different Processors that can be used to work with Records in different ways. Each of these processors has its pros and cons. The ScriptedFilterRecord is intended to work together with these processors and be used as a pre-processing step before processing the FlowFile with more performance consuming Processors, like ScriptedTransformRecord.

The Processor expects a user defined script in order to determine which Records should be kept and filtered out. When creating a script, it is important to note that, unlike ExecuteScript, this Processor does not allow the script itself to expose Properties to be configured or define Relationships.

The provided script is evaluated once for each Record that is encountered in the incoming FlowFile. Each time that the script is invoked, it is expected to return a boolean value, which is used as a basis of filtering: For Records the script returns with a true value, the given Record will be included to the outgoing FlowFile which will be routed to the success Relationship. For false values the given Record will not be added to the output. In addition to this the incoming FlowFile will be transferred to the original Relationship without change. If the script returns an object that is not considered as boolean, the incoming FlowFile will be routed to the failure Relationship instead and no FlowFile will be routed to the success Relationship.

This Processor maintains a Counter: “Records Processed” indicating the number of Records that were passed to the script regardless of the result of the filtering.

Variable Bindings

While the script provided to this Processor does not need to provide boilerplate code or implement any classes/interfaces, it does need some way to access the Records and other information that it needs in order to perform its task. This is accomplished by using Variable Bindings. Each time that the script is invoked, each of the following variables will be made available to the script:

Variable Name Description Variable Class

record

The Record that is to be processed.

Record

recordIndex

The zero-based index of the Record in the FlowFile.

Long (64-bit signed integer)

log

The Processor’s Logger. Anything that is logged to this logger will be written to the logs as if the Processor itself had logged it. Additionally, a bulletin will be created for any log message written to this logger (though by default, the Processor will hide any bulletins with a level below WARN).

ComponentLog

attributes

Map of key/value pairs that are the Attributes of the FlowFile. Both the keys and the values of this Map are of type String. This Map is immutable. Any attempt to modify it will result in an UnsupportedOperationException being thrown.

java.util.Map

Return Value

Each time the script is invoked, it is expected to return a boolean value. Return values other than boolean, including null value will be handled as unexpected script behaviour and handled accordingly: the processing will be interrupted and the incoming FlowFile will be transferred to the failure relationship without further execution.

Example Scripts
Filtering based on position

The following script will keep only the first 2 Records from a FlowFile and filter out all the rest.

Example Input (CSV):

name, allyOf Decelea, Athens Corinth, Sparta Mycenae, Sparta Potidaea, Athens

Example Output (CSV):

name, allyOf Decelea, Athens Corinth, Sparta

Example Script (Groovy):

return recordIndex < 2 ? true : false
groovy
Filtering based on Record contents

The following script will filter the Records based on their content. Any Records satisfies the condition will be part of the FlowFile routed to the success Relationship.

Example Input (JSON):

[
  {
    "city": "Decelea",
    "allyOf": "Athens"
  },
  {
    "city": "Corinth",
    "allyOf": "Sparta"
  },
  {
    "city": "Mycenae",
    "allyOf": "Sparta"
  },
  {
    "city": "Potidaea",
    "allyOf": "Athens"
  }
]
json

Example Output (CSV):

[
  {
    "city": "Decelea",
    "allyOf": "Athens"
  },
  {
    "city": "Potidaea",
    "allyOf": "Athens"
  }
]
json

Example Script (Groovy):

if (record.getValue("allyOf") == "Athens") {
    return true;
} else {
    return false;
}
groovy

ScriptedPartitionRecord

Receives Record-oriented data (i.e., data that can be read by the configured Record Reader) and evaluates the user provided script against each record in the incoming flow file. Each record is then grouped with other records sharing the same partition and a FlowFile is created for each groups of records. Two records shares the same partition if the evaluation of the script results the same return value for both. Those will be considered as part of the same partition.

Tags: record, partition, script, groovy, segment, split, group, organize

Properties

Record Reader

The Record Reader to use parsing the incoming FlowFile into Records

Record Writer

The Record Writer to use for serializing Records after they have been transformed

Script Language

The Language to use for the script

Script Body

Body of script to execute. Only one of Script File or Script Body may be used

Script File

Path to script file to execute. Only one of Script File or Script Body may be used

Module Directory

Comma-separated list of paths to files and/or directories which contain modules required by the script.

Relationships

  • success: FlowFiles that are successfully partitioned will be routed to this relationship

  • failure: If a FlowFile cannot be partitioned from the configured input format to the configured output format, the unchanged FlowFile will be routed to this relationship

  • original: Once all records in an incoming FlowFile have been partitioned, the original FlowFile is routed to this relationship.

Writes Attributes

  • partition: The partition of the outgoing flow file. If the script indicates that the partition has a null value, the attribute will be set to the literal string "<null partition>" (without quotes). Otherwise, the attribute is set to the String representation of whatever value is returned by the script.

  • mime.type: Sets the mime.type attribute to the MIME Type specified by the Record Writer

  • record.count: The number of records within the flow file.

  • record.error.message: This attribute provides on failure the error message encountered by the Reader or Writer.

  • fragment.index: A one-up number that indicates the ordering of the partitioned FlowFiles that were created from a single parent FlowFile

  • fragment.count: The number of partitioned FlowFiles generated from the parent FlowFile

Restricted

Making changes to this component requires admin rights. Non-admins can still schedule this component.

Input Requirement

This component allows an incoming relationship.

Additional Details

Description

The ScriptedPartitionRecord provides the ability to use a scripting language, such as Groovy, to quickly and easily partition a Record based on its contents. There are multiple ways to reach the same behaviour such as using PartitionRecord but working with user provided scripts opens a wide range of possibilities on the decision logic of partitioning the individual records.

The provided script is evaluated once for each Record that is encountered in the incoming FlowFile. Each time that the script is invoked, it is expected to return an object or a null value. The string representation of the return value is used as the record’s “partition”. The null value is handled separately without conversion into string. All Records with the same partition then will be batched to one FlowFile and routed to the success Relationship.

This Processor maintains a Counter with the name of “Records Processed”. This represents the number of processed Records regardless of partitioning.

Variable Bindings

While the script provided to this Processor does not need to provide boilerplate code or implement any classes/interfaces, it does need some way to access the Records and other information that it needs in order to perform its task. This is accomplished by using Variable Bindings. Each time that the script is invoked, each of the following variables will be made available to the script:

Variable Name Description Variable Class

record

The Record that is to be processed.

Record

recordIndex

The zero-based index of the Record in the FlowFile.

Long (64-bit signed integer)

log

The Processor’s Logger. Anything that is logged to this logger will be written to the logs as if the Processor itself had logged it. Additionally, a bulletin will be created for any log message written to this logger (though by default, the Processor will hide any bulletins with a level below WARN).

ComponentLog

attributes

Map of key/value pairs that are the Attributes of the FlowFile. Both the keys and the values of this Map are of type String. This Map is immutable. Any attempt to modify it will result in an UnsupportedOperationException being thrown.

java.util.Map

Return Value

The script is invoked separately for each Record. It is acceptable to return any Object might be represented as string. This string value will be used as the partition of the given Record. Additionally, the script may return null.

Example

The following script will partition the input on the value of the “stellarType” field.

Example Input (CSV):

starSystem, stellarType Wolf 359, M Epsilon Eridani, K Tau Ceti, G Groombridge 1618, K Gliese 1, M

Example Output 1 (CSV) - for partition “M”:

starSystem, stellarType Wolf 359,M Gliese 1,M

Example Output 2 (CSV) - for partition “K”:

starSystem, stellarType Epsilon Eridani,K Groombridge 1618,K

Example Output 3 (CSV) - for partition “G”:

starSystem, stellarType Tau Ceti,G

Note: the order of the outgoing FlowFiles is not guaranteed.

Example Script (Groovy):

return record.getValue("stellarType")
groovy

ScriptedTransformRecord

Provides the ability to evaluate a simple script against each record in an incoming FlowFile. The script may transform the record in some way, filter the record, or fork additional records. See Processor’s Additional Details for more information.

Tags: record, transform, script, groovy, update, modify, filter

Properties

Record Reader

The Record Reader to use parsing the incoming FlowFile into Records

Record Writer

The Record Writer to use for serializing Records after they have been transformed

Script Language

The Language to use for the script

Script Body

Body of script to execute. Only one of Script File or Script Body may be used

Script File

Path to script file to execute. Only one of Script File or Script Body may be used

Module Directory

Comma-separated list of paths to files and/or directories which contain modules required by the script.

Relationships

  • success: Each FlowFile that were successfully transformed will be routed to this Relationship

  • failure: Any FlowFile that cannot be transformed will be routed to this Relationship

Writes Attributes

  • mime.type: Sets the mime.type attribute to the MIME Type specified by the Record Writer

  • record.count: The number of records in the FlowFile

  • record.error.message: This attribute provides on failure the error message encountered by the Reader or Writer.

Restricted

Making changes to this component requires admin rights. Non-admins can still schedule this component.

Input Requirement

This component requires an incoming relationship.

Additional Details

Description

The ScriptedTransformRecord provides the ability to use a scripting language, such as Groovy, to quickly and easily update the contents of a Record. NiFi provides several different Processors that can be used to manipulate Records in different ways. Each of these processors has its pros and cons. The ScriptedTransformRecord is perhaps the most powerful and most versatile option. However, it is also the most error-prone, as it depends on writing custom scripts. It is also likely to yield the lowest performance, as processors and libraries written directly in Java are likely to perform better than interpreted scripts.

When creating a script, it is important to note that, unlike ExecuteScript, this Processor does not allow the script itself to expose Properties to be configured or define Relationships. This is a deliberate decision. If it is necessary to expose such configuration, the ExecuteScript processor should be used instead. By not exposing these elements, the script avoids the need to define a Class or implement methods with a specific method signature. Instead, the script can avoid any boilerplate code and focus purely on the task at hand.

The provided script is evaluated once for each Record that is encountered in the incoming FlowFile. Each time that the script is invoked, it is expected to return a Record object (See note below regarding Return Values). That Record is then written using the configured Record Writer. If the script returns a null value, the Record will not be written. If the script returns an object that is not a Record, the incoming FlowFile will be routed to the failure relationship.

This processor maintains two Counters: “Records Transformed” indicating the number of Records that were passed to the script and for which the script returned a Record, and “Records Dropped” indicating the number of Records that were passed to the script and for which the script returned a value of null.

Variable Bindings

While the script provided to this Processor does not need to provide boilerplate code or implement any classes/interfaces, it does need some way to access the Records and other information that it needs in order to perform its task. This is accomplished by using Variable Bindings. Each time that the script is invoked, each of the following variables will be made available to the script:

Variable Name Description Variable Class

record

The Record that is to be transformed.

Record

recordIndex

The zero-based index of the Record in the FlowFile.

Long (64-bit signed integer)

log

The Processor’s Logger. Anything that is logged to this logger will be written to the logs as if the Processor itself had logged it. Additionally, a bulletin will be created for any log message written to this logger (though by default, the Processor will hide any bulletins with a level below WARN).

ComponentLog

attributes

Map of key/value pairs that are the Attributes of the FlowFile. Both the keys and the values of this Map are of type String. This Map is immutable. Any attempt to modify it will result in an UnsupportedOperationException being thrown.

java.util.Map

Return Value

Each time that the script is invoked, it is expected to return a Record object or a Collection of Record objects. Those Records are then written using the configured Record Writer. If the script returns a null value, the Record will not be written. If the script returns an object that is not a Record or Collection of Records, the incoming FlowFile will be routed to the failure relationship.

The Record that is provided to the script is mutable. Therefore, it is a common pattern to update the record object in the script and simply return that same record object.

Note: Depending on the scripting language, a script with no explicit return value may return null or may return the last value that was referenced. Because returning null will result in dropping the Record and a non-Record return value will result in an Exception (and simply for the sake of clarity), it is important to ensure that the configured script has an explicit return value.

Adding a New Fields

A very common usage of Record-oriented processors is to allow the Record Reader to infer its schema and have the Record Writer inherit the Record’s schema. In this scenario, it is important to note that the Record Writer will inherit the schema of the first Record that it encounters. Therefore, if the configured script will add a new field to a Record, it is important to ensure that the field is added to all Records (with a null value where appropriate).

See the Adding New Fields example for more details.

Performance Considerations

NiFi offers many different processors for updating records in various ways. While each of these has its own pros and cons, performance is often an important consideration. It is generally the case that standard processors, such as UpdateRecord, will perform better than script-oriented processors. However, this may not always be the case. For situations when performance is critical, the best case is to test both approaches to see which performs best.

A simple 5-minute benchmark was done to analyze the difference in performance. The script used simply modifies one field and return the Record otherwise unmodified. The results are shown below. Note that no specifics are given regarding hardware, specifically because the results should not be used to garner expectations of absolute performance but rather to show relative performance between the different options.

Processor Script Used Records processed in 5 minutes

UpdateAttribute

No Script. User-defined Property added with name /num and value 42

50.1 million

ScriptedTransformRecord - Using Language: Groovy

record.setValue(“num”, 42)*record*

18.9 million

Example Scripts
Remove First Record

The following script will remove the first Record from each FlowFile that it encounters.

Example Input (CSV):

name, num Mark, 42 Felicia, 3720 Monica, -3

Example Output (CSV):

name, num Felicia, 3720 Monica, -3

Example Script (Groovy):

return recordIndex == 0 ? null : record
groovy
Replace Field Value

The following script will replace any field in a Record if the value of that field is equal to the value of the “Value To Replace” attribute. The value of that field will be replaced with whatever value is in the “Replacement Value” attribute.

Example Input Content (JSON):

[
  {
    "book": {
      "author": "John Doe",
      "date": "01/01/1980"
    }
  },
  {
    "book": {
      "author": "Jane Doe",
      "date": "01/01/1990"
    }
  }
]
json

Example Input Attributes:

Attribute Name Attribute Value

Value To Replace

Jane Doe

Replacement Value

Author Unknown

Example Output (JSON):

[
  {
    "book": {
      "author": "John Doe",
      "date": "01/01/1980"
    }
  },
  {
    "book": {
      "author": "Author Unknown",
      "date": "01/01/1990"
    }
  }
]
json

Example Script (Groovy):

def replace(rec) {
    rec.toMap().each { k, v ->
        // If the field value is equal to the attribute 'Value to Replace', then set the
        // field value to the 'Replacement Value' attribute.
        if (v?.toString()?.equals(attributes['Value to Replace'])) {
            rec.setValue(k, attributes['Replacement Value'])
        }

        // Call Recursively if the value is a Record
        if (v instanceof org.apache.nifi.serialization.record.Record) {
            replace(v)
        }
    }
}

replace(record)
return record
groovy
Pass-through

The following script allows each Record to pass through without altering the Record in any way.

Example Input:

Example output:

Example Script (Groovy):

record
groovy
Adding New Fields

The following script adds a new field named “favoriteColor” to all Records. Additionally, it adds an “isOdd” field to all even-numbered Records.

It is important that all Records have the same schema. Since we want to add an “isOdd” field to Records 1 and 3, the schema for Records 0 and 2 must also account for this. As a result, we will add the field to all Records but use a null value for Records that are not even. See Adding New Fields for more information.

Example Input Content (CSV):

name, favoriteFood John Doe, Spaghetti Jane Doe, Pizza Jake Doe, Sushi June Doe, Hamburger

Example Output (CSV):

name, favoriteFood, favoriteColor, isOdd John Doe, Spaghetti, Blue, Jane Doe, Pizza, Blue, true Jake Doe, Sushi, Blue, June Doe, Hamburger, Blue, true

Example Script (Groovy):

import org.apache.nifi.serialization.record.RecordField;
import org.apache.nifi.serialization.record.RecordFieldType;

// Always set favoriteColor to Blue. // Because we are calling #setValue with a String as the field name, the field type will be inferred. record.setValue("favoriteColor", "Blue")  // Set the 'isOdd' field to true if the record index is odd. Otherwise, set the 'isOdd' field to `null`. // Because the value may be `null` for the first Record (in fact, it always will be for this particular case), // we need to ensure that the Record Writer's schema be given the correct type for the field. As a result, we will not call // #setValue with a String as the field name but rather will pass a RecordField as the first argument, as the RecordField // allows us to specify the type of the field. // Also note that `RecordField` and `RecordFieldType` are `import`ed above. record.setValue(new RecordField("isOdd", RecordFieldType.BOOLEAN.getDataType()), recordIndex % 2 == 1 ? true : null)  return record
groovy
Fork Record

The following script return each Record that it encounters, plus another Record, which is derived from the first, but where the ‘num’ field is one less than the ‘num’ field of the input.

Example Input (CSV):

name, num Mark, 42 Felicia, 3720 Monica, -3

Example Output (CSV):

name, num Mark, 42 Mark, 41 Felicia, 3720 Felicia, 3719 Monica, -3 Monica, -4

Example Script (Groovy):

import org.apache.nifi.serialization.record.*

def derivedValues = new HashMap(record.toMap())
derivedValues.put('num', derivedValues['num'] - 1)
derived = new MapRecord(record.schema, derivedValues)
return [record, derived]
groovy

ScriptedValidateRecord

This processor provides the ability to validate records in FlowFiles using the user-provided script. The script is expected to have a record as incoming argument and return with a boolean value. Based on this result, the processor categorizes the records as "valid" or "invalid" and routes them to the respective relationship in batch. Additionally the original FlowFile will be routed to the "original" relationship or in case of unsuccessful processing, to the "failed" relationship.

Tags: record, validate, script, groovy

Properties

Record Reader

The Record Reader to use parsing the incoming FlowFile into Records

Record Writer

The Record Writer to use for serializing Records after they have been transformed

Script Language

The Language to use for the script

Script Body

Body of script to execute. Only one of Script File or Script Body may be used

Script File

Path to script file to execute. Only one of Script File or Script Body may be used

Module Directory

Comma-separated list of paths to files and/or directories which contain modules required by the script.

Relationships

  • failure: In case of any issue during processing the incoming flow file, the incoming FlowFile will be routed to this relationship.

  • invalid: FlowFile containing the invalid records from the incoming FlowFile will be routed to this relationship. If there are no invalid records, no FlowFile will be routed to this Relationship.

  • original: After successful procession, the incoming FlowFile will be transferred to this relationship. This happens regardless the FlowFiles might routed to "valid" and "invalid" relationships.

  • valid: FlowFile containing the valid records from the incoming FlowFile will be routed to this relationship. If there are no valid records, no FlowFile will be routed to this Relationship.

Writes Attributes

  • mime.type: Sets the mime.type attribute to the MIME Type specified by the Record Writer

  • record.count: The number of records within the flow file.

  • record.error.message: This attribute provides on failure the error message encountered by the Reader or Writer.

Restricted

Making changes to this component requires admin rights. Non-admins can still schedule this component.

Input Requirement

This component requires an incoming relationship.

Additional Details

Description

The ScriptedValidateRecord Processor provides the ability to use a scripting language, such as Groovy or Jyton in order to validate Records in an incoming FlowFile. The provided script will be evaluated against each Record in an incoming FlowFile. Each of those records will then be routed to either the “valid” or “invalid” FlowFile. As a result, each incoming FlowFile may be broken up into two individual FlowFiles (if some records are valid and some are invalid, according to the script), or the incoming FlowFile may have all of its Records kept together in a single FlowFile, routed to either “valid” or “invalid” (if all records are valid or if all records are invalid, according to the script).

The Processor expects a user defined script in order to determine the validity of the Records. When creating a script, it is important to note that, unlike ExecuteScript, this Processor does not allow the script itself to expose Properties to be configured or define Relationships.

The provided script is evaluated once for each Record that is encountered in the incoming FlowFile. Each time that the script is invoked, it is expected to return a boolean value, which is used to determine if the given Record is valid or not: For Records the script returns with a true value, the given Record is considered valid and will be included to the outgoing FlowFile which will be routed to the valid Relationship. For false values the given Record will be added to the FlowFile routed to the invalid Relationship. Regardless of the number of incoming Records the outgoing Records will be batched. For one incoming FlowFile there could be no more than one FlowFile routed to the valid and the invalid Relationships. In case of there are no valid or invalid Record presents there will be no transferred FlowFile for the respected Relationship. In addition to this the incoming FlowFile will be transferred to the original Relationship without change. If the script returns an object that is not considered as boolean, the incoming FlowFile will be routed to the failure Relationship instead and no FlowFile will be routed to the valid or invalid Relationships.

This Processor maintains a Counter: “Records Processed” indicating the number of Records were processed by the Processor.

Variable Bindings

While the script provided to this Processor does not need to provide boilerplate code or implement any classes/interfaces, it does need some way to access the Records and other information that it needs in order to perform its task. This is accomplished by using Variable Bindings. Each time that the script is invoked, each of the following variables will be made available to the script:

Variable Name Description Variable Class

record

The Record that is to be processed.

Record

recordIndex

The zero-based index of the Record in the FlowFile.

Long (64-bit signed integer)

log

The Processor’s Logger. Anything that is logged to this logger will be written to the logs as if the Processor itself had logged it. Additionally, a bulletin will be created for any log message written to this logger (though by default, the Processor will hide any bulletins with a level below WARN).

ComponentLog

attributes

Map of key/value pairs that are the Attributes of the FlowFile. Both the keys and the values of this Map are of type String. This Map is immutable. Any attempt to modify it will result in an UnsupportedOperationException being thrown.

java.util.Map

Return Value

Each time the script is invoked, it is expected to return a boolean value. Return values other than boolean, including null value will be handled as unexpected script behaviour and handled accordingly: the processing will be interrupted and the incoming FlowFile will be transferred to the failure relationship without further execution.

Example Scripts
Validating based on position

The following script will consider only the first 2 Records as valid.

Example Input (CSV):

company, numberOfTrains Boston & Maine Railroad, 3 Chesapeake & Ohio Railroad, 2 Pennsylvania Railroad, 4 Reading Railroad, 2

Example Output (CSV) - valid Relationship:

company, numberOfTrains Boston & Maine Railroad, 3 Chesapeake & Ohio Railroad, 2

Example Output (CSV) - invalid Relationship:

company, numberOfTrains Pennsylvania Railroad, 4 Reading Railroad, 2

Example Script (Groovy):

return recordIndex < 2 ? true : false
groovy
Validating based on Record contents

The following script will filter the Records based on their content. Any Records satisfies the condition will be part of the FlowFile routed to the valid Relationship, others wil lbe routed to the invalid Relationship.

Example Input (JSON):

[
  {
    "company": "Boston & Maine Railroad",
    "numberOfTrains": 3
  },
  {
    "company": "Chesapeake & Ohio Railroad",
    "numberOfTrains": -1
  },
  {
    "company": "Pennsylvania Railroad",
    "numberOfTrains": 2
  },
  {
    "company": "Reading Railroad",
    "numberOfTrains": 4
  }
]
json

Example Output (CSV) - valid Relationship:

[
  {
    "company": "Boston & Maine Railroad",
    "numberOfTrains": 3
  },
  {
    "company": "Pennsylvania Railroad",
    "numberOfTrains": 2
  },
  {
    "company": "Reading Railroad",
    "numberOfTrains": 4
  }
]
json

Example Output (CSV) - invalid Relationship:

[
  {
    "company": "Chesapeake & Ohio Railroad",
    "numberOfTrains": -1
  }
]
json

Example Script (Groovy):

if (record.getValue("numberOfTrains").toInteger() >= 0) {
    return true;
} else {
    return false;
}
groovy

SearchElasticsearch

A processor that allows the user to repeatedly run a paginated query (with aggregations) written with the Elasticsearch JSON DSL. Search After/Point in Time queries must include a valid "sort" field. The processor will retrieve multiple pages of results until either no more results are available or the Pagination Keep Alive expiration is reached, after which the query will restart with the first page of results being retrieved.

Tags: elasticsearch, elasticsearch5, elasticsearch6, elasticsearch7, elasticsearch8, query, scroll, page, search, json

Properties

Query Definition Style

How the JSON Query will be defined for use by the processor.

Query

A query in JSON syntax, not Lucene syntax. Ex: {"query":{"match":{"somefield":"somevalue"}}}. If the query is empty, a default JSON Object will be used, which will result in a "match_all" query in Elasticsearch.

Query Clause

A "query" clause in JSON syntax, not Lucene syntax. Ex: {"match":{"somefield":"somevalue"}}. If the query is empty, a default JSON Object will be used, which will result in a "match_all" query in Elasticsearch.

Size

The maximum number of documents to retrieve in the query. If the query is paginated, this "size" applies to each page of the query, not the "size" of the entire result set.

Sort

Sort results by one or more fields, in JSON syntax. Ex: [{"price" : {"order" : "asc", "mode" : "avg"}}, {"post_date" : {"format": "strict_date_optional_time_nanos"}}]

Aggregations

One or more query aggregations (or "aggs"), in JSON syntax. Ex: {"items": {"terms": {"field": "product", "size": 10}}}

Fields

Fields of indexed documents to be retrieved, in JSON syntax. Ex: ["user.id", "http.response.*", {"field": "@timestamp", "format": "epoch_millis"}]

Script Fields

Fields to created using script evaluation at query runtime, in JSON syntax. Ex: {"test1": {"script": {"lang": "painless", "source": "doc['price'].value * 2"}}, "test2": {"script": {"lang": "painless", "source": "doc['price'].value * params.factor", "params": {"factor": 2.0}}}}

Query Attribute

If set, the executed query will be set on each result flowfile in the specified attribute.

Index

The name of the index to use.

Type

The type of this document (used by Elasticsearch for indexing and searching).

Max JSON Field String Length

The maximum allowed length of a string value when parsing a JSON document or attribute.

Client Service

An Elasticsearch client service to use for running queries.

Search Results Split

Output a flowfile containing all hits or one flowfile for each individual hit or one flowfile containing all hits from all paged responses.

Search Results Format

Format of Hits output.

Aggregation Results Split

Output a flowfile containing all aggregations or one flowfile for each individual aggregation.

Aggregation Results Format

Format of Aggregation output.

Output No Hits

Output a "hits" flowfile even if no hits found for query. If true, an empty "hits" flowfile will be output even if "aggregations" are output.

Pagination Type

Pagination method to use. Not all types are available for all Elasticsearch versions, check the Elasticsearch docs to confirm which are applicable and recommended for your service.

Pagination Keep Alive

Pagination "keep_alive" period. Period Elasticsearch will keep the scroll/pit cursor alive in between requests (this is not the time expected for all pages to be returned, but the maximum allowed time for requests between page retrievals).

Dynamic Properties

The name of a URL query parameter to add

Adds the specified property name/value as a query parameter in the Elasticsearch URL used for processing. These parameters will override any matching parameters in the query request body. For SCROLL type queries, these parameters are only used in the initial (first page) query as the Elasticsearch Scroll API does not support the same query parameters for subsequent pages of data.

Relationships

  • aggregations: Aggregations are routed to this relationship.

  • hits: Search hits are routed to this relationship.

Writes Attributes

  • mime.type: application/json

  • aggregation.name: The name of the aggregation whose results are in the output flowfile

  • aggregation.number: The number of the aggregation whose results are in the output flowfile

  • page.number: The number of the page (request), starting from 1, in which the results were returned that are in the output flowfile

  • hit.count: The number of hits that are in the output flowfile

  • elasticsearch.query.error: The error message provided by Elasticsearch if there is an error querying the index.

Stateful

Scope: Local

The pagination state (scrollId, searchAfter, pitId, hitCount, pageCount, pageExpirationTimestamp) is retained in between invocations of this processor until the Scroll/PiT has expired (when the current time is later than the last query execution plus the Pagination Keep Alive interval).

Input Requirement

This component does not allow an incoming relationship.

System Resource Considerations

  • MEMORY: Care should be taken on the size of each page because each response from Elasticsearch will be loaded into memory all at once and converted into the resulting flowfiles.

Additional Details

This processor is intended for use with the Elasticsearch JSON DSL and Elasticsearch 5.X and newer. It is designed to be able to take a JSON query (e.g. from Kibana) and execute it as-is against an Elasticsearch cluster in a paginated manner. Like all processors in the “restapi” bundle, it uses the official Elastic client APIs, so it supports leader detection.

The query to execute must be provided in the Query configuration property.

The query is paginated in Elasticsearch using one of the available methods - “Scroll” or “Search After” (optionally with a “Point in Time” for Elasticsearch 7.10+ with XPack enabled). The number of results per page can be controlled using the size parameter in the Query JSON. For Search After functionality, a sort parameter must be present within the Query JSON.

Search results and aggregation results can be split up into multiple flowfiles. Aggregation results will only be split at the top level because nested aggregations lose their context (and thus lose their value) if separated from their parent aggregation. Additionally, the results from all pages can be combined into a single flowfile (but the processor will only load each page of data into memory at any one time).

The following is an example query that would be accepted:

{
  "query": {
    "size": 10000,
    "sort": {
      "product": "desc"
    },
    "match": {
      "restaurant.keyword": "Local Pizzaz FTW Inc"
    }
  },
  "aggs": {
    "weekly_sales": {
      "date_histogram": {
        "field": "date",
        "interval": "week"
      },
      "aggs": {
        "items": {
          "terms": {
            "field": "product",
            "size": 10
          }
        }
      }
    }
  }
}
json
Query Pagination Across Processor Executions

This processor runs on a schedule in order to execute the same query repeatedly. Once a paginated query has been initiated within Elasticsearch, this processor will continue to retrieve results for that same query until no further results are available. After that point, a new paginated query will be initiated using the same Query JSON.

If the results are “Combined” from this processor, then the paginated query will run continually within a single invocation until no more results are available (then the processor will start a new paginated query upon its next invocation). If the results are “Split” or “Per Page”, then each invocation of this processor will retrieve the next page of results until either there are no more results or the paginated query expires within Elasticsearch.

Resetting Queries / Clearing Processor State

Local State is used to track the progress of a paginated query within this processor. If there is need to restart the query completely or change the processor configuration after a paginated query has already been started, be sure to “Clear State” of the processor once it has been stopped and before restarting.

Duplicate Results

This processor does not attempt to de-duplicate results between queries, for example if the same query runs twice and (some or all of) the results are identical, the output will contain these same results for both invocations. This might happen if the NiFi Primary Node changes while a page of data is being retrieved, or if the processor state is cleared, then the processor is restarted.

This processor will continually run the same query unless the processor properties are updated, so unless the data in Elasticsearch has changed, the same data will be retrieved multiple times.

SegmentContent

Segments a FlowFile into multiple smaller segments on byte boundaries. Each segment is given the following attributes: fragment.identifier, fragment.index, fragment.count, segment.original.filename; these attributes can then be used by the MergeContent processor in order to reconstitute the original FlowFile

Tags: segment, split

Properties

Segment Size

The maximum data size in bytes for each segment

Relationships

  • original: The original FlowFile will be sent to this relationship

  • segments: All segments will be sent to this relationship. If the file was small enough that it was not segmented, a copy of the original is sent to this relationship as well as original

Writes Attributes

  • fragment.identifier: All segments produced from the same parent FlowFile will have the same randomly generated UUID added for this attribute

  • fragment.index: A one-up number that indicates the ordering of the segments that were created from a single parent FlowFile

  • fragment.count: The number of segments generated from the parent FlowFile

  • *segment.original.filename *: The filename of the parent FlowFile

  • *segment.original.filename *: The filename will be updated to include the parent’s filename, the segment index, and the segment count

Input Requirement

This component requires an incoming relationship.

See Also

SendTrapSNMP

Sends information to SNMP Manager.

Tags: snmp, send, trap

Properties

SNMP Manager Host

The host of the SNMP Manager where the trap is sent.

SNMP Manager Port

The port of the SNMP Manager where the trap is sent.

SNMP Version

Three significant versions of SNMP have been developed and deployed. SNMPv1 is the original version of the protocol. More recent versions, SNMPv2c and SNMPv3, feature improvements in performance, flexibility and security.

SNMP Community

SNMPv1 and SNMPv2 use communities to establish trust between managers and agents. Most agents support three community names, one each for read-only, read-write and trap. These three community strings control different types of activities. The read-only community applies to get requests. The read-write community string applies to set requests. The trap community string applies to receipt of traps.

SNMP Security Level

SNMP version 3 provides extra security with User Based Security Model (USM). The three levels of security is 1. Communication without authentication and encryption (NoAuthNoPriv). 2. Communication with authentication and without encryption (AuthNoPriv). 3. Communication with authentication and encryption (AuthPriv).

SNMP Security Name

User name used for SNMP v3 Authentication.

SNMP Authentication Protocol

Hash based authentication protocol for secure authentication.

SNMP Authentication Passphrase

Passphrase used for SNMP authentication protocol.

SNMP Privacy Protocol

Privacy allows for encryption of SNMP v3 messages to ensure confidentiality of data.

SNMP Privacy Passphrase

Passphrase used for SNMP privacy protocol.

Number of Retries

Set the number of retries when requesting the SNMP Agent.

Timeout (ms)

Set the timeout in ms when requesting the SNMP Agent.

Enterprise OID

Enterprise is the vendor identification (OID) for the network management sub-system that generated the trap.

SNMP Trap Agent Address

The address where the SNMP Manager sends the trap.

Generic Trap Type

Generic trap type is an integer in the range of 0 to 6. See processor usage for details.

Specific Trap Type

Specific trap type is a number that further specifies the nature of the event that generated the trap in the case of traps of generic type 6 (enterpriseSpecific). The interpretation of this code is vendor-specific.

Trap OID Value

The value of the trap OID.

Relationships

  • success: All FlowFiles that have been successfully used to perform SNMP Set are routed to this relationship

  • failure: All FlowFiles that cannot received from the SNMP agent are routed to this relationship

Input Requirement

This component allows an incoming relationship.

Additional Details

Summary

This processor generates and transmits SNMP Traps to the specified SNMP manager. Attributes can be given as processor properties, either predefined or dynamically using Expression Language from flowfiles. Flowfile properties with snmp prefix (e.g. snmp$1.2.3.4.5 - OID) value can be used to define additional PDU variables.

The allowable Generic Trap Types are:

  1. Cold Start

  2. Warm Start

  3. Link Down

  4. Link Up

  5. Authentication Failure

  6. EGP Neighbor Loss

  7. Enterprise Specific

Specific trap type can set in case of Enterprise Specific generic trap type is chosen.

SetSNMP

Based on incoming FlowFile attributes, the processor will execute SNMP Set requests. When finding attributes with the name snmp$<OID>, the processor will attempt to set the value of the attribute to the corresponding OID given in the attribute name.

Tags: snmp, set, oid

Properties

SNMP Agent Hostname

Hostname or network address of the SNMP Agent.

SNMP Agent Port

Port of the SNMP Agent.

SNMP Version

Three significant versions of SNMP have been developed and deployed. SNMPv1 is the original version of the protocol. More recent versions, SNMPv2c and SNMPv3, feature improvements in performance, flexibility and security.

SNMP Community

SNMPv1 and SNMPv2 use communities to establish trust between managers and agents. Most agents support three community names, one each for read-only, read-write and trap. These three community strings control different types of activities. The read-only community applies to get requests. The read-write community string applies to set requests. The trap community string applies to receipt of traps.

SNMP Security Level

SNMP version 3 provides extra security with User Based Security Model (USM). The three levels of security is 1. Communication without authentication and encryption (NoAuthNoPriv). 2. Communication with authentication and without encryption (AuthNoPriv). 3. Communication with authentication and encryption (AuthPriv).

SNMP Security Name

User name used for SNMP v3 Authentication.

SNMP Authentication Protocol

Hash based authentication protocol for secure authentication.

SNMP Authentication Passphrase

Passphrase used for SNMP authentication protocol.

SNMP Privacy Protocol

Privacy allows for encryption of SNMP v3 messages to ensure confidentiality of data.

SNMP Privacy Passphrase

Passphrase used for SNMP privacy protocol.

Number of Retries

Set the number of retries when requesting the SNMP Agent.

Timeout (ms)

Set the timeout in ms when requesting the SNMP Agent.

Relationships

  • success: All FlowFiles that have been successfully used to perform SNMP Set are routed to this relationship

  • failure: All FlowFiles that failed during the SNMP Set care routed to this relationship

Writes Attributes

  • snmp$<OID>: Response variable binding: OID (e.g. 1.3.6.1.4.1.343) and its value.

  • snmp$errorIndex: Denotes the variable binding in which the error occured.

  • snmp$errorStatus: The snmp4j error status of the PDU.

  • snmp$errorStatusText: The description of error status.

  • snmp$nonRepeaters: The number of non repeater variable bindings in a GETBULK PDU (currently not supported).

  • snmp$requestID: The request ID associated with the PDU.

  • snmp$type: The snmp4j numeric representation of the type of the PDU.

  • snmp$typeString: The name of the PDU type.

Input Requirement

This component requires an incoming relationship.

Additional Details

Summary

This processor sends SNMP set requests to an SNMP agent in order to update information associated to a given OID. This processor supports SNMPv1, SNMPv2c and SNMPv3. The component is based on SNMP4J.

The processor constructs SNMP Set requests by extracting information from FlowFile attributes. The processor is looking for attributes prefixed with snmp$. If such an attribute is found, the attribute name is split using the $ character. The second element must respect the OID format to be considered as a valid OID. If there is a third element, it must represent the SMI Syntax integer value of the type of data associated to the given OID to allow a correct conversion. If there is no third element, the value is considered as a String and the value will be sent as an OctetString object.

SignContentPGP

Sign content using OpenPGP Private Keys

Tags: PGP, GPG, OpenPGP, Encryption, Signing, RFC 4880

Properties

File Encoding

File Encoding for signing

Hash Algorithm

Hash Algorithm for signing

Signing Strategy

Strategy for writing files to success after signing

Private Key Service

PGP Private Key Service for generating content signatures

Private Key ID

PGP Private Key Identifier formatted as uppercase hexadecimal string of 16 characters used for signing

Relationships

  • success: Content signing succeeded

  • failure: Content signing failed

Writes Attributes

  • pgp.compression.algorithm: Compression Algorithm

  • pgp.compression.algorithm.id: Compression Algorithm Identifier

  • pgp.file.encoding: File Encoding

  • pgp.signature.algorithm: Signature Algorithm including key and hash algorithm names

  • pgp.signature.hash.algorithm.id: Signature Hash Algorithm Identifier

  • pgp.signature.key.algorithm.id: Signature Key Algorithm Identifier

  • pgp.signature.key.id: Signature Public Key Identifier

  • pgp.signature.type.id: Signature Type Identifier

  • pgp.signature.version: Signature Version Number

Input Requirement

This component requires an incoming relationship.

SplitAvro

Splits a binary encoded Avro datafile into smaller files based on the configured Output Size. The Output Strategy determines if the smaller files will be Avro datafiles, or bare Avro records with metadata in the FlowFile attributes. The output will always be binary encoded.

Tags: avro, split

Properties

Split Strategy

The strategy for splitting the incoming datafile. The Record strategy will read the incoming datafile by de-serializing each record.

Output Size

The number of Avro records to include per split file. In cases where the incoming file has less records than the Output Size, or when the total number of records does not divide evenly by the Output Size, it is possible to get a split file with less records.

Output Strategy

Determines the format of the output. Either Avro Datafile, or bare record. Bare record output is only intended for use with systems that already require it, and shouldn’t be needed for normal use.

Transfer Metadata

Whether or not to transfer metadata from the parent datafile to the children. If the Output Strategy is Bare Record, then the metadata will be stored as FlowFile attributes, otherwise it will be in the Datafile header.

Relationships

  • failure: If a FlowFile fails processing for any reason (for example, the FlowFile is not valid Avro), it will be routed to this relationship

  • original: The original FlowFile that was split. If the FlowFile fails processing, nothing will be sent to this relationship

  • split: All new files split from the original FlowFile will be routed to this relationship

Writes Attributes

  • fragment.identifier: All split FlowFiles produced from the same parent FlowFile will have the same randomly generated UUID added for this attribute

  • fragment.index: A one-up number that indicates the ordering of the split FlowFiles that were created from a single parent FlowFile

  • fragment.count: The number of split FlowFiles generated from the parent FlowFile

  • *segment.original.filename *: The filename of the parent FlowFile

Input Requirement

This component requires an incoming relationship.

System Resource Considerations

  • MEMORY: An instance of this component can cause high usage of this system resource. Multiple instances or high concurrency settings may result a degradation of performance.

SplitContent

Splits incoming FlowFiles by a specified byte sequence

Tags: content, split, binary

Properties

Byte Sequence Format

Specifies how the <Byte Sequence> property should be interpreted

Byte Sequence

A representation of bytes to look for and upon which to split the source file into separate files

Keep Byte Sequence

Determines whether or not the Byte Sequence should be included with each Split

Byte Sequence Location

If <Keep Byte Sequence> is set to true, specifies whether the byte sequence should be added to the end of the first split or the beginning of the second; if <Keep Byte Sequence> is false, this property is ignored.

Relationships

  • original: The original file

  • splits: All Splits will be routed to the splits relationship

Writes Attributes

  • fragment.identifier: All split FlowFiles produced from the same parent FlowFile will have the same randomly generated UUID added for this attribute

  • fragment.index: A one-up number that indicates the ordering of the split FlowFiles that were created from a single parent FlowFile

  • fragment.count: The number of split FlowFiles generated from the parent FlowFile

  • *segment.original.filename *: The filename of the parent FlowFile

Input Requirement

This component requires an incoming relationship.

System Resource Considerations

  • MEMORY: The FlowFile with its attributes is stored in memory, not the content of the FlowFile. If many splits are generated due to the size of the content, or how the content is configured to be split, a two-phase approach may be necessary to avoid excessive use of memory.

See Also

SplitExcel

This processor splits a multi sheet Microsoft Excel spreadsheet into multiple Microsoft Excel spreadsheets where each sheet from the original file is converted to an individual spreadsheet in its own flow file. Currently this processor is only capable of processing .xlsx (XSSF 2007 OOXML file format) Excel documents and not older .xls (HSSF '97(-2007) file format) documents. Please note all original cell styles are dropped and formulas are removed leaving only the calculated values. Even a single sheet Microsoft Excel spreadsheet is converted to its own flow file with all the original cell styles dropped and formulas removed.

Tags: split, text

Properties

Protection Type

Specifies whether an Excel spreadsheet is protected by a password or not.

Password

The password for a password protected Excel spreadsheet

Relationships

  • failure: If a FlowFile cannot be transformed from the configured input format to the configured output format, the unchanged FlowFile will be routed to this relationship.

  • original: The original FlowFile that was split into segments. If the FlowFile fails processing, nothing will be sent to this relationship

  • split: The individual Excel 'segments' of the original Excel FlowFile will be routed to this relationship.

Writes Attributes

  • fragment.identifier: All split Excel FlowFiles produced from the same parent Excel FlowFile will have the same randomly generated UUID added for this attribute

  • fragment.index: A one-up number that indicates the ordering of the split Excel FlowFiles that were created from a single parent Excel FlowFile

  • fragment.count: The number of split Excel FlowFiles generated from the parent Excel FlowFile

  • segment.original.filename: The filename of the parent Excel FlowFile

  • sheetname: The name of the Excel sheet from the original spreadsheet.

  • total.rows: The number of rows in the Excel sheet from the original spreadsheet.

Input Requirement

This component requires an incoming relationship.

SplitJson

Splits a JSON File into multiple, separate FlowFiles for an array element specified by a JsonPath expression. Each generated FlowFile is comprised of an element of the specified array and transferred to relationship 'split,' with the original file transferred to the 'original' relationship. If the specified JsonPath is not found or does not evaluate to an array element, the original file is routed to 'failure' and no files are generated.

Tags: json, split, jsonpath

Properties

JsonPath Expression

A JsonPath expression that indicates the array element to split into JSON/scalar fragments.

Null Value Representation

Indicates the desired representation of JSON Path expressions resulting in a null value.

Max String Length

The maximum allowed length of a string value when parsing the JSON document

Relationships

  • failure: If a FlowFile fails processing for any reason (for example, the FlowFile is not valid JSON or the specified path does not exist), it will be routed to this relationship

  • original: The original FlowFile that was split into segments. If the FlowFile fails processing, nothing will be sent to this relationship

  • split: All segments of the original FlowFile will be routed to this relationship

Writes Attributes

  • fragment.identifier: All split FlowFiles produced from the same parent FlowFile will have the same randomly generated UUID added for this attribute

  • fragment.index: A one-up number that indicates the ordering of the split FlowFiles that were created from a single parent FlowFile

  • fragment.count: The number of split FlowFiles generated from the parent FlowFile

  • *segment.original.filename *: The filename of the parent FlowFile

Input Requirement

This component requires an incoming relationship.

System Resource Considerations

  • MEMORY: The entirety of the FlowFile’s content (as a JsonNode object) is read into memory, in addition to all of the generated FlowFiles representing the split JSON. If many splits are generated due to the size of the JSON, or how the JSON is configured to be split, a two-phase approach may be necessary to avoid excessive use of memory.

SplitPCAP

Splits one pcap file into multiple pcap files based on a maximum size.

Tags: PCAP, Splitter, Network, Packet, Capture, Wireshark, TShark, TcpDump, WinDump, sniffers

Properties

PCAP Max Size

Maximum size of each output PCAP file. PCAP packets larger than the configured size result in routing FlowFiles to the failure relationship.

Relationships

  • failure: If a FlowFile cannot be transformed from the configured input format to the configured output format, the unchanged FlowFile will be routed to this relationship.

  • original: The original FlowFile that was split into segments. If the FlowFile fails processing, nothing will be sent to this relationship

  • split: The individual PCAP 'segments' of the original PCAP FlowFile will be routed to this relationship.

Writes Attributes

  • error.reason: The reason the FlowFile was sent to the failure relationship.

  • fragment.identifier: All split PCAP FlowFiles produced from the same parent PCAP FlowFile will have the same randomly generated UUID added for this attribute

  • fragment.index: A one-up number that indicates the ordering of the split PCAP FlowFiles that were created from a single parent PCAP FlowFile

  • fragment.count: The number of split PCAP FlowFiles generated from the parent PCAP FlowFile

  • segment.original.filename: The filename of the parent PCAP FlowFile

Input Requirement

This component requires an incoming relationship.

SplitRecord

Splits up an input FlowFile that is in a record-oriented data format into multiple smaller FlowFiles

Tags: split, generic, schema, json, csv, avro, log, logs, freeform, text

Properties

Record Reader

Specifies the Controller Service to use for reading incoming data

Record Writer

Specifies the Controller Service to use for writing out the records

Records Per Split

Specifies how many records should be written to each 'split' or 'segment' FlowFile

Relationships

  • failure: If a FlowFile cannot be transformed from the configured input format to the configured output format, the unchanged FlowFile will be routed to this relationship.

  • original: Upon successfully splitting an input FlowFile, the original FlowFile will be sent to this relationship.

  • splits: The individual 'segments' of the original FlowFile will be routed to this relationship.

Writes Attributes

  • mime.type: Sets the mime.type attribute to the MIME Type specified by the Record Writer for the FlowFiles routed to the 'splits' Relationship.

  • record.count: The number of records in the FlowFile. This is added to FlowFiles that are routed to the 'splits' Relationship.

  • fragment.identifier: All split FlowFiles produced from the same parent FlowFile will have the same randomly generated UUID added for this attribute

  • fragment.index: A one-up number that indicates the ordering of the split FlowFiles that were created from a single parent FlowFile

  • fragment.count: The number of split FlowFiles generated from the parent FlowFile

  • *segment.original.filename *: The filename of the parent FlowFile

Input Requirement

This component requires an incoming relationship.

SplitText

Splits a text file into multiple smaller text files on line boundaries limited by maximum number of lines or total size of fragment. Each output split file will contain no more than the configured number of lines or bytes. If both Line Split Count and Maximum Fragment Size are specified, the split occurs at whichever limit is reached first. If the first line of a fragment exceeds the Maximum Fragment Size, that line will be output in a single split file which exceeds the configured maximum size limit. This component also allows one to specify that each split should include a header lines. Header lines can be computed by either specifying the amount of lines that should constitute a header or by using header marker to match against the read lines. If such match happens then the corresponding line will be treated as header. Keep in mind that upon the first failure of header marker match, no more matches will be performed and the rest of the data will be parsed as regular lines for a given split. If after computation of the header there are no more data, the resulting split will consists of only header lines.

Tags: split, text

Properties

Line Split Count

The number of lines that will be added to each split file, excluding header lines. A value of zero requires Maximum Fragment Size to be set, and line count will not be considered in determining splits.

Maximum Fragment Size

The maximum size of each split file, including header lines. NOTE: in the case where a single line exceeds this property (including headers, if applicable), that line will be output in a split of its own which exceeds this Maximum Fragment Size setting.

Header Line Count

The number of lines that should be considered part of the header; the header lines will be duplicated to all split files

Header Line Marker Characters

The first character(s) on the line of the datafile which signifies a header line. This value is ignored when Header Line Count is non-zero. The first line not containing the Header Line Marker Characters and all subsequent lines are considered non-header

Remove Trailing Newlines

Whether to remove newlines at the end of each split file. This should be false if you intend to merge the split files later. If this is set to 'true' and a FlowFile is generated that contains only 'empty lines' (i.e., consists only of \r and \n characters), the FlowFile will not be emitted. Note, however, that if header lines are specified, the resultant FlowFile will never be empty as it will consist of the header lines, so a FlowFile may be emitted that contains only the header lines.

Relationships

  • failure: If a file cannot be split for some reason, the original file will be routed to this destination and nothing will be routed elsewhere

  • original: The original input file will be routed to this destination when it has been successfully split into 1 or more files

  • splits: The split files will be routed to this destination when an input file is successfully split into 1 or more split files

Writes Attributes

  • text.line.count: The number of lines of text from the original FlowFile that were copied to this FlowFile

  • fragment.size: The number of bytes from the original FlowFile that were copied to this FlowFile, including header, if applicable, which is duplicated in each split FlowFile

  • fragment.identifier: All split FlowFiles produced from the same parent FlowFile will have the same randomly generated UUID added for this attribute

  • fragment.index: A one-up number that indicates the ordering of the split FlowFiles that were created from a single parent FlowFile

  • fragment.count: The number of split FlowFiles generated from the parent FlowFile

  • *segment.original.filename *: The filename of the parent FlowFile

Input Requirement

This component requires an incoming relationship.

System Resource Considerations

  • MEMORY: The FlowFile with its attributes is stored in memory, not the content of the FlowFile. If many splits are generated due to the size of the content, or how the content is configured to be split, a two-phase approach may be necessary to avoid excessive use of memory.

See Also

SplitXml

Splits an XML File into multiple separate FlowFiles, each comprising a child or descendant of the original root element

Tags: xml, split

Properties

Split Depth

Indicates the XML-nesting depth to start splitting XML fragments. A depth of 1 means split the root’s children, whereas a depth of 2 means split the root’s children’s children and so forth.

Relationships

  • failure: If a FlowFile fails processing for any reason (for example, the FlowFile is not valid XML), it will be routed to this relationship

  • original: The original FlowFile that was split into segments. If the FlowFile fails processing, nothing will be sent to this relationship

  • split: All segments of the original FlowFile will be routed to this relationship

Writes Attributes

  • fragment.identifier: All split FlowFiles produced from the same parent FlowFile will have the same randomly generated UUID added for this attribute

  • fragment.index: A one-up number that indicates the ordering of the split FlowFiles that were created from a single parent FlowFile

  • fragment.count: The number of split FlowFiles generated from the parent FlowFile

  • *segment.original.filename *: The filename of the parent FlowFile

Input Requirement

This component requires an incoming relationship.

System Resource Considerations

  • MEMORY: The entirety of the FlowFile’s content (as a Document object) is read into memory, in addition to all of the generated FlowFiles representing the split XML. A Document object can take approximately 10 times as much memory as the size of the XML. For example, a 1 MB XML document may use 10 MB of memory. If many splits are generated due to the size of the XML, a two-phase approach may be necessary to avoid excessive use of memory.

StartAwsPollyJob

Trigger a AWS Polly job. It should be followed by GetAwsPollyJobStatus processor in order to monitor job status.

Tags: Amazon, AWS, ML, Machine Learning, Polly

Properties

AWS Credentials Provider Service

The Controller Service that is used to obtain AWS credentials provider

Region

Communications Timeout

JSON Payload

JSON request for AWS Machine Learning services. The Processor will use FlowFile content for the request when this property is not specified.

SSL Context Service

Specifies an optional SSL Context Service that, if provided, will be used to create connections

Endpoint Override URL

Endpoint URL to use instead of the AWS default including scheme, host, port, and path. The AWS libraries select an endpoint URL based on the AWS region, but this property overrides the selected endpoint URL, allowing use with other S3-compatible endpoints.

Relationships

  • success: FlowFiles are routed to success relationship

  • failure: FlowFiles are routed to failure relationship

  • original: Upon successful completion, the original FlowFile will be routed to this relationship.

Writes Attributes

  • awsTaskId: The task ID that can be used to poll for Job completion in GetAwsPollyJobStatus

Input Requirement

This component allows an incoming relationship.

Additional Details

StartAwsPollyJob

Amazon Polly is a service that turns text into lifelike speech, allowing you to create applications that talk, and build entirely new categories of speech-enabled products. Polly’s Text-to-Speech (TTS) service uses advanced deep learning technologies to synthesize natural sounding human speech. With dozens of lifelike voices across a broad set of languages, you can build speech-enabled applications that work in many different countries.

Usage

Amazon ML Processors are implemented to utilize ML services based on the official AWS API Reference. You can find example json payload in the documentation at the Request Syntax sections. For more details please check the official Polly API reference With this processor you will trigger a startSpeechSynthesisTask async call to Polly Service. You can define json payload as property or provide as a flow file content. Property has higher precedence. After the job is triggered the serialized json response will be written to the output flow file. The awsTaskId attribute will be populated, so it makes it easier to query job status by the corresponding get job status processor.

JSON payload template - note that it can be simplified with the optional fields, check AWS documentation for more details - example:

{
  "Engine": "string",
  "LanguageCode": "string",
  "LexiconNames": [
    "string"
  ],
  "OutputFormat": "string",
  "OutputS3BucketName": "string",
  "OutputS3KeyPrefix": "string",
  "SampleRate": "string",
  "SnsTopicArn": "string",
  "SpeechMarkTypes": [
    "string"
  ],
  "Text": "string",
  "TextType": "string",
  "VoiceId": "string"
}
json

StartAwsTextractJob

Trigger a AWS Textract job. It should be followed by GetAwsTextractJobStatus processor in order to monitor job status.

Tags: Amazon, AWS, ML, Machine Learning, Textract

Properties

AWS Credentials Provider Service

The Controller Service that is used to obtain AWS credentials provider

Region

Communications Timeout

JSON Payload

JSON request for AWS Machine Learning services. The Processor will use FlowFile content for the request when this property is not specified.

SSL Context Service

Specifies an optional SSL Context Service that, if provided, will be used to create connections

Endpoint Override URL

Endpoint URL to use instead of the AWS default including scheme, host, port, and path. The AWS libraries select an endpoint URL based on the AWS region, but this property overrides the selected endpoint URL, allowing use with other S3-compatible endpoints.

Textract Type

Supported values: "Document Analysis", "Document Text Detection", "Expense Analysis"

Relationships

  • success: FlowFiles are routed to success relationship

  • failure: FlowFiles are routed to failure relationship

  • original: Upon successful completion, the original FlowFile will be routed to this relationship.

Writes Attributes

  • awsTaskId: The task ID that can be used to poll for Job completion in GetAwsTextractJobStatus

  • awsTextractType: The selected Textract type, which can be used in GetAwsTextractJobStatus

Input Requirement

This component allows an incoming relationship.

Additional Details

StartAwsTextractJob

Amazon Textract is a machine learning (ML) service that automatically extracts text, handwriting, and data from scanned documents. It goes beyond simple optical character recognition (OCR) to identify, understand, and extract data from forms and tables.

Usage

Amazon ML Processors are implemented to utilize ML services based on the official AWS API Reference. You can find example json payload in the documentation at the Request Syntax sections. For more details please check the official Textract API reference With this processor you will trigger a startDocumentAnalysis, startDocumentTextDetection or startExpenseAnalysis async call according to your type of textract settings. You can define json payload as property or provide as a flow file content. Property has higher precedence. After the job is triggered the serialized json response will be written to the output flow file. The awsTaskId attribute will be populated, so it makes it easier to query job status by the corresponding get job status processor.

Three different type of textract task are supported: Documnet Analysis, Text Detection, Expense Analysis.

DocumentAnalysis

Starts the asynchronous analysis of an input document for relationships between detected items such as key-value pairs, tables, and selection elements. API Reference

Example payload:

{
  "ClientRequestToken": "string",
  "DocumentLocation": {
    "S3Object": {
      "Bucket": "string",
      "Name": "string",
      "Version": "string"
    }
  },
  "FeatureTypes": [
    "string"
  ],
  "JobTag": "string",
  "KMSKeyId": "string",
  "NotificationChannel": {
    "RoleArn": "string",
    "SNSTopicArn": "string"
  },
  "OutputConfig": {
    "S3Bucket": "string",
    "S3Prefix": "string"
  },
  "QueriesConfig": {
    "Queries": [
      {
        "Alias": "string",
        "Pages": [
          "string"
        ],
        "Text": "string"
      }
    ]
  }
}
json
ExpenseAnalysis

Starts the asynchronous analysis of invoices or receipts for data like contact information, items purchased, and vendor names. API Reference

Example payload:

{
  "ClientRequestToken": "string",
  "DocumentLocation": {
    "S3Object": {
      "Bucket": "string",
      "Name": "string",
      "Version": "string"
    }
  },
  "JobTag": "string",
  "KMSKeyId": "string",
  "NotificationChannel": {
    "RoleArn": "string",
    "SNSTopicArn": "string"
  },
  "OutputConfig": {
    "S3Bucket": "string",
    "S3Prefix": "string"
  }
}
json
StartDocumentTextDetection

Starts the asynchronous detection of text in a document. Amazon Textract can detect lines of text and the words that make up a line of text. API Reference

Example payload:

{
  "ClientRequestToken": "string",
  "DocumentLocation": {
    "S3Object": {
      "Bucket": "string",
      "Name": "string",
      "Version": "string"
    }
  },
  "JobTag": "string",
  "KMSKeyId": "string",
  "NotificationChannel": {
    "RoleArn": "string",
    "SNSTopicArn": "string"
  },
  "OutputConfig": {
    "S3Bucket": "string",
    "S3Prefix": "string"
  }
}
json

StartAwsTranscribeJob

Trigger a AWS Transcribe job. It should be followed by GetAwsTranscribeStatus processor in order to monitor job status.

Tags: Amazon, AWS, ML, Machine Learning, Transcribe

Properties

AWS Credentials Provider Service

The Controller Service that is used to obtain AWS credentials provider

Region

Communications Timeout

JSON Payload

JSON request for AWS Machine Learning services. The Processor will use FlowFile content for the request when this property is not specified.

SSL Context Service

Specifies an optional SSL Context Service that, if provided, will be used to create connections

Endpoint Override URL

Endpoint URL to use instead of the AWS default including scheme, host, port, and path. The AWS libraries select an endpoint URL based on the AWS region, but this property overrides the selected endpoint URL, allowing use with other S3-compatible endpoints.

Relationships

  • success: FlowFiles are routed to success relationship

  • failure: FlowFiles are routed to failure relationship

  • original: Upon successful completion, the original FlowFile will be routed to this relationship.

Writes Attributes

  • awsTaskId: The task ID that can be used to poll for Job completion in GetAwsTranscribeJobStatus

Input Requirement

This component allows an incoming relationship.

Additional Details

Automatically convert speech to text

Usage

Amazon ML Processors are implemented to utilize ML services based on the official AWS API Reference. You can find example json payload in the documentation at the Request Syntax sections. For more details please check the official Transcribe API reference With this processor you will trigger a startTranscriptionJob async call to AWS Transcribe Service. You can define json payload as property or provide as a flow file content. Property has higher precedence. After the job is triggered the serialized json response will be written to the output flow file. The awsTaskId attribute will be populated, so it makes it easier to query job status by the corresponding get job status processor.

JSON payload template - note that these can be simplified with the optional fields, check AWS documentation for more details - examples:

{
  "ContentRedaction": {
    "PiiEntityTypes": [
      "string"
    ],
    "RedactionOutput": "string",
    "RedactionType": "string"
  },
  "IdentifyLanguage": boolean,
  "IdentifyMultipleLanguages": boolean,
  "JobExecutionSettings": {
    "AllowDeferredExecution": boolean,
    "DataAccessRoleArn": "string"
  },
  "KMSEncryptionContext": {
    "string": "string"
  },
  "LanguageCode": "string",
  "LanguageIdSettings": {
    "string": {
      "LanguageModelName": "string",
      "VocabularyFilterName": "string",
      "VocabularyName": "string"
    }
  },
  "LanguageOptions": [
    "string"
  ],
  "Media": {
    "MediaFileUri": "string",
    "RedactedMediaFileUri": "string"
  },
  "MediaFormat": "string",
  "MediaSampleRateHertz": number,
  "ModelSettings": {
    "LanguageModelName": "string"
  },
  "OutputBucketName": "string",
  "OutputEncryptionKMSKeyId": "string",
  "OutputKey": "string",
  "Settings": {
    "ChannelIdentification": boolean,
    "MaxAlternatives": number,
    "MaxSpeakerLabels": number,
    "ShowAlternatives": boolean,
    "ShowSpeakerLabels": boolean,
    "VocabularyFilterMethod": "string",
    "VocabularyFilterName": "string",
    "VocabularyName": "string"
  },
  "Subtitles": {
    "Formats": [
      "string"
    ],
    "OutputStartIndex": number
  },
  "Tags": [
    {
      "Key": "string",
      "Value": "string"
    }
  ],
  "TranscriptionJobName": "string"
}
json

StartAwsTranslateJob

Trigger a AWS Translate job. It should be followed by GetAwsTranslateJobStatus processor in order to monitor job status.

Tags: Amazon, AWS, ML, Machine Learning, Translate

Properties

AWS Credentials Provider Service

The Controller Service that is used to obtain AWS credentials provider

Region

Communications Timeout

JSON Payload

JSON request for AWS Machine Learning services. The Processor will use FlowFile content for the request when this property is not specified.

SSL Context Service

Specifies an optional SSL Context Service that, if provided, will be used to create connections

Endpoint Override URL

Endpoint URL to use instead of the AWS default including scheme, host, port, and path. The AWS libraries select an endpoint URL based on the AWS region, but this property overrides the selected endpoint URL, allowing use with other S3-compatible endpoints.

Relationships

  • success: FlowFiles are routed to success relationship

  • failure: FlowFiles are routed to failure relationship

  • original: Upon successful completion, the original FlowFile will be routed to this relationship.

Writes Attributes

  • awsTaskId: The task ID that can be used to poll for Job completion in GetAwsTranslateJobStatus

Input Requirement

This component allows an incoming relationship.

Additional Details

StartAwsTranslateJob

Amazon Translate is a neural machine translation service for translating text to and from English across a breadth of supported languages. Powered by deep-learning technologies, Amazon Translate delivers fast, high-quality, and affordable language translation. It provides a managed, continually trained solution, so you can easily translate company and user-authored content or build applications that require support across multiple languages. The machine translation engine has been trained on a wide variety of content across different domains to produce quality translations that serve any industry need.

Usage

Amazon ML Processors are implemented to utilize ML services based on the official AWS API Reference. You can find example json payload in the documentation at the Request Syntax sections. For more details please check the official Translate API reference With this processor you will trigger a startTextTranslationJob async call to Translate Service You can define json payload as property or provide as a flow file content. Property has higher precedence.

JSON payload template - note that it can be simplified with the optional fields, check AWS documentation for more details - example:

{
  "ClientToken": "string",
  "DataAccessRoleArn": "string",
  "InputDataConfig": {
    "ContentType": "string",
    "S3Uri": "string"
  },
  "JobName": "string",
  "OutputDataConfig": {
    "EncryptionKey": {
      "Id": "string",
      "Type": "string"
    },
    "S3Uri": "string"
  },
  "ParallelDataNames": [
    "string"
  ],
  "Settings": {
    "Formality": "string",
    "Profanity": "string"
  },
  "SourceLanguageCode": "string",
  "TargetLanguageCodes": [
    "string"
  ],
  "TerminologyNames": [
    "string"
  ]
}
json

StartGcpVisionAnnotateFilesOperation

Trigger a Vision operation on file input. It should be followed by GetGcpVisionAnnotateFilesOperationStatus processor in order to monitor operation status.

Tags: Google, Cloud, Machine Learning, Vision

Properties

JSON Payload

JSON request for AWS Machine Learning services. The Processor will use FlowFile content for the request when this property is not specified.

GCP Credentials Provider Service

The Controller Service used to obtain Google Cloud Platform credentials.

Output Bucket

Name of the GCS bucket where the output of the Vision job will be persisted. The value of this property applies when the JSON Payload property is configured. The JSON Payload property value can use Expression Language to reference the value of ${output-bucket}

Vision Feature Type

Type of GCP Vision Feature. The value of this property applies when the JSON Payload property is configured. The JSON Payload property value can use Expression Language to reference the value of ${vision-feature-type}

Relationships

  • success: FlowFiles are routed to success relationship

  • failure: FlowFiles are routed to failure relationship

Writes Attributes

  • operationKey: A unique identifier of the operation returned by the Vision server.

Input Requirement

This component allows an incoming relationship.

Additional Details

Google Cloud Vision - Start Annotate Files Operation

Prerequisites

  • Make sure Vision API is enabled and the account you are using has the right to use it

  • Make sure the input file(s) are available in a GCS bucket

Usage

StartGcpVisionAnnotateFilesOperation is designed to trigger file annotation operations. This processor should be used in pair with the GetGcpVisionAnnotateFilesOperationStatus Processor. Outgoing FlowFiles contain the raw response to the request returned by the Vision server. The response is in JSON format and contains the result and additional metadata as written in the Google Vision API Reference documents.

Payload

The JSON Payload is a request in JSON format as documented in the Google Vision REST API reference document. Payload can be fed to the processor via the JSON Payload property or as a FlowFile content. The property has higher precedence over FlowFile content. Please make sure to delete the default value of the property if you want to use FlowFile content payload. A JSON payload template example:

{
  "requests": [
    {
      "inputConfig": {
        "gcsSource": {
          "uri": "gs://${gcs.bucket}/${filename}"
        },
        "mimeType": "application/pdf"
      },
      "features": [
        {
          "type": "${vision-feature-type}",
          "maxResults": 4
        }
      ],
      "outputConfig": {
        "gcsDestination": {
          "uri": "gs://${output-bucket}/${filename}/"
        },
        "batchSize": 2
      }
    }
  ]
}
json
Features types
  • TEXT_DETECTION: Optical character recognition (OCR) for an image; text recognition and conversion to machine-coded text. Identifies and extracts UTF-8 text in an image.

  • DOCUMENT_TEXT_DETECTION: Optical character recognition (OCR) for a file (PDF/TIFF) or dense text image; dense text recognition and conversion to machine-coded text.

You can find more details at Google Vision Feature List

Example: How to set up a simple Annotate Image Flow

Prerequisites

  • Create an input and output bucket

  • Input files should be available in a GCS bucket

  • This bucket must not contain anything else but the input files

  • Set the bucket property of ListGCSBucket processor to your input bucket name

  • Keep the default value of JSON PAYLOAD property in StartGcpVisionAnnotateFilesOperation

  • Set the Output Bucket property to your output bucket name in StartGcpVisionAnnotateFilesOperation

  • Setup GCP Credentials Provider Service for all GCP related processor

Execution steps:

  • ListGCSBucket processor will return a list of files in the bucket at the first run.

  • ListGCSBucket will return only new items at subsequent runs.

  • StartGcpVisionAnnotateFilesOperation processor will trigger GCP Vision file annotation jobs based on the JSON payload.

  • StartGcpVisionAnnotateFilesOperation processor will populate the operationKey flow file attribute.

  • GetGcpVisionAnnotateFilesOperationStatus processor will periodically query status of the job.

StartGcpVisionAnnotateImagesOperation

Trigger a Vision operation on image input. It should be followed by GetGcpVisionAnnotateImagesOperationStatus processor in order to monitor operation status.

Tags: Google, Cloud, Machine Learning, Vision

Properties

JSON Payload

JSON request for AWS Machine Learning services. The Processor will use FlowFile content for the request when this property is not specified.

GCP Credentials Provider Service

The Controller Service used to obtain Google Cloud Platform credentials.

Output Bucket

Name of the GCS bucket where the output of the Vision job will be persisted. The value of this property applies when the JSON Payload property is configured. The JSON Payload property value can use Expression Language to reference the value of ${output-bucket}

Vision Feature Type

Type of GCP Vision Feature. The value of this property applies when the JSON Payload property is configured. The JSON Payload property value can use Expression Language to reference the value of ${vision-feature-type}

Relationships

  • success: FlowFiles are routed to success relationship

  • failure: FlowFiles are routed to failure relationship

Writes Attributes

  • operationKey: A unique identifier of the operation returned by the Vision server.

Input Requirement

This component allows an incoming relationship.

Additional Details

Google Cloud Vision - Start Annotate Images Operation

Prerequisites

  • Make sure Vision API is enabled and the account you are using has the right to use it

  • Make sure the input image(s) are available in a GCS bucket under /input folder

Usage

StartGcpVisionAnnotateImagesOperation is designed to trigger image annotation operations. This processor should be used in pair with the GetGcpVisionAnnotateImagesOperationStatus Processor. Outgoing FlowFiles contain the raw response to the request returned by the Vision server. The response is in JSON format and contains the result and additional metadata as written in the Google Vision API Reference documents.

Payload

The JSON Payload is a request in JSON format as documented in the Google Vision REST API reference document. Payload can be fed to the processor via the JSON Payload property or as a FlowFile content. The property has higher precedence over FlowFile content. Please make sure to delete the default value of the property if you want to use FlowFile content payload. A JSON payload template example:

{
  "requests": [
    {
      "image": {
        "source": {
          "imageUri": "gs://${gcs.bucket}/${filename}"
        }
      },
      "features": [
        {
          "type": "${vision-feature-type}",
          "maxResults": 4
        }
      ]
    }
  ],
  "outputConfig": {
    "gcsDestination": {
      "uri": "gs://${output-bucket}/${filename}/"
    },
    "batchSize": 2
  }
}
json
Features types
  • TEXT_DETECTION: Optical character recognition (OCR) for an image; text recognition and conversion to machine-coded text. Identifies and extracts UTF-8 text in an image.

  • DOCUMENT_TEXT_DETECTION: Optical character recognition (OCR) for a file (PDF/TIFF) or dense text image; dense text recognition and conversion to machine-coded text.

  • LANDMARK_DETECTION: Provides the name of the landmark, a confidence score and a bounding box in the image for the landmark.

  • LOGO_DETECTION: Provides a textual description of the entity identified, a confidence score, and a bounding polygon for the logo in the file.

  • LABEL_DETECTION: Provides generalized labels for an image.

  • etc.

You can find more details at Google Vision Feature List

Example: How to set up a simple Annotate Image Flow

Prerequisites

  • Create an input and output bucket

  • Input image files should be available in a GCS bucket

  • This bucket must not contain anything else but the input image files

  • Set the bucket property of ListGCSBucket processor to your input bucket name

  • Keep the default value of JSON PAYLOAD property in StartGcpVisionAnnotateImagesOperation

  • Set the Output Bucket property to your output bucket name in StartGcpVisionAnnotateImagesOperation

  • Setup GCP Credentials Provider Service for all GCP related processor

Execution steps:

  • ListGCSBucket processor will return a list of files in the bucket at the first run.

  • ListGCSBucket will return only new items at subsequent runs.

  • StartGcpVisionAnnotateImagesOperation processor will trigger GCP Vision image annotation jobs based on the JSON payload.

  • StartGcpVisionAnnotateImagesOperation processor will populate the operationKey flow file attribute.

  • GetGcpVisionAnnotateImagesOperationStatus processor will periodically query status of the job.

StartSnowflakeIngest

Ingests files from a Snowflake internal or external stage into a Snowflake table. The stage must be created in the Snowflake account beforehand. The result of the ingestion is not available immediately, so this processor can be connected to an GetSnowflakeIngestStatus processor to wait for the results

Tags: snowflake, snowpipe, ingest

Properties

Ingest Manager Provider

Specifies the Controller Service to use for ingesting Snowflake staged files.

Relationships

  • success: For FlowFiles of successful ingest request

  • failure: For FlowFiles of failed ingest request

Reads Attributes

  • snowflake.staged.file.path: Staged file path

Input Requirement

This component requires an incoming relationship.

Additional Details

Description

The StartSnowflakeIngest processor triggers a Snowflake pipe ingestion for a staged file. Please note, that the pipe has to be created in your Snowflake account manually. The processor requires an upstream connection that provides the path of the file to be ingested in the stage through the “snowflake.staged.file.path” attribute. This attribute is automatically filled in by the PutSnowflakeInternalStage processor when using an internal stage. In case a pipe copies data from an external stage, the attribute shall be manually provided (e.g. with an UpdateAttribute processor). NOTE: Since Snowflake pipes ingest files asynchronously, this processor transfers FlowFiles to the “success” relationship when they’re marked for ingestion. In order to wait for the actual result of the ingestion, the processor may be connected to a downstream GetSnowflakeIngestStatus processor.

======= Example flow for internal stage

GetFile → PutSnowflakeInternalStage → StartSnowflakeIngest → GetSnowflakeIngestStatus

======= Example flow for external stage

ListS3 → UpdateAttribute (add the “snowflake.staged.file.path” attribute) → StartSnowflakeIngest → GetSnowflakeIngestStatus

TagS3Object

Adds or updates a tag on an Amazon S3 Object.

Tags: Amazon, S3, AWS, Archive, Tag

Properties

Bucket

The S3 Bucket to interact with

Object Key

The S3 Object Key to use. This is analogous to a filename for traditional file systems.

Region

The AWS Region to connect to.

AWS Credentials Provider Service

The Controller Service that is used to obtain AWS credentials provider

Tag Key

The key of the tag that will be set on the S3 Object

Tag Value

The value of the tag that will be set on the S3 Object

Append Tag

If set to true, the tag will be appended to the existing set of tags on the S3 object. Any existing tags with the same key as the new tag will be updated with the specified value. If set to false, the existing tags will be removed and the new tag will be set on the S3 object.

Version ID

The Version of the Object to tag

Communications Timeout

The amount of time to wait in order to establish a connection to AWS or receive data from AWS before timing out.

SSL Context Service

Specifies an optional SSL Context Service that, if provided, will be used to create connections

Endpoint Override URL

Endpoint URL to use instead of the AWS default including scheme, host, port, and path. The AWS libraries select an endpoint URL based on the AWS region, but this property overrides the selected endpoint URL, allowing use with other S3-compatible endpoints.

Signer Override

The AWS S3 library uses Signature Version 4 by default but this property allows you to specify the Version 2 signer to support older S3-compatible services or even to plug in your own custom signer implementation.

Custom Signer Class Name

Fully qualified class name of the custom signer class. The signer must implement com.amazonaws.auth.Signer interface.

Custom Signer Module Location

Comma-separated list of paths to files and/or directories which contain the custom signer’s JAR file and its dependencies (if any).

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP + AuthN

Relationships

  • success: FlowFiles are routed to this Relationship after they have been successfully processed.

  • failure: If the Processor is unable to process a given FlowFile, it will be routed to this Relationship.

Writes Attributes

  • s3.tag._: The tags associated with the S3 object will be written as part of the FlowFile attributes

  • s3.exception: The class name of the exception thrown during processor execution

  • s3.additionalDetails: The S3 supplied detail from the failed operation

  • s3.statusCode: The HTTP error code (if available) from the failed operation

  • s3.errorCode: The S3 moniker of the failed operation

  • s3.errorMessage: The S3 exception message from the failed operation

Input Requirement

This component requires an incoming relationship.

TailFile

"Tails" a file, or a list of files, ingesting data from the file as it is written to the file. The file is expected to be textual. Data is ingested only when a new line is encountered (carriage return or new-line character or combination). If the file to tail is periodically "rolled over", as is generally the case with log files, an optional Rolling Filename Pattern can be used to retrieve data from files that have rolled over, even if the rollover occurred while NiFi was not running (provided that the data still exists upon restart of NiFi). It is generally advisable to set the Run Schedule to a few seconds, rather than running with the default value of 0 secs, as this Processor will consume a lot of resources if scheduled very aggressively. At this time, this Processor does not support ingesting files that have been compressed when 'rolled over'.

Tags: tail, file, log, text, source

Properties

Tailing mode

Mode to use: single file will tail only one file, multiple file will look for a list of file. In Multiple mode the Base directory is required.

File(s) to Tail

Path of the file to tail in case of single file mode. If using multifile mode, regular expression to find files to tail in the base directory. In case recursivity is set to true, the regular expression will be used to match the path starting from the base directory (see additional details for examples).

Rolling Filename Pattern

If the file to tail "rolls over" as would be the case with log files, this filename pattern will be used to identify files that have rolled over so that if NiFi is restarted, and the file has rolled over, it will be able to pick up where it left off. This pattern supports wildcard characters * and ?, it also supports the notation ${filename} to specify a pattern based on the name of the file (without extension), and will assume that the files that have rolled over live in the same directory as the file being tailed. The same glob pattern will be used for all files.

Post-Rollover Tail Period

When a file is rolled over, the processor will continue tailing the rolled over file until it has not been modified for this amount of time. This allows for another process to rollover a file, and then flush out any buffered data. Note that when this value is set, and the tailed file rolls over, the new file will not be tailed until the old file has not been modified for the configured amount of time. Additionally, when using this capability, in order to avoid data duplication, this period must be set longer than the Processor’s Run Schedule, and the Processor must not be stopped after the file being tailed has been rolled over and before the data has been fully consumed. Otherwise, the data may be duplicated, as the entire file may be written out as the contents of a single FlowFile.

Base directory

Base directory used to look for files to tail. This property is required when using Multifile mode.

Initial Start Position

When the Processor first begins to tail data, this property specifies where the Processor should begin reading data. Once data has been ingested from a file, the Processor will continue from the last point from which it has received data.

State Location

Specifies where the state is located either local or cluster so that state can be stored appropriately in order to ensure that all data is consumed without duplicating data upon restart of NiFi

Recursive lookup

When using Multiple files mode, this property defines if files must be listed recursively or not in the base directory.

Lookup frequency

Only used in Multiple files mode. It specifies the minimum duration the processor will wait before listing again the files to tail.

Maximum age

Only used in Multiple files mode. It specifies the necessary minimum duration to consider that no new messages will be appended in a file regarding its last modification date. This should not be set too low to avoid duplication of data in case new messages are appended at a lower frequency.

Reread when NUL encountered

If this option is set to 'true', when a NUL character is read, the processor will yield and try to read the same part again later. (Note: Yielding may delay the processing of other files tailed by this processor, not just the one with the NUL character.) The purpose of this flag is to allow users to handle cases where reading a file may return temporary NUL values. NFS for example may send file contents out of order. In this case the missing parts are temporarily replaced by NUL values. CAUTION! If the file contains legitimate NUL values, setting this flag causes this processor to get stuck indefinitely. For this reason users should refrain from using this feature if they can help it and try to avoid having the target file on a file system where reads are unreliable.

Line Start Pattern

A Regular Expression to match against the start of a log line. If specified, any line that matches the expression, and any following lines, will be buffered until another line matches the Expression. In doing this, we can avoid splitting apart multi-line messages in the file. This assumes that the data is in UTF-8 format.

Pre-Allocated Buffer Size

Sets the amount of memory that is pre-allocated for each tailed file.

Max Buffer Size

When using the Line Start Pattern, there may be situations in which the data in the file being tailed never matches the Regular Expression. This would result in the processor buffering all data from the tailed file, which can quickly exhaust the heap. To avoid this, the Processor will buffer only up to this amount of data before flushing the buffer, even if it means ingesting partial data from the file.

Relationships

  • success: All FlowFiles are routed to this Relationship.

Writes Attributes

  • tailfile.original.path: Path of the original file the flow file comes from.

Stateful

Scope: Local, Cluster

Stores state about where in the Tailed File it left off so that on restart it does not have to duplicate data. State is stored either local or clustered depend on the <File Location> property.

Restricted

Making changes to this component requires admin rights. Non-admins can still schedule this component.

Input Requirement

This component does not allow an incoming relationship.

Additional Details

Introduction

This processor offers a powerful capability, allowing the user to periodically look at a file that is actively being written to by another process. When the file changes, the new lines are ingested. This Processor assumes that data in the file is textual.

Tailing a file from a filesystem is a seemingly simple but notoriously difficult task. This is because we are periodically checking the contents of a file that is being written to. The file may be constantly changing, or it may rarely change. The file may be “rolled over” (i.e., renamed) and it’s important that even after restarting the application (NiFi, in this case), we are able to pick up where we left off. Other additional complexities also come into play. For example, NFS mounted drives may indicate that data is readable but then return NUL bytes (Unicode 0) when attempting to read, as the actual bytes are not yet known (see the property), and file systems have different timestamp granularities.

This Processor is designed to handle all of these different cases. This can lead to slightly more complex configuration, but this document should provide you with all you need to get started!

Modes

This processor is used to tail a file or multiple files, depending on the configured mode. The mode to choose depends on the logging pattern followed by the file(s) to tail. In any case, if there is a rolling pattern, the rolling files must be plain text files (compression is not supported at the moment).

  • Single file: the processor will tail the file with the path given in ‘File(s) to tail’ property.

  • Multiple files: the processor will look for files into the ‘Base directory’. It will look for file recursively according to the ‘Recursive lookup’ property and will tail all the files matching the regular expression provided in the ‘File(s) to tail’ property.

Rolling filename pattern

In case the ‘Rolling filename pattern’ property is used, when the processor detects that the file to tail has rolled over, the processor will look for possible missing messages in the rolled file. To do so, the processor will use the pattern to find the rolling files in the same directory as the file to tail.

In order to keep this property available in the ‘Multiple files’ mode when multiples files to tail are in the same directory, it is possible to use the ${filename} tag to reference the name (without extension) of the file to tail. For example, if we have:

/my/path/directory/my-app.log.1 /my/path/directory/my-app.log /my/path/directory/application.log.1 /my/path/directory/application.log

the ‘rolling filename pattern’ would be ${filename}.log.*.

Descriptions for different modes and strategies

The ‘Single file’ mode assumes that the file to tail has always the same name even if there is a rolling pattern. Example:

/my/path/directory/my-app.log.2 /my/path/directory/my-app.log.1 /my/path/directory/my-app.log

and new log messages are always appended in my-app.log file.

In case recursivity is set to ‘true’. The regular expression for the files to tail must embrace the possible intermediate directories between the base directory and the files to tail. Example:

/my/path/directory1/my-app1.log /my/path/directory2/my-app2.log /my/path/directory3/my-app3.log

Base directory: /my/path Files to tail: directory[1-3]/my-app[1-3].log Recursivity: true

If the processor is configured with ‘Multiple files’ mode, two additional properties are relevant:

  • Lookup frequency: specifies the minimum duration the processor will wait before listing again the files to tail.

  • Maximum age: specifies the necessary minimum duration to consider that no new messages will be appended in a file regarding its last modification date. If the amount of time that has elapsed since the file was modified is larger than this period of time, the file will not be tailed. For example, if a file was modified 24 hours ago and this property is set to 12 hours, the file will not be tailed. But if this property is set to 36 hours, then the file will continue to be tailed.

It is necessary to pay attention to ‘Lookup frequency’ and ‘Maximum age’ properties, as well as the frequency at which the processor is triggered, in order to achieve high performance. It is recommended to keep ‘Maximum age’ > ‘Lookup frequency’ > processor scheduling frequency to avoid missing data. It also recommended not to set ‘Maximum Age’ too low because if messages are appended in a file after this file has been considered “too old”, all the messages in the file may be read again, leading to data duplication.

If the processor is configured with ‘Multiple files’ mode, the ‘Rolling filename pattern’ property must be specific enough to ensure that only the rolling files will be listed and not other currently tailed files in the same directory ( this can be achieved using ${filename} tag).

Handling Multi-Line Messages

Most of the time, when we tail a file, we are happy to receive data periodically, however it was written to the file. There are scenarios, though, where we may have data written in such a way that multiple lines need to be retained together. Take, for example, the following lines of text that might be found in a log file:

2021-07-09 14:12:19,731 INFO [main] org.apache.nifi.NiFi Launching NiFi...
2021-07-09 14:12:19,915 INFO [main] o.a.n.p.AbstractBootstrapPropertiesLoader Determined default application properties path to be '/Users/mpayne/devel/nifi/nifi-assembly/target/nifi-1.14.0-SNAPSHOT-bin/nifi-1.14.0-SNAPSHOT/./conf/nifi.properties'
2021-07-09 14:12:19,919 INFO [main] o.a.nifi.properties.NiFiPropertiesLoader Loaded 199 properties from /Users/mpayne/devel/nifi/nifi-assembly/target/nifi-1.14.0-SNAPSHOT-bin/nifi-1.14.0-SNAPSHOT/./conf/nifi.properties
2021-07-09 14:12:19,925 WARN Line 1 of Log Message          Line 2: This is an important warning.           Line 3: Please do not ignore this warning.          Line 4: These lines of text make sense only in the context of the original message.
2021-07-09 14:12:19,941 INFO [main] Final message in log file

In this case, we may want to ensure that the log lines are not ingested in such a way that our multi-line log message is not broken up into Lines 1 and 2 in one FlowFile and Lines 3 and 4 in another. To accomplish this, the Processor exposes the property. If we set this Property to a value of \d{4}-\d{2}-\d{2}, then we are telling the Processor that each message should begin with 4 digits, followed by a dash, followed by 2 digits, a dash, and 2 digits. I.e., we are telling it that each message begins with a timestamp in yyyy-MM-dd format. Because of this, even if the Processor runs and sees only Lines 1 and 2 of our multiline log message, it will not ingest the data yet. It will wait until it sees the next message, which starts with a timestamp.

Note that, because of this, the last message that the Processor will encounter in the above situation is the “Final message in log file” line. At this point, the Processor does not know whether the next line of text it encounters will be part of this line or a new message. As such, it will not ingest this data. It will wait until either another message is encountered (that matches our regex) or until the file is rolled over (renamed). Because of this, there may be some delay in ingesting the last message in the file, if the process that writes to the file just stops writing at this point.

Additionally, we run the chance of the Regular Expression not matching the data in the file. This could result in buffering all the file’s content, which could cause NiFi to run out of memory. To avoid this, the property limits the amount of data that can be buffered. If this amount of data is buffered, it will be flushed to the FlowFile, even if another message hasn’t been encountered.

TransformXml

Transforms a FlowFile’s XML content by applying the provided XSLT. A new FlowFile is created with the transformed content and is routed to the 'success' relationship. If the transformation fails, the original FlowFile is routed to the 'failure' relationship

Tags: virtimo, xml, xslt, transform

Properties

XSLT Input Method

Choose from where to provide the XSLT transformation

XSLT Script

XSLT transformation to execute. WARNING: This property should not be used to store very large XSLT files.

XSLT File Name

The name (including full path) of the XSLT file to apply to the FlowFile XML content.

XSLT Lookup Controller

Controller used to lookup the XSLT transformations. WARNING: The controller should not be used to lookup large XSLT files.

XSLT Lookup Key

Key used to lookup the XSLT transformation from the XSLT Lookup Controller.

Surround input with <xml> tag

Use this to create an XML input from text files to parse them further with the processor - e.g. for JSON processing: fn:json-to-xml(.).

Support result documents

Result documents (xsl:result-document) enables a single XSLT transformation to create multiple output documents.The output documents can be transferred via custom outgoing relationships OR written as custom attributes in the success relationship. Transfer via a relationship by adding an 'href' attribute to the result document, e.g.: href="relationshipName". Write to an attribute instead by prefixing the attribute with 'a:', e.g.: href="a:attributeName"

Allow NiFi EL in XPath

Allow the use of the NiFi Expression Language within XPath with the namespace xmlns:nf="http://nifi.org" and a call like nf:eval('${UUID()}'). To use " characters you need to use ".

Indent

Whether or not to indent the output.

Secure processing

Whether or not to mitigate various XML-related attacks like XXE (XML External Entity) attacks.

Resolve external entities

For security reasons and if it is not needed it could be an option to not resolve external entities like DTDs

Cache size

Maximum number of stylesheets to cache. Zero disables the cache.

Cache TTL after last access

The cache TTL (time-to-live) or how long to keep stylesheets in the cache after last access.

Dynamic Properties

XSLT parameter name

These XSLT parameters are passed to the transformer

Relationships

  • success: The FlowFile with transformed content will be routed to this relationship

  • failure: If a FlowFile fails processing for any reason (for example, the FlowFile is not valid XML), it will be routed to this relationship

Dynamic Relationships

  • Value of href attribute in XSLT result document: An XSLT result document’s output documents will be will be transferred here

Writes Attributes

  • Value of href attribute (must be prefixed with 'a:') in XSLT result document: An XSLT result document’s output documents will be written to this attribute

  • xslt-error: If the transformation fails, the error message will be written in this attribute.

Input Requirement

This component requires an incoming relationship.

TransformXml

Applies the provided XSLT file to the FlowFile XML payload. A new FlowFile is created with transformed content and is routed to the 'success' relationship. If the XSL transform fails, the original FlowFile is routed to the 'failure' relationship

Tags: xml, xslt, transform

Properties

XSLT file name

Provides the name (including full path) of the XSLT file to apply to the FlowFile XML content.One of the 'XSLT file name' and 'XSLT Lookup' properties must be defined.

XSLT Lookup

Controller lookup used to store XSLT definitions. One of the 'XSLT file name' and 'XSLT Lookup' properties must be defined. WARNING: note that the lookup controller service should not be used to store large XSLT files.

XSLT Lookup key

Key used to retrieve the XSLT definition from the XSLT lookup controller. This property must be set when using the XSLT controller property.

Indent

Whether or not to indent the output.

Secure processing

Whether or not to mitigate various XML-related attacks like XXE (XML External Entity) attacks.

Cache size

Maximum number of stylesheets to cache. Zero disables the cache.

Cache TTL after last access

The cache TTL (time-to-live) or how long to keep stylesheets in the cache after last access.

Dynamic Properties

An XSLT transform parameter name

These XSLT parameters are passed to the transformer

Relationships

  • success: The FlowFile with transformed content will be routed to this relationship

  • failure: If a FlowFile fails processing for any reason (for example, the FlowFile is not valid XML), it will be routed to this relationship

Input Requirement

This component requires an incoming relationship.

UnpackContent

Unpacks the content of FlowFiles that have been packaged with one of several different Packaging Formats, emitting one to many FlowFiles for each input FlowFile. Supported formats are TAR, ZIP, and FlowFile Stream packages.

Use Cases

Unpack Zip containing filenames with special characters, created on Windows with filename charset 'Cp437' or 'IBM437'.

Input Requirement: This component allows an incoming relationship.

  1. Set "Packaging Format" value to "zip" or "use mime.type attribute".

  2. Set "Filename Character Set" value to "Cp437" or "IBM437". .

Tags: Unpack, un-merge, tar, zip, archive, flowfile-stream, flowfile-stream-v3

Properties

Packaging Format

The Packaging Format used to create the file

Filename Character Set

If supplied this character set will be supplied to the Zip utility to attempt to decode filenames using the specific character set. If not specified the default platform character set will be used. This is useful if a Zip was created with a different character set than the platform default and the zip uses non standard values to specify.

File Filter

Only files contained in the archive whose names match the given regular expression will be extracted (tar/zip only)

Password

Password used for decrypting Zip archives encrypted with ZipCrypto or AES. Configuring a password disables support for alternative Zip compression algorithms.

Allow Stored Entries With Data Descriptor

Some zip archives contain stored entries with data descriptors which by spec should not happen. If this property is true they will be read anyway. If false and such an entry is discovered the zip will fail to process.

Relationships

  • success: Unpacked FlowFiles are sent to this relationship

  • failure: The original FlowFile is sent to this relationship when it cannot be unpacked for some reason

  • original: The original FlowFile is sent to this relationship after it has been successfully unpacked

Reads Attributes

  • mime.type: If the <Packaging Format> property is set to use mime.type attribute, this attribute is used to determine the FlowFile’s MIME Type. In this case, if the attribute is set to application/tar, the TAR Packaging Format will be used. If the attribute is set to application/zip, the ZIP Packaging Format will be used. If the attribute is set to application/flowfile-v3 or application/flowfile-v2 or application/flowfile-v1, the appropriate FlowFile Packaging Format will be used. If this attribute is missing, the FlowFile will be routed to 'failure'. Otherwise, if the attribute’s value is not one of those mentioned above, the FlowFile will be routed to 'success' without being unpacked. Use the File Filter property only extract files matching a specific regular expression.

Writes Attributes

  • mime.type: If the FlowFile is successfully unpacked, its MIME Type is no longer known, so the mime.type attribute is set to application/octet-stream.

  • fragment.identifier: All unpacked FlowFiles produced from the same parent FlowFile will have the same randomly generated UUID added for this attribute

  • fragment.index: A one-up number that indicates the ordering of the unpacked FlowFiles that were created from a single parent FlowFile

  • fragment.count: The number of unpacked FlowFiles generated from the parent FlowFile

  • *segment.original.filename *: The filename of the parent FlowFile. Extensions of .tar, .zip or .pkg are removed because the MergeContent processor automatically adds those extensions if it is used to rebuild the original FlowFile

  • file.lastModifiedTime: The date and time that the unpacked file was last modified (tar and zip only).

  • file.creationTime: The date and time that the file was created. For encrypted zip files this attribute always holds the same value as file.lastModifiedTime. For tar and unencrypted zip files if available it will be returned otherwise this will be the same value asfile.lastModifiedTime.

  • file.lastMetadataChange: The date and time the file’s metadata changed (tar only).

  • file.lastAccessTime: The date and time the file was last accessed (tar and unencrypted zip files only)

  • file.owner: The owner of the unpacked file (tar only)

  • file.group: The group owner of the unpacked file (tar only)

  • file.size: The uncompressed size of the unpacked file (tar and zip only)

  • file.permissions: The read/write/execute permissions of the unpacked file (tar and unencrypted zip files only)

  • file.encryptionMethod: The encryption method for entries in Zip archives

Input Requirement

This component requires an incoming relationship.

See Also

UpdateAttribute

Updates the Attributes for a FlowFile by using the Attribute Expression Language and/or deletes the attributes based on a regular expression

Use Cases

Add a new FlowFile attribute

Input Requirement: This component allows an incoming relationship.

  1. Leave "Delete Attributes Expression" and "Stateful Variables Initial Value" unset.

  2. Set "Store State" to "Do not store state". .

  3. Add a new property. The name of the property will become the name of the newly added attribute.

  4. The value of the property will become the value of the newly added attribute. The value may use the NiFi Expression Language in order to reference other

  5. attributes or call Expression Language functions. .

Overwrite a FlowFile attribute with a new value

Input Requirement: This component allows an incoming relationship.

  1. Leave "Delete Attributes Expression" and "Stateful Variables Initial Value" unset.

  2. Set "Store State" to "Do not store state". .

  3. Add a new property. The name of the property will become the name of the attribute whose value will be overwritten.

  4. The value of the property will become the new value of the attribute. The value may use the NiFi Expression Language in order to reference other

  5. attributes or call Expression Language functions. .

  6. For example, to change the txId attribute to the uppercase version of its current value, add a property named txId with a value of ${txId:toUpper()} .

Rename a file

Input Requirement: This component allows an incoming relationship.

  1. Leave "Delete Attributes Expression" and "Stateful Variables Initial Value" unset.

  2. Set "Store State" to "Do not store state". .

  3. Add a new property whose name is filename and whose value is the desired filename. .

  4. For example, to set the filename to abc.txt, add a property named filename with a value of abc.txt.

  5. To add the txId attribute as a prefix to the filename, add a property named filename with a value of ${txId}${filename}.

  6. Or, to make the filename more readable, separate the txId from the rest of the filename with a hyphen by using a value of ${txId}-${filename}. .

Tags: attributes, modification, update, delete, Attribute Expression Language, state

Properties

Delete Attributes Expression

Regular expression for attributes to be deleted from FlowFiles. Existing attributes that match will be deleted regardless of whether they are updated by this processor.

Store State

Select whether or not state will be stored. Selecting 'Stateless' will offer the default functionality of purely updating the attributes on a FlowFile in a stateless manner. Selecting a stateful option will not only store the attributes on the FlowFile but also in the Processors state. See the 'Stateful Usage' topic of the 'Additional Details' section of this processor’s documentation for more information

Stateful Variables Initial Value

If using state to set/reference variables then this value is used to set the initial value of the stateful variable. This will only be used in the @OnScheduled method when state does not contain a value for the variable. This is required if running statefully but can be empty if needed.

Cache Value Lookup Cache Size

Specifies how many canonical lookup values should be stored in the cache

Dynamic Properties

A FlowFile attribute to update

Updates a FlowFile attribute specified by the Dynamic Property’s key with the value specified by the Dynamic Property’s value

Relationships

  • success: All successful FlowFiles are routed to this relationship

Writes Attributes

  • See additional details: This processor may write or remove zero or more attributes as described in additional details

Stateful

Scope: Local

Gives the option to store values not only on the FlowFile but as stateful variables to be referenced in a recursive manner.

Input Requirement

This component requires an incoming relationship.

Additional Details

Description:

This processor updates the attributes of a FlowFile using properties or rules that are added by the user. There are three ways to use this processor to add or modify attributes. One way is the “Basic Usage”; this allows you to set default attribute changes that affect every FlowFile going through the processor. The second way is the “Advanced Usage”; this allows you to make conditional attribute changes that only affect a FlowFile if it meets certain conditions. It is possible to use both methods in the same processor at the same time. The third way is the “Delete Attributes Expression”; this allows you to provide a regular expression and any attributes with a matching name will be deleted.

Please note that “Delete Attributes Expression” supersedes any updates that occur. If an existing attribute matches the “Delete Attributes Expression”, it will be removed whether it was updated or not. That said, the “Delete Attributes Expression” only applies to attributes that exist in the input FlowFile, if it is added by this processor, the “Delete Attributes Expression” will not detect it.

Properties:

The properties in this processor are added by the user. The expression language is supported in user-added properties for this processor. See the NiFi Expression Language Guide to learn how to formulate proper expression language statements to perform the desired functions.

If an Attribute is added with the name alternate.identifier and that attribute’s value is a URI, an ADD_INFO Provenance Event will be registered, correlating the FlowFile with the given alternate identifier.

Relationships:

  • success

    • If the processor successfully updates the specified attribute(s), then the FlowFile follows this relationship.

  • set state fail

    • If the processor is running statefully, and fails to set the state after adding attributes to the FlowFile, then the FlowFile will be routed to this relationship.

Basic Usage

For basic usage, changes are made by adding a new processor property and referencing as its name the attribute you want to change. Then enter the desired attribute value as the Value. The Value can be as simple as any text string or it can be a NiFi Expression Language statement that specifies how to formulate the value. (See the NiFi Expression Language Usage Guide for details on crafting NiFi Expression Language statements.)

As an example, to alter the standard “filename” attribute so that it has “.txt” appended to the end of it, add a new property and make the property name “filename” (to reference the desired attribute), and as the value, use the NiFi Expression Language statement shown below:

  • Property: filename

  • Value: ${filename}.txt

The preceding example illustrates how to modify an existing attribute. If an attribute does not already exist, this processor can also be used to add a new attribute. For example, the following property could be added to create a new attribute called myAttribute that has the value myValue:

  • Property: myAttribute

  • Value: myValue

In this example, all FlowFiles passing through this processor will receive an additional FlowFile attribute called myAttribute with the value myValue. This type of configuration might be used in a flow where you want to tag every FlowFile with an attribute so that it can be used later in the flow, such as for routing in a RouteOnAttribute processor.

Advanced Usage

The preceding examples illustrate how to make changes to every FlowFile that goes through the processor. However, the UpdateAttribute processor may also be used to make conditional changes.

To change attributes based on some condition, use the Advanced User Interface (UI) in the processor by clicking the Advanced** menu item in the Canvas context menu.

Clicking the Advanced menu item displays the Advanced UI. In the Advanced UI, Conditions and their associated Actions are entered as “Rules”. Each rule basically says, “If these conditions are met, then do this action.” One or more conditions may be used in a given rule, and they all must be met in order for the designated action(s) to be taken.

Adding Rules

To add the first rule, click on the “Create Rule” button in center of the screen. The Edit Rule form will display where the name, comments, conditions, and actions for the rule can be entered. Once the rule is defined, click the Add button. Additional rules can be added by clicking the button with the plus symbol located to the top right of the Rule listing.

Example Rules

This example has two rules: CheckForLargeFiles and CheckForGiantFiles. The CheckForLargeFiles rule has these conditions:

  • ${filename:equals('fileOfInterest')}

  • ${fileSize:toNumber():ge(1048576)}

  • ${fileSize:toNumber():lt(1073741824)}

Then it has this action for the filename attribute:

  • ${filename}.meg

Taken together, this rule says:

  • If the value of the filename attribute is fileOfInterest, and

  • If the fileSize is greater than or equal to (ge) one megabyte (1,048,576 bytes), and

  • If the fileSize is less than (lt) one gigabyte (1,073,741,824 bytes)

  • Then change the value of the filename attribute by appending “.meg” to the filename.

Adding another Rule

Continuing with this example, we can add another rule to check for files that are larger than one gigabyte. When we add this second rule, we can use the previous rule as a template, so to speak, by taking advantage of the “Clone Rule” option in the menu for the Rule that you wan to clone. This will open with new Rule form with the exisitng Rules criteria pre-populated.

In this example, the CheckForGiantFiles rule has these conditions:

  • ${filename:equals('fileOfInterest')}

  • ${fileSize:toNumber():gt(1073741824)}

Then it has this action for the filename attribute:

  • ${filename}.gig

Taken together, this rule says:

  • If the value of the filename attribute is fileOfInterest, and

  • If the fileSize is greater than (gt) one gigabyte (1,073,741,824 bytes)

  • Then change the value of the filename attribute by appending “.gig” to the filename.

Combining the Basic Usage with the Advanced Usage

The UpdateAttribute processor allows you to make both basic usage changes (i.e., to every FlowFile) and advanced usage changes (i.e., conditional) at the same time; however, if they both affect the same attribute(s), then the conditional changes take precedence. This has the added benefit of supporting a type of “else” construct. In other words, if none of the rules match for the attribute, then the basic usage changes will be made.

Deleting Attributes

Deleting attributes is a simple as providing a regular expression for attribute names to be deleted. This can be a simple regular expression that will match a single attribute or more complex regular expression to match a group of similarly named attributes or even several individual attribute names.

  • lastUser - will delete an attribute with the name “lastUser”.

  • user.* - will delete attributes beginning with “user”, including for example “username,`"userName”, "`userID”, and “users”. But it will not delete “User” or “localuser”.

  • (user.|host.|.*Date) - will delete “user”, “username”, “userName”, “hostInfo”, “hosts”, and “updateDate”, but not “User”, “HOST”, “update”, or “updatedate”.

The delete attributes function does not produce a Provenance Event if the alternate.identified Attribute is deleted.

FlowFile Policy

Another setting in the Advanced UI is the FlowFile Policy. It is located in the upper-left corner of the UI, and it defines the processor’s behavior when multiple rules match. It may be changed using the slide toggle. By default, the FlowFile Policy is set to use a clone of the original FlowFile for each matching rule.

If the FlowFile policy is set to “use clone”, and multiple rules match, then a copy of the incoming FlowFile is created, such that the number of outgoing FlowFiles is equal to the number of rules that match. In other words, if two rules (A and B) both match, then there will be two outgoing FlowFiles, one for Rule A and one for Rule B. This can be useful in situations where you want to add an attribute to use as a flag for routing later. In this example, there will be two copies of the file available, one to route for the A path, and one to route for the B path.

If the FlowFile policy is set to “use original”, then all matching rules are applied to the same incoming FlowFile, and there is only one outgoing FlowFile with all the attribute changes applied. In this case, the order of the rules matters and the action for each rule that matches will be applied in that order. If multiple rules contain actions that update the same attribute, the action from the last matching rule will take precedence. Notably, you can drag and drop the rules into a certain order within the Rules list once the FlowFile Policy is set to “use original” and the user has toggled the “Reorder rules” control. While in this reordering mode, other Rule modifications are not allowed.

Filtering Rules

The Advanced UI supports the creation of an arbitrarily large number of rules. In order to manage large rule sets, the listing of rules may be filtered using the Filter mechanism in the lower left corner. Rules may be filtered by any text in the name, condition, or action.

Closing the Advanced UI

Once all changes have been saved in the Advanced UI, you can navigate back to the Canvas using the navigation at the top.

Stateful Usage

By selecting “store state locally” option for the “Store State” property UpdateAttribute will not only store the evaluated properties as attributes of the FlowFile but also as stateful variables to be referenced in a recursive fashion. This enables the processor to calculate things like the sum or count of incoming FlowFiles. A dynamic property can be referenced as a stateful variable like so:

  • Dynamic Property

    • key : theCount

    • value : ${getStateValue("theCount"):plus(1)}

This example will keep a count of the total number of FlowFiles that have passed through the processor. To use logic on top of State, simply use the “Advanced Usage” of UpdateAttribute. All Actions will be stored as stateful attributes as well as being added to FlowFiles. Using the “Advanced Usage” it is possible to keep track of things like a maximum value of the flow so far. This would be done by having a condition of “\({getStateValue("maxValue"):lt(\){value})}” and an action of attribute:"`maxValue`", value:"`${value}`". The “Stateful Variables Initial Value” property is used to initialize the stateful variables and is required to be set if running statefully. Some logic rules will require a very high initial value, like using the Advanced rules to determine the minimum value. If stateful properties reference other stateful properties then the value for the other stateful properties will be an iteration behind. For example, attempting to calculate the average of the incoming stream requires the sum and count. If all three properties are set in the same UpdateAttribute (like below) then the Average will always not include the most recent values of count and sum:

  • Count

    • key : theCount

    • value : ${getStateValue("theCount"):plus(1)}

  • Sum

    • key : theSum

    • value : ${getStateValue("theSum"):plus(${flowfileValue})}

  • Average

    • key : theAverage

    • value : ${getStateValue("theSum"):divide(getStateValue("theCount"))}

Instead, since average only relies on theCount and theSum attributes (which are added to the FlowFile as well) there should be a following Stateless UpdateAttribute which properly calculates the average. In the event that the processor is unable to get the state at the beginning of the onTrigger, the FlowFile will be pushed back to the originating relationship and the processor will yield. If the processor is able to get the state at the beginning of the onTrigger but unable to set the state after adding attributes to the FlowFile, the FlowFile will be transferred to “set state fail”. This is normally due to the state not being the most recent version (another thread has replaced the state with another version). In most use-cases this relationship should loop back to the processor since the only affected attributes will be overwritten. Note: Currently the only “stateful” option is to store state locally. This is done because the current implementation of clustered state relies on Zookeeper and Zookeeper isn’t designed for the type of load/throughput UpdateAttribute with state would demand. In the future, if/when multiple different clustered state options are added, UpdateAttribute will be updated.

Combining the Advanced Usage with Stateful

The UpdateAttribute processor allows you to use both advanced usage changes (i.e., conditional) in addition to storing the values in state at the same time. This allows UpdateAttribute to act as a stateful rules engine to enable powerful concepts such as a Finite-State machine or keeping track of a min/max value. Working with both is relatively simple, when the processor would normally update an attribute on the processor (ie. it matches a conditional rule) the same update is stored to state. Referencing state via the advanced tab is done in the same way too, using “getStateValue”. Note: In the event the “use clone” policy is set and the state is failed to set, no clones will be generated and only the original FlowFile will be transferred to “set state fail”.

Notes about Concurrency and Stateful Usage

When using the stateful option, concurrent tasks should be used with caution. If every incoming FlowFile will update state then it will be much more efficient to have only one task. This is because the first thing the onTrigger does is get the state and the last thing it does is store the state if there are an updates. If it does not have the most recent initial state when it goes to update it will fail and send the FlowFile to “set state fail”. This is done so that the update is successful when it was done with the most recent information. If it didn’t do it in this mock-atomic way, there’d be no guarantee that the state is accurate. When considering Concurrency, the use-cases generally fall into one of three categories:

  • A data stream where each FlowFile updates state ex. updating a counter

  • A data stream where a FlowFile doesn’t always update state ex. a Finite-State machine

  • A data stream that doesn’t update state, and a second “control” stream that one updates every time but is rare compared to the data stream ex. a trigger

The first and last cases are relatively clear-cut in their guidance. For the first, concurrency should not be used. Doing so will just waste CPU and any benefits of concurrency will be wiped due to misses in state. For the last case, it can easily be done using concurrency. Since updates are rare in the first place it will be even more rare that two updates are processed at the same time that cause problems. The second case is a bit of a grey area. If updates are rare then concurrency can probably be used. If updates are frequent then concurrency would probably cause more problems than benefits. Regardless, testing to determine the appropriate tuning is the only true answer.

UpdateByQueryElasticsearch

Update documents in an Elasticsearch index using a query. The query can be loaded from a flowfile body or from the Query parameter. The loaded Query can contain any JSON accepted by Elasticsearch’s _update_by_query API, for example a "query" object to identify what documents are to be updated, plus a "script" to define the updates to perform.

Tags: elastic, elasticsearch, elasticsearch5, elasticsearch6, elasticsearch7, elasticsearch8, update, query

Properties

Query Definition Style

How the JSON Query will be defined for use by the processor.

Query

A query in JSON syntax, not Lucene syntax. Ex: {"query":{"match":{"somefield":"somevalue"}}}. If this parameter is not set, the query will be read from the flowfile content. If the query (property and flowfile content) is empty, a default empty JSON Object will be used, which will result in a "match_all" query in Elasticsearch.

Query Clause

A "query" clause in JSON syntax, not Lucene syntax. Ex: {"match":{"somefield":"somevalue"}}. If the query is empty, a default JSON Object will be used, which will result in a "match_all" query in Elasticsearch.

Script

A "script" to execute during the operation, in JSON syntax. Ex: {"source": "ctx._source.count++", "lang": "painless"}

Query Attribute

If set, the executed query will be set on each result flowfile in the specified attribute.

Index

The name of the index to use.

Type

The type of this document (used by Elasticsearch for indexing and searching).

Max JSON Field String Length

The maximum allowed length of a string value when parsing a JSON document or attribute.

Client Service

An Elasticsearch client service to use for running queries.

Dynamic Properties

The name of a URL query parameter to add

Adds the specified property name/value as a query parameter in the Elasticsearch URL used for processing. These parameters will override any matching parameters in the query request body

Relationships

  • success: If the "by query" operation succeeds, and a flowfile was read, it will be sent to this relationship.

  • failure: If the "by query" operation fails, and a flowfile was read, it will be sent to this relationship.

  • retry: All flowfiles that fail due to server/cluster availability go to this relationship.

Writes Attributes

  • elasticsearch.update.took: The amount of time that it took to complete the update operation in ms.

  • elasticsearch.update.error: The error message provided by Elasticsearch if there is an error running the update.

Input Requirement

This component allows an incoming relationship.

Additional Details

This processor executes an update operation against one or more indices using the _update_by_query handler. The query should be a valid Elasticsearch JSON DSL query (Lucene syntax is not supported). An optional Elasticsearch script can be specified to execute against the matched documents. An example query with script:

{
  "script": {
    "source": "ctx._source.count++",
    "lang": "painless"
  },
  "query": {
    "match": {
      "username.keyword": "john.smith"
    }
  }
}
json

To update all the contents of an index, this could be used:

{
  "query": {
    "match_all": {}
  }
}
json

UpdateCounter

This processor allows users to set specific counters and key points in their flow. It is useful for debugging and basic counting functions.

Tags: counter, debug, instrumentation

Properties

Counter Name

The name of the counter you want to set the value of - supports expression language like ${counterName}

Delta

Adjusts the counter by the specified delta for each flow file received. May be a positive or negative integer.

Relationships

  • success: Counter was updated/retrieved

Reads Attributes

  • counterName: The name of the counter to update/get.

Input Requirement

This component requires an incoming relationship.

UpdateDatabaseTable

This processor uses a JDBC connection and incoming records to generate any database table changes needed to support the incoming records. It expects a 'flat' record layout, meaning none of the top-level record fields has nested fields that are intended to become columns themselves.

Tags: metadata, jdbc, database, table, update, alter

Properties

Record Reader

The service for reading incoming flow files. The reader is only used to determine the schema of the records, the actual records will not be processed.

Database Connection Pooling Service

The Controller Service that is used to obtain connection(s) to the database

Database Type

Database Type for generating statements specific to a particular service or vendor. The Generic Type supports most cases but selecting a specific type enables optimal processing or additional features.

Database Dialect Service

Database Dialect Service for generating statements specific to a particular service or vendor.

Catalog Name

The name of the catalog that the statement should update. This may not apply for the database that you are updating. In this case, leave the field empty. Note that if the property is set and the database is case-sensitive, the catalog name must match the database’s catalog name exactly.

Schema Name

The name of the database schema that the table belongs to. This may not apply for the database that you are updating. In this case, leave the field empty. Note that if the property is set and the database is case-sensitive, the schema name must match the database’s schema name exactly.

Table Name

The name of the database table to update. If the table does not exist, then it will either be created or an error thrown, depending on the value of the Create Table property.

Create Table Strategy

Specifies how to process the target table when it does not exist (create it, fail, e.g.).

Primary Key Fields

A comma-separated list of record field names that uniquely identifies a row in the database. This property is only used if the specified table needs to be created, in which case the Primary Key Fields will be used to specify the primary keys of the newly-created table. IMPORTANT: Primary Key Fields must match the record field names exactly unless 'Quote Column Identifiers' is false and the database allows for case-insensitive column names. In practice it is best to specify Primary Key Fields that exactly match the record field names, and those will become the column names in the created table.

Translate Field Names

If true, the Processor will attempt to translate field names into the corresponding column names for the table specified, for the purposes of determining whether the field name exists as a column in the target table. NOTE: If the target table does not exist and is to be created, this property is ignored and the field names will be used as-is. If false, the field names must match the column names exactly, or the column may not be found and instead an error my be reported that the column already exists.

Column Name Translation Strategy

The strategy used to normalize table column name. Column Name will be uppercased to do case-insensitive matching irrespective of strategy

Column Name Translation Pattern

Column name will be normalized with this regular expression

Update Field Names

This property indicates whether to update the output schema such that the field names are set to the exact column names from the specified table. This should be used if the incoming record field names may not match the table’s column names in terms of upper- and lower-case. For example, this property should be set to true if the output FlowFile is destined for Oracle e.g., which expects the field names to match the column names exactly. NOTE: The value of the 'Translate Field Names' property is ignored when updating field names; instead they are updated to match the column name as returned by the database.

Record Writer

Specifies the Controller Service to use for writing results to a FlowFile. The Record Writer should use Inherit Schema to emulate the inferred schema behavior, i.e. an explicit schema need not be defined in the writer, and will be supplied by the same logic used to infer the schema from the column types. If Create Table Strategy is set 'Create If Not Exists', the Record Writer’s output format must match the Record Reader’s format in order for the data to be placed in the created table location. Note that this property is only used if 'Update Field Names' is set to true and the field names do not all match the column names exactly. If no update is needed for any field names (or 'Update Field Names' is false), the Record Writer is not used and instead the input FlowFile is routed to success or failure without modification.

Quote Table Identifiers

Enabling this option will cause the table name to be quoted to support the use of special characters in the table name and/or forcing the value of the Table Name property to match the target table name exactly.

Quote Column Identifiers

Enabling this option will cause all column names to be quoted, allowing you to use reserved words as column names in your tables and/or forcing the record field names to match the column names exactly.

Query Timeout

Sets the number of seconds the driver will wait for a query to execute. A value of 0 means no timeout. NOTE: Non-zero values may not be supported by the driver.

Relationships

  • success: A FlowFile containing records routed to this relationship after the record has been successfully transmitted to the database.

  • failure: A FlowFile containing records routed to this relationship if the record could not be transmitted to the database.

Writes Attributes

  • output.table: This attribute is written on the flow files routed to the 'success' and 'failure' relationships, and contains the target table name.

  • output.path: This attribute is written on the flow files routed to the 'success' and 'failure' relationships, and contains the path on the file system to the table (or partition location if the table is partitioned).

  • mime.type: Sets the mime.type attribute to the MIME Type specified by the Record Writer, only if a Record Writer is specified and Update Field Names is 'true'.

  • record.count: Sets the number of records in the FlowFile, only if a Record Writer is specified and Update Field Names is 'true'.

Input Requirement

This component requires an incoming relationship.

UpdateRecord

Updates the contents of a FlowFile that contains Record-oriented data (i.e., data that can be read via a RecordReader and written by a RecordWriter). This Processor requires that at least one user-defined Property be added. The name of the Property should indicate a RecordPath that determines the field that should be updated. The value of the Property is either a replacement value (optionally making use of the Expression Language) or is itself a RecordPath that extracts a value from the Record. Whether the Property value is determined to be a RecordPath or a literal value depends on the configuration of the <Replacement Value Strategy> Property.

Use Cases

Combine multiple fields into a single field.

Keywords: combine, concatenate, recordpath

Input Requirement: This component allows an incoming relationship.

  1. "Replacement Value Strategy" = "Record Path Value" .

  2. A single additional property is added to the Processor. The name of the property is a RecordPath identifying the field to place the result in.

  3. The value of the property uses the CONCAT Record Path function to concatenate multiple values together, potentially using other string literal values.

  4. For example, to combine the title, firstName and lastName fields into a single field named fullName, we add a property with the name /fullName and a value of CONCAT(/title, ' ', /firstName, ' ', /lastName) .

Change the value of a record field to an explicit value.

Keywords: change, update, replace, transform

Input Requirement: This component allows an incoming relationship.

  1. "Replacement Value Strategy" = "Literal Value" .

  2. A single additional property is added to the Processor. The name of the property is a RecordPath identifying the field to place the result in.

  3. The value of the property is the explicit value to set the field to. For example, we can set any field with a name of txId, regardless of its level in the data’s hierarchy, to 1111-1111 by adding a property with a name of //txId and a value of 1111-1111 .

Copy the value of one record field to another record field.

Keywords: change, update, copy, recordpath, hierarchy, transform

Input Requirement: This component allows an incoming relationship.

  1. "Replacement Value Strategy" = "Record Path Value" .

  2. A single additional property is added to the Processor. The name of the property is a RecordPath identifying the field to update.

  3. The value of the property is a RecordPath identifying the field to copy the value from.

  4. For example, we can copy the value of /identifiers/all/imei to the identifier field at the root level, by adding a property named /identifier with a value of /identifiers/all/imei. .

Enrich data by injecting the value of an attribute into each Record.

Keywords: enrich, attribute, change, update, replace, insert, transform

Input Requirement: This component allows an incoming relationship.

  1. "Replacement Value Strategy" = "Literal Value" .

  2. A single additional property is added to the Processor. The name of the property is a RecordPath identifying the field to place the result in.

  3. The value of the property is an Expression Language expression that references the attribute of interest. We can, for example, insert a new field name filename into each record by adding a property named /filename with a value of ${filename} .

Change the format of a record field’s value.

Notes: Use the RenameRecordField Processor in order to change a field’s name.

Keywords: change, update, replace, insert, transform, format, date/time, timezone, expression language

Input Requirement: This component allows an incoming relationship.

  1. "Replacement Value Strategy" = "Literal Value" .

  2. A single additional property is added to the Processor. The name of the property is a RecordPath identifying the field to update.

  3. The value is an Expression Language expression that references the field.value variable. For example, to change the date/time format of a field named txDate from year-month-day format to month/day/year format, we add a property named /txDate with a value of ${field.value:toDate('yyyy-MM-dd'):format('MM/dd/yyyy')}. We could also change the timezone of a timestamp field (and insert the timezone for clarity) by using a value of ${field.value:toDate('yyyy-MM-dd HH:mm:ss', 'UTC-0400'):format('yyyy-MM-dd HH:mm:ss Z', 'UTC')}. .

Tags: update, record, generic, schema, json, csv, avro, log, logs, freeform, text

Properties

Record Reader

Specifies the Controller Service to use for reading incoming data

Record Writer

Specifies the Controller Service to use for writing out the records

Replacement Value Strategy

Specifies how to interpret the configured replacement values

Dynamic Properties

A RecordPath.

Allows users to specify values to use to replace fields in the record that match the RecordPath.

Relationships

  • success: FlowFiles that are successfully transformed will be routed to this relationship

  • failure: If a FlowFile cannot be transformed from the configured input format to the configured output format, the unchanged FlowFile will be routed to this relationship

Writes Attributes

  • record.index: This attribute provides the current row index and is only available inside the literal value expression.

  • record.error.message: This attribute provides on failure the error message encountered by the Reader or Writer.

Input Requirement

This component requires an incoming relationship.

See Also

Additional Details

UpdateRecord makes use of the NiFi RecordPath Domain-Specific Language (DSL) to allow the user to indicate which field(s) in the Record should be updated. Users do this by adding a User-defined Property to the Processor’s configuration. The name of the User-defined Property must be the RecordPath text that should be evaluated against each Record. The value of the Property specifies what value should go into that selected Record field.

When specifying the replacement value (the value of the User-defined Property), the user is able to specify a literal value such as the number 10; an Expression Language Expression to reference FlowFile attributes, such as`${filename}`; or another RecordPath path from which to retrieve the desired value from the Record itself. Whether the value entered should be interpreted as a literal or a RecordPath path is determined by the value of the Property.

If a RecordPath is given and does not match any field in an input Record, that Property will be skipped and all other Properties will still be evaluated. If the RecordPath matches exactly one field, that field will be updated with the corresponding value. If multiple fields match the RecordPath, then all fields that match will be updated. If the replacement value is itself a RecordPath that does not match, then a null value will be set for the field. For instances where this is not the desired behavior, RecordPath predicates can be used to filter the fields that match so that no fields will be selected. See RecordPath Predicates for more information.

Below, we lay out some examples in order to provide clarity about the Processor’s behavior. For all the examples below, consider the example to operate on the following set of 2 (JSON) records:

[
  {
    "id": 17,
    "name": "John",
    "child": {
      "id": "1"
    },
    "siblingIds": [
      4,
      8
    ],
    "siblings": [
      {
        "name": "Jeremy",
        "id": 4
      },
      {
        "name": "Julia",
        "id": 8
      }
    ]
  },
  {
    "id": 98,
    "name": "Jane",
    "child": {
      "id": 2
    },
    "gender": "F",
    "siblingIds": [],
    "siblings": []
  }
]
json

For brevity, we will omit the corresponding schema and configuration of the RecordReader and RecordWriter. Otherwise, consider the following set of Properties are configured for the Processor and their associated outputs.

Example 1 - Replace with Literal

Here, we will replace the name of each Record with the name ‘Jeremy’ and set the gender to ‘M’:

Property Name Property Value

Replacement Value Strategy

Literal Value

/name

Jeremy

/gender

M

This will yield the following output:

[
  {
    "id": 17,
    "name": "Jeremy",
    "child": {
      "id": "1"
    },
    "gender": "M",
    "siblingIds": [
      4,
      8
    ],
    "siblings": [
      {
        "name": "Jeremy",
        "id": 4
      },
      {
        "name": "Julia",
        "id": 8
      }
    ]
  },
  {
    "id": 98,
    "name": "Jeremy",
    "child": {
      "id": 2
    },
    "gender": "M",
    "siblingIds": [],
    "siblings": []
  }
]
json

Note that even though the first record did not have a “gender” field in the input, one will be added after the “child” field, as that’s where the field is located in the schema.

Example 2 - Replace with RecordPath

This example will replace the value in one field of the Record with the value from another field. For this example, consider the following set of Properties:

Property Name Property Value

Replacement Value Strategy

Record Path Value

/name

/siblings[0]/name

This will yield the following output:

[
  {
    "id": 17,
    "name": "Jeremy",
    "child": {
      "id": "1"
    },
    "siblingIds": [
      4,
      8
    ],
    "siblings": [
      {
        "name": "Jeremy",
        "id": 4
      },
      {
        "name": "Julia",
        "id": 8
      }
    ]
  },
  {
    "id": 98,
    "name": null,
    "child": {
      "id": 2
    },
    "gender": "F",
    "siblingIds": [],
    "siblings": []
  }
]
json
Example 3 - Replace with Relative RecordPath

In the above example, we replaced the value of field based on another RecordPath. That RecordPath was an “absolute RecordPath,” meaning that it starts with a “slash” character (/) and therefore it specifies the path from the “root” or “outermost” element. However, sometimes we want to reference a field in such a way that we defined the RecordPath relative to the field being updated. This example does just that. For each of the siblings given in the "`siblings`"array, we will replace the sibling’s name with their id’s. To do so, we will configure the processor with the following properties:

Property Name Property Value

Replacement Value Strategy

Record Path Value

/siblings[*]/name

../id

Note that the RecordPath that was given for the value starts with .., which is a reference to the parent. We do this because the field that we are going to update is the “name” field of the sibling. To get to the associated “id” field, we need to go to the “name” field’s parent and then to its “id” child field. The above example results in the following output:

[
  {
    "id": 17,
    "name": "Jeremy",
    "child": {
      "id": "1"
    },
    "siblingIds": [
      4,
      8
    ],
    "siblings": [
      {
        "name": "Jeremy",
        "id": 4
      },
      {
        "name": "Julia",
        "id": 8
      }
    ]
  },
  {
    "id": 98,
    "name": null,
    "child": {
      "id": 2
    },
    "gender": "F",
    "siblingIds": [],
    "siblings": []
  }
]
json
Example 4 - Replace Multiple Values

This example will replace the value of all fields that have the name “id”, regardless of where in the Record hierarchy the field is found. The value that it uses references the Expression Language, so for this example, let’s assume that the incoming FlowFile has an attribute named “replacement.id” that has a value of “91”:

Property Name Property Value

Replacement Value Strategy

Literal Value

//id

$\{replacement.id}

This will yield the following output:

[
  {
    "id": 91,
    "name": "John",
    "child": {
      "id": "91"
    },
    "siblingIds": [
      4,
      8
    ],
    "siblings": [
      {
        "name": "Jeremy",
        "id": 91
      },
      {
        "name": "Julia",
        "id": 91
      }
    ]
  },
  {
    "id": 91,
    "name": "Jane",
    "child": {
      "id": 91
    },
    "gender": "F",
    "siblingIds": [],
    "siblings": []
  }
]
json

It is also worth noting that in this example, some of the “id” fields were of type STRING, while others were of type INT. This is okay because the RecordReaders and RecordWriters should handle these simple type coercions for us.

Example 5 - Use Expression Language to Modify Value

This example will capitalize the value of all ‘name’ fields, regardless of where in the Record hierarchy the field is found. This is done by referencing the ‘field.value’ variable in the Expression Language. We can also access the field.name variable and the field.type variable.

Property Name Property Value

Replacement Value Strategy

Literal Value

//name

$\{field.value:toUpper()}

This will yield the following output:

[
  {
    "id": 17,
    "name": "JOHN",
    "child": {
      "id": "1"
    },
    "siblingIds": [
      4,
      8
    ],
    "siblings": [
      {
        "name": "JEREMY",
        "id": 4
      },
      {
        "name": "JULIA",
        "id": 8
      }
    ]
  },
  {
    "id": 98,
    "name": "JANE",
    "child": {
      "id": 2
    },
    "gender": "F",
    "siblingIds": [],
    "siblings": []
  }
]
json

ValidateCsv

Validates the contents of FlowFiles or a FlowFile attribute value against a user-specified CSV schema. Take a look at the additional documentation of this processor for some schema examples.

Tags: csv, schema, validation

Properties

Schema

The schema to be used for validation. Is expected a comma-delimited string representing the cell processors to apply. The following cell processors are allowed in the schema definition: [ParseBigDecimal, ParseBool, ParseChar, ParseDate, ParseDouble, ParseInt, ParseLong, Optional, DMinMax, Equals, ForbidSubStr, LMinMax, NotNull, Null, RequireHashCode, RequireSubStr, Strlen, StrMinMax, StrNotNullOrEmpty, StrRegEx, Unique, UniqueHashCode, IsIncludedIn]. Note: cell processors cannot be nested except with Optional. Schema is required if Header is false.

CSV Source Attribute

The name of the attribute containing CSV data to be validated. If this property is blank, the FlowFile content will be validated.

Header

True if the incoming flow file contains a header to ignore, false otherwise.

Delimiter character

Character used as 'delimiter' in the incoming data. Example: ,

Quote character

Character used as 'quote' in the incoming data. Example: "

End of line symbols

Symbols used as 'end of line' in the incoming data. Example: \n

Validation strategy

Strategy to apply when routing input files to output relationships.

Include all violations

If true, the validation.error.message attribute would include the list of all the violations for the first invalid line. Note that setting this property to true would slightly decrease the performances as all columns would be validated. If false, a line is invalid as soon as a column is found violating the specified constraint and only this violation for the first invalid line will be included in the validation.error.message attribute.

Relationships

  • invalid: FlowFiles that are not valid according to the specified schema, or no schema or CSV header can be identified, are routed to this relationship

  • valid: FlowFiles that are successfully validated against the schema are routed to this relationship

Writes Attributes

  • count.valid.lines: If line by line validation, number of valid lines extracted from the source data

  • count.invalid.lines: If line by line validation, number of invalid lines extracted from the source data

  • count.total.lines: If line by line validation, total number of lines in the source data

  • validation.error.message: For flow files routed to invalid, message of the first validation error

Input Requirement

This component requires an incoming relationship.

Additional Details

Usage Information

The Validate CSV processor is based on the super-csv library and the concept of Cell Processors. The corresponding java documentation can be found here.

The cell processors cannot be nested (except with Optional which gives the possibility to define a CellProcessor for values that could be null) and must be defined in a comma-delimited string as the Schema property.

The supported cell processors are:

  • ParseBigDecimal

  • ParseBool

  • ParseChar

  • ParseDate

  • ParseDouble

  • ParseInt

  • Optional

  • DMinMax

  • Equals

  • ForbidSubStr

  • LMinMax

  • NotNull

  • Null

  • RequireHashCode

  • RequireSubStr

  • Strlen

  • StrMinMax

  • StrNotNullOrEmpty

  • StrRegEx

  • Unique

  • UniqueHashCode

  • IsIncludedIn

Here are some examples:

Schema property: Null, ParseDate(“dd/MM/yyyy”), Optional(ParseDouble())
Meaning: the input CSV has three columns, the first one can be null and has no specification, the second one must be a date formatted as expected, and the third one must a double or null (no value).

Schema property: ParseBigDecimal(), ParseBool(), ParseChar(), ParseInt(), ParseLong()
Meaning: the input CSV has five columns, the first one must be a big decimal, the second one must be a boolean, the third one must be a char, the fourth one must be an integer and the fifth one must be a long.

Schema property: Equals(), NotNull(), StrNotNullOrEmpty()
Meaning: the input CSV has three columns, all the values of the first column must be equal to each other, all the values of the second column must be not null, and all the values of the third column are not null/empty string values.

Schema property: Strlen(4), StrMinMax(3,5), StrRegex(“@[a-z0-9\\.]”)
Meaning: the input CSV has three columns, all the values of the first column must be 4-characters long, all the values of the second column must be between 3 and 5 characters (inclusive), and all the values of the last column must match the provided regular expression (email address).

Schema property: Unique(), UniqueHashCode()
Meaning: the input CSV has two columns. All the values of the first column must be unique (all the values are stored in memory). All the values of the second column must be unique (only hash codes of the input values are stored to ensure uniqueness).

Schema property: ForbidSubStr(“test”, “tset”), RequireSubStr(“test”)
Meaning: the input CSV has two columns. None of the values in the first column must contain one of the provided strings. And all the values of the second column must contain the provided string.

ValidateJson

Validates the contents of FlowFiles against a configurable JSON Schema. See json-schema.org for specification standards. This Processor does not support input containing multiple JSON objects, such as newline-delimited JSON. If the input FlowFile contains newline-delimited JSON, only the first line will be validated.

Tags: JSON, schema, validation

Properties

Schema Access Strategy

Specifies how to obtain the schema that is to be used for interpreting the data.

Schema Name

Specifies the name of the schema to lookup in the Schema Registry property

JSON Schema Registry

Specifies the Controller Service to use for the JSON Schema Registry

JSON Schema

A URL or file path to the JSON schema or the actual JSON schema content

JSON Schema Version

The JSON schema specification

Max String Length

The maximum allowed length of a string value when parsing the JSON document

Relationships

  • failure: FlowFiles that cannot be read as JSON are routed to this relationship

  • invalid: FlowFiles that are not valid according to the specified schema are routed to this relationship

  • valid: FlowFiles that are successfully validated against the schema are routed to this relationship

Writes Attributes

  • json.validation.errors: If the flow file is routed to the invalid relationship , this attribute will contain the error message resulting from the validation failure.

Restricted

Making changes to this component requires admin rights. Non-admins can still schedule this component.

Input Requirement

This component requires an incoming relationship.

System Resource Considerations

  • MEMORY: Validating JSON requires reading FlowFile content into memory

ValidateRecord

Validates the Records of an incoming FlowFile against a given schema. All records that adhere to the schema are routed to the "valid" relationship while records that do not adhere to the schema are routed to the "invalid" relationship. It is therefore possible for a single incoming FlowFile to be split into two individual FlowFiles if some records are valid according to the schema and others are not. Any FlowFile that is routed to the "invalid" relationship will emit a ROUTE Provenance Event with the Details field populated to explain why records were invalid. In addition, to gain further explanation of why records were invalid, DEBUG-level logging can be enabled for the "org.apache.nifi.processors.standard.ValidateRecord" logger.

Tags: record, schema, validate

Properties

Record Reader

Specifies the Controller Service to use for reading incoming data

Record Writer

Specifies the Controller Service to use for writing out the records. Regardless of the Controller Service schema access configuration, the schema that is used to validate record is used to write the valid results.

Record Writer for Invalid Records

If specified, this Controller Service will be used to write out any records that are invalid. If not specified, the writer specified by the "Record Writer" property will be used with the schema used to read the input records. This is useful, for example, when the configured Record Writer cannot write data that does not adhere to its schema (as is the case with Avro) or when it is desirable to keep invalid records in their original format while converting valid records to another format.

Schema Access Strategy

Specifies how to obtain the schema that should be used to validate records

Schema Registry

Specifies the Controller Service to use for the Schema Registry

Schema Name

Specifies the name of the schema to lookup in the Schema Registry property

Schema Branch

Specifies the name of the branch to use when looking up the schema in the Schema Registry property. If the chosen Schema Registry does not support branching, this value will be ignored.

Schema Version

Specifies the version of the schema to lookup in the Schema Registry. If not specified then the latest version of the schema will be retrieved.

Schema Text

The text of an Avro-formatted Schema

Allow Extra Fields

If the incoming data has fields that are not present in the schema, this property determines whether or not the Record is valid. If true, the Record is still valid. If false, the Record will be invalid due to the extra fields.

Strict Type Checking

If the incoming data has a Record where a field is not of the correct type, this property determines how to handle the Record. If true, the Record will be considered invalid. If false, the Record will be considered valid and the field will be coerced into the correct type (if possible, according to the type coercion supported by the Record Writer). This property controls how the data is validated against the validation schema.

Force Types From Reader’s Schema

If enabled, the processor will coerce every field to the type specified in the Reader’s schema. If the value of a field cannot be coerced to the type, the field will be skipped (will not be read from the input data), thus will not appear in the output. If not enabled, then every field will appear in the output but their types may differ from what is specified in the schema. For details please see the Additional Details page of the processor’s Help. This property controls how the data is read by the specified Record Reader.

Validation Details Attribute Name

If specified, when a validation error occurs, this attribute name will be used to leave the details. The number of characters will be limited by the property 'Maximum Validation Details Length'.

Maximum Validation Details Length

Specifies the maximum number of characters that validation details value can have. Any characters beyond the max will be truncated. This property is only used if 'Validation Details Attribute Name' is set

Relationships

  • failure: If the records cannot be read, validated, or written, for any reason, the original FlowFile will be routed to this relationship

  • invalid: Records that are not valid according to the schema will be routed to this relationship

  • valid: Records that are valid according to the schema will be routed to this relationship

Writes Attributes

  • mime.type: Sets the mime.type attribute to the MIME Type specified by the Record Writer

  • record.count: The number of records in the FlowFile routed to a relationship

Input Requirement

This component requires an incoming relationship.

Additional Details

Examples for the effect of Force Types From Reader’s Schema property

The processor first reads the data from the incoming FlowFile using the specified Record Reader, which uses a schema. Then, depending on the value of the Schema Access Strategy property, the processor can either use the reader’s schema, or a different schema to validate the data against. After that, the processor writes the data into the outgoing FlowFile using the specified Record Writer. If the data is valid, the validation schema is used by the writer. If the data is invalid, the writer uses the reader’s schema. The Force Types From Reader’s Schema property affects the first step: how strictly the reader’s schema should be applied when reading the data from the incoming FlowFile. By affecting how the data is read, the value of the Force Types From Reader’s Schema property also has an effect on what the output of the ValidateRecord processor is, and also whether the output is forwarded to the valid or the invalid relationship. Below are two examples where the value of this property affects the output significantly.

In both examples the input is in XML format and the output is in JSON. In the examples we assume that the same schema is used for reading, validation and writing.

Example 1

Schema:

{
  "namespace": "nifi",
  "name": "test",
  "type": "record",
  "fields": [
    {
      "name": "field1",
      "type": "string"
    },
    {
      "name": "field2",
      "type": "string"
    }
  ]
}
json

Input:

<test>
    <field1>
        <sub_field>content</sub_field>
    </field1>
    <field2>content_of_field_2</field2>
</test>
xml

Output if Force Types From Reader’s Schema = true (forwarded to the invalid relationship):

[
  {
    "field2": "content_of_field_2"
  }
]
json

Output if Force Types From Reader’s Schema = false (forwarded to the invalid relationship):

[
  {
    "field1": {
      "sub_field": "content"
    },
    "field2": "content_of_field_2"
  }
]
json

As you can see, the FlowFile is forwarded to the invalid relationship in both cases, since the input data does not match the provided Avro schema. However, if Force Types From Reader’s Schema = true, only those fields appear in the output that comply with the schema. If Force Types From Reader’s Schema = false, all fields appear in the output regardless of whether they comply with the schema or not.

Example 2

Schema:

{
  "namespace": "nifi",
  "name": "test",
  "type": "record",
  "fields": [
    {
      "name": "field1",
      "type": {
        "type": "array",
        "items": "string"
      }
    },
    {
      "name": "field2",
      "type": {
        "type": "array",
        "items": "string"
      }
    }
  ]
}
json

Input:

<test>
    <field1>content_1</field1>
    <field2>content_2</field2>
    <field2>content_3</field2>
</test>
xml

Output if Force Types From Reader’s Schema = true (forwarded to the valid relationship):

[
  {
    "field1": [
      "content_1"
    ],
    "field2": [
      "content_2",
      "content_3"
    ]
  }
]
json

Output if Force Types From Reader’s Schema = false (forwarded to the invalid relationship):

[
  {
    "field1": "content_1",
    "field2": [
      "content_2",
      "content_3"
    ]
  }
]
json

The schema expects two fields (field1 and field2), both of type ARRAY. field1 only appears once in the input XML document. If Force Types From Reader’s Schema = true, the processor forces this field to be in a type that complies with the schema. So it is put in an array with one element. Since this type coercion can be done, the output is routed to the valid relationship. If Force Types From Reader’s Schema = false the processor does not try to apply type coercion, thus field1 appears in the output as a single value. According to the schema, the processor expects an array for field1, but receives a single element so the output is routed to the invalid relationship.

Schema compliance (and getting routed to the valid or the invalid relationship) does not depend on what Writer is used to produce the output of the ValidateRecord processor. Let us suppose that we used the same schema and input as in Example 2, but instead of JsonRecordSetWriter, we used XMLRecordSetWriter to produce the output. Both in case of Force Types From Reader’s Schema = true and Force Types From Reader’s Schema = false the output is:

<test>
    <field1>content_1</field1>
    <field2>content_2</field2>
    <field2>content_3</field2>
</test>
xml

However, if Force Types From Reader’s Schema = true this output is routed to the valid relationship and if Force Types From Reader’s Schema = false** it is routed to the invalid relationship.

ValidateXml

Validates XML contained in a FlowFile. By default, the XML is contained in the FlowFile content. If the 'XML Source Attribute' property is set, the XML to be validated is contained in the specified attribute. It is not recommended to use attributes to hold large XML documents; doing so could adversely affect system performance. Full schema validation is performed if the processor is configured with the XSD schema details. Otherwise, the only validation performed is to ensure the XML syntax is correct and well-formed, e.g. all opening tags are properly closed.

Tags: xml, schema, validation, xsd

Properties

Schema File

The file path or URL to the XSD Schema file that is to be used for validation. If this property is blank, only XML syntax/structure will be validated.

XML Source Attribute

The name of the attribute containing XML to be validated. If this property is blank, the FlowFile content will be validated.

Relationships

  • invalid: FlowFiles that are not valid according to the specified schema or contain invalid XML are routed to this relationship

  • valid: FlowFiles that are successfully validated against the schema, if provided, or verified to be well-formed XML are routed to this relationship

Writes Attributes

  • validatexml.invalid.error: If the flow file is routed to the invalid relationship the attribute will contain the error message resulting from the validation failure.

Restricted

Making changes to this component requires admin rights. Non-admins can still schedule this component.

Input Requirement

This component requires an incoming relationship.

System Resource Considerations

  • MEMORY: While this processor supports processing XML within attributes, it is strongly discouraged to hold large amounts of data in attributes. In general, attribute values should be as small as possible and hold no more than a couple hundred characters.

Additional Details

Usage Information

In order to fully validate XML, a schema must be provided. The ValidateXML processor allows the schema to be specified in the property ‘Schema File’. The following example illustrates how an XSD schema and XML data work together.

Example XSD specification

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" targetNamespace="http://namespace/1"
           xmlns:tns="http://namespace/1" elementFormDefault="unqualified">
    <xs:element name="bundle" type="tns:BundleType"></xs:element>

    <xs:complexType name="BundleType">
        <xs:sequence>
            <xs:element name="node" type="tns:NodeType" maxOccurs="unbounded" minOccurs="0"></xs:element>
        </xs:sequence>
    </xs:complexType>
    <xs:complexType name="NodeType">
        <xs:sequence>
            <xs:element name="subNode" type="tns:SubNodeType" maxOccurs="unbounded" minOccurs="0"></xs:element>
        </xs:sequence>
    </xs:complexType>
    <xs:complexType name="SubNodeType">
        <xs:sequence>
            <xs:element name="value" type="xs:string"></xs:element>
        </xs:sequence>
    </xs:complexType>
</xs:schema>
xml

Given the schema defined in the above XSD, the following are valid XML data.

<ns:bundle xmlns:ns="http://namespace/1">
    <node>
        <subNode>
            <value>Hello</value>
        </subNode>
        <subNode>
            <value>World!</value>
        </subNode>
    </node>
</ns:bundle>
xml
<ns:bundle xmlns:ns="http://namespace/1">
    <node>
        <subNode>
            <value>Hello World!</value>
        </subNode>
    </node>
</ns:bundle>
xml

The following are invalid XML data. The resulting validatexml.invalid.error attribute is shown.

<ns:bundle xmlns:ns="http://namespace/1">
    <node>Hello World!</node>
</ns:bundle>
xml
validatexml.invalid.error: cvc-complex-type.2.3: Element 'node' cannot have character \[children\], because the type's content type is element-only.
<ns:bundle xmlns:ns="http://namespace/1">
    <node>
        <value>Hello World!</value>
    </node>
</ns:bundle>
xml
validatexml.invalid.error: cvc-complex-type.2.4.a: Invalid content was found starting with element 'value'. One of '{subNode}' is expected.

VerifyContentMAC

Calculates a Message Authentication Code using the provided Secret Key and compares it with the provided MAC property

Tags: Authentication, Signing, MAC, HMAC

Properties

Message Authentication Code Algorithm

Hashed Message Authentication Code Function

Message Authentication Code Encoding

Encoding of the Message Authentication Code

Message Authentication Code

The MAC to compare with the calculated value

Secret Key Encoding

Encoding of the Secret Key

Secret Key

Secret key to calculate the hash

Relationships

  • success: Signature Verification Succeeded

  • failure: Signature Verification Failed

Writes Attributes

  • mac.calculated: Calculated Message Authentication Code encoded by the selected encoding

  • mac.encoding: The Encoding of the Hashed Message Authentication Code

  • mac.algorithm: Hashed Message Authentication Code Algorithm

Input Requirement

This component requires an incoming relationship.

VerifyContentPGP

Verify signatures using OpenPGP Public Keys

Tags: PGP, GPG, OpenPGP, Encryption, Signing, RFC 4880

Properties

Public Key Service

PGP Public Key Service for verifying signatures with Public Key Encryption

Relationships

  • success: Signature Verification Succeeded

  • failure: Signature Verification Failed

Writes Attributes

  • pgp.literal.data.filename: Filename from Literal Data

  • pgp.literal.data.modified: Modified Date Time from Literal Data in milliseconds

  • pgp.signature.created: Signature Creation Time in milliseconds

  • pgp.signature.algorithm: Signature Algorithm including key and hash algorithm names

  • pgp.signature.hash.algorithm.id: Signature Hash Algorithm Identifier

  • pgp.signature.key.algorithm.id: Signature Key Algorithm Identifier

  • pgp.signature.key.id: Signature Public Key Identifier

  • pgp.signature.type.id: Signature Type Identifier

  • pgp.signature.version: Signature Version Number

Input Requirement

This component requires an incoming relationship.

Wait

Routes incoming FlowFiles to the 'wait' relationship until a matching release signal is stored in the distributed cache from a corresponding Notify processor. When a matching release signal is identified, a waiting FlowFile is routed to the 'success' relationship. The release signal entry is then removed from the cache. The attributes of the FlowFile that produced the release signal are copied to the waiting FlowFile if the Attribute Cache Regex property of the corresponding Notify processor is set properly. If there are multiple release signals in the cache identified by the Release Signal Identifier, and the Notify processor is configured to copy the FlowFile attributes to the cache, then the FlowFile passing the Wait processor receives the union of the attributes of the FlowFiles that produced the release signals in the cache (identified by Release Signal Identifier). Waiting FlowFiles will be routed to 'expired' if they exceed the Expiration Duration. If you need to wait for more than one signal, specify the desired number of signals via the 'Target Signal Count' property. This is particularly useful with processors that split a source FlowFile into multiple fragments, such as SplitText. In order to wait for all fragments to be processed, connect the 'original' relationship to a Wait processor, and the 'splits' relationship to a corresponding Notify processor. Configure the Notify and Wait processors to use the '${fragment.identifier}' as the value of 'Release Signal Identifier', and specify '${fragment.count}' as the value of 'Target Signal Count' in the Wait processor.It is recommended to use a prioritizer (for instance First In First Out) when using the 'wait' relationship as a loop.

Tags: map, cache, wait, hold, distributed, signal, release

Properties

Release Signal Identifier

A value that specifies the key to a specific release signal cache. To decide whether the FlowFile that is being processed by the Wait processor should be sent to the 'success' or the 'wait' relationship, the processor checks the signals in the cache specified by this key.

Target Signal Count

The number of signals that need to be in the cache (specified by the Release Signal Identifier) in order for the FlowFile processed by the Wait processor to be sent to the ‘success’ relationship. If the number of signals in the cache has reached this number, the FlowFile is routed to the 'success' relationship and the number of signals in the cache is decreased by this value. If Signal Counter Name is specified, this processor checks a particular counter, otherwise checks against the total number of signals in the cache.

Signal Counter Name

Within the cache (specified by the Release Signal Identifier) the signals may belong to different counters. If this property is specified, the processor checks the number of signals in the cache that belong to this particular counter. If not specified, the processor checks the total number of signals in the cache.

Wait Buffer Count

Specify the maximum number of incoming FlowFiles that can be buffered to check whether it can move forward. The more buffer can provide the better performance, as it reduces the number of interactions with cache service by grouping FlowFiles by signal identifier. Only a signal identifier can be processed at a processor execution.

Releasable FlowFile Count

A value, or the results of an Attribute Expression Language statement, which will be evaluated against a FlowFile in order to determine the releasable FlowFile count. This specifies how many FlowFiles can be released when a target count reaches target signal count. Zero (0) has a special meaning, any number of FlowFiles can be released as long as signal count matches target.

Expiration Duration

Indicates the duration after which waiting FlowFiles will be routed to the 'expired' relationship

Distributed Cache Service

The Controller Service that is used to check for release signals from a corresponding Notify processor

Attribute Copy Mode

Specifies how to handle attributes copied from FlowFiles entering the Notify processor

Wait Mode

Specifies how to handle a FlowFile waiting for a notify signal

Wait Penalty Duration

If configured, after a signal identifier got processed but did not meet the release criteria, the signal identifier is penalized and FlowFiles having the signal identifier will not be processed again for the specified period of time, so that the signal identifier will not block others to be processed. This can be useful for use cases where a Wait processor is expected to process multiple signal identifiers, and each signal identifier has multiple FlowFiles, and also the order of releasing FlowFiles is important within a signal identifier. The FlowFile order can be configured with Prioritizers. IMPORTANT: There is a limitation of number of queued signals can be processed, and Wait processor may not be able to check all queued signal ids. See additional details for the best practice.

Relationships

  • success: A FlowFile with a matching release signal in the cache will be routed to this relationship

  • failure: When the cache cannot be reached, or if the Release Signal Identifier evaluates to null or empty, FlowFiles will be routed to this relationship

  • expired: A FlowFile that has exceeded the configured Expiration Duration will be routed to this relationship

  • wait: A FlowFile with no matching release signal in the cache will be routed to this relationship

Writes Attributes

  • wait.start.timestamp: All FlowFiles will have an attribute 'wait.start.timestamp', which sets the initial epoch timestamp when the file first entered this processor. This is used to determine the expiration time of the FlowFile. This attribute is not written when the FlowFile is transferred to failure, expired or success

  • wait.counter.<counterName>: The name of each counter for which at least one signal has been present in the cache since the last time the cache was empty gets copied to the current FlowFile as an attribute.

Input Requirement

This component requires an incoming relationship.

Additional Details

Best practices to handle multiple signal ids at a Wait processor

When a Wait processor is expected to process multiple signal ids, by configuring ‘Release Signal Identifier’ with a FlowFile attribute Expression Language, there are few things to consider in order to get the expected result. Processor configuration can vary based on your requirement. Also, you will need to have high level understanding on how Wait processor works:

  • The Wait processor only processes a single signal id at a time

  • How frequently the Wait processor runs is defined in the ‘Run Schedule’

  • Which FlowFile is processed is determined by a Prioritizer

  • Not limited to the Wait processor, but for all processors, the order of queued FlowFiles in a connection is undefined if no Prioritizer is set

See following sections for common patterns

Release any FlowFile as soon as its signal is notified

This is the most common use case. FlowFiles are independent and can be released in any order.

======= Important configurations:

  • Use FirstInFirstOutPrioritizer (FIFO) at ‘wait’ relationship (or the incoming connection if ‘Wait Mode’ is ’Keep in the upstream connection)

The following table illustrates the notified signal ids, queued FlowFiles and what will happen at each Wait run cycle.

# of Wait run Notified Signals Queue Index (FIFO) FlowFile UUID Signal ID

1

B

1

a

A

This FlowFile is processed. But its signal is not found, and will be re-queued at the end of the queue.

2

b

B

3

c

C

2

B

1

b

B

This FlowFile is processed and since its signal is notified, this one will be released to ‘success’.

2

c

C

3

a

A

3

1

c

C

This FlowFile will be processed at the next run.

2

a

A

Release higher priority FlowFiles in each signal id

Multiple FlowFiles share the same signal id, and the order of releasing a FlowFile is important.

======= Important configurations:

  • Use a (or set of a) Prioritizer(s) suites your need other than FIFO, at ‘wait’ relationship (or the incoming connection if ‘Wait Mode’ is ’Keep in the upstream connection), e.g. PriorityPrioritizer

  • Specify adequate ‘Wait Penalty Duration’, e.g. “3 sec”,

  • ‘Wait Penalty Duration’ should be grater than ‘Run Schedule’, e.g “3 sec” > “1 sec”

  • Increase ‘Run Duration’ to avoid the limitation of number of signal ids (see the note below)

The following table illustrates the notified signal ids, queued FlowFiles and what will happen at each Wait run cycle. The example uses PriorityPrioritizer to control the order of processing FlowFiles within a signal id. If ‘Wait Penalty Duration’ is configured, Wait processor tracks unreleased signal ids and their penalty representing when they will be checked again.

# of Wait run Notified Signals Signal Penalties Queue Index (via ‘priority’ attribute) FlowFile UUID Signal ID ‘priority’ attr

1 (00:01)

B

1

a-1

A

1

This FlowFile is processed. But its signal is not found. Penalized.

2

b-1

B

1

Since a-1 and b-1 have the same priority ‘1’, b-1 may be processed before a-1. You can add another Prioritizer to define more specific ordering.

3

b-2

B

2

2 (00:02)

B

A (00:04)

1

a-1

A

1

This FlowFile is the first one according to the configured Prioritizer, but the signal id is penalized. So, this FlowFile is skipped at this execution.

2

b-1

B

1

This FlowFile is processed.

3

b-2

B

2

3 (00:03)

A (00:04)

1

a-1

A

1

This FlowFile is the first one but is still penalized.

2

b-2

B

2

This FlowFile is processed, but its signal is not notified yet, thus will be penalized.

4 (00:04)

B (00:06)

1

a-1

A

1

This FlowFile is no longer penalized, and get processed. But its signal is not notified yet, thus will be penalized again.

2

b-2

B

2

======= The importance of ‘Run Duration’ when ‘Wait Penalty Duration’ is used

There are limitation of number of signals can be checked based on the combination of ‘Run Schedule’ and ‘Wait Penalize Duration’. If this limitation is engaged, some FlowFiles may not be processed and remain in the ‘wait’ relationship even if their signal ids are notified. Let’s say Wait is configured with:

  • Run Schedule = 1 sec

  • Wait Penalize Duration = 3 sec

  • Release Signal Identifier = ${uuid}

And there are 5 FlowFiles F1, F2 … F5 in the ‘wait’ relationship. Then the signal for F5 is notified. Wait will work as follows:

  • At 00:00 Wait checks the signal for F1, not found, and penalize F1 (till 00:03)

  • At 00:01 Wait checks the signal for F2, not found, and penalize F2 (till 00:04)

  • At 00:02 Wait checks the signal for F3, not found, and penalize F3 (till 00:05)

  • At 00:03 Wait checks the signal for F4, not found, and penalize F4 (till 00:06)

  • At 00:04 Wait checks the signal for F1 again, because it’s not penalized any longer

Repeat above cycle, thus F5 will not be released until one of F1 … F4 is released.

To mitigate such limitation, increasing ‘Run Duration’ is recommended. By increasing ‘Run Duration’, Wait processor can keep being scheduled for that duration. For example, with ‘Run Duration’ 500 ms, Wait should be able to loop through all 5 queued FlowFiles at a single run.

Using counters

A counter is basically a label to differentiate signals within the cache. (A cache in this context is a “container” that contains signals that have the same signal identifier.)

Let’s suppose that there are the following signals in the cache (note, that these are not FlowFiles on the incoming (or wait) connection of the Wait processor, like in the examples above, but release signals stored in the cache.)

Signal ID Signal Counter Name

A

counter_1

A

counter_1

A

counter_2

In this state, the following FlowFile gets processed by the Wait processor, (the FlowFile has a signal_counter_name attribute and the Wait processor is configured to use the value of this attribute as the value of the Signal Counter Name property):

FlowFile UUID Signal ID signal_counter_name

a-1

A

counter_3

Despite the fact that the cache identified by Signal ID “A” has signals in it, the FlowFile above will be sent to the ’ wait’ relationship, since there is no signal in the cache that belongs to the counter named “counter_3”.

Let’s suppose, that the state of the cache is the same as above, and the following FlowFile gets processed by the Wait processor:

FlowFile UUID Signal ID signal_counter_name

a-2

A

counter_1

The FlowFile is transmitted to the ‘success’ relationship, since cache “A” has signals in it and there are signals that belong to “counter_1”. The outgoing FlowFile will have the following attributes and their values appended to it:

  • wait.counter.counter_1 : 2

  • wait.counter.counter_2 : 1

  • wait.counter.total : 3

The key point here is that counters can be used to differentiate between signals within the cache. If counters are used, a new attribute will be appended to the FlowFile passing the Wait processor for each counter. If a large number of counters are used within a cache, the FlowFile passing the Wait processor will have a large number of attributes appended to it. To avoid that, it is recommended to use multiple caches with a few counters in each, instead of one cache with many counters.

For example:

  • Cache identified by Release Signal ID “A” has counters: “counter_1” and “counter_2”

  • Cache identified by Release Signal ID “B” has counters: “counter_3” and “counter_4”

  • Cache identified by Release Signal ID “C” has counters: “counter_5” and “counter_6”

(Counter names do not need to be unique between caches, the counter name(s) used in cache “A” could be reused in cache ” B” and “C” as well.)

Service

ADLSCredentialsControllerService

Defines credentials for ADLS processors.

Tags: azure, microsoft, cloud, storage, adls, credentials

Properties

Storage Account Name

The storage account name. There are certain risks in allowing the account name to be stored as a FlowFile attribute. While it does provide for a more flexible flow by allowing the account name to be fetched dynamically from a FlowFile attribute, care must be taken to restrict access to the event provenance data (e.g., by strictly controlling the policies governing provenance for this processor). In addition, the provenance repositories may be put on encrypted disk partitions.

Endpoint Suffix

Storage accounts in public Azure always use a common FQDN suffix. Override this endpoint suffix with a different suffix in certain circumstances (like Azure Stack or non-public Azure regions).

Credentials Type

Credentials type to be used for authenticating to Azure

Account Key

The storage account key. This is an admin-like password providing access to every container in this account. It is recommended one uses Shared Access Signature (SAS) token, Managed Identity or Service Principal instead for fine-grained control with policies. There are certain risks in allowing the account key to be stored as a FlowFile attribute. While it does provide for a more flexible flow by allowing the account key to be fetched dynamically from a FlowFile attribute, care must be taken to restrict access to the event provenance data (e.g., by strictly controlling the policies governing provenance for this processor). In addition, the provenance repositories may be put on encrypted disk partitions.

SAS Token

Shared Access Signature token (the leading '?' may be included) There are certain risks in allowing the SAS token to be stored as a FlowFile attribute. While it does provide for a more flexible flow by allowing the SAS token to be fetched dynamically from a FlowFile attribute, care must be taken to restrict access to the event provenance data (e.g., by strictly controlling the policies governing provenance for this processor). In addition, the provenance repositories may be put on encrypted disk partitions.

Managed Identity Client ID

Client ID of the managed identity. The property is required when User Assigned Managed Identity is used for authentication. It must be empty in case of System Assigned Managed Identity.

Service Principal Tenant ID

Tenant ID of the Azure Active Directory hosting the Service Principal.

Service Principal Client ID

Client ID (or Application ID) of the Client/Application having the Service Principal.

Service Principal Client Secret

Password of the Client/Application.

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP, SOCKS In case of SOCKS, it is not guaranteed that the selected SOCKS Version will be used by the processor.

Additional Details

Security considerations of using Expression Language for sensitive properties

Allowing Expression Language for a property has the advantage of configuring the property dynamically via FlowFile attributes or Variable Registry entries. In case of sensitive properties, it also has a drawback of exposing sensitive information like passwords, security keys or tokens. When the value of a sensitive property comes from a FlowFile attribute, it travels by the FlowFile in clear text form and is also saved in the provenance repository. Variable Registry does not support the encryption of sensitive information either. Due to these, the sensitive credential data can be exposed to unauthorized parties.

Best practices for using Expression Language for sensitive properties:

  • use it only if necessary

  • control access to the flow and to provenance repository

  • encrypt disks storing FlowFiles and provenance data

  • if the sensitive data is a temporary token (like the SAS token), use a shorter lifetime and refresh the token periodically

ADLSCredentialsControllerServiceLookup

Provides an ADLSCredentialsService that can be used to dynamically select another ADLSCredentialsService. This service requires an attribute named 'adls.credentials.name' to be passed in, and will throw an exception if the attribute is missing. The value of 'adls.credentials.name' will be used to select the ADLSCredentialsService that has been registered with that name. This will allow multiple ADLSCredentialsServices to be defined and registered, and then selected dynamically at runtime by tagging flow files with the appropriate 'adls.credentials.name' attribute.

Tags: azure, microsoft, cloud, storage, adls, credentials

Properties

Dynamic Properties

The name to register ADLSCredentialsService

If 'adls.credentials.name' attribute contains the name of the dynamic property, then the ADLSCredentialsService (registered in the value) will be selected.

AmazonGlueSchemaRegistry

Provides a Schema Registry that interacts with the AWS Glue Schema Registry so that those Schemas that are stored in the Glue Schema Registry can be used in NiFi. When a Schema is looked up by name by this registry, it will find a Schema in the Glue Schema Registry with their names.

Tags: schema, registry, aws, avro, glue

Properties

Schema Registry Name

The name of the Schema Registry

Region

The region of the cloud resources

Communications Timeout

Specifies how long to wait to receive data from the Schema Registry before considering the communications a failure

Cache Size

Specifies how many Schemas should be cached from the Schema Registry

Cache Expiration

Specifies how long a Schema that is cached should remain in the cache. Once this time period elapses, a cached version of a schema will no longer be used, and the service will have to communicate with the Schema Registry again in order to obtain the schema.

AWS Credentials Provider Service

The Controller Service that is used to obtain AWS credentials provider

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP + AuthN

SSL Context Service

Specifies an optional SSL Context Service that, if provided, will be used to create connections

ApicurioSchemaRegistry

Provides a Schema Registry that interacts with the Apicurio Schema Registry so that those Schemas that are stored in the Apicurio Schema Registry can be used in NiFi. When a Schema is looked up by name by this registry, it will find a Schema in the Apicurio Schema Registry with their artifact identifiers.

Tags: schema, registry, apicurio, avro

Properties

Schema Registry URL

The URL of the Schema Registry e.g. http://localhost:8080

Schema Group ID

The artifact Group ID for the schemas

Cache Size

Specifies how many Schemas should be cached from the Schema Registry. The cache size must be a non-negative integer. When it is set to 0, the cache is effectively disabled.

Cache Expiration

Specifies how long a Schema that is cached should remain in the cache. Once this time period elapses, a cached version of a schema will no longer be used, and the service will have to communicate with the Schema Registry again in order to obtain the schema.

Web Client Service Provider

Controller service for HTTP client operations

AvroReader

Parses Avro data and returns each Avro record as an separate Record object. The Avro data may contain the schema itself, or the schema can be externalized and accessed by one of the methods offered by the 'Schema Access Strategy' property.

Tags: avro, parse, record, row, reader, delimited, comma, separated, values

Properties

Schema Access Strategy

Specifies how to obtain the schema that is to be used for interpreting the data.

Schema Registry

Specifies the Controller Service to use for the Schema Registry

Schema Name

Specifies the name of the schema to lookup in the Schema Registry property

Schema Version

Specifies the version of the schema to lookup in the Schema Registry. If not specified then the latest version of the schema will be retrieved.

Schema Branch

Specifies the name of the branch to use when looking up the schema in the Schema Registry property. If the chosen Schema Registry does not support branching, this value will be ignored.

Schema Text

The text of an Avro-formatted Schema

Schema Reference Reader

Service implementation responsible for reading FlowFile attributes or content to determine the Schema Reference Identifier

Cache Size

Specifies how many Schemas should be cached

AvroRecordSetWriter

Writes the contents of a RecordSet in Binary Avro format.

Tags: avro, result, set, writer, serializer, record, recordset, row

Properties

Schema Write Strategy

Specifies how the schema for a Record should be added to the data.

Schema Cache

Specifies a Schema Cache to add the Record Schema to so that Record Readers can quickly lookup the schema.

Schema Reference Writer

Service implementation responsible for writing FlowFile attributes or content header with Schema reference information

Schema Access Strategy

Specifies how to obtain the schema that is to be used for interpreting the data.

Schema Registry

Specifies the Controller Service to use for the Schema Registry

Schema Name

Specifies the name of the schema to lookup in the Schema Registry property

Schema Version

Specifies the version of the schema to lookup in the Schema Registry. If not specified then the latest version of the schema will be retrieved.

Schema Branch

Specifies the name of the branch to use when looking up the schema in the Schema Registry property. If the chosen Schema Registry does not support branching, this value will be ignored.

Schema Text

The text of an Avro-formatted Schema

Schema Reference Reader

Service implementation responsible for reading FlowFile attributes or content to determine the Schema Reference Identifier

Compression Format

Compression type to use when writing Avro files. Default is None.

Cache Size

Specifies how many Schemas should be cached

Encoder Pool Size

Avro Writers require the use of an Encoder. Creation of Encoders is expensive, but once created, they can be reused. This property controls the maximum number of Encoders that can be pooled and reused. Setting this value too small can result in degraded performance, but setting it higher can result in more heap being used. This property is ignored if the Avro Writer is configured with a Schema Write Strategy of 'Embed Avro Schema'.

AvroSchemaRegistry

Provides a service for registering and accessing schemas. You can register a schema as a dynamic property where 'name' represents the schema name and 'value' represents the textual representation of the actual schema following the syntax and semantics of Avro’s Schema format.

Tags: schema, registry, avro, json, csv

Properties

Validate Field Names

Whether or not to validate the field names in the Avro schema based on Avro naming rules. If set to true, all field names must be valid Avro names, which must begin with [A-Za-z_], and subsequently contain only [A-Za-z0-9_]. If set to false, no validation will be performed on the field names.

Dynamic Properties

Schema name

Adds a named schema using the JSON string representation of an Avro schema

AWSCredentialsProviderControllerService

Defines credentials for Amazon Web Services processors. Uses default credentials without configuration. Default credentials support EC2 instance profile/role, default user profile, environment variables, etc. Additional options include access key / secret key pairs, credentials file, named profile, and assume role credentials.

Tags: aws, credentials, provider

Properties

Use Default Credentials

If true, uses the Default Credential chain, including EC2 instance profiles or roles, environment variables, default user credentials, etc.

Access Key ID

Secret Access Key

Credentials File

Path to a file containing AWS access key and secret key in properties file format.

Profile Name

The AWS profile name for credentials from the profile configuration file.

Use Anonymous Credentials

If true, uses Anonymous credentials

Assume Role ARN

The AWS Role ARN for cross account access. This is used in conjunction with Assume Role Session Name and other Assume Role properties.

Assume Role Session Name

The AWS Role Session Name for cross account access. This is used in conjunction with Assume Role ARN.

Assume Role Session Time

Session time for role based session (between 900 and 3600 seconds). This is used in conjunction with Assume Role ARN.

Assume Role External ID

External ID for cross-account access. This is used in conjunction with Assume Role ARN.

Assume Role SSL Context Service

SSL Context Service used when connecting to the STS Endpoint.

Assume Role Proxy Configuration Service

Proxy configuration for cross-account access, if needed within your environment. This will configure a proxy to request for temporary access keys into another AWS account.

Assume Role STS Region

The AWS Security Token Service (STS) region

Assume Role STS Endpoint Override

The default AWS Security Token Service (STS) endpoint ("sts.amazonaws.com") works for all accounts that are not for China (Beijing) region or GovCloud. You only need to set this property to "sts.cn-north-1.amazonaws.com.cn" when you are requesting session credentials for services in China(Beijing) region or to "sts.us-gov-west-1.amazonaws.com" for GovCloud.

Assume Role STS Signer Override

The AWS STS library uses Signature Version 4 by default. This property allows you to plug in your own custom signer implementation.

Custom Signer Class Name

Fully qualified class name of the custom signer class. The signer must implement com.amazonaws.auth.Signer interface.

Custom Signer Module Location

Comma-separated list of paths to files and/or directories which contain the custom signer’s JAR file and its dependencies (if any).

Restricted

Making changes to this component requires admin rights. Non-admins can still schedule this component.

AzureBlobStorageFileResourceService

Provides an Azure Blob Storage file resource for other components.

Use Cases

Fetch a specific file from Azure Blob Storage. The service provides higher performance compared to fetch processors when the data should be moved between different storages without any transformation.

Input Requirement: This component allows an incoming relationship.

  1. "Container Name" = "${azure.container}"

  2. "Blob Name" = "${azure.blobname}" .

  3. The "Storage Credentials" property should specify an instance of the AzureStorageCredentialsService_v12 in order to provide credentials for accessing the storage container. .

Tags: azure, microsoft, cloud, storage, file, resource, blob

Properties

Storage Credentials

Controller Service used to obtain Azure Blob Storage Credentials.

Container Name

Name of the Azure storage container. In case of PutAzureBlobStorage processor, container can be created if it does not exist.

Blob Name

The full name of the blob

AzureCosmosDBClientService

Provides a controller service that configures a connection to Cosmos DB (Core SQL API) and provides access to that connection to other Cosmos DB-related components.

Tags: azure, cosmos, document, service

Properties

Cosmos DB URI

Cosmos DB URI, typically in the form of https://{databaseaccount}.documents.azure.com:443/ Note this host URL is for Cosmos DB with Core SQL API from Azure Portal (Overview→URI)

Cosmos DB Access Key

Cosmos DB Access Key from Azure Portal (Settings→Keys). Choose a read-write key to enable database or container creation at run time

Cosmos DB Consistency Level

Choose from five consistency levels on the consistency spectrum. Refer to Cosmos DB documentation for their differences

AzureDataLakeStorageFileResourceService

Provides an Azure Data Lake Storage (ADLS) file resource for other components.

Use Cases

Fetch the specified file from Azure Data Lake Storage. The service provides higher performance compared to fetch processors when the data should be moved between different storages without any transformation.

Input Requirement: This component allows an incoming relationship.

  1. "Filesystem Name" = "${azure.filesystem}"

  2. "Directory Name" = "${azure.directory}"

  3. "File Name" = "${azure.filename}" .

  4. The "ADLS Credentials" property should specify an instance of the ADLSCredentialsService in order to provide credentials for accessing the filesystem. .

Tags: azure, microsoft, cloud, storage, adlsgen2, file, resource, datalake

Properties

ADLS Credentials

Controller Service used to obtain Azure Credentials.

Filesystem Name

Name of the Azure Storage File System (also called Container). It is assumed to be already existing.

Directory Name

Name of the Azure Storage Directory. The Directory Name cannot contain a leading '/'. The root directory can be designated by the empty string value. In case of the PutAzureDataLakeStorage processor, the directory will be created if not already existing.

File Name

The filename

AzureEventHubRecordSink

Format and send Records to Azure Event Hubs

Tags: azure, record, sink

Properties

Service Bus Endpoint

Provides the domain for connecting to Azure Event Hubs

Event Hub Namespace

Provides provides the host for connecting to Azure Event Hubs

Event Hub Name

Provides the Event Hub Name for connections

Transport Type

Advanced Message Queuing Protocol Transport Type for communication with Azure Event Hubs

Record Writer

Specifies the Controller Service to use for writing out the records.

Authentication Strategy

Strategy for authenticating to Azure Event Hubs

Shared Access Policy

The name of the shared access policy. This policy must have Send claims

Shared Access Policy Key

The primary or secondary key of the shared access policy

Partition Key

A hint for Azure Event Hub message broker how to distribute messages across one or more partitions

AzureStorageCredentialsControllerService_v12

Provides credentials for Azure Storage processors using Azure Storage client library v12.

Tags: azure, microsoft, cloud, storage, blob, credentials, queue

Properties

Storage Account Name

The storage account name.

Endpoint Suffix

Storage accounts in public Azure always use a common FQDN suffix. Override this endpoint suffix with a different suffix in certain circumstances (like Azure Stack or non-public Azure regions).

Credentials Type

Credentials type to be used for authenticating to Azure

Account Key

The storage account key. This is an admin-like password providing access to every container in this account. It is recommended one uses Shared Access Signature (SAS) token, Managed Identity or Service Principal instead for fine-grained control with policies.

SAS Token

Shared Access Signature token (the leading '?' may be included)

Managed Identity Client ID

Client ID of the managed identity. The property is required when User Assigned Managed Identity is used for authentication. It must be empty in case of System Assigned Managed Identity.

Service Principal Tenant ID

Tenant ID of the Azure Active Directory hosting the Service Principal.

Service Principal Client ID

Client ID (or Application ID) of the Client/Application having the Service Principal.

Service Principal Client Secret

Password of the Client/Application.

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP, SOCKS In case of SOCKS, it is not guaranteed that the selected SOCKS Version will be used by the processor.

AzureStorageCredentialsControllerServiceLookup_v12

Provides an AzureStorageCredentialsService_v12 that can be used to dynamically select another AzureStorageCredentialsService_v12. This service requires an attribute named 'azure.storage.credentials.name' to be passed in, and will throw an exception if the attribute is missing. The value of 'azure.storage.credentials.name' will be used to select the AzureStorageCredentialsService_v12 that has been registered with that name. This will allow multiple AzureStorageCredentialsServices_v12 to be defined and registered, and then selected dynamically at runtime by tagging flow files with the appropriate 'azure.storage.credentials.name' attribute.

Tags: azure, microsoft, cloud, storage, blob, queue, credentials

Properties

Dynamic Properties

The name to register AzureStorageCredentialsService_v12

If 'azure.storage.credentials.name' attribute contains the name of the dynamic property, then the AzureStorageCredentialsService_v12 (registered in the value) will be selected.

BPCRecordSink

Sends records to Virtimo Business Process Center (BPC). BPC will log the records using the selected BPC Log Service.

Tags: virtimo, bpc

Properties

BPC Controller

Controller used to define the connection to the BPC. The API-Key used by the controller requires 'LOG_SERVICE_WRITE_DATA' and 'LOG_SERVICE_CONFIG_GET_INSTANCES'-Rights.

BPC Logger

Select the logger from available ones.

BPC Logger ID

The ID of the logger (i.e. the Component ID of the Log Service in BPC).

Key Name

Records will be logged in BPC as 'parent' entries with an added key field using this name (each record’s key value will be a randomly generated UUID). This name should match the value of the 'idColumns' entry of the 'parent' entry in the BPC Log Service’s 'Keys' setting.

Max records sent per message

Send large record sets as separate messages to BPC. Each message will contain no more than this many records. If not specified, all records will be sent in a single message.

Suppress Null Values

Specifies how the writer should handle a null field

SYSDATE field name

If specified, a field with the given name and the value 'SYSDATE' will be included for each record sent. The BPC Log Service can be configured to replace 'SYSDATE' with the current date and time when logging the record. To properly configure the BPC Log Service, add an entry for the specified field within the BPC Log Service’s 'Fields' setting (for the parent index) using the data type 'timestamp'.

See Also

CEFReader

Parses CEF (Common Event Format) events, returning each row as a record. This reader allows for inferring a schema based on the first event in the FlowFile or providing an explicit schema for interpreting the values.

Tags: cef, record, reader, parser

Properties

Schema Access Strategy

Specifies how to obtain the schema that is to be used for interpreting the data.

Schema Registry

Specifies the Controller Service to use for the Schema Registry

Schema Name

Specifies the name of the schema to lookup in the Schema Registry property

Schema Version

Specifies the version of the schema to lookup in the Schema Registry. If not specified then the latest version of the schema will be retrieved.

Schema Branch

Specifies the name of the branch to use when looking up the schema in the Schema Registry property. If the chosen Schema Registry does not support branching, this value will be ignored.

Schema Text

The text of an Avro-formatted Schema

Schema Reference Reader

Service implementation responsible for reading FlowFile attributes or content to determine the Schema Reference Identifier

Raw Message Field

If set the raw message will be added to the record using the property value as field name. This is not the same as the "rawEvent" extension field!

Invalid Field

Used when a line in the FlowFile cannot be parsed by the CEF parser. If set, instead of failing to process the FlowFile, a record is being added with one field. This record contains one field with the name specified by the property and the raw message as value.

DateTime Locale

The IETF BCP 47 representation of the Locale to be used when parsing date fields with long or short month names (e.g. may <en-US> vs. mai. <fr-FR>. The defaultvalue is generally safe. Only change if having issues parsing CEF messages

Inference Strategy

Defines the set of fields should be included in the schema and the way the fields are being interpreted.

Schema Inference Cache

Specifies a Schema Cache to use when inferring the schema. If not populated, the schema will be inferred each time. However, if a cache is specified, the cache will first be consulted and if the applicable schema can be found, it will be used instead of inferring the schema.

Accept empty extensions

If set to true, empty extensions will be accepted and will be associated to a null value.

Additional Details

The CEFReader Controller Service serves as a mean to read and interpret CEF messages.

The reader supports CEF Version 23. The expected format and the extension fields known by the Extension Dictionary are defined by the description of the ArcSight Common Event Format. The reader allows to work with Syslog prefixes and custom extensions. A couple of CEF message examples the reader can work with:

CEF:0|Company|Product|1.2.3|audit-login|Successful login|3|
Oct 12 04:16:11 localhost CEF:0|Company|Product|1.2.3|audit-login|Successful login|3|
Oct 12 04:16:11 localhost CEF:0|Company|Product|1.2.3|audit-login|Successful login|3|cn1Label=userid spt=46117 cn1=99999 cfp1=1.23  dst=127.0.0.1 c6a1=2345:0425:2CA1:0000:0000:0567:5673:23b5 dmac=00:0D:60:AF:1B:61 start=1479152665000 end=Jan 12 2017 12:23:45 dlat=456.789 loginsequence=123
Raw message

It is possible to preserve the original message in the produced record. This comes in handy when the message contains a Syslog prefix which is not part of the Record instance. In order to preserve the raw message, the “Raw Message Field” property must be set. The reader will use the value of this property as field name and will add the raw message as custom extension field. The value of the “Raw Message Field” must differ from the header fields and the extension fields known by the CEF Extension Dictionary. If the property is empty, the raw message will not be added.

When using predefined schema, the field defined by the “Raw Message Field” must appear in it as a STRING record field. In case of the schema is inferred, the field will be automatically added as an additional custom extension, regardless of the Inference Strategy.

Schemas and Type Coercion

When a record is parsed from incoming data, it is separated into fields. Each of these fields is then looked up against the configured schema (by field name) in order to determine what the type of the data should be. If the field is not present in the schema, that field is omitted from the Record. If the field is found in the schema, the data type of the received data is compared against the data type specified in the schema. If the types match, the value of that field is used as-is. If the schema indicates that the field should be of a different type, then the Controller Service will attempt to coerce the data into the type specified by the schema. If the field cannot be coerced into the specified type, an Exception will be thrown.

The following rules apply when attempting to coerce a field value from one data type to another:

  • Any data type can be coerced into a String type.

  • Any numeric data type (Byte, Short, Int, Long, Float, Double) can be coerced into any other numeric data type.

  • Any numeric value can be coerced into a Date, Time, or Timestamp type, by assuming that the Long value is the number of milliseconds since epoch (Midnight GMT, January 1, 1970).

  • A String value can be coerced into a Date, Time, or Timestamp type, if its format matches the configured “Date Format,” “Time Format,” or “Timestamp Format.”

  • A String value can be coerced into a numeric value if the value is of the appropriate type. For example, the String value 8 can be coerced into any numeric type. However, the String value 8.2 can be coerced into a Double or Float type but not an Integer.

  • A String value of “true” or “false” (regardless of case) can be coerced into a Boolean value.

  • A String value that is not empty can be coerced into a Char type. If the String contains more than 1 character, the first character is used and the rest of the characters are ignored.

  • Any “date/time” type (Date, Time, Timestamp) can be coerced into any other “date/time” type.

  • Any “date/time” type can be coerced into a Long type, representing the number of milliseconds since epoch (Midnight GMT, January 1, 1970).

  • Any “date/time” type can be coerced into a String. The format of the String is whatever DateFormat is configured for the corresponding property (Date Format, Time Format, Timestamp Format property).

If none of the above rules apply when attempting to coerce a value from one data type to another, the coercion will fail and an Exception will be thrown.

Schema inference

While NiFi’s Record API does require that each Record have a schema, it is often convenient to infer the schema based on the values in the data, rather than having to manually create a schema. This is accomplished by selecting a value of ” Infer Schema” for the “Schema Access Strategy” property. When using this strategy, the Reader will determine the schema by first parsing all data in the FlowFile, keeping track of all fields that it has encountered and the type of each field. Once all data has been parsed, a schema is formed that encompasses all fields that have been encountered.

A common concern when inferring schemas is how to handle the condition of two values that have different types. For example, a custom extension field might have a Float value in one record and String in another. In these cases, the inferred will contain a CHOICE data type with FLOAT and STRING options. Records will be allowed to have either value for the particular field.

CEF format comes with specification not only to the message format but also has directives for the content. Because of this, the data type of some fields are not determined by the actual value(s) in the FlowFile but by the CEF format. This includes header fields, which always have to appear and comply to the data types defined in the CEF format. Also, extension fields from the Extension Dictionary might or might not appear in the generated schema based on the FlowFile content but in case an extension field is added its data type is bound by the CEF format. Custom extensions have no similar restrictions, their presence in the schema is completely depending on the FlowFile content.

Schema inference in CEFReader supports multiple possible strategies for convenience. These strategies determine which fields should be included to the schema from the incoming CEF messages. With this, one might filter out custom extensions for example. It is important to mention that this will have serious effect on every further steps in the record procession. For example using an Inference Strategy which omits fields together with ConvertRecord Processor will result Records with only the part of the original fields.

Headers only

Using this strategy will result a schema which contains only the header fields from the incoming message. All other fields (standard of custom extensions) will be ignored. The type of these fields are defined by the CEF format and regardless of the content of the message used as a template, their data type is also defined by the format.

Headers and extensions

Additionally to the header fields, this strategy will include standard extensions from the messages in the FlowFile. This means, not all standard extensions will be part of the outgoing Record but the ones the Schema Inference found in the incoming messages. The data type of these Record fields are determined by the CEF format, ignoring the actual value in the observed field.

With custom extensions inferred

While the type of the header and standard extension fields are bound by the CEF format, it is possible to add further fields to the message called “custom extensions”. These fields are not part of the “Extension Dictionary”, thus their data type is not predefined. Using “With custom extensions inferred” Inference Strategy, the CEFReader tries to determine the possible data type for these custom extension fields based on their value.

With custom extensions as strings

In some cases it is undesirable to let the Reader determine the type of the custom extensions. For convenience CEFReader provides an Inference Strategy which regardless of their value, consider custom extension fields as String data. Otherwise this strategy behaves like the “With custom extensions inferred”.

Caching of Inferred Schemas

This Record Reader requires that if a schema is to be inferred, that all records be read in order to ensure that the schema that gets inferred is applicable for all records in the FlowFile. However, this can become expensive, especially if the data undergoes many different transformations. To alleviate the cost of inferring schemas, the Record Reader can be configured with a “Schema Inference Cache” by populating the property with that name. This is a Controller Service that can be shared by Record Readers and Record Writers.

Whenever a Record Writer is used to write data, if it is configured with a “Schema Cache,” it will also add the schema to the Schema Cache. This will result in an identifier for that schema being added as an attribute to the FlowFile.

Whenever a Record Reader is used to read data, if it is configured with a “Schema Inference Cache”, it will first look for a “schema.cache.identifier” attribute on the FlowFile. If the attribute exists, it will use the value of that attribute to lookup the schema in the schema cache. If it is able to find a schema in the cache with that identifier, then it will use that schema instead of reading, parsing, and analyzing the data to infer the schema. If the attribute is not available on the FlowFile, or if the attribute is available but the cache does not have a schema with that identifier, then the Record Reader will proceed to infer the schema as described above.

The end result is that users are able to chain together many different Processors to operate on Record-oriented data. Typically, only the first such Processor in the chain will incur the “penalty” of inferring the schema. For all other Processors in the chain, the Record Reader is able to simply lookup the schema in the Schema Cache by identifier. This allows the Record Reader to infer a schema accurately, since it is inferred based on all data in the FlowFile, and still allows this to happen efficiently since the schema will typically only be inferred once, regardless of how many Processors handle the data.

Handling invalid events

An event is considered invalid if it is malformed in a way that the underlying CEF parser cannot read it properly. CEFReader has two ways to deal with malformed events determined by the usage of the property “Invalid Field”. If the property is not set, the reading will fail at the time of reading the first invalid event. If the property is set, a product of the read will be a record with single field. The field is named based on the property and the value of the field will be the original event text. By default, the “Invalid Field” property is not set.

When the “Invalid Field” property is set, the read records might contain both records representing well formed CEF events and malformed ones as well. As of this, further steps might be needed in order to separate these before further processing.

ConfluentEncodedSchemaReferenceReader

Reads Schema Identifier according to Confluent encoding as a header consisting of a byte marker and an integer represented as four bytes

Tags: confluent, schema, registry, kafka, avro

Properties

ConfluentEncodedSchemaReferenceWriter

Writes Schema Identifier according to Confluent encoding as a header consisting of a byte marker and an integer represented as four bytes

Tags: confluent, schema, registry, kafka, avro

Properties

ConfluentSchemaRegistry

Provides a Schema Registry that interacts with the Confluent Schema Registry so that those Schemas that are stored in the Confluent Schema Registry can be used in NiFi. The Confluent Schema Registry has a notion of a "subject" for schemas, which is their terminology for a schema name. When a Schema is looked up by name by this registry, it will find a Schema in the Confluent Schema Registry with that subject.

Tags: schema, registry, confluent, avro, kafka

Properties

Schema Registry URLs

A comma-separated list of URLs of the Schema Registry to interact with

SSL Context Service

Specifies the SSL Context Service to use for interacting with the Confluent Schema Registry

Communications Timeout

Specifies how long to wait to receive data from the Schema Registry before considering the communications a failure

Cache Size

Specifies how many Schemas should be cached from the Schema Registry

Cache Expiration

Specifies how long a Schema that is cached should remain in the cache. Once this time period elapses, a cached version of a schema will no longer be used, and the service will have to communicate with the Schema Registry again in order to obtain the schema.

Authentication Type

HTTP Client Authentication Type for Confluent Schema Registry

Username

Username for authentication to Confluent Schema Registry

Password

Password for authentication to Confluent Schema Registry

Dynamic Properties

request.header.*

Properties that begin with 'request.header.' are populated into a map and passed as http headers in REST requests to the Confluent Schema Registry

CSVReader

Parses CSV-formatted data, returning each row in the CSV file as a separate record. This reader allows for inferring a schema based on the first line of the CSV, if a 'header line' is present, or providing an explicit schema for interpreting the values. See Controller Service’s Usage for further documentation.

Tags: csv, parse, record, row, reader, delimited, comma, separated, values

Properties

Schema Access Strategy

Specifies how to obtain the schema that is to be used for interpreting the data.

Schema Registry

Specifies the Controller Service to use for the Schema Registry

Schema Name

Specifies the name of the schema to lookup in the Schema Registry property

Schema Version

Specifies the version of the schema to lookup in the Schema Registry. If not specified then the latest version of the schema will be retrieved.

Schema Branch

Specifies the name of the branch to use when looking up the schema in the Schema Registry property. If the chosen Schema Registry does not support branching, this value will be ignored.

Schema Text

The text of an Avro-formatted Schema

Schema Reference Reader

Service implementation responsible for reading FlowFile attributes or content to determine the Schema Reference Identifier

CSV Parser

Specifies which parser to use to read CSV records. NOTE: Different parsers may support different subsets of functionality and may also exhibit different levels of performance.

Date Format

Specifies the format to use when reading/writing Date fields. If not specified, Date fields will be assumed to be number of milliseconds since epoch (Midnight, Jan 1, 1970 GMT). If specified, the value must match the Java java.time.format.DateTimeFormatter format (for example, MM/dd/yyyy for a two-digit month, followed by a two-digit day, followed by a four-digit year, all separated by '/' characters, as in 01/01/2017).

Time Format

Specifies the format to use when reading/writing Time fields. If not specified, Time fields will be assumed to be number of milliseconds since epoch (Midnight, Jan 1, 1970 GMT). If specified, the value must match the Java java.time.format.DateTimeFormatter format (for example, HH:mm:ss for a two-digit hour in 24-hour format, followed by a two-digit minute, followed by a two-digit second, all separated by ':' characters, as in 18:04:15).

Timestamp Format

Specifies the format to use when reading/writing Timestamp fields. If not specified, Timestamp fields will be assumed to be number of milliseconds since epoch (Midnight, Jan 1, 1970 GMT). If specified, the value must match the Java java.time.format.DateTimeFormatter format (for example, MM/dd/yyyy HH:mm:ss for a two-digit month, followed by a two-digit day, followed by a four-digit year, all separated by '/' characters; and then followed by a two-digit hour in 24-hour format, followed by a two-digit minute, followed by a two-digit second, all separated by ':' characters, as in 01/01/2017 18:04:15).

CSV Format

Specifies which "format" the CSV data is in, or specifies if custom formatting should be used.

Value Separator

The character that is used to separate values/fields in a CSV Record. If the property has been specified via Expression Language but the expression gets evaluated to an invalid Value Separator at runtime, then it will be skipped and the default Value Separator will be used.

Record Separator

Specifies the characters to use in order to separate CSV Records

Treat First Line as Header

Specifies whether or not the first line of CSV should be considered a Header or should be considered a record. If the Schema Access Strategy indicates that the columns must be defined in the header, then this property will be ignored, since the header must always be present and won’t be processed as a Record. Otherwise, if 'true', then the first line of CSV data will not be processed as a record and if 'false',then the first line will be interpreted as a record.

Ignore CSV Header Column Names

If the first line of a CSV is a header, and the configured schema does not match the fields named in the header line, this controls how the Reader will interpret the fields. If this property is true, then the field names mapped to each column are driven only by the configured schema and any fields not in the schema will be ignored. If this property is false, then the field names found in the CSV Header will be used as the names of the fields.

Quote Character

The character that is used to quote values so that escape characters do not have to be used. If the property has been specified via Expression Language but the expression gets evaluated to an invalid Quote Character at runtime, then it will be skipped and the default Quote Character will be used.

Escape Character

The character that is used to escape characters that would otherwise have a specific meaning to the CSV Parser. If the property has been specified via Expression Language but the expression gets evaluated to an invalid Escape Character at runtime, then it will be skipped and the default Escape Character will be used. Setting it to an empty string means no escape character should be used.

Comment Marker

The character that is used to denote the start of a comment. Any line that begins with this comment will be ignored.

Null String

Specifies a String that, if present as a value in the CSV, should be considered a null field instead of using the literal value.

Trim Fields

Whether or not white space should be removed from the beginning and end of fields

Character Set

The Character Encoding that is used to encode/decode the CSV file

Allow Duplicate Header Names

Whether duplicate header names are allowed. Header names are case-sensitive, for example "name" and "Name" are treated as separate fields. Handling of duplicate header names is CSV Parser specific (where applicable): * Apache Commons CSV - duplicate headers will result in column data "shifting" right with new fields created for "unknown_field_index_X" where "X" is the CSV column index number * Jackson CSV - duplicate headers will be de-duplicated with the field value being that of the right-most duplicate CSV column * FastCSV - duplicate headers will be de-duplicated with the field value being that of the left-most duplicate CSV column

Trim double quote

Whether or not to trim starting and ending double quotes. For example: with trim string '"test"' would be parsed to 'test', without trim would be parsed to '"test"'.If set to 'false' it means full compliance with RFC-4180. Default value is true, with trim.

Additional Details

The CSVReader allows for interpreting input data as delimited Records. By default, a comma is used as the field separator, but this is configurable. It is common, for instance, to use a tab in order to read tab-separated values, or TSV.
There are pre-defined CSV formats in the reader like EXCEL. Further information regarding their settings can be found here: https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/CSVFormat.html
The reader allows for customization of the CSV Format, such as which character should be used to separate CSV fields, which character should be used for quoting and when to quote fields, which character should denote a comment, etc. The names of the fields may be specified either by having a “header line” as the first line in the CSV (in which case the Schema Access Strategy should be “Infer Schema” or “Use String Fields From Header”) or can be supplied by specifying the schema by using the Schema Text or looking up the schema in a Schema Registry.

Schemas and Type Coercion

When a record is parsed from incoming data, it is separated into fields. Each of these fields is then looked up against the configured schema (by field name) in order to determine what the type of the data should be. If the field is not present in the schema, that field is omitted from the Record. If the field is found in the schema, the data type of the received data is compared against the data type specified in the schema. If the types match, the value of that field is used as-is. If the schema indicates that the field should be of a different type, then the Controller Service will attempt to coerce the data into the type specified by the schema. If the field cannot be coerced into the specified type, an Exception will be thrown.

The following rules apply when attempting to coerce a field value from one data type to another:

  • Any data type can be coerced into a String type.

  • Any numeric data type (Byte, Short, Int, Long, Float, Double) can be coerced into any other numeric data type.

  • Any numeric value can be coerced into a Date, Time, or Timestamp type, by assuming that the Long value is the number of milliseconds since epoch (Midnight GMT, January 1, 1970).

  • A String value can be coerced into a Date, Time, or Timestamp type, if its format matches the configured “Date Format,” “Time Format,” or “Timestamp Format.”

  • A String value can be coerced into a numeric value if the value is of the appropriate type. For example, the String value 8 can be coerced into any numeric type. However, the String value 8.2 can be coerced into a Double or Float type but not an Integer.

  • A String value of “true” or “false” (regardless of case) can be coerced into a Boolean value.

  • A String value that is not empty can be coerced into a Char type. If the String contains more than 1 character, the first character is used and the rest of the characters are ignored.

  • Any “date/time” type (Date, Time, Timestamp) can be coerced into any other “date/time” type.

  • Any “date/time” type can be coerced into a Long type, representing the number of milliseconds since epoch (Midnight GMT, January 1, 1970).

  • Any “date/time” type can be coerced into a String. The format of the String is whatever DateFormat is configured for the corresponding property (Date Format, Time Format, Timestamp Format property).

If none of the above rules apply when attempting to coerce a value from one data type to another, the coercion will fail and an Exception will be thrown.

Schema Inference

While NiFi’s Record API does require that each Record have a schema, it is often convenient to infer the schema based on the values in the data, rather than having to manually create a schema. This is accomplished by selecting a value of ” Infer Schema” for the “Schema Access Strategy” property. When using this strategy, the Reader will determine the schema by first parsing all data in the FlowFile, keeping track of all fields that it has encountered and the type of each field. Once all data has been parsed, a schema is formed that encompasses all fields that have been encountered.

A common concern when inferring schemas is how to handle the condition of two values that have different types. For example, consider a FlowFile with the following two records:

name, age John, 8 Jane, Ten

It is clear that the “name” field will be inferred as a STRING type. However, how should we handle the “age” field? Should the field be an CHOICE between INT and STRING? Should we prefer LONG over INT? Should we just use a STRING? Should the field be considered nullable?

To help understand how this Record Reader infers schemas, we have the following list of rules that are followed in the inference logic:

  • All fields are inferred to be nullable.

  • When two values are encountered for the same field in two different records (or two values are encountered for an ARRAY type), the inference engine prefers to use a “wider” data type over using a CHOICE data type. A data type “A” is said to be wider than data type “B” if and only if data type “A” encompasses all values of “B” in addition to other values. For example, the LONG type is wider than the INT type but not wider than the BOOLEAN type (and BOOLEAN is also not wider than LONG). INT is wider than SHORT. The STRING type is considered wider than all other types except MAP, RECORD, ARRAY, and CHOICE.

  • Before inferring the type of value, leading and trailing whitespace are removed. Additionally, if the value is surrounded by double-quotes (“), the double-quotes are removed. Therefore, the value 16 is interpreted the same as "16". Both will be interpreted as an INT. However, the value " 16" will be inferred as a STRING type because the white space is enclosed within double-quotes, which means that the white space is considered part of the value.

  • If the “Time Format,” “Timestamp Format,” or “Date Format” properties are configured, any value that would otherwise be considered a STRING type is first checked against the configured formats to see if it matches any of them. If the value matches the Timestamp Format, the value is considered a Timestamp field. If it matches the Date Format, it is considered a Date field. If it matches the Time Format, it is considered a Time field. In the unlikely event that the value matches more than one of the configured formats, they will be matched in the order: Timestamp, Date, Time. I.e., if a value matched both the Timestamp Format and the Date Format, the type that is inferred will be Timestamp. Because parsing dates and times can be expensive, it is advisable not to configure these formats if dates, times, and timestamps are not expected, or if processing the data as a STRING is acceptable. For use cases when this is important, though, the inference engine is intelligent enough to optimize the parsing by first checking several very cheap conditions. For example, the string’s length is examined to see if it is too long or too short to match the pattern. This results in far more efficient processing than would result if attempting to parse each string value as a timestamp.

  • The MAP type is never inferred.

  • The ARRAY type is never inferred.

  • The RECORD type is never inferred.

  • If a field exists but all values are null, then the field is inferred to be of type STRING.

Caching of Inferred Schemas

This Record Reader requires that if a schema is to be inferred, that all records be read in order to ensure that the schema that gets inferred is applicable for all records in the FlowFile. However, this can become expensive, especially if the data undergoes many different transformations. To alleviate the cost of inferring schemas, the Record Reader can be configured with a “Schema Inference Cache” by populating the property with that name. This is a Controller Service that can be shared by Record Readers and Record Writers.

Whenever a Record Writer is used to write data, if it is configured with a “Schema Cache,” it will also add the schema to the Schema Cache. This will result in an identifier for that schema being added as an attribute to the FlowFile.

Whenever a Record Reader is used to read data, if it is configured with a “Schema Inference Cache”, it will first look for a “schema.cache.identifier” attribute on the FlowFile. If the attribute exists, it will use the value of that attribute to lookup the schema in the schema cache. If it is able to find a schema in the cache with that identifier, then it will use that schema instead of reading, parsing, and analyzing the data to infer the schema. If the attribute is not available on the FlowFile, or if the attribute is available but the cache does not have a schema with that identifier, then the Record Reader will proceed to infer the schema as described above.

The end result is that users are able to chain together many different Processors to operate on Record-oriented data. Typically, only the first such Processor in the chain will incur the “penalty” of inferring the schema. For all other Processors in the chain, the Record Reader is able to simply lookup the schema in the Schema Cache by identifier. This allows the Record Reader to infer a schema accurately, since it is inferred based on all data in the FlowFile, and still allows this to happen efficiently since the schema will typically only be inferred once, regardless of how many Processors handle the data.

Examples
Example 1

As an example, consider a FlowFile whose contents consists of the following:

id, name, balance, join_date, notes   1, John, 48.23, 04/03/2007 "Our very   first customer!"   2, Jane, 1245.89, 08/22/2009,   3, Frank Franklin, "48481.29", 04/04/2016,

Additionally, let’s consider that this Controller Service is configured with the Schema Registry pointing to an AvroSchemaRegistry and the schema is configured as the following:

{
  "namespace": "nifi",
  "name": "balances",
  "type": "record",
  "fields": [
    {
      "name": "id",
      "type": "int"
    },
    {
      "name": "name",
      "type": "string"
    },
    {
      "name": "balance",
      "type": "double"
    },
    {
      "name": "join_date",
      "type": {
        "type": "int",
        "logicalType": "date"
      }
    },
    {
      "name": "notes",
      "type": "string"
    }
  ]
}
json

In the example above, we see that the ‘join_date’ column is a Date type. In order for the CSV Reader to be able to properly parse a value as a date, we need to provide the reader with the date format to use. In this example, we would configure the Date Format property to be MM/dd/yyyy to indicate that it is a two-digit month, followed by a two-digit day, followed by a four-digit year - each separated by a slash. In this case, the result will be that this FlowFile consists of 3 different records. The first record will contain the following values:

Field Name Field Value

id

1

name

John

balance

48.23

join_date

04/03/2007

notes

Our very first customer!

The second record will contain the following values:

Field Name Field Value

id

2

name

Jane

balance

1245.89

join_date

08/22/2009

notes

The third record will contain the following values:

Field Name Field Value

id

3

name

Frank Franklin

balance

48481.29

join_date

04/04/2016

notes

Example 2 - Schema with CSV Header Line

When CSV data consists of a header line that outlines the column names, the reader provides a couple of different properties for configuring how to handle these column names. The “Schema Access Strategy” property as well as the associated properties (“Schema Registry,” “Schema Text,” and “Schema Name” properties) can be used to specify how to obtain the schema. If the “Schema Access Strategy” is set to “Use String Fields From Header” then the header line of the CSV will be used to determine the schema. Otherwise, a schema will be referenced elsewhere. But what happens if a schema is obtained from a Schema Registry, for instance, and the CSV Header indicates a different set of column names?

For example, let’s say that the following schema is obtained from the Schema Registry:

{
  "namespace": "nifi",
  "name": "balances",
  "type": "record",
  "fields": [
    {
      "name": "id",
      "type": "int"
    },
    {
      "name": "name",
      "type": "string"
    },
    {
      "name": "balance",
      "type": "double"
    },
    {
      "name": "memo",
      "type": "string"
    }
  ]
}
json

And the CSV contains the following data:

id, name, balance, notes 1, John Doe, 123.45, First Customer

Note here that our schema indicates that the final column is named “memo” whereas the CSV Header indicates that it is named “notes.”

In this case, the reader will look at the “Ignore CSV Header Column Names” property. If this property is set to “true” then the column names provided in the CSV will simply be ignored and the last column will be called “memo.” However, if the “Ignore CSV Header Column Names” property is set to “false” then the result will be that the last column will be named “notes” and each record will have a null value for the “memo” column.

With “Ignore CSV Header Column Names” property set to “false”:

Field Name Field Value

id

1

name

John Doe

balance

123.45

memo

First Customer

With “Ignore CSV Header Column Names” property set to “true”:

Field Name Field Value

id

1

name

John Doe

balance

123.45

notes

First Customer

memo

null

CSVRecordLookupService

A reloadable CSV file-based lookup service. When the lookup key is found in the CSV file, the columns are returned as a Record. All returned fields will be strings. The first line of the csv file is considered as header.

Tags: lookup, cache, enrich, join, csv, reloadable, key, value, record

Properties

CSV File

Path to a CSV File in which the key value pairs can be looked up.

CSV Format

Specifies which "format" the CSV data is in, or specifies if custom formatting should be used.

Character Set

The Character Encoding that is used to decode the CSV file.

Lookup Key Column

The field in the CSV file that will serve as the lookup key. This is the field that will be matched against the property specified in the lookup processor.

Ignore Duplicates

Ignore duplicate keys for records in the CSV file.

Value Separator

The character that is used to separate values/fields in a CSV Record. If the property has been specified via Expression Language but the expression gets evaluated to an invalid Value Separator at runtime, then it will be skipped and the default Value Separator will be used.

Quote Character

The character that is used to quote values so that escape characters do not have to be used. If the property has been specified via Expression Language but the expression gets evaluated to an invalid Quote Character at runtime, then it will be skipped and the default Quote Character will be used.

Quote Mode

Specifies how fields should be quoted when they are written

Comment Marker

The character that is used to denote the start of a comment. Any line that begins with this comment will be ignored.

Escape Character

The character that is used to escape characters that would otherwise have a specific meaning to the CSV Parser. If the property has been specified via Expression Language but the expression gets evaluated to an invalid Escape Character at runtime, then it will be skipped and the default Escape Character will be used. Setting it to an empty string means no escape character should be used.

Trim Fields

Whether or not white space should be removed from the beginning and end of fields

Restricted

Making changes to this component requires admin rights. Non-admins can still schedule this component.

CSVRecordSetWriter

Writes the contents of a RecordSet as CSV data. The first line written will be the column names (unless the 'Include Header Line' property is false). All subsequent lines will be the values corresponding to the record fields.

Tags: csv, result, set, recordset, record, writer, serializer, row, tsv, tab, separated, delimited

Properties

Schema Write Strategy

Specifies how the schema for a Record should be added to the data.

Schema Cache

Specifies a Schema Cache to add the Record Schema to so that Record Readers can quickly lookup the schema.

Schema Reference Writer

Service implementation responsible for writing FlowFile attributes or content header with Schema reference information

Schema Access Strategy

Specifies how to obtain the schema that is to be used for interpreting the data.

Schema Registry

Specifies the Controller Service to use for the Schema Registry

Schema Name

Specifies the name of the schema to lookup in the Schema Registry property

Schema Version

Specifies the version of the schema to lookup in the Schema Registry. If not specified then the latest version of the schema will be retrieved.

Schema Branch

Specifies the name of the branch to use when looking up the schema in the Schema Registry property. If the chosen Schema Registry does not support branching, this value will be ignored.

Schema Text

The text of an Avro-formatted Schema

Schema Reference Reader

Service implementation responsible for reading FlowFile attributes or content to determine the Schema Reference Identifier

Date Format

Specifies the format to use when reading/writing Date fields. If not specified, Date fields will be assumed to be number of milliseconds since epoch (Midnight, Jan 1, 1970 GMT). If specified, the value must match the Java java.time.format.DateTimeFormatter format (for example, MM/dd/yyyy for a two-digit month, followed by a two-digit day, followed by a four-digit year, all separated by '/' characters, as in 01/01/2017).

Time Format

Specifies the format to use when reading/writing Time fields. If not specified, Time fields will be assumed to be number of milliseconds since epoch (Midnight, Jan 1, 1970 GMT). If specified, the value must match the Java java.time.format.DateTimeFormatter format (for example, HH:mm:ss for a two-digit hour in 24-hour format, followed by a two-digit minute, followed by a two-digit second, all separated by ':' characters, as in 18:04:15).

Timestamp Format

Specifies the format to use when reading/writing Timestamp fields. If not specified, Timestamp fields will be assumed to be number of milliseconds since epoch (Midnight, Jan 1, 1970 GMT). If specified, the value must match the Java java.time.format.DateTimeFormatter format (for example, MM/dd/yyyy HH:mm:ss for a two-digit month, followed by a two-digit day, followed by a four-digit year, all separated by '/' characters; and then followed by a two-digit hour in 24-hour format, followed by a two-digit minute, followed by a two-digit second, all separated by ':' characters, as in 01/01/2017 18:04:15).

CSV Format

Specifies which "format" the CSV data is in, or specifies if custom formatting should be used.

CSV Writer

Specifies which writer implementation to use to write CSV records. NOTE: Different writers may support different subsets of functionality and may also exhibit different levels of performance.

Value Separator

The character that is used to separate values/fields in a CSV Record. If the property has been specified via Expression Language but the expression gets evaluated to an invalid Value Separator at runtime, then it will be skipped and the default Value Separator will be used.

Include Header Line

Specifies whether or not the CSV column names should be written out as the first line.

Quote Character

The character that is used to quote values so that escape characters do not have to be used. If the property has been specified via Expression Language but the expression gets evaluated to an invalid Quote Character at runtime, then it will be skipped and the default Quote Character will be used.

Escape Character

The character that is used to escape characters that would otherwise have a specific meaning to the CSV Parser. If the property has been specified via Expression Language but the expression gets evaluated to an invalid Escape Character at runtime, then it will be skipped and the default Escape Character will be used. Setting it to an empty string means no escape character should be used.

Comment Marker

The character that is used to denote the start of a comment. Any line that begins with this comment will be ignored.

Null String

Specifies a String that, if present as a value in the CSV, should be considered a null field instead of using the literal value.

Trim Fields

Whether or not white space should be removed from the beginning and end of fields

Quote Mode

Specifies how fields should be quoted when they are written

Record Separator

Specifies the characters to use in order to separate CSV Records

Include Trailing Delimiter

If true, a trailing delimiter will be added to each CSV Record that is written. If false, the trailing delimiter will be omitted.

Character Set

The Character Encoding that is used to encode/decode the CSV file

DatabaseRecordLookupService

A relational-database-based lookup service. When the lookup key is found in the database, the specified columns (or all if Lookup Value Columns are not specified) are returned as a Record. Only one row will be returned for each lookup, duplicate database entries are ignored.

Tags: lookup, cache, enrich, join, rdbms, database, reloadable, key, value, record

Properties

Database Connection Pooling Service

The Controller Service that is used to obtain connection to database

Table Name

The name of the database table to be queried. Note that this may be case-sensitive depending on the database.

Lookup Key Column

The column in the table that will serve as the lookup key. This is the column that will be matched against the property specified in the lookup processor. Note that this may be case-sensitive depending on the database.

Lookup Value Columns

A comma-delimited list of columns in the table that will be returned when the lookup key matches. Note that this may be case-sensitive depending on the database.

Cache Size

Specifies how many lookup values/records should be cached. The cache is shared for all tables and keeps a map of lookup values to records. Setting this property to zero means no caching will be done and the table will be queried for each lookup value in each record. If the lookup table changes often or the most recent data must be retrieved, do not use the cache.

Clear Cache on Enabled

Whether to clear the cache when this service is enabled. If the Cache Size is zero then this property is ignored. Clearing the cache when the service is enabled ensures that the service will first go to the database to get the most recent data.

Cache Expiration

Time interval to clear all cache entries. If the Cache Size is zero then this property is ignored.

Default Decimal Precision

When a DECIMAL/NUMBER value is written as a 'decimal' Avro logical type, a specific 'precision' denoting number of available digits is required. Generally, precision is defined by column data type definition or database engines default. However undefined precision (0) can be returned from some database engines. 'Default Decimal Precision' is used when writing those undefined precision numbers.

Default Decimal Scale

When a DECIMAL/NUMBER value is written as a 'decimal' Avro logical type, a specific 'scale' denoting number of available decimal digits is required. Generally, scale is defined by column data type definition or database engines default. However when undefined precision (0) is returned, scale can also be uncertain with some database engines. 'Default Decimal Scale' is used when writing those undefined numbers. If a value has more decimals than specified scale, then the value will be rounded-up, e.g. 1.53 becomes 2 with scale 0, and 1.5 with scale 1.

DatabaseRecordSink

Provides a service to write records using a configured database connection.

Tags: db, jdbc, database, connection, record

Properties

Database Connection Pooling Service

The Controller Service that is used to obtain a connection to the database for sending records.

Catalog Name

The name of the catalog that the statement should update. This may not apply for the database that you are updating. In this case, leave the field empty

Schema Name

The name of the schema that the table belongs to. This may not apply for the database that you are updating. In this case, leave the field empty

Table Name

The name of the table that the statement should affect.

Translate Field Names

If true, the Processor will attempt to translate field names into the appropriate column names for the table specified. If false, the field names must match the column names exactly, or the column will not be updated

Unmatched Field Behavior

If an incoming record has a field that does not map to any of the database table’s columns, this property specifies how to handle the situation

Unmatched Column Behavior

If an incoming record does not have a field mapping for all of the database table’s columns, this property specifies how to handle the situation

Quote Column Identifiers

Enabling this option will cause all column names to be quoted, allowing you to use reserved words as column names in your tables.

Quote Table Identifiers

Enabling this option will cause the table name to be quoted to support the use of special characters in the table name.

Max Wait Time

The maximum amount of time allowed for a running SQL statement , zero means there is no limit. Max time less than 1 second will be equal to zero.

DatabaseTableSchemaRegistry

Provides a service for generating a record schema from a database table definition. The service is configured to use a table name and a database connection fetches the table metadata (i.e. table definition) such as column names, data types, nullability, etc.

Tags: schema, registry, database, table

Properties

Database Connection Pooling Service

The Controller Service that is used to obtain a connection to the database for retrieving table information.

Catalog Name

The name of the catalog used to locate the desired table. This may not apply for the database that you are querying. In this case, leave the field empty. Note that if the property is set and the database is case-sensitive, the catalog name must match the database’s catalog name exactly.

Schema Name

The name of the schema that the table belongs to. This may not apply for the database that you are updating. In this case, leave the field empty. Note that if the property is set and the database is case-sensitive, the schema name must match the database’s schema name exactly. Also notice that if the same table name exists in multiple schemas and Schema Name is not specified, the service will find those tables and give an error if the different tables have the same column name(s).

DBCPConnectionPool

Provides Database Connection Pooling Service. Connections can be asked from pool and returned after usage.

Tags: dbcp, jdbc, database, connection, pooling, store

Properties

Database Connection URL

A database connection URL used to connect to a database. May contain database system name, host, port, database name and some parameters. The exact syntax of a database connection URL is specified by your DBMS.

Database Driver Class Name

Database driver class name

Database Driver Location(s)

Comma-separated list of files/folders and/or URLs containing the driver JAR and its dependencies (if any). For example '/var/tmp/mariadb-java-client-1.1.7.jar'

Kerberos User Service

Specifies the Kerberos User Controller Service that should be used for authenticating with Kerberos

Database User

Database user name

Password

The password for the database user

Max Wait Time

The maximum amount of time that the pool will wait (when there are no available connections) for a connection to be returned before failing, or -1 to wait indefinitely.

Max Total Connections

The maximum number of active connections that can be allocated from this pool at the same time, or negative for no limit.

Validation query

Validation query used to validate connections before returning them. When connection is invalid, it gets dropped and new valid connection will be returned. Note!! Using validation might have some performance penalty.

Minimum Idle Connections

The minimum number of connections that can remain idle in the pool without extra ones being created. Set to or zero to allow no idle connections.

Max Idle Connections

The maximum number of connections that can remain idle in the pool without extra ones being released. Set to any negative value to allow unlimited idle connections.

Max Connection Lifetime

The maximum lifetime of a connection. After this time is exceeded the connection will fail the next activation, passivation or validation test. A value of zero or less means the connection has an infinite lifetime.

Time Between Eviction Runs

The time period to sleep between runs of the idle connection evictor thread. When non-positive, no idle connection evictor thread will be run.

Minimum Evictable Idle Time

The minimum amount of time a connection may sit idle in the pool before it is eligible for eviction.

Soft Minimum Evictable Idle Time

The minimum amount of time a connection may sit idle in the pool before it is eligible for eviction by the idle connection evictor, with the extra condition that at least a minimum number of idle connections remain in the pool. When the not-soft version of this option is set to a positive value, it is examined first by the idle connection evictor: when idle connections are visited by the evictor, idle time is first compared against it (without considering the number of idle connections in the pool) and then against this soft option, including the minimum idle connections constraint.

Dynamic Properties

JDBC property name

JDBC driver property name and value applied to JDBC connections.

SENSITIVE.JDBC property name

JDBC driver property name prefixed with 'SENSITIVE.' handled as a sensitive property.

Restricted

Making changes to this component requires admin rights. Non-admins can still schedule this component.

DBCPConnectionPoolLookup

Provides a DBCPService that can be used to dynamically select another DBCPService. This service requires an attribute named 'database.name' to be passed in when asking for a connection, and will throw an exception if the attribute is missing. The value of 'database.name' will be used to select the DBCPService that has been registered with that name. This will allow multiple DBCPServices to be defined and registered, and then selected dynamically at runtime by tagging flow files with the appropriate 'database.name' attribute.

Tags: dbcp, jdbc, database, connection, pooling, store

Properties

Dynamic Properties

The name to register DBCPService

If 'database.name' attribute contains the name of the dynamic property, then the DBCPService (registered in the value) will be selected.

DistributedMapCacheLookupService

Allows to choose a distributed map cache client to retrieve the value associated to a key. The coordinates that are passed to the lookup must contain the key 'key'.

Tags: lookup, enrich, key, value, map, cache, distributed

Properties

Distributed Cache Service

The Controller Service that is used to get the cached values.

Character Encoding

Specifies a character encoding to use.

ElasticSearchClientServiceImpl

A controller service for accessing an Elasticsearch client, using the Elasticsearch (low-level) REST Client.

Tags: elasticsearch, elasticsearch6, elasticsearch7, elasticsearch8, client

Properties

HTTP Hosts

A comma-separated list of HTTP hosts that host Elasticsearch query nodes. Note that the Host is included in requests as a header (typically including domain and port, e.g. elasticsearch:9200).

Path Prefix

Sets the path’s prefix for every request used by the http client. For example, if this is set to "/my/path", then any client request will become "/my/path/" + endpoint. In essence, every request’s endpoint is prefixed by this pathPrefix. The path prefix is useful for when Elasticsearch is behind a proxy that provides a base path or a proxy that requires all paths to start with '/'; it is not intended for other purposes and it should not be supplied in other scenarios

Authorization Scheme

Authorization Scheme used for optional authentication to Elasticsearch.

Username

The username to use with XPack security.

Password

The password to use with XPack security.

API Key ID

Unique identifier of the API key.

API Key

Encoded API key.

SSL Context Service

The SSL Context Service used to provide client certificate information for TLS/SSL connections. This service only applies if the Elasticsearch endpoint(s) have been secured with TLS/SSL.

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP

Connect timeout

Controls the amount of time, in milliseconds, before a timeout occurs when trying to connect.

Read timeout

Controls the amount of time, in milliseconds, before a timeout occurs when waiting for a response.

Charset

The charset to use for interpreting the response from Elasticsearch.

Suppress Null/Empty Values

Specifies how the writer should handle null and empty fields (including objects and arrays)

Enable Compression

Whether the REST client should compress requests using gzip content encoding and add the "Accept-Encoding: gzip" header to receive compressed responses

Send Meta Header

Whether to send a "X-Elastic-Client-Meta" header that describes the runtime environment. It contains information that is similar to what could be found in User-Agent. Using a separate header allows applications to use User-Agent for their own needs, e.g. to identify application version or other environment information

Strict Deprecation

Whether the REST client should return any response containing at least one warning header as a failure

Node Selector

Selects Elasticsearch nodes that can receive requests. Used to keep requests away from dedicated Elasticsearch master nodes

Sniff Cluster Nodes

Periodically sniff for nodes within the Elasticsearch cluster via the Elasticsearch Node Info API. If Elasticsearch security features are enabled (default to "true" for 8.x+), the Elasticsearch user must have the "monitor" or "manage" cluster privilege to use this API.Note that all HTTP Hosts (and those that may be discovered within the cluster using the Sniffer) must use the same protocol, e.g. http or https, and be contactable using the same client settings. Finally the Elasticsearch "network.publish_host" must match one of the "network.bind_host" list entries see https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-network.html for more information

Sniffer Interval

Interval between Cluster sniffer operations

Sniffer Request Timeout

Cluster sniffer timeout for node info requests

Sniff on Failure

Enable sniffing on failure, meaning that after each failure the Elasticsearch nodes list gets updated straightaway rather than at the following ordinary sniffing round

Sniffer Failure Delay

Delay between an Elasticsearch request failure and updating available Cluster nodes using the Sniffer

Dynamic Properties

The name of a Request Header to add

Adds the specified property name/value as a Request Header in the Elasticsearch requests.

Additional Details

Sniffing

The Elasticsearch Sniffer can be used to locate Elasticsearch Nodes within a Cluster to which you are connecting. This can be beneficial if your cluster dynamically changes over time, e.g. new Nodes are added to maintain performance during heavy load.

Sniffing can also be used to update the list of Hosts within the Cluster if a connection Failure is encountered during operation. In order to “Sniff on Failure”, you must also enable “Sniff Cluster Nodes”.

Not all situations make sense to use Sniffing, for example if:

  • Elasticsearch is situated behind a load balancer, which dynamically routes connections from NiFi

  • Elasticsearch is on a different network to NiFi

There may also be need to set some of the Elasticsearch Networking Advanced Settings, such as network.publish_host to ensure that the HTTP Hosts found by the Sniffer are accessible by NiFi. For example, Elasticsearch may use a network internal publish_host that is inaccessible to NiFi, but instead should use an address/IP that NiFi understands. It may also be necessary to add this same address to Elasticsearch’s network.bind_host list.

See Elasticsearch sniffing best practices: What, when, why, how for more details of the best practices.

Resources Usage Consideration

This Elasticsearch client relies on a RestClient using the Apache HTTP Async Client. By default, it will start one dispatcher thread, and a number of worker threads used by the connection manager. There will be as many worker thread as the number of locally detected processors/cores on the NiFi host. Consequently, it is highly recommended to have only one instance of this controller service per remote Elasticsearch destination and have this controller service shared across all the Elasticsearch processors of the NiFi flows. Having a very high number of instances could lead to resource starvation and result in OOM errors.

ElasticSearchLookupService

Lookup a record from Elasticsearch Server associated with the specified document ID. The coordinates that are passed to the lookup must contain the key 'id'.

Tags: lookup, enrich, record, elasticsearch

Properties

Schema Access Strategy

Specifies how to obtain the schema that is to be used for interpreting the data.

Schema Registry

Specifies the Controller Service to use for the Schema Registry

Schema Name

Specifies the name of the schema to lookup in the Schema Registry property

Schema Version

Specifies the version of the schema to lookup in the Schema Registry. If not specified then the latest version of the schema will be retrieved.

Schema Branch

Specifies the name of the branch to use when looking up the schema in the Schema Registry property. If the chosen Schema Registry does not support branching, this value will be ignored.

Schema Text

The text of an Avro-formatted Schema

Client Service

An ElasticSearch client service to use for running queries.

Index

The name of the index to read from

Type

The type of this document (used by Elasticsearch for indexing and searching)

Dynamic Properties

A JSONPath expression

Retrieves an object using JSONPath from the result document and places it in the return Record at the specified Record Path.

Additional Details

Description:

This lookup service uses ElasticSearch as its data source. Mappings in LookupRecord map record paths to paths within an ElasticSearch document. Example:

/user/name => user.contact.name

That would map the record path /user/name to an embedded document named contact with a field named name.

The query that is assembled from these is a boolean query where all the criteria are under the must list. In addition, wildcards are not supported right now and all criteria are translated into literal match queries.

Post-Processing

Because an ElasticSearch result might be structured differently than the record which will be enriched by this service, users can specify an additional set of mappings on this lookup service that map JsonPath operations to record paths. Example:

$.user.contact.email => /user/email_address

Would copy the field email from the embedded document contact into the record at that path.

ElasticSearchStringLookupService

Lookup a string value from Elasticsearch Server associated with the specified document ID. The coordinates that are passed to the lookup must contain the key 'id'.

Tags: lookup, enrich, value, key, elasticsearch

Properties

Client Service

An ElasticSearch client service to use for running queries.

Index

The name of the index to read from

Type

The type of this document (used by Elasticsearch for indexing and searching)

EmailRecordSink

Provides a RecordSinkService that can be used to send records in email using the specified writer for formatting.

Tags: email, smtp, record, sink, send, write

Properties

From

Specifies the Email address to use as the sender. Comma separated sequence of addresses following RFC822 syntax.

To

The recipients to include in the To-Line of the email. Comma separated sequence of addresses following RFC822 syntax.

CC

The recipients to include in the CC-Line of the email. Comma separated sequence of addresses following RFC822 syntax.

BCC

The recipients to include in the BCC-Line of the email. Comma separated sequence of addresses following RFC822 syntax.

Subject

The email subject

SMTP Hostname

The hostname of the SMTP Server that is used to send Email Notifications

SMTP Port

The Port used for SMTP communications

SMTP Auth

Flag indicating whether authentication should be used

SMTP Username

Username for the SMTP account

SMTP Password

Password for the SMTP account

SMTP STARTTLS

Flag indicating whether STARTTLS should be enabled. If the server does not support STARTTLS, the connection continues without the use of TLS

SMTP SSL

Flag indicating whether SSL should be enabled

SMTP X-Mailer Header

X-Mailer used in the header of the outgoing email

Record Writer

Specifies the Controller Service to use for writing out the records.

EmbeddedHazelcastCacheManager

A service that runs embedded Hazelcast and provides cache instances backed by that. The server does not ask for authentication, it is recommended to run it within secured network.

Tags: hazelcast, cache

Properties

Hazelcast Cluster Name

Name of the Hazelcast cluster.

Hazelcast Port

Port for the Hazelcast instance to use.

Hazelcast Clustering Strategy

Specifies with what strategy the Hazelcast cluster should be created.

Hazelcast Instances

Only used with "Explicit" Clustering Strategy! List of NiFi instance host names which should be part of the Hazelcast cluster. Host names are separated by comma. The port specified in the "Hazelcast Port" property will be used as server port. The list must contain every instance that will be part of the cluster. Other instances will join the Hazelcast cluster as clients.

Additional Details

This service starts and manages an embedded Hazelcast instance. The cache manager has direct accesses to the instance - and the data stored in it. However, the instance opens a port for potential clients to join and this cannot be prevented. Note that this might leave the instance open for rogue clients to join.

It is possible to have multiple independent Hazelcast instances on the same host (whether via EmbeddedHazelcastCacheManager or externally) without any interference by setting the properties accordingly. If there are no other instances, the default cluster name and port number can simply be used.

The service supports multiple ways to set up a Hazelcast cluster. This is controlled by the property, named “Hazelcast Clustering Strategy”. The following strategies may be used:

None

This is the default value. Used when sharing data between nodes is not required. With this value, every NiFi node in the cluster (if it is clustered) connects to its local Hazelcast server only. The Hazelcast servers do not form a cluster.

All Nodes

Can be used only in clustered node. Using this strategy will result a single Hazelcast cluster consisting of the embedded instances of all the NiFi nodes. This strategy requires all Hazelcast servers listening on the same port. Having different port numbers (based on expression for example) would prevent the cluster from forming.

The controller service automatically gathers the host list from the NiFi cluster itself when it is enabled. It is not required for all the nodes to have been successfully joined at this point, but the join must have been initiated. When the controller service is enabled at the start of the NiFi instance, the enabling of the service will be prevented until the node is considered clustered.

Hazelcast can accept nodes that join at a later time. As the new node has a comprehensive list of the expected instances - including the already existing ones and itself - Hazelcast will be able to reach the expected state. Beware: this may take significant time.

Explicit

Can be used only in clustered node. Explicit Clustering Strategy allows more control over the Hazelcast cluster members. All Nodes Clustering Strategy, this strategy works with a list of Hazelcast servers, but instance discovery is not automatic.

This strategy uses property named “Hazelcast Instances” to determine the members of the Hazelcast clusters. This list of hosts must contain all the instances expected to be part of the cluster. The instance list may contain only hosts which are part of the NiFi cluster. The port specified in the “Hazelcast Port” will be used as Hazelcast server port.

In case the current node is not part of the instance list, the service will start a Hazelcast client. The client then will connect to the Hazelcast addresses specified in the instance list. Users of the service should not perceive any difference in functionality.

ExcelReader

Parses a Microsoft Excel document returning each row in each sheet as a separate record. This reader allows for inferring a schema from all the required sheets or providing an explicit schema for interpreting the values.See Controller Service’s Usage for further documentation. This reader is currently only capable of processing .xlsx (XSSF 2007 OOXML file format) Excel documents and not older .xls (HSSF '97(-2007) file format) documents.

Tags: excel, spreadsheet, xlsx, parse, record, row, reader, values, cell

Properties

Schema Access Strategy

Specifies how to obtain the schema that is to be used for interpreting the data.

Schema Registry

Specifies the Controller Service to use for the Schema Registry

Schema Name

Specifies the name of the schema to lookup in the Schema Registry property

Schema Version

Specifies the version of the schema to lookup in the Schema Registry. If not specified then the latest version of the schema will be retrieved.

Schema Branch

Specifies the name of the branch to use when looking up the schema in the Schema Registry property. If the chosen Schema Registry does not support branching, this value will be ignored.

Schema Text

The text of an Avro-formatted Schema

Schema Reference Reader

Service implementation responsible for reading FlowFile attributes or content to determine the Schema Reference Identifier

Starting Row

The row number of the first row to start processing (One based). Use this to skip over rows of data at the top of a worksheet that are not part of the dataset. When using the 'Use Starting Row' strategy this should be the column header row.

Required Sheets

Comma-separated list of Excel document sheet names whose rows should be extracted from the excel document. If this property is left blank then all the rows from all the sheets will be extracted from the Excel document. The list of names is case sensitive. Any sheets not specified in this value will be ignored. An exception will be thrown if a specified sheet(s) are not found.

Protection Type

Specifies whether an Excel spreadsheet is protected by a password or not.

Password

The password for a password protected Excel spreadsheet

Date Format

Specifies the format to use when reading/writing Date fields. If not specified, Date fields will be assumed to be number of milliseconds since epoch (Midnight, Jan 1, 1970 GMT). If specified, the value must match the Java java.time.format.DateTimeFormatter format (for example, MM/dd/yyyy for a two-digit month, followed by a two-digit day, followed by a four-digit year, all separated by '/' characters, as in 01/01/2017).

Time Format

Specifies the format to use when reading/writing Time fields. If not specified, Time fields will be assumed to be number of milliseconds since epoch (Midnight, Jan 1, 1970 GMT). If specified, the value must match the Java java.time.format.DateTimeFormatter format (for example, HH:mm:ss for a two-digit hour in 24-hour format, followed by a two-digit minute, followed by a two-digit second, all separated by ':' characters, as in 18:04:15).

Timestamp Format

Specifies the format to use when reading/writing Timestamp fields. If not specified, Timestamp fields will be assumed to be number of milliseconds since epoch (Midnight, Jan 1, 1970 GMT). If specified, the value must match the Java java.time.format.DateTimeFormatter format (for example, MM/dd/yyyy HH:mm:ss for a two-digit month, followed by a two-digit day, followed by a four-digit year, all separated by '/' characters; and then followed by a two-digit hour in 24-hour format, followed by a two-digit minute, followed by a two-digit second, all separated by ':' characters, as in 01/01/2017 18:04:15).

Additional Details

The ExcelReader allows for interpreting input data as delimited Records. Each row in an Excel spreadsheet is a record and each cell is considered a field. The reader allows for choosing which row to start from and which sheets in a spreadsheet to ingest. When using the “Use Starting Row” strategy, the field names will be assumed to be the column names from the configured starting row. If there are any column(s) from the starting row which are blank, they are automatically assigned a field name using the cell number prefixed with “column_”. When using the “Infer Schema” strategy, the field names will be assumed to be the cell numbers of each column prefixed with “column_”. Otherwise, the names of fields can be supplied when specifying the schema by using the Schema Text or looking up the schema in a Schema Registry.

Schemas and Type Coercion

When a record is parsed from incoming data, it is separated into fields. Each of these fields is then looked up against the configured schema (by field name) in order to determine what the type of the data should be. If the field is not present in the schema, that field is omitted from the Record. If the field is found in the schema, the data type of the received data is compared against the data type specified in the schema. If the types match, the value of that field is used as-is. If the schema indicates that the field should be of a different type, then the Controller Service will attempt to coerce the data into the type specified by the schema. If the field cannot be coerced into the specified type, an Exception will be thrown.

The following rules apply when attempting to coerce a field value from one data type to another:

  • Any data type can be coerced into a String type.

  • Any numeric data type (Byte, Short, Int, Long, Float, Double) can be coerced into any other numeric data type.

  • Any numeric value can be coerced into a Date, Time, or Timestamp type, by assuming that the Long value is the number of milliseconds since epoch (Midnight GMT, January 1, 1970).

  • A String value can be coerced into a Date, Time, or Timestamp type, if its format matches the configured “Date Format,” “Time Format,” or “Timestamp Format.”

  • A String value can be coerced into a numeric value if the value is of the appropriate type. For example, the String value 8 can be coerced into any numeric type. However, the String value 8.2 can be coerced into a Double or Float type but not an Integer.

  • A String value of “true” or “false” (regardless of case) can be coerced into a Boolean value.

  • A String value that is not empty can be coerced into a Char type. If the String contains more than 1 character, the first character is used and the rest of the characters are ignored.

  • Any “date/time” type (Date, Time, Timestamp) can be coerced into any other “date/time” type.

  • Any “date/time” type can be coerced into a Long type, representing the number of milliseconds since epoch (Midnight GMT, January 1, 1970).

  • Any “date/time” type can be coerced into a String. The format of the String is whatever DateFormat is configured for the corresponding property (Date Format, Time Format, Timestamp Format property).

If none of the above rules apply when attempting to coerce a value from one data type to another, the coercion will fail and an Exception will be thrown.

Use Starting Row and Schema Inference

While NiFi’s Record API does require that each Record have a schema, it is often convenient to infer the schema based on the values in the data, rather than having to manually create a schema. This is accomplished by selecting either value of “Use Starting Row” or “Infer Schema” for the “Schema Access Strategy” property. When using the “Use Starting Row” strategy, the Reader will determine the schema by parsing the first ten rows after the configured starting row of the data in the FlowFile all the while keeping track of all fields that it has encountered and the type of each field. A schema is then formed that encompasses all encountered fields. A schema can even be inferred if there are blank lines within those ten rows, but if they are all blank, then this strategy will fail to create a schema. When using the “Infer Schema” strategy, the Reader will determine the schema by first parsing all data in the FlowFile, keeping track of all fields that it has encountered and the type of each field. Once all data has been parsed, a schema is formed that encompasses all fields that have been encountered.

A common concern when inferring schemas is how to handle the condition of two values that have different types. For example, consider a FlowFile with the following two records:

name, age John, 8 Jane, Ten

It is clear that the “name” field will be inferred as a STRING type. However, how should we handle the “age” field? Should the field be an CHOICE between INT and STRING? Should we prefer LONG over INT? Should we just use a STRING? Should the field be considered nullable?

To help understand how this Record Reader infers schemas, we have the following list of rules that are followed in the inference logic:

  • All fields are inferred to be nullable.

  • When two values are encountered for the same field in two different records (or two values are encountered for an ARRAY type), the inference engine prefers to use a “wider” data type over using a CHOICE data type. A data type “A” is said to be wider than data type “B” if and only if data type “A” encompasses all values of “B” in addition to other values. For example, the LONG type is wider than the INT type but not wider than the BOOLEAN type (and BOOLEAN is also not wider than LONG). INT is wider than SHORT. The STRING type is considered wider than all other types except MAP, RECORD, ARRAY, and CHOICE.

  • Before inferring the type of value, leading and trailing whitespace are removed. Additionally, if the value is surrounded by double-quotes (“), the double-quotes are removed. Therefore, the value 16 is interpreted the same as "16". Both will be interpreted as an INT. However, the value " 16" will be inferred as a STRING type because the white space is enclosed within double-quotes, which means that the white space is considered part of the value.

  • If the “Time Format,” “Timestamp Format,” or “Date Format” properties are configured, any value that would otherwise be considered a STRING type is first checked against the configured formats to see if it matches any of them. If the value matches the Timestamp Format, the value is considered a Timestamp field. If it matches the Date Format, it is considered a Date field. If it matches the Time Format, it is considered a Time field. In the unlikely event that the value matches more than one of the configured formats, they will be matched in the order: Timestamp, Date, Time. I.e., if a value matched both the Timestamp Format and the Date Format, the type that is inferred will be Timestamp. Because parsing dates and times can be expensive, it is advisable not to configure these formats if dates, times, and timestamps are not expected, or if processing the data as a STRING is acceptable. For use cases when this is important, though, the inference engine is intelligent enough to optimize the parsing by first checking several very cheap conditions. For example, the string’s length is examined to see if it is too long or too short to match the pattern. This results in far more efficient processing than would result if attempting to parse each string value as a timestamp.

  • The MAP type is never inferred.

  • The ARRAY type is never inferred.

  • The RECORD type is never inferred.

  • If a field exists but all values are null, then the field is inferred to be of type STRING.

Caching of Inferred Schemas

This Record Reader requires that if a schema is to be inferred, that all records be read in order to ensure that the schema that gets inferred is applicable for all records in the FlowFile. However, this can become expensive, especially if the data undergoes many different transformations. To alleviate the cost of inferring schemas, the Record Reader can be configured with a “Schema Inference Cache” by populating the property with that name. This is a Controller Service that can be shared by Record Readers and Record Writers.

Whenever a Record Writer is used to write data, if it is configured with a “Schema Cache,” it will also add the schema to the Schema Cache. This will result in an identifier for that schema being added as an attribute to the FlowFile.

Whenever a Record Reader is used to read data, if it is configured with a “Schema Inference Cache”, it will first look for a “schema.cache.identifier” attribute on the FlowFile. If the attribute exists, it will use the value of that attribute to lookup the schema in the schema cache. If it is able to find a schema in the cache with that identifier, then it will use that schema instead of reading, parsing, and analyzing the data to infer the schema. If the attribute is not available on the FlowFile, or if the attribute is available but the cache does not have a schema with that identifier, then the Record Reader will proceed to infer the schema as described above.

The end result is that users are able to chain together many different Processors to operate on Record-oriented data. Typically, only the first such Processor in the chain will incur the “penalty” of inferring the schema. For all other Processors in the chain, the Record Reader is able to simply lookup the schema in the Schema Cache by identifier. This allows the Record Reader to infer a schema accurately, since it is inferred based on all data in the FlowFile, and still allows this to happen efficiently since the schema will typically only be inferred once, regardless of how many Processors handle the data.

Examples
Example 1

As an example, consider a FlowFile whose contents are an Excel spreadsheet whose only sheet consists of the following:

id, name, balance, join_date, notes   1, John, 48.23, 04/03/2007 "Our very   first customer!"   2, Jane, 1245.89, 08/22/2009,   3, Frank Franklin, "48481.29", 04/04/2016,

Additionally, let’s consider that this Controller Service is configured to skip the first line and is configured with the Schema Registry pointing to an AvroSchemaRegistry which contains the following schema:

{
  "namespace": "nifi",
  "name": "balances",
  "type": "record",
  "fields": [
    {
      "name": "id",
      "type": "int"
    },
    {
      "name": "name",
      "type": "string"
    },
    {
      "name": "balance",
      "type": "double"
    },
    {
      "name": "join_date",
      "type": {
        "type": "int",
        "logicalType": "date"
      }
    },
    {
      "name": "notes",
      "type": "string"
    }
  ]
}
json

In the example above, we see that the ‘join_date’ column is a Date type. In order for the Excel Reader to be able to properly parse a value as a date, we need to provide the reader with the date format to use. In this example, we would configure the Date Format property to be MM/dd/yyyy to indicate that it is a two-digit month, followed by a two-digit day, followed by a four-digit year - each separated by a slash. In this case, the result will be that this FlowFile consists of 3 different records. The first record will contain the following values:

Field Name Field Value

id

1

name

John

balance

48.23

join_date

04/03/2007

notes

Our very first customer!

The second record will contain the following values:

Field Name Field Value

id

2

name

Jane

balance

1245.89

join_date

08/22/2009

notes

The third record will contain the following values:

Field Name Field Value

id

3

name

Frank Franklin

balance

48481.29

join_date

04/04/2016

notes

ExternalHazelcastCacheManager

A service that provides cache instances backed by Hazelcast running outside of NiFi.

Tags: hazelcast, cache

Properties

Hazelcast Cluster Name

Name of the Hazelcast cluster.

Hazelcast Server Address

Addresses of one or more the Hazelcast instances, using {host:port} format, separated by comma.

Hazelcast Initial Backoff

The amount of time the client waits before it tries to reestablish connection for the first time.

Hazelcast Maximum Backoff

The maximum amount of time the client waits before it tries to reestablish connection.

Hazelcast Backoff Multiplier

A multiplier by which the wait time is increased before each attempt to reestablish connection.

Hazelcast Connection Timeout

The maximum amount of time the client tries to connect or reconnect before giving up.

Additional Details

This service connects to an external Hazelcast cluster (or standalone instance) as client. Hazelcast 4.0.0 or newer version is required. The connection to the server is kept alive using Hazelcast’s built-in reconnection capability. This might be fine-tuned by setting the following properties:

  • Hazelcast Initial Backoff

  • Hazelcast Maximum Backoff

  • Hazelcast Backoff Multiplier

  • Hazelcast Connection Timeout

If the service cannot connect or abruptly disconnected it tries to reconnect after a backoff time. The amount of time waiting before the first attempt is defined by the Initial Backoff. If the connection is still not successful the client waits gradually more between the attempts until the waiting time reaches the value set in the ‘Hazelcast Maximum Backoff’ property (or the connection timeout, whichever is smaller). The backoff time after the first attempt is always based on the previous amount, multiplied by the Backoff Multiplier. Note: the real backoff time might be slightly differ as some “jitter” is added to the calculation in order to avoid regularity.

FreeFormTextRecordSetWriter

Writes the contents of a RecordSet as free-form text. The configured text is able to make use of the Expression Language to reference each of the fields that are available in a Record, as well as the attributes in the FlowFile and variables. If there is a name collision, the field name/value is used before attributes or variables. Each record in the RecordSet will be separated by a single newline character.

Tags: text, freeform, expression, language, el, record, recordset, resultset, writer, serialize

Properties

Text

The text to use when writing the results. This property will evaluate the Expression Language using any of the fields available in a Record.

Character Set

The Character set to use when writing the data to the FlowFile

GCPCredentialsControllerService

Defines credentials for Google Cloud Platform processors. Uses Application Default credentials without configuration. Application Default credentials support environmental variable (GOOGLE_APPLICATION_CREDENTIALS) pointing to a credential file, the config generated by gcloud auth application-default login, AppEngine/Compute Engine service accounts, etc.

Tags: gcp, credentials, provider

Properties

Use Application Default Credentials

If true, uses Google Application Default Credentials, which checks the GOOGLE_APPLICATION_CREDENTIALS environment variable for a filepath to a service account JSON key, the config generated by the gcloud sdk, the App Engine service account, and the Compute Engine service account.

Use Compute Engine Credentials

If true, uses Google Compute Engine Credentials of the Compute Engine VM Instance which NiFi is running on.

Service Account JSON File

Path to a file containing a Service Account key file in JSON format.

Service Account JSON

The raw JSON containing a Service Account keyfile.

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP + AuthN

Delegation Strategy

The Delegation Strategy determines which account is used when calls are made with the GCP Credential.

Delegation User

This user will be impersonated by the service account for api calls. API calls made using this credential will appear as if they are coming from delegate user with the delegate user’s access. Any scopes supplied from processors to this credential must have domain-wide delegation setup with the service account.

Restricted

Making changes to this component requires admin rights. Non-admins can still schedule this component.

GCSFileResourceService

Provides a Google Compute Storage (GCS) file resource for other components.

Use Cases

Fetch a specific file from GCS. The service provides higher performance compared to fetch processors when the data should be moved between different storages without any transformation.

Input Requirement: This component allows an incoming relationship.

  1. "Bucket" = "${gcs.bucket}"

  2. "Name" = "${filename}" .

  3. The "GCP Credentials Provider Service" property should specify an instance of the GCPCredentialsService in order to provide credentials for accessing the bucket. .

Tags: file, resource, gcs

Properties

Bucket

Bucket of the object.

Name

Name of the object.

GCP Credentials Provider Service

The Controller Service used to obtain Google Cloud Platform credentials.

See Also

GrokReader

Provides a mechanism for reading unstructured text data, such as log files, and structuring the data so that it can be processed. The service is configured using Grok patterns. The service reads from a stream of data and splits each message that it finds into a separate Record, each containing the fields that are configured. If a line in the input does not match the expected message pattern, the line of text is either considered to be part of the previous message or is skipped, depending on the configuration, with the exception of stack traces. A stack trace that is found at the end of a log message is considered to be part of the previous message but is added to the 'stackTrace' field of the Record. If a record has no stack trace, it will have a NULL value for the stackTrace field (assuming that the schema does in fact include a stackTrace field of type String). Assuming that the schema includes a '_raw' field of type String, the raw message will be included in the Record.

Tags: grok, logs, logfiles, parse, unstructured, text, record, reader, regex, pattern, logstash

Properties

Schema Access Strategy

Specifies how to obtain the schema that is to be used for interpreting the data.

Schema Registry

Specifies the Controller Service to use for the Schema Registry

Schema Name

Specifies the name of the schema to lookup in the Schema Registry property

Schema Version

Specifies the version of the schema to lookup in the Schema Registry. If not specified then the latest version of the schema will be retrieved.

Schema Branch

Specifies the name of the branch to use when looking up the schema in the Schema Registry property. If the chosen Schema Registry does not support branching, this value will be ignored.

Schema Text

The text of an Avro-formatted Schema

Schema Reference Reader

Service implementation responsible for reading FlowFile attributes or content to determine the Schema Reference Identifier

Grok Patterns

Grok Patterns to use for parsing logs. If not specified, a built-in default Pattern file will be used. If specified, all patterns specified will override the default patterns. See the Controller Service’s Additional Details for a list of pre-defined patterns.

Grok Expressions

Specifies the format of a log line in Grok format. This allows the Record Reader to understand how to parse each log line. The property supports one or more Grok expressions. The Reader attempts to parse input lines according to the configured order of the expressions.If a line in the log file does not match any expressions, the line will be assumed to belong to the previous log message.If other Grok patterns are referenced by this expression, they need to be supplied in the Grok Pattern File property.

No Match Behavior

If a line of text is encountered and it does not match the given Grok Expression, and it is not part of a stack trace, this property specifies how the text should be processed.

Restricted

Making changes to this component requires admin rights. Non-admins can still schedule this component.

Additional Details

The GrokReader Controller Service provides a means for parsing and structuring input that is made up of unstructured text, such as log files. Grok allows users to add a naming construct to Regular Expressions such that they can be composed in order to create expressions that are easier to manage and work with. This Controller Service consists of one Required Property and a few Optional Properties. The is named Grok Pattern File property specifies the filename of a file that contains Grok Patterns that can be used for parsing log data. If not specified, a default patterns file will be used. Its contents are provided below. There are also properties for specifying the schema to use when parsing data. The schema is not required. However, when data is parsed a Record is created that contains all the fields present in the Grok Expression (explained below), and all fields are of type String. If a schema is chosen, the field can be declared to be a different, compatible type, such as number. Additionally, if the schema does not contain one of the fields in the parsed data, that field will be ignored. This can be used to filter out fields that are not of interest.

Note: a _raw field is also added to preserve the original message.

The Required Property is named Grok Expression and specifies how to parse each incoming record. This is done by providing a Grok Expression such as: %{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:level} \[%{DATA:thread}\] %{DATA:class} %{GREEDYDATA:message}. This Expression will parse Apache NiFi log messages. This is accomplished by specifying that a line begins with the TIMESTAMP_ISO8601 pattern (which is a Regular Expression defined in the default Grok Patterns File). The value that matches this pattern is then given the name timestamp. As a result, the value that matches this pattern will be assigned to a field named timestamp in the Record that produced by this Controller Service.

No Match Behavior

If a line is encountered in the FlowFile that does not match the configured Grok Expression, it is assumed that the line is part of the previous message. If the line is the start of a stack trace, then the entire stack trace is read in and assigned to a field named STACK_TRACE. Otherwise, the line will be processed according to the value defined in the “No Match Behavior” property.

Append to Previous Message

The line is appended to the last field defined in the Grok Expression. This is typically done because the last field is a ‘message’ type of field, which can consist of new-lines.

Skip Line

The line is completely dismissed.

Raw Line

The fields will be null except the _raw field that will contain the line allowing further processing downstream.

Examples

Assuming two messages (<6>Feb 28 12:00:00 192.168.0.1 aliyun[11111]: [error] Syslog test and This is a bad message...) with the Grok Expression <%{POSINT:priority}>%{SYSLOGTIMESTAMP:timestamp} %{SYSLOGHOST:hostname} %{WORD:ident}%{GREEDYDATA:message} ; and assuming a JSON Writer, the following results will be generated:

Append to Previous Message
[
  {
    "priority": "6",
    "timestamp": "Feb 28 12:00:00",
    "hostname": "192.168.0.1",
    "ident": "aliyun",
    "message": "[11111]: [error] Syslog test\nThis is a bad message...",
    "stackTrace": null,
    "_raw": "<6>Feb 28 12:00:00 192.168.0.1 aliyun[11111]: [error] Syslog test\nThis is a bad message..."
  }
]
json
Skip Line
[
  {
    "priority": "6",
    "timestamp": "Feb 28 12:00:00",
    "hostname": "192.168.0.1",
    "ident": "aliyun",
    "message": "[11111]: [error] Syslog test",
    "stackTrace": null,
    "_raw": "<6>Feb 28 12:00:00 192.168.0.1 aliyun[11111]: [error] Syslog test"
  }
]
json
Raw Line
[
  {
    "priority": "6",
    "timestamp": "Feb 28 12:00:00",
    "hostname": "192.168.0.1",
    "ident": "aliyun",
    "message": "[11111]: [error] Syslog test",
    "stackTrace": null,
    "_raw": "<6>Feb 28 12:00:00 192.168.0.1 aliyun[11111]: [error] Syslog test"
  },
  {
    "priority": null,
    "timestamp": null,
    "hostname": null,
    "ident": null,
    "message": null,
    "stackTrace": null,
    "_raw": "This is a bad message..."
  }
]
json
Schemas and Type Coercion

When a record is parsed from incoming data, it is separated into fields. Each of these fields is then looked up against the configured schema (by field name) in order to determine what the type of the data should be. If the field is not present in the schema, that field is omitted from the Record. If the field is found in the schema, the data type of the received data is compared against the data type specified in the schema. If the types match, the value of that field is used as-is. If the schema indicates that the field should be of a different type, then the Controller Service will attempt to coerce the data into the type specified by the schema. If the field cannot be coerced into the specified type, an Exception will be thrown.

The following rules apply when attempting to coerce a field value from one data type to another:

  • Any data type can be coerced into a String type.

  • Any numeric data type (Byte, Short, Int, Long, Float, Double) can be coerced into any other numeric data type.

  • Any numeric value can be coerced into a Date, Time, or Timestamp type, by assuming that the Long value is the number of milliseconds since epoch (Midnight GMT, January 1, 1970).

  • A String value can be coerced into a Date, Time, or Timestamp type, if its format matches the configured “Date Format,” “Time Format,” or “Timestamp Format.”

  • A String value can be coerced into a numeric value if the value is of the appropriate type. For example, the String value 8 can be coerced into any numeric type. However, the String value 8.2 can be coerced into a Double or Float type but not an Integer.

  • A String value of “true” or “false” (regardless of case) can be coerced into a Boolean value.

  • A String value that is not empty can be coerced into a Char type. If the String contains more than 1 character, the first character is used and the rest of the characters are ignored.

  • Any “date/time” type (Date, Time, Timestamp) can be coerced into any other “date/time” type.

  • Any “date/time” type can be coerced into a Long type, representing the number of milliseconds since epoch (Midnight GMT, January 1, 1970).

  • Any “date/time” type can be coerced into a String. The format of the String is whatever DateFormat is configured for the corresponding property (Date Format, Time Format, Timestamp Format property).

If none of the above rules apply when attempting to coerce a value from one data type to another, the coercion will fail and an Exception will be thrown.

Examples

As an example, consider that this Controller Service is configured with the following properties:

Property Name Property Value

Grok Expression

%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:level} \[%{DATA:thread}\] %{DATA:class} %{GREEDYDATA:message}

Additionally, let’s consider a FlowFile whose contents consists of the following:

2016-08-04 13:26:32,473 INFO [Leader Election Notification Thread-1] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1fa27ea5 has been interrupted; no longer leader for role 'Cluster Coordinator'
2016-08-04 13:26:32,474 ERROR [Leader Election Notification Thread-2] o.apache.nifi.controller.FlowController One
Two
Three
org.apache.nifi.exception.UnitTestException: Testing to ensure we are able to capture stack traces
        at org.apache.nifi.cluster.coordination.node.NodeClusterCoordinator.getElectedActiveCoordinatorAddress(NodeClusterCoordinator.java:185)
        at org.apache.nifi.cluster.coordination.node.NodeClusterCoordinator.getElectedActiveCoordinatorAddress(NodeClusterCoordinator.java:185)
        at org.apache.nifi.cluster.coordination.node.NodeClusterCoordinator.getElectedActiveCoordinatorAddress(NodeClusterCoordinator.java:185)
        at org.apache.nifi.cluster.coordination.node.NodeClusterCoordinator.getElectedActiveCoordinatorAddress(NodeClusterCoordinator.java:185)
        at org.apache.nifi.cluster.coordination.node.NodeClusterCoordinator.getElectedActiveCoordinatorAddress(NodeClusterCoordinator.java:185)
        at org.apache.nifi.cluster.coordination.node.NodeClusterCoordinator.getElectedActiveCoordinatorAddress(NodeClusterCoordinator.java:185)
        at org.apache.nifi.cluster.coordination.node.NodeClusterCoordinator.getElectedActiveCoordinatorAddress(NodeClusterCoordinator.java:185)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_45]
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [na:1.8.0_45]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_45]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [na:1.8.0_45]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_45]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_45]
        at java.lang.Thread.run(Thread.java:745) [na:1.8.0_45]
Caused by: org.apache.nifi.exception.UnitTestException: Testing to ensure we are able to capture stack traces
    at org.apache.nifi.cluster.coordination.node.NodeClusterCoordinator.getElectedActiveCoordinatorAddress(NodeClusterCoordinator.java:185)
    at org.apache.nifi.cluster.coordination.node.NodeClusterCoordinator.getElectedActiveCoordinatorAddress(NodeClusterCoordinator.java:185)
    at org.apache.nifi.cluster.coordination.node.NodeClusterCoordinator.getElectedActiveCoordinatorAddress(NodeClusterCoordinator.java:185)
    at org.apache.nifi.cluster.coordination.node.NodeClusterCoordinator.getElectedActiveCoordinatorAddress(NodeClusterCoordinator.java:185)
    at org.apache.nifi.cluster.coordination.node.NodeClusterCoordinator.getElectedActiveCoordinatorAddress(NodeClusterCoordinator.java:185)
    ... 12 common frames omitted
2016-08-04 13:26:35,475 WARN [Curator-Framework-0] org.apache.curator.ConnectionState Connection attempt unsuccessful after 3008 (greater than max timeout of 3000). Resetting connection and trying again with a new connection.

In this case, the result will be that this FlowFile consists of 3 different records. The first record will contain the following values:

Field Name Field Value

timestamp

2016-08-04 13:26:32,473

level

INFO

thread

Leader Election Notification Thread-1

class

o.a.n.c.l.e.CuratorLeaderElectionManager

message

org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager/$ElectionListener@1fa27ea5 has been interrupted; no longer leader for role ‘Cluster Coordinator’

STACK_TRACE

null

The second record will contain the following values:

Field Name Field Value

timestamp

2016-08-04 13:26:32,474

level

ERROR

thread

Leader Election Notification Thread-2

class

o.apache.nifi.controller.FlowController

message

One Two Three

STACK_TRACE

org.apache.nifi.exception.UnitTestException: Testing to ensure we are able to capture stack traces at org.apache.nifi.cluster.coordination.node.NodeClusterCoordinator.getElectedActiveCoordinatorAddress(NodeClusterCoordinator.java:185) at org.apache.nifi.cluster.coordination.node.NodeClusterCoordinator.getElectedActiveCoordinatorAddress(NodeClusterCoordinator.java:185) at org.apache.nifi.cluster.coordination.node.NodeClusterCoordinator.getElectedActiveCoordinatorAddress(NodeClusterCoordinator.java:185) at org.apache.nifi.cluster.coordination.node.NodeClusterCoordinator.getElectedActiveCoordinatorAddress(NodeClusterCoordinator.java:185) at org.apache.nifi.cluster.coordination.node.NodeClusterCoordinator.getElectedActiveCoordinatorAddress(NodeClusterCoordinator.java:185) at org.apache.nifi.cluster.coordination.node.NodeClusterCoordinator.getElectedActiveCoordinatorAddress(NodeClusterCoordinator.java:185) at org.apache.nifi.cluster.coordination.node.NodeClusterCoordinator.getElectedActiveCoordinatorAddress(NodeClusterCoordinator.java:185) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_45] at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [na:1.8.0_45] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access/\(301(ScheduledThreadPoolExecutor.java:180) \[na:1.8.0\_45]<br> at java.util.concurrent.ScheduledThreadPoolExecutor/\)ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [na:1.8.0_45] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_45] at java.util.concurrent.ThreadPoolExecutor/$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_45] at java.lang.Thread.run(Thread.java:745) [na:1.8.0_45]Caused by: org.apache.nifi.exception.UnitTestException: Testing to ensure we are able to capture stack traces at org.apache.nifi.cluster.coordination.node.NodeClusterCoordinator.getElectedActiveCoordinatorAddress(NodeClusterCoordinator.java:185) at org.apache.nifi.cluster.coordination.node.NodeClusterCoordinator.getElectedActiveCoordinatorAddress(NodeClusterCoordinator.java:185) at org.apache.nifi.cluster.coordination.node.NodeClusterCoordinator.getElectedActiveCoordinatorAddress(NodeClusterCoordinator.java:185) at org.apache.nifi.cluster.coordination.node.NodeClusterCoordinator.getElectedActiveCoordinatorAddress(NodeClusterCoordinator.java:185) at org.apache.nifi.cluster.coordination.node.NodeClusterCoordinator.getElectedActiveCoordinatorAddress(NodeClusterCoordinator.java:185) … 12 common frames omitted

The third record will contain the following values:

Field Name Field Value

timestamp

2016-08-04 13:26:35,475

level

WARN

thread

Curator-Framework-0

class

org.apache.curator.ConnectionState

message

Connection attempt unsuccessful after 3008 (greater than max timeout of 3000). Resetting connection and trying again with a new connection.

STACK_TRACE

null

Default Patterns

The following patterns are available in the default Grok Pattern File:

# Log Levels
LOGLEVEL ([Aa]lert|ALERT|[Tt]race|TRACE|[Dd]ebug|DEBUG|[Nn]otice|NOTICE|[Ii]nfo|INFO|[Ww]arn?(?:ing)?|WARN?(?:ING)?|[Ee]rr?(?:or)?|ERR?(?:OR)?|[Cc]rit?(?:ical)?|CRIT?(?:ICAL)?|[Ff]atal|FATAL|[Ss]evere|SEVERE|EMERG(?:ENCY)?|[Ee]merg(?:ency)?)|FINE|FINER|FINEST|CONFIG

# Syslog Dates: Month Day HH:MM:SS
SYSLOGTIMESTAMP %{MONTH} +%{MONTHDAY} %{TIME}
PROG (?:[\w._/%-]+)
SYSLOGPROG %{PROG:program}(?:\[%{POSINT:pid}\])?
SYSLOGHOST %{IPORHOST}
SYSLOGFACILITY <%{NONNEGINT:facility}.%{NONNEGINT:priority}>
HTTPDATE %{MONTHDAY}/%{MONTH}/%{YEAR}:%{TIME} %{INT}

# Months: January, Feb, 3, 03, 12, December
MONTH \b(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)\b
MONTHNUM (?:0?[1-9]|1[0-2])
MONTHNUM2 (?:0[1-9]|1[0-2])
MONTHDAY (?:(?:0[1-9])|(?:[12][0-9])|(?:3[01])|[1-9])

# Days: Monday, Tue, Thu, etc...
DAY (?:Mon(?:day)?|Tue(?:sday)?|Wed(?:nesday)?|Thu(?:rsday)?|Fri(?:day)?|Sat(?:urday)?|Sun(?:day)?)

# Years?
YEAR (?>\d\d){1,2}
HOUR (?:2[0123]|[01]?[0-9])
MINUTE (?:[0-5][0-9])
# '60' is a leap second in most time standards and thus is valid.
SECOND (?:(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?)
TIME (?!<[0-9])%{HOUR}:%{MINUTE}(?::%{SECOND})(?![0-9])

# datestamp is YYYY/MM/DD-HH:MM:SS.UUUU (or something like it)
DATE_US_MONTH_DAY_YEAR %{MONTHNUM}[/-]%{MONTHDAY}[/-]%{YEAR}
DATE_US_YEAR_MONTH_DAY %{YEAR}[/-]%{MONTHNUM}[/-]%{MONTHDAY}
DATE_US %{DATE_US_MONTH_DAY_YEAR}|%{DATE_US_YEAR_MONTH_DAY}
DATE_EU %{MONTHDAY}[./-]%{MONTHNUM}[./-]%{YEAR}
ISO8601_TIMEZONE (?:Z|[+-]%{HOUR}(?::?%{MINUTE}))
ISO8601_SECOND (?:%{SECOND}|60)
TIMESTAMP_ISO8601 %{YEAR}-%{MONTHNUM}-%{MONTHDAY}[T ]%{HOUR}:?%{MINUTE}(?::?%{SECOND})?%{ISO8601_TIMEZONE}?
DATE %{DATE_US}|%{DATE_EU}
DATESTAMP %{DATE}[- ]%{TIME}
TZ (?:[PMCE][SD]T|UTC)
DATESTAMP_RFC822 %{DAY} %{MONTH} %{MONTHDAY} %{YEAR} %{TIME} %{TZ}
DATESTAMP_RFC2822 %{DAY}, %{MONTHDAY} %{MONTH} %{YEAR} %{TIME} %{ISO8601_TIMEZONE}
DATESTAMP_OTHER %{DAY} %{MONTH} %{MONTHDAY} %{TIME} %{TZ} %{YEAR}
DATESTAMP_EVENTLOG %{YEAR}%{MONTHNUM2}%{MONTHDAY}%{HOUR}%{MINUTE}%{SECOND}


POSINT \b(?:[1-9][0-9]*)\b
NONNEGINT \b(?:[0-9]+)\b
WORD \b\w+\b
NOTSPACE \S+
SPACE \s*
DATA .*?
GREEDYDATA .*
QUOTEDSTRING (?>(?<!\\)(?>"(?>\\.|[^\\"]+)+"|""|(?>'(?>\\.|[^\\']+)+')|''|(?>`(?>\\.|[^\\`]+)+`)|``))
UUID [A-Fa-f0-9]{8}-(?:[A-Fa-f0-9]{4}-){3}[A-Fa-f0-9]{12}

USERNAME [a-zA-Z0-9._-]+
USER %{USERNAME}
INT (?:[+-]?(?:[0-9]+))
BASE10NUM (?<![0-9.+-])(?>[+-]?(?:(?:[0-9]+(?:\.[0-9]+)?)|(?:\.[0-9]+)))
NUMBER (?:%{BASE10NUM})
BASE16NUM (?<![0-9A-Fa-f])(?:[+-]?(?:0x)?(?:[0-9A-Fa-f]+))
BASE16FLOAT \b(?<![0-9A-Fa-f.])(?:[+-]?(?:0x)?(?:(?:[0-9A-Fa-f]+(?:\.[0-9A-Fa-f]*)?)|(?:\.[0-9A-Fa-f]+)))\b

# Networking
MAC (?:%{CISCOMAC}|%{WINDOWSMAC}|%{COMMONMAC})
CISCOMAC (?:(?:[A-Fa-f0-9]{4}\.){2}[A-Fa-f0-9]{4})
WINDOWSMAC (?:(?:[A-Fa-f0-9]{2}-){5}[A-Fa-f0-9]{2})
COMMONMAC (?:(?:[A-Fa-f0-9]{2}:){5}[A-Fa-f0-9]{2})
IPV6 ((([0-9A-Fa-f]{1,4}:){7}([0-9A-Fa-f]{1,4}|:))|(([0-9A-Fa-f]{1,4}:){6}(:[0-9A-Fa-f]{1,4}|((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3})|:))|(([0-9A-Fa-f]{1,4}:){5}(((:[0-9A-Fa-f]{1,4}){1,2})|:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3})|:))|(([0-9A-Fa-f]{1,4}:){4}(((:[0-9A-Fa-f]{1,4}){1,3})|((:[0-9A-Fa-f]{1,4})?:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(([0-9A-Fa-f]{1,4}:){3}(((:[0-9A-Fa-f]{1,4}){1,4})|((:[0-9A-Fa-f]{1,4}){0,2}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(([0-9A-Fa-f]{1,4}:){2}(((:[0-9A-Fa-f]{1,4}){1,5})|((:[0-9A-Fa-f]{1,4}){0,3}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(([0-9A-Fa-f]{1,4}:){1}(((:[0-9A-Fa-f]{1,4}){1,6})|((:[0-9A-Fa-f]{1,4}){0,4}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(:(((:[0-9A-Fa-f]{1,4}){1,7})|((:[0-9A-Fa-f]{1,4}){0,5}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:)))(%.+)?
IPV4 (?<![0-9])(?:(?:25[0-5]|2[0-4][0-9]|[0-1]?[0-9]{1,2})[.](?:25[0-5]|2[0-4][0-9]|[0-1]?[0-9]{1,2})[.](?:25[0-5]|2[0-4][0-9]|[0-1]?[0-9]{1,2})[.](?:25[0-5]|2[0-4][0-9]|[0-1]?[0-9]{1,2}))(?![0-9])
IP (?:%{IPV6}|%{IPV4})
HOSTNAME \b(?:[0-9A-Za-z][0-9A-Za-z-]{0,62})(?:\.(?:[0-9A-Za-z][0-9A-Za-z-]{0,62}))*(\.?|\b)
HOST %{HOSTNAME}
IPORHOST (?:%{HOSTNAME}|%{IP})
HOSTPORT %{IPORHOST}:%{POSINT}

# paths
PATH (?:%{UNIXPATH}|%{WINPATH})
UNIXPATH (?>/(?>[\w_%!$@:.,-]+|\\.)*)+
TTY (?:/dev/(pts|tty([pq])?)(\w+)?/?(?:[0-9]+))
WINPATH (?>[A-Za-z]+:|\\)(?:\\[^\\?*]*)+
URIPROTO [A-Za-z]+(\+[A-Za-z+]+)?
URIHOST %{IPORHOST}(?::%{POSINT:port})?
# uripath comes loosely from RFC1738, but mostly from what Firefox
# doesn't turn into %XX
URIPATH (?:/[A-Za-z0-9$.+!*'(){},~:;=@#%_\-]*)+
#URIPARAM \?(?:[A-Za-z0-9]+(?:=(?:[^&]*))?(?:&(?:[A-Za-z0-9]+(?:=(?:[^&]*))?)?)*)?
URIPARAM \?[A-Za-z0-9$.+!*'|(){},~@#%&/=:;_?\-\[\]]*
URIPATHPARAM %{URIPATH}(?:%{URIPARAM})?
URI %{URIPROTO}://(?:%{USER}(?::[^@]*)?@)?(?:%{URIHOST})?(?:%{URIPATHPARAM})?

# Shortcuts
QS %{QUOTEDSTRING}

# Log formats
SYSLOGBASE %{SYSLOGTIMESTAMP:timestamp} (?:%{SYSLOGFACILITY} )?%{SYSLOGHOST:logsource} %{SYSLOGPROG}:
COMMONAPACHELOG %{IPORHOST:clientip} %{USER:ident} %{USER:auth} \[%{HTTPDATE:timestamp}\] "(?:%{WORD:verb} %{NOTSPACE:request}(?: HTTP/%{NUMBER:httpversion})?|%{DATA:rawrequest})" %{NUMBER:response} (?:%{NUMBER:bytes}|-)
COMBINEDAPACHELOG %{COMMONAPACHELOG} %{QS:referrer} %{QS:agent}

HadoopDBCPConnectionPool

Provides a Database Connection Pooling Service for Hadoop related JDBC services. This service requires that the Database Driver Location(s) contains some version of a hadoop-common JAR, or a shaded JAR that shades hadoop-common.

Tags: dbcp, jdbc, database, connection, pooling, store, hadoop

Properties

Database Connection URL

A database connection URL used to connect to a database. May contain database system name, host, port, database name and some parameters. The exact syntax of a database connection URL is specified by your DBMS.

Database Driver Class Name

Database driver class name

Database Driver Location(s)

Comma-separated list of files/folders and/or URLs containing the driver JAR and its dependencies. For example '/var/tmp/phoenix-client.jar'. NOTE: It is required that the resources specified by this property provide the classes from hadoop-common, such as Configuration and UserGroupInformation.

Hadoop Configuration Resources

A file, or comma separated list of files, which contain the Hadoop configuration (core-site.xml, etc.). Without this, Hadoop will search the classpath, or will revert to a default configuration. Note that to enable authentication with Kerberos, the appropriate properties must be set in the configuration files.

Kerberos User Service

Specifies the Kerberos User Controller Service that should be used for authenticating with Kerberos

Database User

Database user name

Password

The password for the database user

Max Wait Time

The maximum amount of time that the pool will wait (when there are no available connections) for a connection to be returned before failing, or -1 to wait indefinitely.

Max Total Connections

The maximum number of active connections that can be allocated from this pool at the same time, or negative for no limit.

Validation query

Validation query used to validate connections before returning them. When connection is invalid, it gets dropped and new valid connection will be returned. Note!! Using validation might have some performance penalty.

Minimum Idle Connections

The minimum number of connections that can remain idle in the pool without extra ones being created. Set to or zero to allow no idle connections.

Max Idle Connections

The maximum number of connections that can remain idle in the pool without extra ones being released. Set to any negative value to allow unlimited idle connections.

Max Connection Lifetime

The maximum lifetime of a connection. After this time is exceeded the connection will fail the next activation, passivation or validation test. A value of zero or less means the connection has an infinite lifetime.

Time Between Eviction Runs

The time period to sleep between runs of the idle connection evictor thread. When non-positive, no idle connection evictor thread will be run.

Minimum Evictable Idle Time

The minimum amount of time a connection may sit idle in the pool before it is eligible for eviction.

Soft Minimum Evictable Idle Time

The minimum amount of time a connection may sit idle in the pool before it is eligible for eviction by the idle connection evictor, with the extra condition that at least a minimum number of idle connections remain in the pool. When the not-soft version of this option is set to a positive value, it is examined first by the idle connection evictor: when idle connections are visited by the evictor, idle time is first compared against it (without considering the number of idle connections in the pool) and then against this soft option, including the minimum idle connections constraint.

Dynamic Properties

The name of a Hadoop configuration property.

These properties will be set on the Hadoop configuration after loading any provided configuration files.

Restricted

Making changes to this component requires admin rights. Non-admins can still schedule this component.

HazelcastMapCacheClient

An implementation of DistributedMapCacheClient that uses Hazelcast as the backing cache. This service relies on an other controller service, manages the actual Hazelcast calls, set in Hazelcast Cache Manager.

Tags: hazelcast, cache, map

Properties

Hazelcast Cache Manager

A Hazelcast Cache Manager which manages connections to Hazelcast and provides cache instances.

Hazelcast Cache Name

The name of a given cache. A Hazelcast cluster may handle multiple independent caches, each identified by a name. Clients using caches with the same name are working on the same data structure within Hazelcast.

Hazelcast Entry Lifetime

Indicates how long the written entries should exist in Hazelcast. Setting it to '0 secs' means that the datawill exists until its deletion or until the Hazelcast server is shut down. Using EmbeddedHazelcastCacheManager ascache manager will not provide policies to limit the size of the cache.

Additional Details

This implementation of distributed map cache is backed by Hazelcast. The Hazelcast connection is provided and maintained by an instance of HazelcastCacheManager. One HazelcastCacheManager might serve multiple cache clients. This implementation uses the IMap data structure. The identifier of the Hazelcast IMap will be the same as the value of the property Hazelcast Cache Name. It is recommended for all HazelcastMapCacheClient instances to use different cache names.

The implementation supports the atomic method family defined in AtomicDistributedMapCacheClient. This is achieved by maintaining a revision number for every entry. The revision is an 8 byte long integer. It is increased when the entry is updated. The value is kept during modifications not part of the atomic method family but this is mainly for regular management of the entries. It is not recommended to work with elements by mixing the two method families.

The convention for all the entries is to reserve the first 8 bytes for the revision. The rest of the content is the serialized payload.

HikariCPConnectionPool

Provides Database Connection Pooling Service based on HikariCP. Connections can be asked from pool and returned after usage.

Tags: dbcp, hikari, jdbc, database, connection, pooling, store

Properties

Database Connection URL

A database connection URL used to connect to a database. May contain database system name, host, port, database name and some parameters. The exact syntax of a database connection URL is specified by your DBMS.

Database Driver Class Name

The fully-qualified class name of the JDBC driver. Example: com.mysql.jdbc.Driver

Database Driver Location(s)

Comma-separated list of files/folders and/or URLs containing the driver JAR and its dependencies (if any). For example '/var/tmp/mariadb-java-client-1.1.7.jar'

Kerberos User Service

Specifies the Kerberos User Controller Service that should be used for authenticating with Kerberos

Database User

Database user name

Password

The password for the database user

Max Wait Time

The maximum amount of time that the pool will wait (when there are no available connections) for a connection to be returned before failing, or 0 <time units> to wait indefinitely.

Max Total Connections

This property controls the maximum size that the pool is allowed to reach, including both idle and in-use connections. Basically this value will determine the maximum number of actual connections to the database backend. A reasonable value for this is best determined by your execution environment. When the pool reaches this size, and no idle connections are available, the service will block for up to connectionTimeout milliseconds before timing out.

Validation Query

Validation Query used to validate connections before returning them. When connection is invalid, it gets dropped and new valid connection will be returned. NOTE: Using validation might have some performance penalty.

Minimum Idle Connections

This property controls the minimum number of idle connections that HikariCP tries to maintain in the pool. If the idle connections dip below this value and total connections in the pool are less than 'Max Total Connections', HikariCP will make a best effort to add additional connections quickly and efficiently. It is recommended that this property to be set equal to 'Max Total Connections'.

Max Connection Lifetime

The maximum lifetime of a connection. After this time is exceeded the connection will fail the next activation, passivation or validation test. A value of zero or less means the connection has an infinite lifetime.

Dynamic Properties

JDBC property name

Specifies a property name and value to be set on the JDBC connection(s). If Expression Language is used, evaluation will be performed upon the controller service being enabled. Note that no flow file input (attributes, e.g.) is available for use in Expression Language constructs for these properties.

Restricted

Making changes to this component requires admin rights. Non-admins can still schedule this component.

HttpRecordSink

Format and send Records to a configured uri using HTTP post. The Record Writer formats the records which are sent as the body of the HTTP post request. JsonRecordSetWriter is often used with this processor because many HTTP posts require a JSON body.

Tags: http, post, record, sink

Properties

API URL

The URL which receives the HTTP requests.

Maximum Batch Size

Specifies the maximum number of records to send in the body of each HTTP request. Zero means the batch size is not limited, and all records are sent together in a single HTTP request.

Record Writer

Specifies the Controller Service to use for writing out the records.

Web Service Client Provider

Controller service to provide the HTTP client for sending the HTTP requests.

OAuth2 Access Token Provider

OAuth2 service that provides the access tokens for the HTTP requests.

HybridRESTClientController

Defines the address and credentials of a BPC, allowing BPC processors and services to send messages to the BPC.

Tags: hybrid, cloud, onprem, virtimo, websocket

Properties

Target URL

The URL of the BPC, including scheme, host, port and path. This URL will be used when sending messages.

Browser Base URL

The BPC’s base URL used to access the BPC from your browser (e.g. https://mycompany.virtimo.cloud/bpc). Provenance events where this service was used to send data will contain a link (built on this base URL) which you can click to view the data within the BPC from this browser. If not set, the 'URL'-property will be used.

Sender ID

How the BPC will identify messages received from this service. If the BPC receives messages from different IGUASU instances, please make sure that this ID is unique for each IGUASU instance.

API Key

The API key, to communicate with the Virtimo System. Use this API key or basic auth.

Basic Authentication Username

The username to be used by the client to authenticate against the Remote URL. Cannot include control characters (0-31), ':', or DEL (127). No longer supported as of BPC 4.

Basic Authentication Password

The password to be used by the client to authenticate against the Remote URL.

Connection Timeout

Max wait time for connection to remote service.

Read Timeout

Max wait time for response from remote service.

Idle Timeout

Max idle time before closing connection to the remote service.

Max Idle Connections

Max number of idle connections to keep open.

SSL Context Service

The SSL Context Service used to provide client certificate information for TLS/SSL (https) connections. It is also used to connect to HTTPS Proxy.

HybridRESTServerController

Defines a REST server that can be called from suitable Virtimo systems (IGUASU, INUBIT, BPC).

Tags: hybrid, cloud, onprem, virtimo, websocket

Properties

Hostname

The Hostname to bind to. If not specified, will bind to all hosts

Listening Port

The Port to listen on for incoming HTTP requests

Request Expiration

Specifies how long a request should be left unanswered before being evicted from the cache and being responded to with a Service Unavailable status code

Maximum Thread Pool Size

The maximum number of threads to be used by the embedded Jetty server. The value can be set between 8 and 1000. The value of this property affects the performance of the flows and the operating system, therefore the default value should only be changed in justified cases. A value that is less than the default value may be suitable if only a small number of HTTP clients connect to the server. A greater value may be suitable if a large number of HTTP clients are expected to make requests to the server simultaneously.

SSL Context Service

The SSL Context Service used to secure the server. If specified, the server will only accept HTTPS requests, otherwise, the server will only accept HTTP requests.

SSL Client Authentication

Specifies whether or it should authenticate client by its certificate. This value is ignored if the <SSL Context Service> Property is not specified or the SSL Context provided uses only a KeyStore and not a TrustStore.

Basic Authentication Username

If set, a client can authenticate using this username.

Basic Authentication Password

If set, a client can authenticate using this password.

HybridWebSocketClientController

Client that initiates a standing connection to a Virtimo cloud system (IGUASU). The appropriate processors can receive and send messages via this connection without opening an external port on this system.

Multi-Processor Use Cases

Integrate your on-prem system using IGUASU Gateway with your IGUASU on the cloud.

Notes: IGUASU Gateway is a light-weight version of IGUASU ideal for securely integrating your on-prem system with IGUASU, allowing messages to be sent freely in both directions - without having to open any external ports on the on-prem system or using a VPN. IGUASU Gateway does not offer a UI, so the flow which will be used on the IGUASU Gateway must first be configured on IGUASU and then pushed to the IGUASU Gateway, which is illustrated in this use case. For details on how to install IGUASU Gateway on your on-prem system, please reach out to your contact person at Virtimo.

Processor:

  1. In the same process group, add processors as necessary for exchanging messages between IGUASU and IGUASU Gateway (InvokeCloud, ListenCloud and RespondCloud).

  2. Add processors as necessary for processing messages received from IGUASU and sending messages containing data from your on-prem system.

  3. Once finished, push the flow of the process group to the IGUASU Gateway.

  4. IGUASU Gateway should now be able to initiate a standing connection with IGUASU, allowing messages to be sent freely and securely. .

Tags: hybrid, cloud, onprem, virtimo, websocket

Properties

URL

The URL of the Virtimo system to connect to, may include port and path. The scheme may be omitted but only WSS is supported.

Connection Timeout

Max wait time for connection to remote service.

Ping Interval

Interval by which a ping is made.

SSL Context Service

The SSL Context Service used to provide client certificate information for TLS/SSL (https) connections. It is also used to connect to HTTPS Proxy.

Push Listeners Timeout

When pushing available listeners using this controller to remote service, the request will timeout if no response is received within this much time.

HybridWebSocketServerController

Server that expects standing connections initiated by an on-prem Virtimo system(s). The appropriate processors can receive and send messages via this connection without opening an external port on the on-prem system.

Tags: hybrid, cloud, onprem, virtimo, websocket

Properties

Hostname

The Hostname to bind to. If not specified, will bind to all hosts

Listening Port

The Port to listen on for incoming HTTP requests

SSL Context Service

The SSL Context Service used to secure the server. The server only accepts HTTPS requests

Push Listeners Timeout

When pushing available listeners using this controller to remote service, the request will timeout if no response is received within this much time.

IPLookupService

A lookup service that provides several types of enrichment information for IP addresses. The service is configured by providing a MaxMind Database file and specifying which types of enrichment should be provided for an IP Address or Hostname. Each type of enrichment is a separate lookup, so configuring the service to provide all of the available enrichment data may be slower than returning only a portion of the available enrichments. In order to use this service, a lookup must be performed using key of 'ip' and a value that is a valid IP address or hostname. View the Usage of this component and choose to view Additional Details for more information, such as the Schema that pertains to the information that is returned.

Tags: lookup, enrich, ip, geo, ipgeo, maxmind, isp, domain, cellular, anonymous, tor

Properties

MaxMind Database File

Path to Maxmind IP Enrichment Database File

Lookup Geo Enrichment

Specifies whether or not information about the geographic information, such as cities, corresponding to the IP address should be returned

Lookup ISP

Specifies whether or not information about the Information Service Provider corresponding to the IP address should be returned

Lookup Domain Name

Specifies whether or not information about the Domain Name corresponding to the IP address should be returned. If true, the lookup will contain second-level domain information, such as foo.com but will not contain bar.foo.com

Lookup Connection Type

Specifies whether or not information about the Connection Type corresponding to the IP address should be returned. If true, the lookup will contain a 'connectionType' field that (if populated) will contain a value of 'Dialup', 'Cable/DSL', 'Corporate', or 'Cellular'

Lookup Anonymous IP Information

Specifies whether or not information about whether or not the IP address belongs to an anonymous network should be returned.

Additional Details

The IPLookupService is powered by a MaxMind database and can return several different types of enrichment information about a given IP address. Below is the schema of the Record that is returned by this service (in Avro Schema format). The schema is for a single record that consists of several fields: geo, isp, domainName, connectionType, and anonymousIp. Each of these fields is nullable and will be populated only if the IP address that is searched for has the relevant information in the MaxMind database and if the Controller Service is configured to return such information. Because each of the fields requires a separate lookup in the database, it is advisable to retrieve only those fields that are of value.

{
  "name": "enrichmentRecord",
  "namespace": "nifi",
  "type": "record",
  "fields": [
    {
      "name": "geo",
      "type": [
        "null",
        {
          "name": "cityGeo",
          "type": "record",
          "fields": [
            {
              "name": "city",
              "type": [
                "null",
                "string"
              ]
            },
            {
              "name": "accuracy",
              "type": [
                "null",
                "int"
              ],
              "doc": "The radius, in kilometers, around the given location, where the IP address is believed to be"
            },
            {
              "name": "metroCode",
              "type": [
                "null",
                "int"
              ]
            },
            {
              "name": "timeZone",
              "type": [
                "null",
                "string"
              ]
            },
            {
              "name": "latitude",
              "type": [
                "null",
                "double"
              ]
            },
            {
              "name": "longitude",
              "type": [
                "null",
                "double"
              ]
            },
            {
              "name": "country",
              "type": [
                "null",
                {
                  "type": "record",
                  "name": "country",
                  "fields": [
                    {
                      "name": "name",
                      "type": "string"
                    },
                    {
                      "name": "isoCode",
                      "type": "string"
                    }
                  ]
                }
              ]
            },
            {
              "name": "subdivisions",
              "type": {
                "type": "array",
                "items": {
                  "type": "record",
                  "name": "subdivision",
                  "fields": [
                    {
                      "name": "name",
                      "type": "string"
                    },
                    {
                      "name": "isoCode",
                      "type": "string"
                    }
                  ]
                }
              }
            },
            {
              "name": "continent",
              "type": [
                "null",
                "string"
              ]
            },
            {
              "name": "postalCode",
              "type": [
                "null",
                "string"
              ]
            }
          ]
        }
      ]
    },
    {
      "name": "isp",
      "type": [
        "null",
        {
          "name": "ispEnrich",
          "type": "record",
          "fields": [
            {
              "name": "name",
              "type": [
                "null",
                "string"
              ]
            },
            {
              "name": "organization",
              "type": [
                "null",
                "string"
              ]
            },
            {
              "name": "asn",
              "type": [
                "null",
                "int"
              ]
            },
            {
              "name": "asnOrganization",
              "type": [
                "null",
                "string"
              ]
            }
          ]
        }
      ]
    },
    {
      "name": "domainName",
      "type": [
        "null",
        "string"
      ]
    },
    {
      "name": "connectionType",
      "type": [
        "null",
        "string"
      ],
      "doc": "One of 'Dialup', 'Cable/DSL', 'Corporate', 'Cellular'"
    },
    {
      "name": "anonymousIp",
      "type": [
        "null",
        {
          "name": "anonymousIpType",
          "type": "record",
          "fields": [
            {
              "name": "anonymous",
              "type": "boolean"
            },
            {
              "name": "anonymousVpn",
              "type": "boolean"
            },
            {
              "name": "hostingProvider",
              "type": "boolean"
            },
            {
              "name": "publicProxy",
              "type": "boolean"
            },
            {
              "name": "torExitNode",
              "type": "boolean"
            }
          ]
        }
      ]
    }
  ]
}
json

While this schema is fairly complex, it is a single record with 5 fields. This makes it quite easy to update an existing schema to allow for this record, by adding a new field to an existing schema and pasting in the schema above as the type.

For example, suppose that we have an existing schema that is as simple as:

{
  "name": "ipRecord",
  "namespace": "nifi",
  "type": "record",
  "fields": [
    {
      "name": "ip",
      "type": "string"
    }
  ]
}
json

Now, let’s suppose that we want to add a new field named enrichment to the above schema. Further, let’s say that we want the new enrichment field to be nullable. We can do so by copying and pasting our enrichment schema from above thus:

{
  "name": "ipRecord",
  "namespace": "nifi",
  "type": "record",
  "fields": [
    {
      "name": "ip",
      "type": "string"
    },
    {
      "name": "enrichment",
      "type": [
        "null",
        <Paste Enrichment Schema Here>
      ]
    }
  ]
}
json

JASN1Reader

Reads ASN.1 content and creates NiFi records. NOTE: ASN.1 schema preparation requires the JDK at runtime for model compilation.

Tags: asn, ans1, jasn.1, jasn1, record, reader, parser

Properties

Root Model Name

The model name in the form of 'MODULE-NAME.ModelType'. Mutually exclusive with and should be preferred to 'Root Model Class Name'. (See additional details for more information.)

Root Model Class Name

A canonical class name that is generated by the ASN.1 compiler to encode the ASN.1 input data. Mutually exclusive with 'Root Model Name'. Should be used when the former cannot be set properly. See additional details for more information.

ASN.1 Files

Comma-separated list of ASN.1 files.

Schema Preparation Strategy

When set, NiFi will do additional preprocessing steps that creates modified versions of the provided ASN files, removing unsupported features in a way that makes them less strict but otherwise should still be compatible with incoming data. The original files will remain intact and new ones will be created with the same names in the directory defined in the 'Additional Preprocessing Output Directory' property. For more information about these additional preprocessing steps please see Additional Details - Additional Preprocessing.

Schema Preparation Directory

When the processor is configured to do additional preprocessing, new modified schema files will be created in this directory. For more information about additional preprocessing please see description of the 'Do Additional Preprocessing' property or Additional Details - Additional Preprocessing.

Additional Details

Summary

This service creates record readers for ASN.1 input.

ASN.1 schema files (with full path) can be defined via the ASN.1 Files property as a comma separated list. The controller service preprocesses these files and generates sources that it uses for parsing data later.
Note that this preprocessing may take a while, especially when the schema files are large. The service remains in the Enabling state until this preprocessing is finished. Processors using the service are ready to be started at this point but probably won’t work properly until the service is fully Enabled.
Also note that the preprocessing phase can fail if there are problems with the ASN.1 schema files. The bulletin - as per usual - will show error messages related to the failure but interpreting those messages may not be straightforward. For help troubleshooting such messages please refer to the Troubleshooting section below.

The root model type can be defined via the Root Model Name property. It’s format should be “MODULE-NAME.ModelType”. “MODULE-NAME” is the name of the ASN module defined at the beginnig of the ASN schema file. “ModelType” is an ASN type defined in the ASN files that is not referenced by any other type. The reader created by this service expects ASN records of this root model type.

More than one root model types can be defined in the ASN schema files but one service instance can only work with one such type at a time. Multiple different ASN data types can be processed by creating multiple instances of this service.

The ASN schema files are ultimately compiled into Java classes in a temporary directory when the service is enabled. (The directory is deleted when the service is disabled. Of course the ASN schema files remain.) The service actually needs the fully qualified name of the class compiled from the root model type. It usually guesses the name of this class correctly from Root Model Name.
However there may be situations where this is not the case. Should this happen, one can take use of the fact that NiFi logs the temporary directory where the compiled Java classes can be found. Once the proper class of the root model type is identified in that directory (should be easily done by looking for it by its name) it can be provided directly via the Root Model Class Name property. (Note however that the service should be left Enabled while doing the search as it deletes the temporary directory when it is disabled. To be able to set the property the service needs to be disabled in the end - and let it remove the directory, however this shouldn’t be an issue as the name of the root model class will be the same in the new temporary directory.)

Troubleshooting

The preprocessing is done in two phases:

  1. The first phase reads the ASN.1 schema files and parses them. Formatting errors are usually reported during this phase. Here are some possible error messages and the potential root causes of the issues:

    • line NNN:MMM: unexpected token: someFieldName - On the NNNth line, starting at the MMMth position, someFieldName is encountered which was unexpected. Usually this means someFieldName itself is fine but the previous field declaration doesn’t have a comma ‘,’ at the end.

    • line NNN:MMM: unexpected token: [ - On the NNNth line, starting at the MMMth position, the opening square bracket ‘[’ is encountered which was unexpected. Usually this is the index part of the field declaration and for some reason the field declaration is invalid. This can typically occur if the field name is invalid, e.g. starts with an uppercase letter. (Field names must start with a lowercase letter.)

  2. The second phase compiles the ASN.1 schema files into Java classes. Even if the ASN.1 files meet the formal requirements, due to the nature of the created Java files there are some extra limitations:

    • On certain systems type names are treated as case-insensitive. Because of this, two types whose names only differ in the cases of their letters may cause errors. For example if the ASN.1 schema files define both ’ SameNameWithDifferentCase’ and ‘SAMENAMEWithDifferentCase’, the following error may be reported:

      class SAMENAMEWithDifferentCase is public, should be declared in a file named SAMENAMEWithDifferentCase.java

    • Certain keywords cannot be used as field names. Known reserved keywords and the corresponding reported error messages are:

    • length

      incompatible types: com.beanit.asn1bean.ber.types.BerInteger cannot be converted to com.beanit.asn1bean.ber.BerLength
      incompatible types: boolean cannot be converted to java.io.OutputStream
      Some messages have been simplified; recompile with -Xdiags:verbose to get full output

Additional Preprocessing

NiFi doesn’t support every feature that the ASN standard allows. To alleviate problems when encountering ASN files with unsupported features, NiFi can do additional preprocessing steps that creates modified versions of the provided ASN files, removing unsupported features in a way that makes them less strict but otherwise should still be compatible with incoming data. This feature can be switched on via the ‘Do Additional Preprocessing’ property. The original files will remain intact and new ones will be created with the same names in a directory set in the ‘Additional Preprocessing Output Directory’ property. Please note that this is a best-effort attempt. It is also strongly recommended to compare the resulting ASN files to the originals and make sure they are still appropriate.

The following modification are applied:

  1. Constraints - Advanced constraints are not recognized as valid ASN elements by NiFi. This step will try to remove all types of constraints.
    E.g.

    field [3] INTEGER(SIZE(1..8,…,10|12|20)) OPTIONAL

    will be changed to

    field [3] INTEGER OPTIONAL

  2. Version brackets - NiFi will try to remove all version brackets and leave all defined fields as OPTIONAL.
    E.g.

    MyType ::= SEQUENCE \{ integerField1 INTEGER, integerField2 INTEGER, …, – comment1 [[ – comment2 integerField3 INTEGER, integerField4 INTEGER, integerField5 INTEGER ]] }

    will be changed to

    MyType ::= SEQUENCE \{ integerField1 INTEGER, integerField2 INTEGER, …, – comment1 – comment2 integerField3 INTEGER OPTIONAL, integerField4 INTEGER OPTIONAL, integerField5 INTEGER OPTIONAL }

  3. “Hugging” comments - This is not really an ASN feature but a potential error. The double dash comment indicator “–” should be separated from ASN elements.
    E.g.

    field [0] INTEGER(1..8)–comment

    will be changed to

    field [0] INTEGER(1..8) –comment

JettyWebSocketClient

Implementation of WebSocketClientService. This service uses Jetty WebSocket client module to provide WebSocket session management throughout the application.

Tags: WebSocket, Jetty, client

Properties

Input Buffer Size

The size of the input (read from network layer) buffer size.

Max Text Message Size

The maximum size of a text message during parsing/generating.

Max Binary Message Size

The maximum size of a binary message during parsing/generating.

WebSocket URI

The WebSocket URI this client connects to.

SSL Context Service

The SSL Context Service to use in order to secure the server. If specified, the server will accept only WSS requests; otherwise, the server will accept only WS requests

Connection Timeout

The timeout to connect the WebSocket URI.

Connection Attempt Count

The number of times to try and establish a connection.

Session Maintenance Interval

The interval between session maintenance activities. A WebSocket session established with a WebSocket server can be terminated due to different reasons including restarting the WebSocket server or timing out inactive sessions. This session maintenance activity is periodically executed in order to reconnect those lost sessions, so that a WebSocket client can reuse the same session id transparently after it reconnects successfully. The maintenance activity is executed until corresponding processors or this controller service is stopped.

User Name

The user name for Basic Authentication.

User Password

The user password for Basic Authentication.

Authentication Header Charset

The charset for Basic Authentication header base64 string.

Custom Authorization

Configures a custom HTTP Authorization Header as described in RFC 7235 Section 4.2. Setting a custom Authorization Header excludes configuring the User Name and User Password properties for Basic Authentication.

HTTP Proxy Host

The host name of the HTTP Proxy.

HTTP Proxy Port

The port number of the HTTP Proxy.

JettyWebSocketServer

Implementation of WebSocketServerService. This service uses Jetty WebSocket server module to provide WebSocket session management throughout the application.

Tags: WebSocket, Jetty, server

Properties

Input Buffer Size

The size of the input (read from network layer) buffer size.

Max Text Message Size

The maximum size of a text message during parsing/generating.

Max Binary Message Size

The maximum size of a binary message during parsing/generating.

Listen Port

The port number on which this WebSocketServer listens to.

SSL Context Service

The SSL Context Service to use in order to secure the server. If specified, the server will accept only WSS requests; otherwise, the server will accept only WS requests

SSL Client Authentication

Specifies whether or not the Processor should authenticate client by its certificate. This value is ignored if the <SSL Context Service> Property is not specified or the SSL Context provided uses only a KeyStore and not a TrustStore.

Enable Basic Authentication

If enabled, client connection requests are authenticated with Basic authentication using the specified Login Provider.

Basic Authentication Path Spec

Specify a Path Spec to apply Basic Authentication.

Basic Authentication Roles

The authenticated user must have one of specified role. Multiple roles can be set as comma separated string. '' represents any role and so does '*' any role including no role.

Login Service

Specify which Login Service to use for Basic Authentication.

Users Properties File

Specify a property file containing users for Basic Authentication using HashLoginService. See http://www.eclipse.org/jetty/documentation/current/configuring-security.html for detail.

JMSConnectionFactoryProvider

Provides a generic service to create vendor specific javax.jms.ConnectionFactory implementations. The Connection Factory can be served once this service is configured successfully.

Tags: jms, messaging, integration, queue, topic, publish, subscribe

Properties

JMS Connection Factory Implementation Class

The fully qualified name of the JMS ConnectionFactory implementation class (eg. org.apache.activemq.ActiveMQConnectionFactory).

JMS Client Libraries

Path to the directory with additional resources (eg. JARs, configuration files etc.) to be added to the classpath (defined as a comma separated list of values). Such resources typically represent target JMS client libraries for the ConnectionFactory implementation.

JMS Broker URI

URI pointing to the network location of the JMS Message broker. Example for ActiveMQ: 'tcp://myhost:61616'. Examples for IBM MQ: 'myhost(1414)' and 'myhost01(1414),myhost02(1414)'.

JMS SSL Context Service

The SSL Context Service used to provide client certificate information for TLS/SSL connections.

Dynamic Properties

The name of a Connection Factory configuration property.

The properties that are set following Java Beans convention where a property name is derived from the 'set*' method of the vendor specific ConnectionFactory’s implementation. For example, 'com.ibm.mq.jms.MQConnectionFactory.setChannel(String)' would imply 'channel' property and 'com.ibm.mq.jms.MQConnectionFactory.setTransportType(int)' would imply 'transportType' property.

Restricted

Making changes to this component requires admin rights. Non-admins can still schedule this component.

Additional Details

Description

This controller service serves as a general factory service to serving vendor specific instances of the javax.jms.ConnectionFactory. It does so by allowing user to configure vendor specific properties as well as point to the location of the vendor provided JMS client libraries so the correct implementation of the javax.jms.ConnectionFactory can be found, loaded, instantiated and served to the dependent processors (see PublishJMS, ConsumeJMS).

All JMS vendors and ConnectionFactory implementations are supported as long as the configuration values can be set through set methods (detailed explanation in the last paragraph). However, some helpful accommodation are done for the following JMS vendors:

  • Apache ActiveMQ

  • IBM MQ

  • TIBCO EMS

  • Qpid JMS (AMQP 1.0)

This controller service exposes only a single mandatory static configuration property that are required across all implementations. The rest of the configuration properties are either optional or vendor specific.

The mandatory configuration property is:

The following static configuration properties are optional but required in many cases:

  • JMS Client Libraries - Path to the directory with additional resources (eg. JARs, configuration files, etc.) to be added to the classpath (defined as a comma separated list of values). Such resources typically represent target JMS client libraries for the ConnectionFactory implementation.

  • JMS Broker URI - URI pointing to the network location of the JMS Message broker. For example:

    • Apache ActiveMQ - tcp://myhost:1234 for single broker and failover:(tcp://myhost01:1234,tcp://myhost02:1234) for multiple brokers.

    • IBM MQ - myhost(1234) for single broker. myhost01(1234),myhost02(1234) for multiple brokers.

    • TIBCO EMS - tcp://myhost:1234 for single broker and tcp://myhost01:7222,tcp://myhost02:7222 for multiple brokers.

    • Qpid JMS (AMQP 1.0) - amqp[s]://myhost:1234 for single broker and failover:(amqp[s]://myhost01: 1234,amqp[s]://myhost02:1234) for multiple brokers.

The rest of the vendor specific configuration are set through dynamic properties utilizing the Java Beans convention where a property name is derived from the set method of the vendor specific ConnectionFactory’s implementation. For example, com.ibm.mq.jms.MQConnectionFactory.setChannel(String) would imply ‘channel’ property and com.ibm.mq.jms.MQConnectionFactory.setTransportType(int) would imply ‘transportType’ property. For the list of available properties please consult vendor provided documentation. Following is examples of such vendor provided documentation:

Besides the dynamic properties and set methods described in the previous section, some providers also support additional configuration via the Broker URI (as query parameters added to the URI):

Sample controller service configuration for IBM MQ
Property Value Static/Dynamic Comments

JMS Connection Factory Implementation

com.ibm.mq.jms.MQQueueConnectionFactory

Static

Vendor provided implementation of QueueConnectionFactory

JMS Client Libraries

/opt/mqm/java/lib

Static

Default installation path of client JAR files on Linux systems

JMS Broker URI

mqhost01(1414),mqhost02(1414)

Static

Connection Name List syntax. Colon separated host/port pair(s) is also supported

JMS SSL Context Service

Static

Only required if using SSL/TLS

channel

TO.BAR

Dynamic

Required when using the client transport mode

queueManager

PQM1

Dynamic

Name of queue manager. Always required.

transportType

1

Dynamic

Constant integer value corresponding to the client transport mode. Default value is “Bindings, then client”

JndiJmsConnectionFactoryProvider

Provides a service to lookup an existing JMS ConnectionFactory using the Java Naming and Directory Interface (JNDI).

Tags: jms, jndi, messaging, integration, queue, topic, publish, subscribe

Properties

JNDI Initial Context Factory Class

The fully qualified class name of the JNDI Initial Context Factory Class (java.naming.factory.initial).

JNDI Provider URL

The URL of the JNDI Provider to use as the value for java.naming.provider.url. See additional details documentation for allowed URL schemes.

JNDI Name of the Connection Factory

The name of the JNDI Object to lookup for the Connection Factory.

JNDI / JMS Client Libraries

Specifies jar files and/or directories to add to the ClassPath in order to load the JNDI / JMS client libraries. This should be a comma-separated list of files, directories, and/or URLs. If a directory is given, any files in that directory will be included, but subdirectories will not be included (i.e., it is not recursive).

JNDI Principal

The Principal to use when authenticating with JNDI (java.naming.security.principal).

JNDI Credentials

The Credentials to use when authenticating with JNDI (java.naming.security.credentials).

Dynamic Properties

The name of a JNDI Initial Context environment variable.

In order to perform a JNDI Lookup, an Initial Context must be established. When this is done, an Environment can be established for the context. Any dynamic/user-defined property that is added to this Controller Service will be added as an Environment configuration/variable to this Context.

Additional Details

Capabilities

This Controller Service allows users to reference a JMS Connection Factory that has already been established and made available via Java Naming and Directory Interface (JNDI) Server. Please see documentation from your JMS Vendor in order to understand the appropriate values to configure for this service.

A Connection Factory in Java is typically obtained via JNDI in code like below. The comments have been added in to explain how this maps to the Controller Service’s configuration.

Hashtable env = new Hashtable();
env.put(Context.INITIAL_CONTEXT_FACTORY, JNDI_INITIAL_CONTEXT_FACTORY); // Value for this comes from the "JNDI Initial Context Factory Class" property.
env.put(Context.PROVIDER_URL, JNDI_PROVIDER_URL); // Value for this comes from the "JNDI Provider URL" property.
env.put("My-Environment-Variable","Environment-Variable-Value"); // This is accomplished by added a user-defined property with name "My-Environment-Variable" and value "Environment-Variable-Value"

Context initialContext = new InitialContext(env);
ConnectionFactory connectionFactory = initialContext.lookup(JNDI_CONNECTION_FACTORY_NAME); // Value for this comes from the "JNDI Name of the Connection Factory" property
java

It is also important to note that, in order for this to work, the class named by the “JNDI Initial Context Factory Class” must be available on the classpath. The JMS provider specific client classes (like the class of the Connection Factory object to be retrieved from JNDI) must also be available on the classpath. In NiFi, this is accomplished by setting the “JNDI / JMS Client Libraries” property to point to one or more .jar files or directories (comma-separated values).

When the Controller Service is disabled and then re-enabled, it will perform the JNDI lookup again. Once the Connection Factory has been obtained, though, it will not perform another JNDI lookup until the service is disabled.

Example Configuration

As an example, the following configuration may be used to connect to Active MQ’s JMS Broker, using the Connection Factory provided via their embedded JNDI server:

Property Name Property Value

JNDI Initial Context Factory Class

org.apache.activemq.jndi.ActiveMQInitialContextFactory

JNDI Provider URL

tcp://jms-broker:61616

JNDI Name of the Connection Factory

ConnectionFactory

JNDI / JMS Client Libraries

/opt/apache-activemq-5.15.2/lib/

The above example assumes that there exists a host that is accessible with hostname “jms-broker” and that is running Apache ActiveMQ on port 61616 and also that the jar(s) containing the org.apache.activemq.jndi.ActiveMQInitialContextFactory class and the other JMS client classes can be found within the /opt/apache-activemq-5.15.2/lib/ directory.

Property Validation

The following component properties include additional validation to restrict allowed values:

  • JNDI Provider URL

JNDI Provider URL Validation

The default validation for JNDI Provider URL allows the following URL schemes:

  • file

  • jgroups

  • ssl

  • t3

  • t3s

  • tcp

  • udp

  • vm

The following Java System property can be configured to override the default allowed URL schemes:

  • org.apache.nifi.jms.cf.jndi.provider.url.schemes.allowed

The System property must contain a space-separated list of URL schemes. This property can be configured in the application bootstrap.conf as follows:

  • java.arg.jndiJmsUrlSchemesAllowed=-Dorg.apache.nifi.jms.cf.jndi.provider.url.schemes.allowed=ssl tcp

JsonConfigBasedBoxClientService

Provides Box client objects through which Box API calls can be used.

Tags: box, client, provider

Properties

Account ID

The ID of the Box account who owns the accessed resource. Same as 'User Id' under 'App Info' in the App 'General Settings'.

App Config File

Full path of an App config JSON file. See Additional Details for more information.

App Config JSON

The raw JSON containing an App config. See Additional Details for more information.

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP + AuthN

Additional Details

Setting up a Box App

This processor requires a pre-configured App under the Account owning the resources being accessed (https://app.box.com/developers/console).

The App should have the following configuration:

  • If you create a new App, select ‘Server Authentication (with JWT)’ authentication method.
    If you want to use an existing App, choose one with ‘OAuth 2.0 with JSON Web Tokens (Server Authentication)’ as Authentication method.

  • Should have a ‘Client ID’ and ‘Client Secret’.

  • ‘App Access Level’ should be ‘App + Enterprise Access’.

  • ‘Application Scopes’ should have ‘Write all files and folders in Box’ enabled.

  • ‘Advanced Features’ should have ‘Generate user access tokens’ and ‘Make API calls using the as-user header’ enabled.

  • Under ‘Add and Manage Public Keys’ generate a Public/Private Keypair and download the configuration JSON file (under App Settings). The full path of this file should be set in the ‘Box Config File’ property.
    Note that you can only download the configuration JSON with the keypair details only once, when you generate the keypair. Also this is the only time Box will show you the private key.
    If you want to download the configuration JSON file later (under ‘App Settings’) - or if you want to use your own keypair - after you download it you need to edit the file and add the keypair details manually.

  • After all settings are done, the App needs to be reauthorized on the admin page. (‘Reauthorize App’ at https://app.box.com/master/custom-apps.)
    If the app is configured for the first time it needs to be added (‘Add App’) before it can be authorized.

JsonPathReader

Parses JSON records and evaluates user-defined JSON Path’s against each JSON object. While the reader expects each record to be well-formed JSON, the content of a FlowFile may consist of many records, each as a well-formed JSON array or JSON object with optional whitespace between them, such as the common 'JSON-per-line' format. If an array is encountered, each element in that array will be treated as a separate record. User-defined properties define the fields that should be extracted from the JSON in order to form the fields of a Record. Any JSON field that is not extracted via a JSONPath will not be returned in the JSON Records.

Tags: json, jsonpath, record, reader, parser

Properties

Schema Access Strategy

Specifies how to obtain the schema that is to be used for interpreting the data.

Schema Registry

Specifies the Controller Service to use for the Schema Registry

Schema Name

Specifies the name of the schema to lookup in the Schema Registry property

Schema Version

Specifies the version of the schema to lookup in the Schema Registry. If not specified then the latest version of the schema will be retrieved.

Schema Branch

Specifies the name of the branch to use when looking up the schema in the Schema Registry property. If the chosen Schema Registry does not support branching, this value will be ignored.

Schema Text

The text of an Avro-formatted Schema

Schema Reference Reader

Service implementation responsible for reading FlowFile attributes or content to determine the Schema Reference Identifier

Max String Length

The maximum allowed length of a string value when parsing the JSON document

Allow Comments

Whether to allow comments when parsing the JSON document

Date Format

Specifies the format to use when reading/writing Date fields. If not specified, Date fields will be assumed to be number of milliseconds since epoch (Midnight, Jan 1, 1970 GMT). If specified, the value must match the Java java.time.format.DateTimeFormatter format (for example, MM/dd/yyyy for a two-digit month, followed by a two-digit day, followed by a four-digit year, all separated by '/' characters, as in 01/01/2017).

Time Format

Specifies the format to use when reading/writing Time fields. If not specified, Time fields will be assumed to be number of milliseconds since epoch (Midnight, Jan 1, 1970 GMT). If specified, the value must match the Java java.time.format.DateTimeFormatter format (for example, HH:mm:ss for a two-digit hour in 24-hour format, followed by a two-digit minute, followed by a two-digit second, all separated by ':' characters, as in 18:04:15).

Timestamp Format

Specifies the format to use when reading/writing Timestamp fields. If not specified, Timestamp fields will be assumed to be number of milliseconds since epoch (Midnight, Jan 1, 1970 GMT). If specified, the value must match the Java java.time.format.DateTimeFormatter format (for example, MM/dd/yyyy HH:mm:ss for a two-digit month, followed by a two-digit day, followed by a four-digit year, all separated by '/' characters; and then followed by a two-digit hour in 24-hour format, followed by a two-digit minute, followed by a two-digit second, all separated by ':' characters, as in 01/01/2017 18:04:15).

Dynamic Properties

The field name for the record.

User-defined properties identify how to extract specific fields from a JSON object in order to create a Record

See Also

Additional Details

The JsonPathReader Controller Service, parses FlowFiles that are in the JSON format. User-defined properties specify how to extract all relevant fields from the JSON in order to create a Record. The Controller Service will not be valid unless at least one JSON Path is provided. Unlike the JsonTreeReader Controller Service, this service will return a record that contains only those fields that have been configured via JSON Path.

If the root of the FlowFile’s JSON is a JSON Array, each JSON Object found in that array will be treated as a separate Record, not as a single record made up of an array. If the root of the FlowFile’s JSON is a JSON Object, it will be evaluated as a single Record.

Supplying a JSON Path is accomplished by adding a user-defined property where the name of the property becomes the name of the field in the Record that is returned. The value of the property must be a valid JSON Path expression. This JSON Path will be evaluated against each top-level JSON Object in the FlowFile, and the result will be the value of the field whose name is specified by the property name. If any JSON Path is given but no field is present in the Schema with the proper name, then the field will be skipped.

This Controller Service must be configured with a schema. Each JSON Path that is evaluated and is found in the “root level” of the schema will produce a Field in the Record. I.e., the schema should match the Record that is created by evaluating all the JSON Paths. It should not match the “incoming JSON” that is read from the FlowFile.

Schemas and Type Coercion

When a record is parsed from incoming data, it is separated into fields. Each of these fields is then looked up against the configured schema (by field name) in order to determine what the type of the data should be. If the field is not present in the schema, that field is omitted from the Record. If the field is found in the schema, the data type of the received data is compared against the data type specified in the schema. If the types match, the value of that field is used as-is. If the schema indicates that the field should be of a different type, then the Controller Service will attempt to coerce the data into the type specified by the schema. If the field cannot be coerced into the specified type, an Exception will be thrown.

The following rules apply when attempting to coerce a field value from one data type to another:

  • Any data type can be coerced into a String type.

  • Any numeric data type (Byte, Short, Int, Long, Float, Double) can be coerced into any other numeric data type.

  • Any numeric value can be coerced into a Date, Time, or Timestamp type, by assuming that the Long value is the number of milliseconds since epoch (Midnight GMT, January 1, 1970).

  • A String value can be coerced into a Date, Time, or Timestamp type, if its format matches the configured “Date Format,” “Time Format,” or “Timestamp Format.”

  • A String value can be coerced into a numeric value if the value is of the appropriate type. For example, the String value 8 can be coerced into any numeric type. However, the String value 8.2 can be coerced into a Double or Float type but not an Integer.

  • A String value of “true” or “false” (regardless of case) can be coerced into a Boolean value.

  • A String value that is not empty can be coerced into a Char type. If the String contains more than 1 character, the first character is used and the rest of the characters are ignored.

  • Any “date/time” type (Date, Time, Timestamp) can be coerced into any other “date/time” type.

  • Any “date/time” type can be coerced into a Long type, representing the number of milliseconds since epoch (Midnight GMT, January 1, 1970).

  • Any “date/time” type can be coerced into a String. The format of the String is whatever DateFormat is configured for the corresponding property (Date Format, Time Format, Timestamp Format property). If no value is specified, then the value will be converted into a String representation of the number of milliseconds since epoch (Midnight GMT, January 1, 1970).

If none of the above rules apply when attempting to coerce a value from one data type to another, the coercion will fail and an Exception will be thrown.

Schema Inference

While NiFi’s Record API does require that each Record have a schema, it is often convenient to infer the schema based on the values in the data, rather than having to manually create a schema. This is accomplished by selecting a value of ” Infer Schema” for the “Schema Access Strategy” property. When using this strategy, the Reader will determine the schema by first parsing all data in the FlowFile, keeping track of all fields that it has encountered and the type of each field. Once all data has been parsed, a schema is formed that encompasses all fields that have been encountered.

A common concern when inferring schemas is how to handle the condition of two values that have different types. For example, consider a FlowFile with the following two records:

[
  {
    "name": "John",
    "age": 8,
    "values": "N/A"
  },
  {
    "name": "Jane",
    "age": "Ten",
    "values": [
      8,
      "Ten"
    ]
  }
]
json

It is clear that the “name” field will be inferred as a STRING type. However, how should we handle the “age” field? Should the field be an CHOICE between INT and STRING? Should we prefer LONG over INT? Should we just use a STRING? Should the field be considered nullable?

To help understand how this Record Reader infers schemas, we have the following list of rules that are followed in the inference logic:

  • All fields are inferred to be nullable.

  • When two values are encountered for the same field in two different records (or two values are encountered for an ARRAY type), the inference engine prefers to use a “wider” data type over using a CHOICE data type. A data type “A” is said to be wider than data type “B” if and only if data type “A” encompasses all values of “B” in addition to other values. For example, the LONG type is wider than the INT type but not wider than the BOOLEAN type (and BOOLEAN is also not wider than LONG). INT is wider than SHORT. The STRING type is considered wider than all other types except MAP, RECORD, ARRAY, and CHOICE.

  • If two values are encountered for the same field in two different records (or two values are encountered for an ARRAY type), but neither value is of a type that is wider than the other, then a CHOICE type is used. In the example above, the “values” field will be inferred as a CHOICE between a STRING or an ARRRAY.

  • If the “Time Format,” “Timestamp Format,” or “Date Format” properties are configured, any value that would otherwise be considered a STRING type is first checked against the configured formats to see if it matches any of them. If the value matches the Timestamp Format, the value is considered a Timestamp field. If it matches the Date Format, it is considered a Date field. If it matches the Time Format, it is considered a Time field. In the unlikely event that the value matches more than one of the configured formats, they will be matched in the order: Timestamp, Date, Time. I.e., if a value matched both the Timestamp Format and the Date Format, the type that is inferred will be Timestamp. Because parsing dates and times can be expensive, it is advisable not to configure these formats if dates, times, and timestamps are not expected, or if processing the data as a STRING is acceptable. For use cases when this is important, though, the inference engine is intelligent enough to optimize the parsing by first checking several very cheap conditions. For example, the string’s length is examined to see if it is too long or too short to match the pattern. This results in far more efficient processing than would result if attempting to parse each string value as a timestamp.

  • The MAP type is never inferred. Instead, the RECORD type is used.

  • If a field exists but all values are null, then the field is inferred to be of type STRING.

Caching of Inferred Schemas

This Record Reader requires that if a schema is to be inferred, that all records be read in order to ensure that the schema that gets inferred is applicable for all records in the FlowFile. However, this can become expensive, especially if the data undergoes many different transformations. To alleviate the cost of inferring schemas, the Record Reader can be configured with a “Schema Inference Cache” by populating the property with that name. This is a Controller Service that can be shared by Record Readers and Record Writers.

Whenever a Record Writer is used to write data, if it is configured with a “Schema Cache,” it will also add the schema to the Schema Cache. This will result in an identifier for that schema being added as an attribute to the FlowFile.

Whenever a Record Reader is used to read data, if it is configured with a “Schema Inference Cache”, it will first look for a “schema.cache.identifier” attribute on the FlowFile. If the attribute exists, it will use the value of that attribute to lookup the schema in the schema cache. If it is able to find a schema in the cache with that identifier, then it will use that schema instead of reading, parsing, and analyzing the data to infer the schema. If the attribute is not available on the FlowFile, or if the attribute is available but the cache does not have a schema with that identifier, then the Record Reader will proceed to infer the schema as described above.

The end result is that users are able to chain together many different Processors to operate on Record-oriented data. Typically, only the first such Processor in the chain will incur the “penalty” of inferring the schema. For all other Processors in the chain, the Record Reader is able to simply lookup the schema in the Schema Cache by identifier. This allows the Record Reader to infer a schema accurately, since it is inferred based on all data in the FlowFile, and still allows this to happen efficiently since the schema will typically only be inferred once, regardless of how many Processors handle the data.

Examples

As an example, consider a FlowFile whose content contains the following JSON:

[
  {
    "id": 17,
    "name": "John",
    "child": {
      "id": "1"
    },
    "siblingIds": [
      4,
      8
    ],
    "siblings": [
      {
        "name": "Jeremy",
        "id": 4
      },
      {
        "name": "Julia",
        "id": 8
      }
    ]
  },
  {
    "id": 98,
    "name": "Jane",
    "child": {
      "id": 2
    },
    "gender": "F",
    "siblingIds": [],
    "siblings": []
  }
]
json

And the following schema has been configured:

{
  "namespace": "nifi",
  "name": "person",
  "type": "record",
  "fields": [
    {
      "name": "id",
      "type": "int"
    },
    {
      "name": "name",
      "type": "string"
    },
    {
      "name": "childId",
      "type": "long"
    },
    {
      "name": "gender",
      "type": "string"
    },
    {
      "name": "siblingNames",
      "type": {
        "type": "array",
        "items": "string"
      }
    }
  ]
}
json

If we configure this Controller Service with the following user-defined properties:

Property Name Property Value

id

$.id

name

$.name

childId

$.child.id

gender

$.gender

siblingNames

$.siblings[*].name

In this case, the FlowFile will generate two Records. The first record will consist of the following key/value pairs:

Field Name Field Value

id

17

name

John

childId

1

gender

null

siblingNames

array of two elements: Jeremy and Julia

The second record will consist of the following key/value pairs:

Field Name Field Value

id

98

name

Jane

childId

2

gender

F

siblingNames

empty array

JsonRecordSetWriter

Writes the results of a RecordSet as either a JSON Array or one JSON object per line. If using Array output, then even if the RecordSet consists of a single row, it will be written as an array with a single element. If using One Line Per Object output, the JSON objects cannot be pretty-printed.

Tags: json, resultset, writer, serialize, record, recordset, row

Properties

Schema Write Strategy

Specifies how the schema for a Record should be added to the data.

Schema Cache

Specifies a Schema Cache to add the Record Schema to so that Record Readers can quickly lookup the schema.

Schema Reference Writer

Service implementation responsible for writing FlowFile attributes or content header with Schema reference information

Schema Access Strategy

Specifies how to obtain the schema that is to be used for interpreting the data.

Schema Registry

Specifies the Controller Service to use for the Schema Registry

Schema Name

Specifies the name of the schema to lookup in the Schema Registry property

Schema Version

Specifies the version of the schema to lookup in the Schema Registry. If not specified then the latest version of the schema will be retrieved.

Schema Branch

Specifies the name of the branch to use when looking up the schema in the Schema Registry property. If the chosen Schema Registry does not support branching, this value will be ignored.

Schema Text

The text of an Avro-formatted Schema

Schema Reference Reader

Service implementation responsible for reading FlowFile attributes or content to determine the Schema Reference Identifier

Date Format

Specifies the format to use when reading/writing Date fields. If not specified, Date fields will be assumed to be number of milliseconds since epoch (Midnight, Jan 1, 1970 GMT). If specified, the value must match the Java java.time.format.DateTimeFormatter format (for example, MM/dd/yyyy for a two-digit month, followed by a two-digit day, followed by a four-digit year, all separated by '/' characters, as in 01/01/2017).

Time Format

Specifies the format to use when reading/writing Time fields. If not specified, Time fields will be assumed to be number of milliseconds since epoch (Midnight, Jan 1, 1970 GMT). If specified, the value must match the Java java.time.format.DateTimeFormatter format (for example, HH:mm:ss for a two-digit hour in 24-hour format, followed by a two-digit minute, followed by a two-digit second, all separated by ':' characters, as in 18:04:15).

Timestamp Format

Specifies the format to use when reading/writing Timestamp fields. If not specified, Timestamp fields will be assumed to be number of milliseconds since epoch (Midnight, Jan 1, 1970 GMT). If specified, the value must match the Java java.time.format.DateTimeFormatter format (for example, MM/dd/yyyy HH:mm:ss for a two-digit month, followed by a two-digit day, followed by a four-digit year, all separated by '/' characters; and then followed by a two-digit hour in 24-hour format, followed by a two-digit minute, followed by a two-digit second, all separated by ':' characters, as in 01/01/2017 18:04:15).

Pretty Print JSON

Specifies whether or not the JSON should be pretty printed

Suppress Null Values

Specifies how the writer should handle a null field

Allow Scientific Notation

Specifies whether or not scientific notation should be used when writing numbers

Output Grouping

Specifies how the writer should output the JSON records (as an array or one object per line, e.g.) Note that if 'One Line Per Object' is selected, then Pretty Print JSON must be false.

Compression Format

The compression format to use. Valid values are: GZIP, BZIP2, ZSTD, XZ-LZMA2, LZMA, Snappy, and Snappy Framed

Compression Level

The compression level to use; this is valid only when using GZIP compression. A lower value results in faster processing but less compression; a value of 0 indicates no compression but simply archiving

JsonTreeReader

Parses JSON into individual Record objects. While the reader expects each record to be well-formed JSON, the content of a FlowFile may consist of many records, each as a well-formed JSON array or JSON object with optional whitespace between them, such as the common 'JSON-per-line' format. If an array is encountered, each element in that array will be treated as a separate record. If the schema that is configured contains a field that is not present in the JSON, a null value will be used. If the JSON contains a field that is not present in the schema, that field will be skipped. See the Usage of the Controller Service for more information and examples.

Tags: json, tree, record, reader, parser

Properties

Schema Access Strategy

Specifies how to obtain the schema that is to be used for interpreting the data.

Schema Registry

Specifies the Controller Service to use for the Schema Registry

Schema Name

Specifies the name of the schema to lookup in the Schema Registry property

Schema Version

Specifies the version of the schema to lookup in the Schema Registry. If not specified then the latest version of the schema will be retrieved.

Schema Branch

Specifies the name of the branch to use when looking up the schema in the Schema Registry property. If the chosen Schema Registry does not support branching, this value will be ignored.

Schema Text

The text of an Avro-formatted Schema

Schema Reference Reader

Service implementation responsible for reading FlowFile attributes or content to determine the Schema Reference Identifier

Schema Inference Cache

Specifies a Schema Cache to use when inferring the schema. If not populated, the schema will be inferred each time. However, if a cache is specified, the cache will first be consulted and if the applicable schema can be found, it will be used instead of inferring the schema.

Starting Field Strategy

Start processing from the root node or from a specified nested node.

Starting Field Name

Skips forward to the given nested JSON field (array or object) to begin processing.

Schema Application Strategy

Specifies whether the schema is defined for the whole JSON or for the selected part starting from "Starting Field Name".

Max String Length

The maximum allowed length of a string value when parsing the JSON document

Allow Comments

Whether to allow comments when parsing the JSON document

Date Format

Specifies the format to use when reading/writing Date fields. If not specified, Date fields will be assumed to be number of milliseconds since epoch (Midnight, Jan 1, 1970 GMT). If specified, the value must match the Java java.time.format.DateTimeFormatter format (for example, MM/dd/yyyy for a two-digit month, followed by a two-digit day, followed by a four-digit year, all separated by '/' characters, as in 01/01/2017).

Time Format

Specifies the format to use when reading/writing Time fields. If not specified, Time fields will be assumed to be number of milliseconds since epoch (Midnight, Jan 1, 1970 GMT). If specified, the value must match the Java java.time.format.DateTimeFormatter format (for example, HH:mm:ss for a two-digit hour in 24-hour format, followed by a two-digit minute, followed by a two-digit second, all separated by ':' characters, as in 18:04:15).

Timestamp Format

Specifies the format to use when reading/writing Timestamp fields. If not specified, Timestamp fields will be assumed to be number of milliseconds since epoch (Midnight, Jan 1, 1970 GMT). If specified, the value must match the Java java.time.format.DateTimeFormatter format (for example, MM/dd/yyyy HH:mm:ss for a two-digit month, followed by a two-digit day, followed by a four-digit year, all separated by '/' characters; and then followed by a two-digit hour in 24-hour format, followed by a two-digit minute, followed by a two-digit second, all separated by ':' characters, as in 01/01/2017 18:04:15).

See Also

Additional Details

The JsonTreeReader Controller Service reads a JSON Object and creates a Record object either for the entire JSON Object tree or a subpart (see “Starting Field Strategies” section). The Controller Service must be configured with a Schema that describes the structure of the JSON data. If any field exists in the JSON that is not in the schema, that field will be skipped. If the schema contains a field for which no JSON field exists, a null value will be used in the Record (or the default value defined in the schema, if applicable).

If the root element of the JSON is a JSON Array, each JSON Object within that array will be treated as its own separate Record. If the root element is a JSON Object, the JSON will all be treated as a single Record.

Schemas and Type Coercion

When a record is parsed from incoming data, it is separated into fields. Each of these fields is then looked up against the configured schema (by field name) in order to determine what the type of the data should be. If the field is not present in the schema, that field is omitted from the Record. If the field is found in the schema, the data type of the received data is compared against the data type specified in the schema. If the types match, the value of that field is used as-is. If the schema indicates that the field should be of a different type, then the Controller Service will attempt to coerce the data into the type specified by the schema. If the field cannot be coerced into the specified type, an Exception will be thrown.

The following rules apply when attempting to coerce a field value from one data type to another:

  • Any data type can be coerced into a String type.

  • Any numeric data type (Byte, Short, Int, Long, Float, Double) can be coerced into any other numeric data type.

  • Any numeric value can be coerced into a Date, Time, or Timestamp type, by assuming that the Long value is the number of milliseconds since epoch (Midnight GMT, January 1, 1970).

  • A String value can be coerced into a Date, Time, or Timestamp type, if its format matches the configured “Date Format,” “Time Format,” or “Timestamp Format.”

  • A String value can be coerced into a numeric value if the value is of the appropriate type. For example, the String value 8 can be coerced into any numeric type. However, the String value 8.2 can be coerced into a Double or Float type but not an Integer.

  • A String value of “true” or “false” (regardless of case) can be coerced into a Boolean value.

  • A String value that is not empty can be coerced into a Char type. If the String contains more than 1 character, the first character is used and the rest of the characters are ignored.

  • Any “date/time” type (Date, Time, Timestamp) can be coerced into any other “date/time” type.

  • Any “date/time” type can be coerced into a Long type, representing the number of milliseconds since epoch (Midnight GMT, January 1, 1970).

  • Any “date/time” type can be coerced into a String. The format of the String is whatever DateFormat is configured for the corresponding property (Date Format, Time Format, Timestamp Format property). If no value is specified, then the value will be converted into a String representation of the number of milliseconds since epoch (Midnight GMT, January 1, 1970).

If none of the above rules apply when attempting to coerce a value from one data type to another, the coercion will fail and an Exception will be thrown.

Schema Inference

While NiFi’s Record API does require that each Record have a schema, it is often convenient to infer the schema based on the values in the data, rather than having to manually create a schema. This is accomplished by selecting a value of ” Infer Schema” for the “Schema Access Strategy” property. When using this strategy, the Reader will determine the schema by first parsing all data in the FlowFile, keeping track of all fields that it has encountered and the type of each field. Once all data has been parsed, a schema is formed that encompasses all fields that have been encountered.

A common concern when inferring schemas is how to handle the condition of two values that have different types. For example, consider a FlowFile with the following two records:

[
  {
    "name": "John",
    "age": 8,
    "values": "N/A"
  },
  {
    "name": "Jane",
    "age": "Ten",
    "values": [
      8,
      "Ten"
    ]
  }
]
json

It is clear that the “name” field will be inferred as a STRING type. However, how should we handle the “age” field? Should the field be an CHOICE between INT and STRING? Should we prefer LONG over INT? Should we just use a STRING? Should the field be considered nullable?

To help understand how this Record Reader infers schemas, we have the following list of rules that are followed in the inference logic:

  • All fields are inferred to be nullable.

  • When two values are encountered for the same field in two different records (or two values are encountered for an ARRAY type), the inference engine prefers to use a “wider” data type over using a CHOICE data type. A data type “A” is said to be wider than data type “B” if and only if data type “A” encompasses all values of “B” in addition to other values. For example, the LONG type is wider than the INT type but not wider than the BOOLEAN type (and BOOLEAN is also not wider than LONG). INT is wider than SHORT. The STRING type is considered wider than all other types except MAP, RECORD, ARRAY, and CHOICE.

  • If two values are encountered for the same field in two different records (or two values are encountered for an ARRAY type), but neither value is of a type that is wider than the other, then a CHOICE type is used. In the example above, the “values” field will be inferred as a CHOICE between a STRING or an ARRRAY.

  • If the “Time Format,” “Timestamp Format,” or “Date Format” properties are configured, any value that would otherwise be considered a STRING type is first checked against the configured formats to see if it matches any of them. If the value matches the Timestamp Format, the value is considered a Timestamp field. If it matches the Date Format, it is considered a Date field. If it matches the Time Format, it is considered a Time field. In the unlikely event that the value matches more than one of the configured formats, they will be matched in the order: Timestamp, Date, Time. I.e., if a value matched both the Timestamp Format and the Date Format, the type that is inferred will be Timestamp. Because parsing dates and times can be expensive, it is advisable not to configure these formats if dates, times, and timestamps are not expected, or if processing the data as a STRING is acceptable. For use cases when this is important, though, the inference engine is intelligent enough to optimize the parsing by first checking several very cheap conditions. For example, the string’s length is examined to see if it is too long or too short to match the pattern. This results in far more efficient processing than would result if attempting to parse each string value as a timestamp.

  • The MAP type is never inferred. Instead, the RECORD type is used.

  • If a field exists but all values are null, then the field is inferred to be of type STRING.

Caching of Inferred Schemas

This Record Reader requires that if a schema is to be inferred, that all records be read in order to ensure that the schema that gets inferred is applicable for all records in the FlowFile. However, this can become expensive, especially if the data undergoes many different transformations. To alleviate the cost of inferring schemas, the Record Reader can be configured with a “Schema Inference Cache” by populating the property with that name. This is a Controller Service that can be shared by Record Readers and Record Writers.

Whenever a Record Writer is used to write data, if it is configured with a “Schema Cache,” it will also add the schema to the Schema Cache. This will result in an identifier for that schema being added as an attribute to the FlowFile.

Whenever a Record Reader is used to read data, if it is configured with a “Schema Inference Cache”, it will first look for a “schema.cache.identifier” attribute on the FlowFile. If the attribute exists, it will use the value of that attribute to lookup the schema in the schema cache. If it is able to find a schema in the cache with that identifier, then it will use that schema instead of reading, parsing, and analyzing the data to infer the schema. If the attribute is not available on the FlowFile, or if the attribute is available but the cache does not have a schema with that identifier, then the Record Reader will proceed to infer the schema as described above.

The end result is that users are able to chain together many different Processors to operate on Record-oriented data. Typically, only the first such Processor in the chain will incur the “penalty” of inferring the schema. For all other Processors in the chain, the Record Reader is able to simply lookup the schema in the Schema Cache by identifier. This allows the Record Reader to infer a schema accurately, since it is inferred based on all data in the FlowFile, and still allows this to happen efficiently since the schema will typically only be inferred once, regardless of how many Processors handle the data.

Starting Field Strategies

When using JsonTreeReader, two different starting field strategies can be selected. With the default Root Node strategy, the JsonTreeReader begins processing from the root element of the JSON and creates a Record object for the entire JSON Object tree, while the Nested Field strategy defines a nested field from which to begin processing.

Using the Nested Field strategy, a schema corresponding to the nested JSON part should be specified. In case of schema inference, the JsonTreeReader will automatically infer a schema from nested records.

Root Node Strategy

Consider the following JSON is read with the default Root Node strategy:

[
  {
    "id": 17,
    "name": "John",
    "child": {
      "id": "1"
    },
    "dob": "10-29-1982",
    "siblings": [
      {
        "name": "Jeremy",
        "id": 4
      },
      {
        "name": "Julia",
        "id": 8
      }
    ]
  },
  {
    "id": 98,
    "name": "Jane",
    "child": {
      "id": 2
    },
    "dob": "08-30-1984",
    "gender": "F",
    "siblingIds": [],
    "siblings": []
  }
]
json

Also, consider that the schema that is configured for this JSON is as follows (assuming that the AvroSchemaRegistry Controller Service is chosen to denote the Schema):

{
  "namespace": "nifi",
  "name": "person",
  "type": "record",
  "fields": [
    {
      "name": "id",
      "type": "int"
    },
    {
      "name": "name",
      "type": "string"
    },
    {
      "name": "gender",
      "type": "string"
    },
    {
      "name": "dob",
      "type": {
        "type": "int",
        "logicalType": "date"
      }
    },
    {
      "name": "siblings",
      "type": {
        "type": "array",
        "items": {
          "type": "record",
          "fields": [
            {
              "name": "name",
              "type": "string"
            }
          ]
        }
      }
    }
  ]
}
json

Let us also assume that this Controller Service is configured with the “Date Format” property set to “MM-dd-yyyy”, as this matches the date format used for our JSON data. This will result in the JSON creating two separate records, because the root element is a JSON array with two elements.

The first Record will consist of the following values:

Field Name Field Value

id

17

name

John

gender

null

dob

11-30-1983

siblings

see below

Siblings

array with two elements, each of which is itself a Record:

Field Name Field Value

name

Jeremy

and:

Field Name Field Value

name

Julia

The second Record will consist of the following values:

Field Name Field Value

id

98

name

Jane

gender

F

dob

08-30-1984

siblings

empty array

Nested Field Strategy

Using the Nested Field strategy, consider the same JSON where the specified Starting Field Name is “siblings”. The schema that is configured for this JSON is as follows:

{
  "namespace": "nifi",
  "name": "siblings",
  "type": "record",
  "fields": [
    {
      "name": "name",
      "type": "string"
    },
    {
      "name": "id",
      "type": "int"
    }
  ]
}
json

The first Record will consist of the following values:

Field Name Field Value

name

Jeremy

id

4

The second Record will consist of the following values:

Field Name Field Value

name

Julia

id

8

Schema Application Strategies

When using JsonTreeReader with “Nested Field Strategy” and the “Schema Access Strategy” is not “Infer Schema”, it can be configured for the entire original JSON (“Whole document” strategy) or for the nested field section (“Selected part” strategy).

Kafka3ConnectionService

Provides and manages connections to Kafka Brokers for producer or consumer operations.

Properties

Bootstrap Servers

Comma-separated list of Kafka Bootstrap Servers in the format host:port. Corresponds to Kafka bootstrap.servers property

Security Protocol

Security protocol used to communicate with brokers. Corresponds to Kafka Client security.protocol property

SASL Mechanism

SASL mechanism used for authentication. Corresponds to Kafka Client sasl.mechanism property

SASL Username

Username provided with configured password when using PLAIN or SCRAM SASL Mechanisms

SASL Password

Password provided with configured username when using PLAIN or SCRAM SASL Mechanisms

Kerberos User Service

Service supporting user authentication with Kerberos

Kerberos Service Name

The service name that matches the primary name of the Kafka server configured in the broker JAAS configuration

SSL Context Service

Service supporting SSL communication with Kafka brokers

Transaction Isolation Level

Specifies how the service should handle transaction isolation levels when communicating with Kafka. The uncommited option means that messages will be received as soon as they are written to Kafka but will be pulled, even if the producer cancels the transactions. The committed option configures the service to not receive any messages for which the producer’s transaction was canceled, but this can result in some latency since the consumer must wait for the producer to finish its entire transaction instead of pulling as the messages become available. Corresponds to Kafka isolation.level property.

Max Poll Records

Maximum number of records Kafka should return in a single poll.

Client Timeout

Default timeout for Kafka client operations. Mapped to Kafka default.api.timeout.ms. The Kafka request.timeout.ms property is derived from half of the configured timeout

Max Metadata Wait Time

The amount of time publisher will wait to obtain metadata or wait for the buffer to flush during the 'send' call before failing the entire 'send' call. Corresponds to Kafka max.block.ms property

Acknowledgment Wait Time

After sending a message to Kafka, this indicates the amount of time that the service will wait for a response from Kafka. If Kafka does not acknowledge the message within this time period, the service will throw an exception.

Dynamic Properties

The name of a Kafka configuration property.

These properties will be added on the Kafka configuration after loading any provided configuration properties. In the event a dynamic property represents a property that was already set, its value will be ignored and WARN message logged. For the list of available Kafka properties please refer to: http://kafka.apache.org/documentation.html#configuration.

KerberosKeytabUserService

Provides a mechanism for creating a KerberosUser from a principal and keytab that other components are able to use in order to perform authentication using Kerberos. By encapsulating this information into a Controller Service and allowing other components to make use of it (as opposed to specifying the principal and keytab directly in the processor) an administrator is able to choose which users are allowed to use which keytabs and principals. This provides a more robust security model for multi-tenant use cases.

Tags: Kerberos, Keytab, Principal, Credentials, Authentication, Security

Properties

Kerberos Principal

Kerberos principal to authenticate as. Requires nifi.kerberos.krb5.file to be set in your nifi.properties

Kerberos Keytab

Kerberos keytab associated with the principal.

Restricted

Making changes to this component requires admin rights. Non-admins can still schedule this component.

KerberosPasswordUserService

Provides a mechanism for creating a KerberosUser from a principal and password that other components are able to use in order to perform authentication using Kerberos.

Tags: Kerberos, Password, Principal, Credentials, Authentication, Security

Properties

Kerberos Principal

Kerberos principal to authenticate as. Requires nifi.kerberos.krb5.file to be set in your nifi.properties

Kerberos Password

Kerberos password associated with the principal.

KerberosTicketCacheUserService

Provides a mechanism for creating a KerberosUser from a principal and ticket cache that other components are able to use in order to perform authentication using Kerberos. By encapsulating this information into a Controller Service and allowing other components to make use of it an administrator is able to choose which users are allowed to use which ticket caches and principals. This provides a more robust security model for multi-tenant use cases.

Tags: Kerberos, Ticket, Cache, Principal, Credentials, Authentication, Security

Properties

Kerberos Principal

Kerberos principal to authenticate as. Requires nifi.kerberos.krb5.file to be set in your nifi.properties

Kerberos Ticket Cache File

Kerberos ticket cache associated with the principal.

Restricted

Making changes to this component requires admin rights. Non-admins can still schedule this component.

LoggingRecordSink

Provides a RecordSinkService that can be used to log records to the application log (nifi-app.log, e.g.) using the specified writer for formatting.

Tags: record, sink, log

Properties

Record Writer

Specifies the Controller Service to use for writing out the records.

Log Level

The Log Level at which to log records (INFO, DEBUG, e.g.)

MapCacheClientService

Provides the ability to communicate with a MapCacheServer. This can be used in order to share a Map between nodes in a NiFi cluster

Tags: distributed, cache, state, map, cluster

Properties

Server Hostname

The name of the server that is running the DistributedMapCacheServer service

Server Port

The port on the remote server that is to be used when communicating with the DistributedMapCacheServer service

SSL Context Service

If specified, indicates the SSL Context Service that is used to communicate with the remote server. If not specified, communications will not be encrypted

Communications Timeout

Specifies how long to wait when communicating with the remote server before determining that there is a communications failure if data cannot be sent or received

See Also

MapCacheServer

Provides a map (key/value) cache that can be accessed over a socket. Interaction with this service is typically accomplished via a Map Cache Client Service.

Tags: distributed, cluster, map, cache, server, key/value

Properties

Port

The port to listen on for incoming connections

Maximum Cache Entries

The maximum number of cache entries that the cache can hold

Eviction Strategy

Determines which strategy should be used to evict values from the cache to make room for new entries

Persistence Directory

If specified, the cache will be persisted in the given directory; if not specified, the cache will be in-memory only

SSL Context Service

If specified, this service will be used to create an SSL Context that will be used to secure communications; if not specified, communications will not be secure

Maximum Read Size

The maximum number of network bytes to read for a single cache item

MetroLineController

Connects Metro processors so that FlowFiles can be exchanged between them.

Tags: virtimo, metro

Properties

Query Rendezvous Time

The time that is waited by another processor while GetMetro is offered a query for FlowFiles.

Exchange Rendezvous Time

The maximum time to wait when exchanging data between threads.

MongoDBControllerService

Provides a controller service that configures a connection to MongoDB and provides access to that connection to other Mongo-related components.

Tags: mongo, mongodb, service

Properties

Mongo URI

MongoURI, typically of the form: mongodb://host1[:port1][,host2[:port2],…​]

Database User

Database user name

Password

The password for the database user

SSL Context Service

The SSL Context Service used to provide client certificate information for TLS/SSL connections.

Client Auth

Client authentication policy when connecting to secure (TLS/SSL) cluster. Possible values are REQUIRED, WANT, NONE. This property is only used when an SSL Context has been defined and enabled.

Write Concern

The write concern to use

MongoDBLookupService

Provides a lookup service based around MongoDB. Each key that is specified will be added to a query as-is. For example, if you specify the two keys, user and email, the resulting query will be { "user": "tester", "email": "tester@test.com" }. The query is limited to the first result (findOne in the Mongo documentation). If no "Lookup Value Field" is specified then the entire MongoDB result document minus the _id field will be returned as a record.

Tags: mongo, mongodb, lookup, record

Properties

Schema Access Strategy

Specifies how to obtain the schema that is to be used for interpreting the data.

Schema Registry

Specifies the Controller Service to use for the Schema Registry

Schema Name

Specifies the name of the schema to lookup in the Schema Registry property

Schema Version

Specifies the version of the schema to lookup in the Schema Registry. If not specified then the latest version of the schema will be retrieved.

Schema Branch

Specifies the name of the branch to use when looking up the schema in the Schema Registry property. If the chosen Schema Registry does not support branching, this value will be ignored.

Schema Text

The text of an Avro-formatted Schema

Client Service

A MongoDB controller service to use with this lookup service.

Mongo Database Name

The name of the database to use

Mongo Collection Name

The name of the collection to use

Lookup Value Field

The field whose value will be returned when the lookup key(s) match a record. If not specified then the entire MongoDB result document minus the _id field will be returned as a record.

Projection

Specifies a projection for limiting which fields will be returned.

Neo4JCypherClientService

Provides a client service for managing connections to a Neo4J 4.X or newer database. Configuration information for the Neo4J driver that corresponds to most of the settings for this service can be found here: https://neo4j.com/docs/driver-manual/current/client-applications/#driver-configuration. This service was created as a result of the break in driver compatibility between Neo4J 3.X and 4.X and might be renamed in the future if and when Neo4J should break driver compatibility between 4.X and a future release.

Tags: graph, neo4j, cypher

Properties

Neo4j Connection URL

Neo4J endpoing to connect to.

Username

Username for accessing Neo4J

Password

Password for Neo4J user. A dummy non-blank password is required even if it disabled on the server.

Neo4J Max Connection Time Out (seconds)

The maximum time for establishing connection to the Neo4j

Neo4J Max Connection Pool Size

The maximum connection pool size for Neo4j.

Neo4J Max Connection Acquisition Timeout

The maximum connection acquisition timeout.

Neo4J Idle Time Before Connection Test

The idle time before connection test.

Neo4J Max Connection Lifetime

The maximum connection lifetime

SSL Trust Chain PEM

Neo4J requires trust chains to be stored in a PEM file. If you want to use a custom trust chain rather than defaulting to the system trust chain, specify the path to a PEM file with the trust chain.

ParquetReader

Parses Parquet data and returns each Parquet record as a separate Record object. The schema will come from the Parquet data itself.

Tags: parquet, parse, record, row, reader

Properties

Avro Read Compatibility

Specifies the value for 'parquet.avro.compatible' in the underlying Parquet library

ParquetRecordSetWriter

Writes the contents of a RecordSet in Parquet format.

Tags: parquet, result, set, writer, serializer, record, recordset, row

Properties

Schema Write Strategy

Specifies how the schema for a Record should be added to the data.

Schema Cache

Specifies a Schema Cache to add the Record Schema to so that Record Readers can quickly lookup the schema.

Schema Reference Writer

Service implementation responsible for writing FlowFile attributes or content header with Schema reference information

Schema Access Strategy

Specifies how to obtain the schema that is to be used for interpreting the data.

Schema Registry

Specifies the Controller Service to use for the Schema Registry

Schema Name

Specifies the name of the schema to lookup in the Schema Registry property

Schema Version

Specifies the version of the schema to lookup in the Schema Registry. If not specified then the latest version of the schema will be retrieved.

Schema Branch

Specifies the name of the branch to use when looking up the schema in the Schema Registry property. If the chosen Schema Registry does not support branching, this value will be ignored.

Schema Text

The text of an Avro-formatted Schema

Schema Reference Reader

Service implementation responsible for reading FlowFile attributes or content to determine the Schema Reference Identifier

Cache Size

Specifies how many Schemas should be cached

Compression Type

The type of compression for the file being written.

Row Group Size

The row group size used by the Parquet writer. The value is specified in the format of <Data Size> <Data Unit> where Data Unit is one of B, KB, MB, GB, TB.

Page Size

The page size used by the Parquet writer. The value is specified in the format of <Data Size> <Data Unit> where Data Unit is one of B, KB, MB, GB, TB.

Dictionary Page Size

The dictionary page size used by the Parquet writer. The value is specified in the format of <Data Size> <Data Unit> where Data Unit is one of B, KB, MB, GB, TB.

Max Padding Size

The maximum amount of padding that will be used to align row groups with blocks in the underlying filesystem. If the underlying filesystem is not a block filesystem like HDFS, this has no effect. The value is specified in the format of <Data Size> <Data Unit> where Data Unit is one of B, KB, MB, GB, TB.

Enable Dictionary Encoding

Specifies whether dictionary encoding should be enabled for the Parquet writer

Enable Validation

Specifies whether validation should be enabled for the Parquet writer

Writer Version

Specifies the version used by Parquet writer

Avro Write Old List Structure

Specifies the value for 'parquet.avro.write-old-list-structure' in the underlying Parquet library

Avro Add List Element Records

Specifies the value for 'parquet.avro.add-list-element-records' in the underlying Parquet library

INT96 Fields

List of fields with full path that should be treated as INT96 timestamps.

PEMEncodedSSLContextProvider

SSLContext Provider configurable using PEM Private Key and Certificate files.
Supports PKCS1 and PKCS8 encoding for Private Keys as well as X.509 encoding for Certificates.

Tags: PEM, SSL, TLS, Key, Certificate, PKCS1, PKCS8, X.509, ECDSA, Ed25519, RSA

Properties

TLS Protocol

TLS protocol version required for negotiating encrypted communications.

Private Key Source

Source of information for loading Private Key and Certificate Chain

Private Key

PEM Private Key encoded using either PKCS1 or PKCS8. Supported algorithms include ECDSA, Ed25519, and RSA

Private Key Location

PEM Private Key file location encoded using either PKCS1 or PKCS8. Supported algorithms include ECDSA, Ed25519, and RSA

Certificate Chain

PEM X.509 Certificate Chain associated with Private Key starting with standard BEGIN CERTIFICATE header

Certificate Chain Location

PEM X.509 Certificate Chain file location associated with Private Key starting with standard BEGIN CERTIFICATE header

Certificate Authorities Source

Source of information for loading trusted Certificate Authorities

Certificate Authorities

PEM X.509 Certificate Authorities trusted for verifying peers in TLS communications containing one or more standard certificates

PropertiesFileLookupService

A reloadable properties file-based lookup service

Tags: lookup, cache, enrich, join, properties, reloadable, key, value

Properties

Configuration File

A configuration file

Restricted

Making changes to this component requires admin rights. Non-admins can still schedule this component.

ProtobufReader

Parses a Protocol Buffers message from binary format.

Tags: protobuf, record, reader, parser

Properties

Schema Access Strategy

Specifies how to obtain the schema that is to be used for interpreting the data.

Schema Registry

Specifies the Controller Service to use for the Schema Registry

Schema Name

Specifies the name of the schema to lookup in the Schema Registry property

Schema Version

Specifies the version of the schema to lookup in the Schema Registry. If not specified then the latest version of the schema will be retrieved.

Schema Branch

Specifies the name of the branch to use when looking up the schema in the Schema Registry property. If the chosen Schema Registry does not support branching, this value will be ignored.

Schema Text

The text of an Avro-formatted Schema

Schema Reference Reader

Service implementation responsible for reading FlowFile attributes or content to determine the Schema Reference Identifier

Proto Directory

Directory containing Protocol Buffers message definition (.proto) file(s).

Message Type

Fully qualified name of the Protocol Buffers message type including its package (eg. mypackage.MyMessage). The .proto files configured in 'Proto Directory' must contain the definition of this message type.

Additional Details

The ProtobufReader Controller Service reads and parses a Protocol Buffers Message from binary format and creates a Record object. The Controller Service must be configured with the same ‘.proto’ file that was used for the Message encoding, and the fully qualified name of the Message type including its package (e.g. mypackage.MyMessage). The Reader will always generate one record from the input data which represents the provided Protocol Buffers Message type. Further information about Protocol Buffers can be found here: protobuf.dev

Data Type Mapping

When a record is parsed from incoming data, the Controller Service is going to map the Proto Message field types to the corresponding NiFi data types. The mapping between the provided Message fields and the encoded input is always based on the field tag numbers. When a field is defined as ‘repeated’ then it’s data type will be an array with data type of it’s originally specified type. The following tables show which proto field type will correspond to which NiFi field type after the conversion.

Scalar Value Types
Proto Type Proto Wire Type NiFi Data Type

double

fixed64

double

float

fixed32

float

int32

varint

int

int64

varint

long

uint32

varint

long

uint64

varint

bigint

sint32

varint

long

sint64

varint

long

fixed32

fixed32

long

fixed64

fixed64

bigint

sfixed32

varint

int

sfixed64

varint

long

bool

varint

boolean

string

length delimited

string

bytes

length delimited

array[byte]

Composite Value Types
Proto Type Proto Wire Type NiFi Data Type

message

length delimited

record

enum

varint

enum

map

length delimited

map

oneof

-

choice

Schemas and Type Coercion

When a record is parsed from incoming data, it is separated into fields. Each of these fields is then looked up against the configured schema (by field name) in order to determine what the type of the data should be. If the field is not present in the schema, that field will be stored in the Record’s value list on its original type. If the field is found in the schema, the data type of the received data is compared against the data type specified in the schema. If the types match, the value of that field is used as-is. If the schema indicates that the field should be of a different type, then the Controller Service will attempt to coerce the data into the type specified by the schema. If the field cannot be coerced into the specified type, an Exception will be thrown.

The following rules apply when attempting to coerce a field value from one data type to another:

  • Any data type can be coerced into a String type.

  • Any numeric data type (Int, Long, Float, Double) can be coerced into any other numeric data type.

  • Any numeric value can be coerced into a Date, Time, or Timestamp type, by assuming that the Long value is the number of milliseconds since epoch (Midnight GMT, January 1, 1970).

  • A String value can be coerced into a Date, Time, or Timestamp type, if its format matches the configured “Date Format,” “Time Format,” or “Timestamp Format.”

  • A String value can be coerced into a numeric value if the value is of the appropriate type. For example, the String value 8 can be coerced into any numeric type. However, the String value 8.2 can be coerced into a Double or Float type but not an Integer.

  • A String value of “true” or “false” (regardless of case) can be coerced into a Boolean value.

  • A String value that is not empty can be coerced into a Char type. If the String contains more than 1 character, the first character is used and the rest of the characters are ignored.

If none of the above rules apply when attempting to coerce a value from one data type to another, the coercion will fail and an Exception will be thrown.

Schema Access Strategy

Beside the common Schema Access strategies like getting the schema from property value or accessing it from Schema Registry, the ProtobufReader Controller Service offers another access strategy option called “Generate from Proto file”. When using this strategy, the Reader will generate the Record Schema from the provided ‘.proto’ file and Message type. This is a recommended strategy when the user doesn’t want to manually create the schema or when no type coercion is needed.

Protobuf Any Field Type

Protocol Buffers offers further Message types called Well-Known Types. These are additionally provided messages that defines complex structured types and wrappers for scalar types. The Any type is one of these Well-Known Types which is used to store an arbitrary serialized Message along with a URL that describes the type of the serialized Message. Since the Message type and the embedded Message will be available only when the Any Message is already populated with data, the ProtobufReader needs to do this Message processing at data conversion time. The Reader is capable to generate schema for the embedded Message in the Any field and replace it in the result Record schema.

Example

There is a Message called ‘TestMessage’ which has only one field that is an Any typed field. There is another Message called ‘NestedMessage’ that we would like to add as serialized Message in the value of ‘anyField’.

message Any {
    string type_url = 1;
    bytes value = 2;
}

message TestMessage {
    google.protobuf.Any anyField = 3;
}

message NestedMessage {
    string field_1 = 1;
    string field_2 = 2;
    string field_3 = 3;
}

``

With normal data conversion our result would look like this:

{
  anyField: {
    type_url: "type.googleapis.com/NestedMessage"
    value: [
      84,
      101,
      115,
      116,
      32,
      98,
      121,
      116,
      101,
      115
    ]
  }
}
json

Result after the Protobuf Reader replaces the Any Message’s fields with the processed embedded Message:

{
  anyField: {
    field_1: "value 1",
    field_2: "value 2",
    field_3: "value 3"
  }
}
json

ReaderLookup

Provides a RecordReaderFactory that can be used to dynamically select another RecordReaderFactory. This will allow multiple RecordReaderFactories to be defined and registered, and then selected dynamically at runtime by referencing a FlowFile attribute in the Service to Use property.

Tags: lookup, parse, record, row, reader

Properties

Service to Use

Specifies the name of the user-defined property whose associated Controller Service should be used.

Dynamic Properties

Name of the RecordReader

RecordSetWriterLookup

Provides a RecordSetWriterFactory that can be used to dynamically select another RecordSetWriterFactory. This will allow multiple RecordSetWriterFactory’s to be defined and registered, and then selected dynamically at runtime by tagging FlowFiles with the attributes and referencing those attributes in the Service to Use property.

Tags: lookup, result, set, writer, serializer, record, recordset, row

Properties

Service to Use

Specifies the name of the user-defined property whose associated Controller Service should be used.

Dynamic Properties

Name of the RecordSetWriter

See Also

RecordSinkServiceLookup

Provides a RecordSinkService that can be used to dynamically select another RecordSinkService. This service requires an attribute named 'record.sink.name' to be passed in when asking for a connection, and will throw an exception if the attribute is missing. The value of 'record.sink.name' will be used to select the RecordSinkService that has been registered with that name. This will allow multiple RecordSinkServices to be defined and registered, and then selected dynamically at runtime by tagging flow files with the appropriate 'record.sink.name' attribute. Note that this controller service is not intended for use in reporting tasks that employ RecordSinkService instances, such as QueryNiFiReportingTask.

Tags: record, sink, lookup

Properties

Dynamic Properties

The name to register the specified RecordSinkService

If 'record.sink.name' attribute contains the name of the dynamic property, then the RecordSinkService (registered in the value) will be selected.

RedisConnectionPoolService

A service that provides connections to Redis.

Tags: redis, cache

Properties

Redis Mode

The type of Redis being communicated with - standalone, sentinel, or clustered.

Connection String

The connection string for Redis. In a standalone instance this value will be of the form hostname:port. In a sentinel instance this value will be the comma-separated list of sentinels, such as host1:port1,host2:port2,host3:port3. In a clustered instance this value will be the comma-separated list of cluster masters, such as host1:port,host2:port,host3:port.

Database Index

The database index to be used by connections created from this connection pool. See the databases property in redis.conf, by default databases 0-15 will be available.

Communication Timeout

The timeout to use when attempting to communicate with Redis.

Cluster Max Redirects

The maximum number of redirects that can be performed when clustered.

Sentinel Master

The name of the sentinel master, require when Mode is set to Sentinel

Username

The username used to authenticate to the Redis server.

Password

The password used to authenticate to the Redis server. See the 'requirepass' property in redis.conf.

Sentinel Username

The username used to authenticate to the Redis sentinel server.

Sentinel Password

The password used to authenticate to the Redis Sentinel server. See the 'requirepass' and 'sentinel sentinel-pass' properties in sentinel.conf.

SSL Context Service

If specified, this service will be used to create an SSL Context that will be used to secure communications; if not specified, communications will not be secure

Pool - Max Total

The maximum number of connections that can be allocated by the pool (checked out to clients, or idle awaiting checkout). A negative value indicates that there is no limit.

Pool - Max Idle

The maximum number of idle connections that can be held in the pool, or a negative value if there is no limit.

Pool - Min Idle

The target for the minimum number of idle connections to maintain in the pool. If the configured value of Min Idle is greater than the configured value for Max Idle, then the value of Max Idle will be used instead.

Pool - Block When Exhausted

Whether or not clients should block and wait when trying to obtain a connection from the pool when the pool has no available connections. Setting this to false means an error will occur immediately when a client requests a connection and none are available.

Pool - Max Wait Time

The amount of time to wait for an available connection when Block When Exhausted is set to true.

Pool - Min Evictable Idle Time

The minimum amount of time an object may sit idle in the pool before it is eligible for eviction.

Pool - Time Between Eviction Runs

The amount of time between attempting to evict idle connections from the pool.

Pool - Num Tests Per Eviction Run

The number of connections to tests per eviction attempt. A negative value indicates to test all connections.

Pool - Test On Create

Whether or not connections should be tested upon creation.

Pool - Test On Borrow

Whether or not connections should be tested upon borrowing from the pool.

Pool - Test On Return

Whether or not connections should be tested upon returning to the pool.

Pool - Test While Idle

Whether or not connections should be tested while idle.

RedisDistributedMapCacheClientService

An implementation of DistributedMapCacheClient that uses Redis as the backing cache. This service relies on the WATCH, MULTI, and EXEC commands in Redis, which are not fully supported when Redis is clustered. As a result, this service can only be used with a Redis Connection Pool that is configured for standalone or sentinel mode. Sentinel mode can be used to provide high-availability configurations.

Tags: redis, distributed, cache, map

Properties

Redis Connection Pool

TTL

Indicates how long the data should exist in Redis. Setting '0 secs' would mean the data would exist forever

RestLookupService

Use a REST service to look up values.

Tags: rest, lookup, json, xml, http

Properties

URL

The URL for the REST endpoint. Expression language is evaluated against the lookup key/value pairs, not flowfile attributes.

Record Reader

The record reader to use for loading the payload and handling it as a record set.

Record Path

An optional record path that can be used to define where in a record to get the real data to merge into the record set to be enriched. See documentation for examples of when this might be useful.

Response Handling Strategy

Whether to return all responses or throw errors for unsuccessful HTTP status codes.

SSL Context Service

The SSL Context Service used to provide client certificate information for TLS/SSL connections.

Authentication Strategy

Authentication strategy to use with REST service.

OAuth2 Access Token Provider

Enables managed retrieval of OAuth2 Bearer Token applied to HTTP requests using the Authorization Header.

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP, SOCKS + AuthN In case of SOCKS, it is not guaranteed that the selected SOCKS Version will be used by the processor.

Basic Authentication Username

The username to be used by the client to authenticate against the Remote URL. Cannot include control characters (0-31), ':', or DEL (127).

Basic Authentication Password

The password to be used by the client to authenticate against the Remote URL.

Use Digest Authentication

Whether to communicate with the website using Digest Authentication. 'Basic Authentication Username' and 'Basic Authentication Password' are used for authentication.

Connection Timeout

Max wait time for connection to remote service.

Read Timeout

Max wait time for response from remote service.

Dynamic Properties

*

All dynamic properties are added as HTTP headers with the name as the header name and the value as the header value.

Additional Details

General

This lookup service has the following optional lookup coordinate keys:

  • request.method; defaults to ‘get’, valid values:

    • delete

    • get

    • post

    • put

  • request.body; contains a string representing JSON, XML, etc. to be sent with any of those methods except for “get”.

  • mime.type; specifies media type of the request body, required when ‘body’ is passed.

  • *; any other keys can be configured to pass variables to resolve target URLs. See ‘Dynamic URLs’ section below.

The record reader is used to consume the response of the REST service call and turn it into one or more records. The record path property is provided to allow for a lookup path to either a nested record or a single point deep in the REST response. Note: a valid schema must be built that encapsulates the REST response accurately in order for this service to work.

Headers

Headers are supported using dynamic properties. Just add a dynamic property and the name will be the header name and the value will be the value for the header. Expression language powered by input from the variable registry is supported.

Dynamic URLs

The URL property supports expression language through the lookup key/value pairs configured on the component using this lookup service (e.g. LookupRecord processor). The configuration specified by the user will be passed through to the expression language engine for evaluation. Note: flowfile attributes will be disregarded here for this property.

Ex. URL: http://example.com/service/\({user.name}/friend/\)\{friend.id}, combined with example record paths at LookupRecord processor:

  • user.name ⇒ “/example/username”

  • friend.id ⇒ “/example/first_friend”

Would dynamically produce an endpoint of http://example.com/service/john.smith/friend/12345

Using Environment Properties with URLs

In addition to the lookup key/value pairs, environment properties / system variables can be referred from expression languages configured at the URL property.

Ex. URL: http://${apiServerHostname}:${apiServerPort}/service/$\{user.name}/friend/$\{friend.id}, combined with the previous example record paths, and environment properties:

  • apiServerHostname ⇒ “test.example.com”

  • apiServerPort ⇒ “8080”

Would dynamically produce an endpoint of http://test.example.com:8080/service/john.smith/friend/12345

S3FileResourceService

Provides an Amazon Web Services (AWS) S3 file resource for other components.

Use Cases

Fetch a specific file from S3. The service provides higher performance compared to fetch processors when the data should be moved between different storages without any transformation.

Input Requirement: This component allows an incoming relationship.

  1. "Bucket" = "${s3.bucket}"

  2. "Object Key" = "${filename}" .

  3. The "Region" property must be set to denote the S3 region that the Bucket resides in. .

  4. The "AWS Credentials Provider Service" property should specify an instance of the AWSCredentialsProviderService in order to provide credentials for accessing the bucket. .

Tags: Amazon, S3, AWS, file, resource

Properties

Bucket

The S3 Bucket to interact with

Object Key

The S3 Object Key to use. This is analogous to a filename for traditional file systems.

Region

The AWS Region to connect to.

AWS Credentials Provider Service

The Controller Service that is used to obtain AWS credentials provider

See Also

ScriptedLookupService

Allows the user to provide a scripted LookupService instance in order to enrich records from an incoming flow file.

Tags: lookup, record, script, invoke, groovy

Properties

Script Engine

No Script Engines found

Script File

Path to script file to execute. Only one of Script File or Script Body may be used

Script Body

Body of script to execute. Only one of Script File or Script Body may be used

Module Directory

Comma-separated list of paths to files and/or directories which contain modules required by the script.

Dynamic Properties

Script Engine Binding property

Updates a script engine property specified by the Dynamic Property’s key with the value specified by the Dynamic Property’s value

Restricted

Making changes to this component requires admin rights. Non-admins can still schedule this component.

ScriptedReader

Allows the user to provide a scripted RecordReaderFactory instance in order to read/parse/generate records from an incoming flow file.

Tags: record, recordFactory, script, invoke, groovy

Properties

Script Engine

No Script Engines found

Script File

Path to script file to execute. Only one of Script File or Script Body may be used

Script Body

Body of script to execute. Only one of Script File or Script Body may be used

Module Directory

Comma-separated list of paths to files and/or directories which contain modules required by the script.

Dynamic Properties

Script Engine Binding property

Updates a script engine property specified by the Dynamic Property’s key with the value specified by the Dynamic Property’s value

Restricted

Making changes to this component requires admin rights. Non-admins can still schedule this component.

ScriptedRecordSetWriter

Allows the user to provide a scripted RecordSetWriterFactory instance in order to write records to an outgoing flow file.

Tags: record, writer, script, invoke, groovy

Properties

Script Engine

No Script Engines found

Script File

Path to script file to execute. Only one of Script File or Script Body may be used

Script Body

Body of script to execute. Only one of Script File or Script Body may be used

Module Directory

Comma-separated list of paths to files and/or directories which contain modules required by the script.

Dynamic Properties

Script Engine Binding property

Updates a script engine property specified by the Dynamic Property’s key with the value specified by the Dynamic Property’s value

Restricted

Making changes to this component requires admin rights. Non-admins can still schedule this component.

ScriptedRecordSink

Allows the user to provide a scripted RecordSinkService instance in order to transmit records to the desired target. The script must set a variable 'recordSink' to an implementation of RecordSinkService.

Tags: record, record sink, script, invoke, groovy

Properties

Script Engine

No Script Engines found

Script File

Path to script file to execute. Only one of Script File or Script Body may be used

Script Body

Body of script to execute. Only one of Script File or Script Body may be used

Module Directory

Comma-separated list of paths to files and/or directories which contain modules required by the script.

Dynamic Properties

Script Engine Binding property

Updates a script engine property specified by the Dynamic Property’s key with the value specified by the Dynamic Property’s value

Restricted

Making changes to this component requires admin rights. Non-admins can still schedule this component.

SetCacheClientService

Provides the ability to communicate with a SetCacheServer. This can be used in order to share a Set between nodes in a NiFi cluster

Tags: distributed, cache, state, set, cluster

Properties

Server Hostname

The name of the server that is running the DistributedSetCacheServer service

Server Port

The port on the remote server that is to be used when communicating with the DistributedSetCacheServer service

SSL Context Service

If specified, indicates the SSL Context Service that is used to communicate with the remote server. If not specified, communications will not be encrypted

Communications Timeout

Specifies how long to wait when communicating with the remote server before determining that there is a communications failure if data cannot be sent or received

See Also

SetCacheServer

Provides a set (collection of unique values) cache that can be accessed over a socket. Interaction with this service is typically accomplished via a DistributedSetCacheClient service.

Tags: distributed, set, distinct, cache, server

Properties

Port

The port to listen on for incoming connections

Maximum Cache Entries

The maximum number of cache entries that the cache can hold

Eviction Strategy

Determines which strategy should be used to evict values from the cache to make room for new entries

Persistence Directory

If specified, the cache will be persisted in the given directory; if not specified, the cache will be in-memory only

SSL Context Service

If specified, this service will be used to create an SSL Context that will be used to secure communications; if not specified, communications will not be secure

Maximum Read Size

The maximum number of network bytes to read for a single cache item

SftpServerController

(TECHNOLOGY PREVIEW) Starts an SFTP server that clients can use to read/write files. Commands made by client are routed to processors where they are converted to FlowFiles.

Properties

Hostname

The Hostname to bind to. If not specified, will bind to all hosts

Listening Port

The Port to listen on for incoming HTTP requests

Request Expiration

Specifies how long a request should be left unanswered before being evicted from the cache and being responded to with a Service Unavailable status code

Maximum Thread Pool Size

The maximum number of threads to be used by the embedded Jetty server. The value can be set between 8 and 1000. The value of this property affects the performance of the flows and the operating system, therefore the default value should only be changed in justified cases. A value that is less than the default value may be suitable if only a small number of HTTP clients connect to the server. A greater value may be suitable if a large number of HTTP clients are expected to make requests to the server simultaneously.

SSL Context Service

The SSL Context Service used to secure the server. If the SSL Context Service has a Truststore, clients can authenticate using their private key.

Basic Authentication Username

If set, a client can authenticate using this username.

Basic Authentication Password

If set, a client can authenticate using this password.

SimpleCsvFileLookupService

A reloadable CSV file-based lookup service. The first line of the csv file is considered as header.

Tags: lookup, cache, enrich, join, csv, reloadable, key, value

Properties

CSV File

Path to a CSV File in which the key value pairs can be looked up.

CSV Format

Specifies which "format" the CSV data is in, or specifies if custom formatting should be used.

Character Set

The Character Encoding that is used to decode the CSV file.

Lookup Key Column

The field in the CSV file that will serve as the lookup key. This is the field that will be matched against the property specified in the lookup processor.

Ignore Duplicates

Ignore duplicate keys for records in the CSV file.

Value Separator

The character that is used to separate values/fields in a CSV Record. If the property has been specified via Expression Language but the expression gets evaluated to an invalid Value Separator at runtime, then it will be skipped and the default Value Separator will be used.

Quote Character

The character that is used to quote values so that escape characters do not have to be used. If the property has been specified via Expression Language but the expression gets evaluated to an invalid Quote Character at runtime, then it will be skipped and the default Quote Character will be used.

Quote Mode

Specifies how fields should be quoted when they are written

Comment Marker

The character that is used to denote the start of a comment. Any line that begins with this comment will be ignored.

Escape Character

The character that is used to escape characters that would otherwise have a specific meaning to the CSV Parser. If the property has been specified via Expression Language but the expression gets evaluated to an invalid Escape Character at runtime, then it will be skipped and the default Escape Character will be used. Setting it to an empty string means no escape character should be used.

Trim Fields

Whether or not white space should be removed from the beginning and end of fields

Lookup Value Column

Lookup value column.

Restricted

Making changes to this component requires admin rights. Non-admins can still schedule this component.

SimpleDatabaseLookupService

A relational-database-based lookup service. When the lookup key is found in the database, the specified lookup value column is returned. Only one value will be returned for each lookup, duplicate database entries are ignored.

Tags: lookup, cache, enrich, join, rdbms, database, reloadable, key, value

Properties

Database Connection Pooling Service

The Controller Service that is used to obtain connection to database

Table Name

The name of the database table to be queried. Note that this may be case-sensitive depending on the database.

Lookup Key Column

The column in the table that will serve as the lookup key. This is the column that will be matched against the property specified in the lookup processor. Note that this may be case-sensitive depending on the database.

Lookup Value Column

The column whose value will be returned when the Lookup value is matched

Cache Size

Specifies how many lookup values/records should be cached. The cache is shared for all tables and keeps a map of lookup values to records. Setting this property to zero means no caching will be done and the table will be queried for each lookup value in each record. If the lookup table changes often or the most recent data must be retrieved, do not use the cache.

Clear Cache on Enabled

Whether to clear the cache when this service is enabled. If the Cache Size is zero then this property is ignored. Clearing the cache when the service is enabled ensures that the service will first go to the database to get the most recent data.

Cache Expiration

Time interval to clear all cache entries. If the Cache Size is zero then this property is ignored.

SimpleKeyValueLookupService

Allows users to add key/value pairs as User-defined Properties. Each property that is added can be looked up by Property Name. The coordinates that are passed to the lookup must contain the key 'key'.

Tags: lookup, enrich, key, value

Properties

Dynamic Properties

A key that can be looked up

Allows users to add key/value pairs as User-defined Properties. Each property that is added can be looked up by Property Name. The coordinates that are passed to the lookup must contain the key 'key'.

SimpleRedisDistributedMapCacheClientService

An implementation of DistributedMapCacheClient that uses Redis as the backing cache. This service is intended to be used when a non-atomic DistributedMapCacheClient is required.

Tags: redis, distributed, cache, map

Properties

Redis Connection Pool

TTL

Indicates how long the data should exist in Redis. Setting '0 secs' would mean the data would exist forever

SimpleScriptedLookupService

Allows the user to provide a scripted LookupService instance in order to enrich records from an incoming flow file. The script is expected to return an optional string value rather than an arbitrary object (record, e.g.). Also the scripted lookup service should implement StringLookupService, otherwise the getValueType() method must be implemented even though it will be ignored, as SimpleScriptedLookupService returns String as the value type on the script’s behalf.

Tags: lookup, script, invoke, groovy

Properties

Script Engine

No Script Engines found

Script File

Path to script file to execute. Only one of Script File or Script Body may be used

Script Body

Body of script to execute. Only one of Script File or Script Body may be used

Module Directory

Comma-separated list of paths to files and/or directories which contain modules required by the script.

Dynamic Properties

Script Engine Binding property

Updates a script engine property specified by the Dynamic Property’s key with the value specified by the Dynamic Property’s value

Restricted

Making changes to this component requires admin rights. Non-admins can still schedule this component.

SiteToSiteReportingRecordSink

Provides a service to write records using a configured RecordSetWriter over a Site-to-Site connection.

Tags: db, s2s, site, record

Properties

Record Writer

Specifies the Controller Service to use for writing out the records.

Destination URL

The URL of the destination NiFi instance or, if clustered, a comma-separated list of address in the format of http(s)://host:port/nifi. This destination URL will only be used to initiate the Site-to-Site connection. The data sent by this reporting task will be load-balanced on all the nodes of the destination (if clustered).

Input Port Name

The name of the Input Port to deliver data to.

SSL Context Service

The SSL Context Service to use when communicating with the destination. If not specified, communications will not be secure.

Instance URL

The URL of this instance to use in the Content URI of each event.

Compress Events

Indicates whether or not to compress the data being sent.

Communications Timeout

Specifies how long to wait to a response from the destination before deciding that an error has occurred and canceling the transaction

Batch Size

Specifies how many records to send in a single batch, at most.

Transport Protocol

Specifies which transport protocol to use for Site-to-Site communication.

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP + AuthN

SlackRecordSink

Format and send Records to a configured Channel using the Slack Post Message API. The service requires a Slack App with a Bot User configured for access to a Slack workspace. The Bot User OAuth Bearer Token is required for posting messages to Slack.

Tags: slack, record, sink

Properties

API URL

Slack Web API URL for posting text messages to channels. It only needs to be changed if Slack changes its API URL.

Access Token

Bot OAuth Token used for authenticating and authorizing the Slack request sent by NiFi.

Channel ID

Slack channel, private group, or IM channel to send the message to. Use Channel ID instead of the name.

Input Character Set

Specifies the character set of the records used to generate the Slack message.

Record Writer

Specifies the Controller Service to use for writing out the records.

Web Service Client Provider

Controller service to provide HTTP client for communicating with Slack API

SmbjClientProviderService

Provides access to SMB Sessions with shared authentication credentials.

Tags: samba, smb, cifs, files

Properties

Hostname

The network host of the SMB file server.

Port

Port to use for connection.

Share

The network share to which files should be listed from. This is the "first folder"after the hostname: smb://hostname:port/[share]/dir1/dir2

Username

The username used for authentication.

Password

The password used for authentication.

Domain

The domain used for authentication. Optional, in most cases username and password is sufficient.

SMB Dialect

The SMB dialect is negotiated between the client and the server by default to the highest common version supported by both end. In some rare cases, the client-server communication may fail with the automatically negotiated dialect. This property can be used to set the dialect explicitly (e.g. to downgrade to a lower version), when those situations would occur.

Use Encryption

Turns on/off encrypted communication between the client and the server. The property’s behavior is SMB dialect dependent: SMB 2.x does not support encryption and the property has no effect. In case of SMB 3.x, it is a hint/request to the server to turn encryption on if the server also supports it.

Enable DFS

Enables accessing Distributed File System (DFS) and following DFS links during SMB operations.

Timeout

Timeout for read and write operations.

SnowflakeComputingConnectionPool

Provides Snowflake Connection Pooling Service. Connections can be asked from pool and returned after usage.

Tags: snowflake, dbcp, jdbc, database, connection, pooling, store

Properties

Connection URL Format

The format of the connection URL.

Snowflake URL

Example connection string: jdbc:snowflake://[account].[region].snowflakecomputing.com/?[connection_params] The connection parameters can include db=DATABASE_NAME to avoid using qualified table names such as DATABASE_NAME.PUBLIC.TABLE_NAME

Account Locator

Snowflake account locator to use for connection.

Cloud Region

Snowflake cloud region to use for connection.

Cloud Type

Snowflake cloud type to use for connection.

Organization Name

Snowflake organization name to use for connection.

Account Name

Snowflake account name to use for connection.

Username

The Snowflake user name.

Password

The password for the Snowflake user.

Database

The database to use by default. The same as passing 'db=DATABASE_NAME' to the connection string.

Schema

The schema to use by default. The same as passing 'schema=SCHEMA' to the connection string.

Warehouse

The warehouse to use by default. The same as passing 'warehouse=WAREHOUSE' to the connection string.

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests.

Validation query

Validation query used to validate connections before returning them. When connection is invalid, it gets dropped and new valid connection will be returned. Note!! Using validation might have some performance penalty.

Max Wait Time

The maximum amount of time that the pool will wait (when there are no available connections) for a connection to be returned before failing, or -1 to wait indefinitely.

Max Total Connections

The maximum number of active connections that can be allocated from this pool at the same time, or negative for no limit.

Minimum Idle Connections

The minimum number of connections that can remain idle in the pool without extra ones being created. Set to or zero to allow no idle connections.

Max Idle Connections

The maximum number of connections that can remain idle in the pool without extra ones being released. Set to any negative value to allow unlimited idle connections.

Max Connection Lifetime

The maximum lifetime of a connection. After this time is exceeded the connection will fail the next activation, passivation or validation test. A value of zero or less means the connection has an infinite lifetime.

Time Between Eviction Runs

The time period to sleep between runs of the idle connection evictor thread. When non-positive, no idle connection evictor thread will be run.

Minimum Evictable Idle Time

The minimum amount of time a connection may sit idle in the pool before it is eligible for eviction.

Soft Minimum Evictable Idle Time

The minimum amount of time a connection may sit idle in the pool before it is eligible for eviction by the idle connection evictor, with the extra condition that at least a minimum number of idle connections remain in the pool. When the not-soft version of this option is set to a positive value, it is examined first by the idle connection evictor: when idle connections are visited by the evictor, idle time is first compared against it (without considering the number of idle connections in the pool) and then against this soft option, including the minimum idle connections constraint.

Dynamic Properties

JDBC property name

Snowflake JDBC driver property name and value applied to JDBC connections.

StandardAsanaClientProviderService

Common service to authenticate with Asana, and to work on a specified workspace.

Tags: asana, service, authentication

Properties

API URL

Base URL of Asana API. Leave it as default, unless you have your own Asana instance serving on a different URL. (typical for on-premise installations)

Personal Access Token

Similarly to entering your username/password into a website, when you access your Asana data via the API you need to authenticate. Personal Access Token (PAT) is an authentication mechanism for accessing the API. You can generate a PAT from the Asana developer console. Refer to Asana Authentication Quick Start for detailed instructions on getting started.

Workspace

Specify which Asana workspace to use. Case sensitive. A workspace is the highest-level organizational unit in Asana. All projects and tasks have an associated workspace. An organization is a special kind of workspace that represents a company. In an organization, you can group your projects into teams.

StandardAzureCredentialsControllerService

Provide credentials to use with an Azure client.

Tags: azure, security, credentials, provider, session

Properties

Credential Configuration Strategy

Managed Identity Client ID

Client ID of the managed identity. The property is required when User Assigned Managed Identity is used for authentication. It must be empty in case of System Assigned Managed Identity.

StandardDatabaseDialectService

Database Dialect Service supporting ANSI SQL.
Supported Statement Types: ALTER, CREATE, SELECT

Tags: Relational, Database, JDBC, SQL

Properties

StandardDropboxCredentialService

Defines credentials for Dropbox processors.

Tags: dropbox, credentials, provider

Properties

App Key

App Key of the user’s Dropbox app. See Additional Details for more information.

App Secret

App Secret of the user’s Dropbox app. See Additional Details for more information.

Access Token

Access Token of the user’s Dropbox app. See Additional Details for more information about Access Token generation.

Refresh Token

Refresh Token of the user’s Dropbox app. See Additional Details for more information about Refresh Token generation.

Additional Details

Generating credentials for Dropbox authentication

StandardDropboxCredentialService requires “App Key”, “App Secret”, “Access Token” and “Refresh Token”.

This document describes how to generate these credentials using an existing Dropbox account.

Generate App Key and App Secret
  • Login with your Dropbox account.

  • If you already have an app created, go to Dropbox Developers page, click on “App Console” button and select your app. On the app’s info page you will find the “App key” and “App secret”.
    (See also Dropbox Getting Started, App Console tab, ” Navigating the App Console” chapter)

  • If you don’t have any apps, go to Dropbox Developers page and click on “Create app” button. (See also Dropbox Getting Started, App Console tab, “Creating a Dropbox app” chapter.)

    • On the next page select “Scoped access” and “Full Dropbox” as access type.

    • Provide a name for your app.

    • On the app’s info page you will find the “App key” and “App secret”. (See also Dropbox Getting Started, App Console tab, ” Navigating the App Console” chapter.)

Set required permissions for your app

The “files.content.read” permission has to be enabled for the application to be able to read the files in Dropbox.

You can set permissions in Dropbox Developers page.

  • Click on “App Console” button and select your app.

  • Go to “Permissions” tab and enable the “files.content.read” permission.

  • Click “Submit” button.

  • NOTE: In case you already have an Access Token and Refresh Token, those tokens have to be regenerated after the permission change. See “Generate Access Token and Refresh Token” chapter about token generation.

Generate Access Token and Refresh Token
  • Go to the following web page:

  • Click “Next” and click on “Allow” button on the next page.

  • An access code will be generated for you, it will be displayed on the next page:

    “Access Code Generated Enter this code into your_app_name to finish the process your_generated_access_code

  • Execute the following command from terminal to fetch the access and refresh tokens.

    Make sure you execute the curl command right after the access code generation, since the code expires very quickly.
    In case the curl command returns “invalid grant” error, please generate a new access code (see previous step)

    curl https://api.dropbox.com/oauth2/token -d code=your_generated_access_code -d grant_type=authorization_code -u your_app_key:_your_app_secret_

  • The curl command results a json file which contains the “access_token” and “refresh_token”:

{
  "access_token": "sl.xxxxxxxxxxx"
  "expires_in": 14400,
  "refresh_token": "xxxxxx",
  "scope": "files.content.read files.metadata.read",
  "uid": "xxxxxx",
  "account_id": "dbid:xxxx"
}
json

StandardFileResourceService

Provides a file resource for other components. The file needs to be available locally by Nifi (e.g. local disk or mounted storage). NiFi needs to have read permission to the file.

Tags: file, resource

Properties

File Path

Path to a file that can be accessed locally.

Restricted

Making changes to this component requires admin rights. Non-admins can still schedule this component.

StandardHashiCorpVaultClientService

A controller service for interacting with HashiCorp Vault.

Tags: hashicorp, vault, client

Properties

Configuration Strategy

Specifies the source of the configuration properties.

Vault URI

The URI of the HashiCorp Vault server (e.g., http://localhost:8200). Required if not specified in the Bootstrap HashiCorp Vault Configuration File.

Vault Authentication

Vault authentication method, as described in the Spring Vault Environment Configuration documentation (https://docs.spring.io/spring-vault/docs/2.3.x/reference/html/#vault.core.environment-vault-configuration).

SSL Context Service

The SSL Context Service used to provide client certificate information for TLS/SSL connections to the HashiCorp Vault server.

Vault Properties Files

A comma-separated list of files containing HashiCorp Vault configuration properties, as described in the Spring Vault Environment Configuration documentation (https://docs.spring.io/spring-vault/docs/2.3.x/reference/html/#vault.core.environment-vault-configuration). All of the Spring property keys and authentication-specific property keys are supported.

Connection Timeout

The connection timeout for the HashiCorp Vault client

Read Timeout

The read timeout for the HashiCorp Vault client

Dynamic Properties

A Spring Vault configuration property name

Allows any Spring Vault property keys to be specified, as described in (https://docs.spring.io/spring-vault/docs/2.3.x/reference/html/#vault.core.environment-vault-configuration). See Additional Details for more information.

Additional Details

Configuring the Bootstrap HashiCorp Vault Configuration File

The ./conf/bootstrap-hashicorp-vault.conf file that comes with Apache NiFi is a convenient way to configure this controller service in a manner consistent with the HashiCorpVault sensitive property provider. Since this file is already used for configuring the Vault client for protecting sensitive properties in the NiFi configuration files (see the Administrator’s Guide), it’s a natural starting point for configuring the controller service as well.

An example configuration of this properties file is as follows:

# HTTP or HTTPS URI for HashiCorp Vault is required to enable the Sensitive Properties Provider
vault.uri=https://127.0.0.1:8200

# Optional file supports authentication properties described in the Spring Vault Environment Configuration
# https://docs.spring.io/spring-vault/docs/2.3.x/reference/html/#vault.core.environment-vault-configuration
#
# All authentication properties must be included in bootstrap-hashicorp-vault.conf when this property is not specified.
# Properties in bootstrap-hashicorp-vault.conf take precedence when the same values are defined in both files.
# Token Authentication is the default when the 'vault.authentication' property is not specified.
vault.authentication.properties.file=[full/path/to/vault-auth.properties]

# Optional Timeout properties
vault.connection.timeout=5 secs
vault.read.timeout=15 secs

# Optional TLS properties
vault.ssl.enabledCipherSuites=
vault.ssl.enabledProtocols=TLSv1.3
vault.ssl.key-store=[path/to/keystore.p12]
vault.ssl.key-store-type=PKCS12
vault.ssl.key-store-password=[keystore password]
vault.ssl.trust-store=[path/to/truststore.p12]
vault.ssl.trust-store-type=PKCS12
vault.ssl.trust-store-password=[truststore password]
properties

In order to use this file in the StandardHashiCorpVaultClientService, specify the following properties:

  • Configuration Strategy - Properties Files

  • Vault Properties Files - ./conf/bootstrap-hashicorp-vault.conf

If your bootstrap configuration includes the vault.authentication.properties.file containing additional authentication properties, this file will also need to be added to the Vault Properties Files property as a comma-separated value.

Configuring the Client using Direct Properties

However, if you want to specify or override properties directly in the controller service, you may do this by specifying a Configuration Strategy of ‘Direct Properties’. This can be useful if you are reusing an SSLContextService or want to parameterize the Vault configuration properties. Authentication-related properties can also be added as sensitive dynamic properties, as seen in the examples below.

Vault Authentication

Under the hood, the controller service uses Spring Vault, and directly supports the property keys specified in Spring Vault’s documentation. Following are some common examples of authentication with Vault.

======= Token Authentication

The simplest authentication scheme uses a rotating token, which is enabled by default in Vault. To specify this mechanism, select “TOKEN” from the “Vault Authentication” property (the default). However, since the token should rotate by nature, it is a best practice to use the ‘Properties Files’ Configuration Strategy, and keep the token value in an external properties file, indicating this filename in the ‘Vault Properties Files’ property. Then an external process can rotate the token in the file without updating NiFi configuration. In order to pick up the changed token, the controller service must be disabled and re-enabled.

For testing purposes, however, it may be more convenient to specify the token directly in the controller service. To do so, add a new Sensitive property named ‘vault.token’ and enter the token as the value.

======= Certificate Authentication

Certificate authentication must be enabled in the Vault server before it can be used from NiFi, but it uses the same TLS settings as the actual client connection, so no additional authentication properties are required. While these TLS settings can be provided in an external properties file, we will demonstrate configuring an SSLContextService instead.

First, create an SSLContextService controller service and configure the Filename, Password, and Type for both the Keystore and Truststore. Enable it, and assign it as the SSL Context Service in the Vault controller service. Then, simply specify “CERT” as the “Vault Authentication” property value.

======= Other Authentication Methods

To configure the other authentication methods, see the Spring Vault documentation linked above. All relevant properties should be added either to the external properties files referenced in the “Vault Properties Files” property if using the ‘Properties Files’ Configuration Strategy, or added as custom properties with the same name if using the ‘Direct Properties’ Configuration Strategy. For example, for the Azure authentication mechanism, properties will have to be added for ‘vault.azure-msi.azure-path’, ‘vault.azure-msi.role’, and ‘vault.azure-msi.identity-token-service’.

StandardHttpContextMap

Provides the ability to store and retrieve HTTP requests and responses external to a Processor, so that multiple Processors can interact with the same HTTP request.

Tags: http, request, response

Properties

Maximum Outstanding Requests

The maximum number of HTTP requests that can be outstanding at any one time. Any attempt to register an additional HTTP Request will cause an error

Request Expiration

Specifies how long an HTTP Request should be left unanswered before being evicted from the cache and being responded to with a Service Unavailable status code

StandardJsonSchemaRegistry

Provides a service for registering and accessing JSON schemas. One can register a schema as a dynamic property where 'name' represents the schema name and 'value' represents the textual representation of the actual schema following the syntax and semantics of the JSON Schema format. Empty schemas and schemas only consisting of whitespace are not acceptable schemas.The registry is heterogeneous registry as it can store schemas of different schema draft versions. By default the registry is configured to store schemas of Draft 2020-12. When a schema is added, the version which is currently is set, is what the schema is saved as.

Tags: schema, registry, json

Properties

JSON Schema Version

The JSON schema specification

Dynamic Properties

Schema Name

Adds a named schema using the JSON string representation of a JSON schema

StandardKustoIngestService

Sends batches of flowfile content or stream flowfile content to an Azure ADX cluster.

Tags: Azure, Data, Explorer, ADX, Kusto, ingest, azure

Properties

Authentication Strategy

Authentication method for access to Azure Data Explorer

Application Client ID

Azure Data Explorer Application Client Identifier for Authentication

Application Key

Azure Data Explorer Application Key for Authentication

Application Tenant ID

Azure Data Explorer Application Tenant Identifier for Authentication

Cluster URI

Azure Data Explorer Cluster URI

StandardKustoQueryService

Standard implementation of Kusto Query Service for Azure Data Explorer

Tags: Azure, Data, Explorer, ADX, Kusto

Properties

Cluster URI

Azure Data Explorer Cluster URI

Authentication Strategy

Authentication method for access to Azure Data Explorer

Application Client ID

Azure Data Explorer Application Client Identifier for Authentication

Application Tenant ID

Azure Data Explorer Application Tenant Identifier for Authentication

Application Key

Azure Data Explorer Application Key for Authentication

StandardOauth2AccessTokenProvider

Provides OAuth 2.0 access tokens that can be used as Bearer authorization header in HTTP requests. Can use either Resource Owner Password Credentials Grant or Client Credentials Grant. Client authentication can be done with either HTTP Basic authentication or in the request body.

Tags: oauth2, provider, authorization, access token, http

Properties

Authorization Server URL

The URL of the authorization server that issues access tokens.

Client Authentication Strategy

Strategy for authenticating the client against the OAuth2 token provider service.

Grant Type

The OAuth2 Grant Type to be used when acquiring an access token.

Username

Username on the service that is being accessed.

Password

Password for the username on the service that is being accessed.

Refresh Token

Refresh Token.

Client ID

Client secret

Scope

Space-delimited, case-sensitive list of scopes of the access request (as per the OAuth 2.0 specification)

Resource

Resource URI for the access token request defined in RFC 8707 Section 2

Audience

Audience for the access token request defined in RFC 8693 Section 2.1

Refresh Window

The service will attempt to refresh tokens expiring within the refresh window, subtracting the configured duration from the token expiration.

SSL Context Service

HTTP Protocols

HTTP Protocols supported for Application Layer Protocol Negotiation with TLS

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP + AuthN

StandardPGPPrivateKeyService

PGP Private Key Service provides Private Keys loaded from files or properties

Tags: PGP, GPG, OpenPGP, Encryption, Private, Key, RFC 4880

Properties

Keyring File

File path to PGP Keyring or Secret Key encoded in binary or ASCII Armor

Keyring

PGP Keyring or Secret Key encoded in ASCII Armor

Key Password

Password used for decrypting Private Keys

StandardPGPPublicKeyService

PGP Public Key Service providing Public Keys loaded from files

Tags: PGP, GPG, OpenPGP, Encryption, Private, Key, RFC 4880

Properties

Keyring File

File path to PGP Keyring or Public Key encoded in binary or ASCII Armor

Keyring

PGP Keyring or Public Key encoded in ASCII Armor

StandardPrivateKeyService

Private Key Service provides access to a Private Key loaded from configured sources

Tags: PEM, PKCS8

Properties

Key File

File path to Private Key structured using PKCS8 and encoded as PEM

Key

Private Key structured using PKCS8 and encoded as PEM

Key Password

Password used for decrypting Private Keys

StandardProxyConfigurationService

Provides a set of configurations for different NiFi components to use a proxy server.

Tags: Proxy

Properties

Proxy Type

Proxy type.

SOCKS Version

SOCKS Protocol Version

Proxy Server Host

Proxy server hostname or ip-address.

Proxy Server Port

Proxy server port number.

Proxy User Name

The name of the proxy client for user authentication.

Proxy User Password

The password of the proxy client for user authentication.

StandardRestrictedSSLContextService

Restricted implementation of the SSLContextService. Provides the ability to configure keystore and/or truststore properties once and reuse that configuration throughout the application, but only allows a restricted set of TLS/SSL protocols to be chosen (no SSL protocols are supported). The set of protocols selectable will evolve over time as new protocols emerge and older protocols are deprecated. This service is recommended over StandardSSLContextService if a component doesn’t expect to communicate with legacy systems since it is unlikely that legacy systems will support these protocols.

Tags: tls, ssl, secure, certificate, keystore, truststore, jks, p12, pkcs12, pkcs

Properties

Keystore Filename

The fully-qualified filename of the Keystore

Keystore Password

The password for the Keystore

Key Password

The password for the key. If this is not specified, but the Keystore Filename, Password, and Type are specified, then the Keystore Password will be assumed to be the same as the Key Password.

Keystore Type

The Type of the Keystore

Truststore Filename

The fully-qualified filename of the Truststore

Truststore Password

The password for the Truststore

Truststore Type

The Type of the Truststore

TLS Protocol

TLS Protocol Version for encrypted connections. Supported versions depend on the specific version of Java used.

StandardS3EncryptionService

Adds configurable encryption to S3 Put and S3 Fetch operations.

Tags: service, aws, s3, encryption, encrypt, decryption, decrypt, key

Properties

Encryption Strategy

Strategy to use for S3 data encryption and decryption.

Key ID or Key Material

For None and Server-side S3: not used. For Server-side KMS and Client-side KMS: the KMS Key ID must be configured. For Server-side Customer Key and Client-side Customer Key: the Key Material must be specified in Base64 encoded form. In case of Server-side Customer Key, the key must be an AES-256 key. In case of Client-side Customer Key, it can be an AES-256, AES-192 or AES-128 key.

KMS Region

The Region of the AWS Key Management Service. Only used in case of Client-side KMS.

Additional Details

Description

The StandardS3EncryptionService manages an encryption strategy and applies that strategy to various S3 operations.
Note: This service has no effect when a processor has the Server Side Encryption property set. To use this service with processors so configured, first create a service instance, set the Encryption Strategy to Server-side S3, disable the Server Side Encryption processor setting, and finally, associate the processor with the service.

Configuration Details
Encryption Strategy

The name of the specific encryption strategy for this service to use when encrypting and decrypting S3 operations.

  • None - no encryption is configured or applied.

  • Server-side S3 - encryption and decryption is managed by S3; no keys are required.

  • Server-side KMS - encryption and decryption are performed by S3 using the configured KMS key.

  • Server-side Customer Key - encryption and decryption are performed by S3 using the supplied customer key.

  • Client-side KMS - like the Server-side KMS strategy, with the encryption and decryption performed by the client.

  • Client-side Customer Key - like the Server-side Customer Key strategy, with the encryption and decryption performed by the client.

Key ID or Key Material

When configured for either the Server-side or Client-side KMS strategies, this field should contain the KMS Key ID.

When configured for either the Server-side or Client-side Customer Key strategies, this field should contain the key material, and that material must be base64 encoded.

All other encryption strategies ignore this field.

KMS Region

KMS key region, if any. This value must match the actual region of the KMS key if supplied.

StandardSnowflakeIngestManagerProviderService

Provides a Snowflake Ingest Manager for Snowflake pipe processors

Tags: snowflake, jdbc, database, connection

Properties

Account Identifier Format

The format of the account identifier.

Snowflake URL

Example host url: [account-locator].[cloud-region].[cloud].snowflakecomputing.com

Account Locator

Snowflake account locator to use for connection.

Cloud Region

Snowflake cloud region to use for connection.

Cloud Type

Snowflake cloud type to use for connection.

Organization Name

Snowflake organization name to use for connection.

Account Name

Snowflake account name to use for connection.

User Name

The Snowflake user name.

Private Key Service

Specifies the Controller Service that will provide the private key. The public key needs to be added to the user account in the Snowflake account beforehand.

Database

The database to use by default. The same as passing 'db=DATABASE_NAME' to the connection string.

Schema

The schema to use by default. The same as passing 'schema=SCHEMA' to the connection string.

Pipe

The Snowflake pipe to ingest from.

StandardSSLContextService

Standard implementation of the SSLContextService. Provides the ability to configure keystore and/or truststore properties once and reuse that configuration throughout the application. This service can be used to communicate with both legacy and modern systems. If you only need to communicate with non-legacy systems, then the StandardRestrictedSSLContextService is recommended as it only allows a specific set of SSL protocols to be chosen.

Tags: ssl, secure, certificate, keystore, truststore, jks, p12, pkcs12, pkcs, tls

Properties

Keystore Filename

The fully-qualified filename of the Keystore

Keystore Password

The password for the Keystore

Key Password

The password for the key. If this is not specified, but the Keystore Filename, Password, and Type are specified, then the Keystore Password will be assumed to be the same as the Key Password.

Keystore Type

The Type of the Keystore

Truststore Filename

The fully-qualified filename of the Truststore

Truststore Password

The password for the Truststore

Truststore Type

The Type of the Truststore

TLS Protocol

SSL or TLS Protocol Version for encrypted connections. Supported versions include insecure legacy options and depend on the specific version of Java used.

StandardWebClientServiceProvider

Web Client Service Provider with support for configuring standard HTTP connection properties

Tags: HTTP, Web, Client

Properties

Connect Timeout

Maximum amount of time to wait before failing during initial socket connection

Read Timeout

Maximum amount of time to wait before failing while reading socket responses

Write Timeout

Maximum amount of time to wait before failing while writing socket requests

Redirect Handling Strategy

Handling strategy for responding to HTTP 301 or 302 redirects received with a Location header

SSL Context Service

SSL Context Service overrides system default TLS settings for HTTPS communication

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests.

Syslog5424Reader

Provides a mechanism for reading RFC 5424 compliant Syslog data, such as log files, and structuring the data so that it can be processed.

Tags: syslog 5424, syslog, logs, logfiles, parse, text, record, reader

Properties

Character Set

Specifies which character set of the Syslog messages

Raw message

If true, the record will have a _raw field containing the raw message

Additional Details

The Syslog5424Reader Controller Service provides a means for parsing valid RFC 5424 Syslog messages. This service produces records with a set schema to match the specification.

The Required Property of this service is named Character Set and specifies the Character Set of the incoming text.

Schemas

When a record is parsed from incoming data, it is parsed into the RFC 5424 schema.

======= The RFC 5424 schema

{
  "type": "record",
  "name": "nifiRecord",
  "namespace": "org.apache.nifi",
  "fields": [
    {
      "name": "priority",
      "type": [
        "null",
        "string"
      ]
    },
    {
      "name": "severity",
      "type": [
        "null",
        "string"
      ]
    },
    {
      "name": "facility",
      "type": [
        "null",
        "string"
      ]
    },
    {
      "name": "version",
      "type": [
        "null",
        "string"
      ]
    },
    {
      "name": "timestamp",
      "type": [
        "null",
        {
          "type": "long",
          "logicalType": "timestamp-millis"
        }
      ]
    },
    {
      "name": "hostname",
      "type": [
        "null",
        "string"
      ]
    },
    {
      "name": "body",
      "type": [
        "null",
        "string"
      ]
    },
    "name"
    :
    "appName",
    "type"
    :
    [
      "null",
      "string"
    ]
    },
    {
      "name": "procid",
      "type": [
        "null",
        "string"
      ]
    },
    {
      "name": "messageid",
      "type": [
        "null",
        "string"
      ]
    },
    {
      "name": "structuredData",
      "type": [
        "null",
        {
          "type": "map",
          "values": {
            "type": "map",
            "values": "string"
          }
        }
      ]
    }
  ]
}
json

SyslogReader

Attempts to parses the contents of a Syslog message in accordance to RFC5424 and RFC3164. In the case of RFC5424 formatted messages, structured data is not supported, and will be returned as part of the message.Note: Be mindfull that RFC3164 is informational and a wide range of different implementations are present in the wild.

Tags: syslog, logs, logfiles, parse, text, record, reader

Properties

Character Set

Specifies which character set of the Syslog messages

Raw message

If true, the record will have a _raw field containing the raw message

Additional Details

The SyslogReader Controller Service provides a means to parse the contents of a Syslog message in accordance to RFC5424 and RFC3164 formats. This reader produces records with a set schema to match the common set of fields between the specifications.

The Required Property of this service is named Character Set and specifies the Character Set of the incoming text.

Schemas

When a record is parsed from incoming data, it is parsed into the Generic Syslog Schema.

======= The Generic Syslog Schema

{
  "type": "record",
  "name": "nifiRecord",
  "namespace": "org.apache.nifi",
  "fields": [
    {
      "name": "priority",
      "type": [
        "null",
        "string"
      ]
    },
    {
      "name": "severity",
      "type": [
        "null",
        "string"
      ]
    },
    {
      "name": "facility",
      "type": [
        "null",
        "string"
      ]
    },
    {
      "name": "version",
      "type": [
        "null",
        "string"
      ]
    },
    {
      "name": "timestamp",
      "type": [
        "null",
        "string"
      ]
    },
    {
      "name": "hostname",
      "type": [
        "null",
        "string"
      ]
    },
    {
      "name": "body",
      "type": [
        "null",
        "string"
      ]
    }
  ]
}
json

TinkerpopClientService

This service interacts with a tinkerpop-compliant graph service, providing both script submission and bytecode submission capabilities. Script submission is the default, with the script command being sent to the gremlin server as text. This should only be used for simple interactions with a tinkerpop-compliant server such as counts or other operations that do not require the injection of custom classed. Bytecode submission allows much more flexibility. When providing a jar, custom serializers can be used and pre-compiled graph logic can be utilized by groovy scriptsprovided by processors such as the ExecuteGraphQueryRecord.

Tags: graph, gremlin

Properties

Script Submission Type

A selection that toggles for between script submission or as bytecode submission

Settings Specification

Selecting "Service-Defined Settings" connects using the setting on this service. Selecting "Yaml Settings" uses the specified YAML file for connection settings.

Remote Objects File

The remote-objects file YAML used for connecting to the gremlin server.

Extension JARs

A comma-separated list of Java JAR files to be loaded. This should contain any Serializers or other classes specified in the YAML file. Additionally, any custom classes required for the groovy script to work in the bytecode submission setting should also be contained in these JAR files.

Extension Classes

A comma-separated list of fully qualified Java class names that correspond to classes to implement. This is useful for services such as JanusGraph that need specific serialization classes. This configuration property has no effect unless a value for the Extension JAR field is also provided.

Contact Points

A comma-separated list of hostnames or IP addresses where an Gremlin-enabled server can be found.

Port

The port where Gremlin Server is running on each host listed as a contact point.

Path

The URL path where Gremlin Server is running on each host listed as a contact point.

Traversal Source Name

An optional property that lets you set the name of the remote traversal instance. This can be really important when working with databases like JanusGraph that support multiple backend traversal configurations simultaneously.

Username

The username used to authenticate with the gremlin server. Note: when using a remote.yaml file, this username value (if set) will overload any username set in the YAML file.

Password

The password used to authenticate with the gremlin server. Note: when using a remote.yaml file, this password setting (if set) will override any password set in the YAML file

SSL Context Service

The SSL Context Service used to provide client certificate information for TLS/SSL connections.

Additional Details

Description:

This client service configures a connection to a Gremlin Server and allows Gremlin queries to be executed against the Gremlin Server. For more information on Gremlin and Gremlin Server, see the Apache Tinkerpop project.

This client service supports two differnt modes of operation: Script Submission and Bytecode Submission, described below.

Script Submission

Script submission is the default way to interact with the gremlin server. This takes the input script and uses Script Submission to interact with the gremlin server. Because the script is shipped to the gremlin server as a string, only simple queries are recommended ( count, path, etc.) as there are no complex serializers available in this operation. This also means that NiFi will not be opinionated about what is returned, whatever the response from the tinkerpop server is, the response will be deserialized assuming common Java types. In the case of a Map return, the values will be returned as a record in the FlowFile response, in all other cases, the return of the query will be coerced into a Map with key “result” and value being the result of your script submission for that specific response.

Serialization Issues in Script Submission

A common issue when creating Gremlin scripts for first time users is to accidentally return an unserializable object. Gremlin is a Groovy DSL and so it behaves like compiled Groovy including returning the last statement in the script. This is an example of a Gremlin script that could cause unexpected failures:

g.V().hasLabel("person").has("name", "John Smith").valueMap()

The valueMap() step is not directly serializable and will fail. To fix that you have two potential options:

//Return a Map
g.V().hasLabel("person").has("name", "John Smith").valueMap().next()

Alternative:

g.V().hasLabel("person").has("name", "John Smith").valueMap()
true //Return boolean literal
Bytecode Submission

Bytecode submission is the more flexible of the two submission method and will be much more performant in a production system. When combined with the Yaml connection settings and a custom jar, very complex graph queries can be run directly within the NiFi JVM, leveraging custom serializers to decrease serialization overhead.

Instead of submitting a script to the gremlin server, requiring string serialization on both sides of the string result set, the groovy script is compiled within the NiFi JVM. This compiled script has the bindings of g (the GraphTraversalSource) and log (the NiFi logger) injected into the compiled code. Utilizing g, your result set is contained within NiFi and serialization should take care of the overhead of your responses drastically decreasing the likelihood of serialization errors.

As the result returned cannot be known by NiFi to be a specific type, your groovy script must rerun a Map<String, Object>, otherwise the response will be ignored. Here is an example:

Object results = g.V().hasLabel("person").has("name", "John Smith").valueMap().collect()
[result: results]

This will break up your response objects into an array within your result key, allowing further processing within nifi if necessary.

UDPEventRecordSink

Format and send Records as UDP Datagram Packets to a configurable destination

Tags: UDP, event, record, sink

Properties

Hostname

Destination hostname or IP address

Port

Destination port number

Record Writer

Specifies the Controller Service to use for writing out the records.

Sender Threads

Number of worker threads allocated for handling socket communication

VolatileSchemaCache

Provides a Schema Cache that evicts elements based on a Least-Recently-Used algorithm. This cache is not persisted, so any restart of NiFi will result in the cache being cleared. Additionally, the cache will be cleared any time that the Controller Service is stopped and restarted.

Tags: record, schema, cache

Properties

Maximum Cache Size

The maximum number of Schemas to cache.

WindowsEventLogReader

Reads Windows Event Log data as XML content having been generated by ConsumeWindowsEventLog, ParseEvtx, etc. (see Additional Details) and creates Record object(s). If the root tag of the input XML is 'Events', the child content is expected to be a series of 'Event' tags, each of which will constitute a single record. If the root tag is 'Event', the content is expected to be a single 'Event' and thus a single record. No other root tags are valid. Only events of type 'System' are currently supported.

Tags: xml, windows, event, log, record, reader, parser

Properties

Additional Details

Description:

This controller service is used to parse Windows Event Log events in the form of XML input (possibly from ConsumeWindowsEventLog or ParseEvtx).

Input XML Example:
<Event xmlns="https://schemas.microsoft.com/win/2004/08/events/event">
    <System>
        <Provider Name="Service Control Manager" Guid="{555908d1-a6d7-4695-8e1e-26931d2012f4}"
                  EventSourceName="Service Control Manager"/>
        <EventID Qualifiers="16384">7036</EventID>
        <Version>0</Version>
        <Level>4</Level>
        <Task>0</Task>
        <Opcode>0</Opcode>
        <Keywords>0x8080000000000000</Keywords>
        <TimeCreated SystemTime="2016-06-10T22:28:53.905233700Z"/>
        <EventRecordID>34153</EventRecordID>
        <Correlation/>
        <Execution ProcessID="684" ThreadID="3504"/>
        <Channel>System</Channel>
        <Computer>WIN-O05CNUCF16M.hdf.local</Computer>
        <Security/>
    </System>
    <EventData>
        <Data Name="param1">Smart Card Device Enumeration Service</Data>
        <Data>param2</Data>
        <Binary>5300630044006500760069006300650045006E0075006D002F0034000000</Binary>
    </EventData>
</Event>
xml
Output example (using ConvertRecord with JsonRecordSetWriter
[
  {
    "System": {
      "Provider": {
        "Guid": "{555908d1-a6d7-4695-8e1e-26931d2012f4}",
        "Name": "Service Control Manager"
      },
      "EventID": 7036,
      "Version": 0,
      "Level": 4,
      "Task": 0,
      "Opcode": 0,
      "Keywords": "0x8080000000000000",
      "TimeCreated": {
        "SystemTime": "2016-06-10T22:28:53.905233700Z"
      },
      "EventRecordID": 34153,
      "Correlation": null,
      "Execution": {
        "ThreadID": 3504,
        "ProcessID": 684
      },
      "Channel": "System",
      "Computer": "WIN-O05CNUCF16M.hdf.local",
      "Security": null
    },
    "EventData": {
      "param1": "Smart Card Device Enumeration Service",
      "param2": "5300630044006500760069006300650045006E0075006D002F0034000000"
    }
  }
]
json

XMLFileLookupService

A reloadable XML file-based lookup service. This service uses Apache Commons Configuration. Example XML configuration file and how to access specific configuration can be found at http://commons.apache.org/proper/commons-configuration/userguide/howto_hierarchical.html. External entity processing is disabled.

Tags: lookup, cache, enrich, join, xml, reloadable, key, value

Properties

Configuration File

A configuration file

Restricted

Making changes to this component requires admin rights. Non-admins can still schedule this component.

XMLReader

Reads XML content and creates Record objects. Records are expected in the second level of XML data, embedded in an enclosing root tag.

Tags: xml, record, reader, parser

Properties

Schema Access Strategy

Specifies how to obtain the schema that is to be used for interpreting the data.

Schema Registry

Specifies the Controller Service to use for the Schema Registry

Schema Name

Specifies the name of the schema to lookup in the Schema Registry property

Schema Version

Specifies the version of the schema to lookup in the Schema Registry. If not specified then the latest version of the schema will be retrieved.

Schema Branch

Specifies the name of the branch to use when looking up the schema in the Schema Registry property. If the chosen Schema Registry does not support branching, this value will be ignored.

Schema Text

The text of an Avro-formatted Schema

Schema Reference Reader

Service implementation responsible for reading FlowFile attributes or content to determine the Schema Reference Identifier

Parse XML Attributes

When 'Schema Access Strategy' is 'Infer Schema' and this property is 'true' then XML attributes are parsed and added to the record as new fields. When the schema is inferred but this property is 'false', XML attributes and their values are ignored.

Schema Inference Cache

Specifies a Schema Cache to use when inferring the schema. If not populated, the schema will be inferred each time. However, if a cache is specified, the cache will first be consulted and if the applicable schema can be found, it will be used instead of inferring the schema.

Expect Records as Array

This property defines whether the reader expects a FlowFile to consist of a single Record or a series of Records with a "wrapper element". Because XML does not provide for a way to read a series of XML documents from a stream directly, it is common to combine many XML documents by concatenating them and then wrapping the entire XML blob with a "wrapper element". This property dictates whether the reader expects a FlowFile to consist of a single Record or a series of Records with a "wrapper element" that will be ignored.

Attribute Prefix

If this property is set, the name of attributes will be prepended with a prefix when they are added to a record.

Field Name for Content

If tags with content (e. g. <field>content</field>) are defined as nested records in the schema, the name of the tag will be used as name for the record and the value of this property will be used as name for the field. If tags with content shall be parsed together with attributes (e. g. <field attribute="123">content</field>), they have to be defined as records. In such a case, the name of the tag will be used as the name for the record and the value of this property will be used as the name for the field holding the original content. The name of the attribute will be used to create a new record field, the content of which will be the value of the attribute. For more information, see the 'Additional Details…​' section of the XMLReader controller service’s documentation.

Date Format

Specifies the format to use when reading/writing Date fields. If not specified, Date fields will be assumed to be number of milliseconds since epoch (Midnight, Jan 1, 1970 GMT). If specified, the value must match the Java java.time.format.DateTimeFormatter format (for example, MM/dd/yyyy for a two-digit month, followed by a two-digit day, followed by a four-digit year, all separated by '/' characters, as in 01/01/2017).

Time Format

Specifies the format to use when reading/writing Time fields. If not specified, Time fields will be assumed to be number of milliseconds since epoch (Midnight, Jan 1, 1970 GMT). If specified, the value must match the Java java.time.format.DateTimeFormatter format (for example, HH:mm:ss for a two-digit hour in 24-hour format, followed by a two-digit minute, followed by a two-digit second, all separated by ':' characters, as in 18:04:15).

Timestamp Format

Specifies the format to use when reading/writing Timestamp fields. If not specified, Timestamp fields will be assumed to be number of milliseconds since epoch (Midnight, Jan 1, 1970 GMT). If specified, the value must match the Java java.time.format.DateTimeFormatter format (for example, MM/dd/yyyy HH:mm:ss for a two-digit month, followed by a two-digit day, followed by a four-digit year, all separated by '/' characters; and then followed by a two-digit hour in 24-hour format, followed by a two-digit minute, followed by a two-digit second, all separated by ':' characters, as in 01/01/2017 18:04:15).

Additional Details

The XMLReader Controller Service reads XML content and creates Record objects. The Controller Service must be configured with a schema that describes the structure of the XML data. Fields in the XML data that are not defined in the schema will be skipped. Depending on whether the property “Expect Records as Array” is set to “false” or “true”, the reader either expects a single record or an array of records for each FlowFile.

Example: Single record

<record>
    <field1>content</field1>
    <field2>content</field2>
</record>
xml

An array of records has to be enclosed by a root tag. Example: Array of records

<root>
    <record>
        <field1>content</field1>
        <field2>content</field2>
    </record>
    <record>
        <field1>content</field1>
        <field2>content</field2>
    </record>
</root>
xml
Example: Simple Fields

The simplest kind of data within XML data are tags / fields only containing content (no attributes, no embedded tags). They can be described in the schema by simple types (e. g. INT, STRING, …).

<root>
    <record>
        <simple_field>content</simple_field>
    </record>
</root>
xml

This record can be described by a schema containing one field (e.g. of type string). By providing this schema, the reader expects zero or one occurrences of “simple_field” in the record.

{
  "namespace": "nifi",
  "name": "test",
  "type": "record",
  "fields": [
    {
      "name": "simple_field",
      "type": "string"
    }
  ]
}
json
Example: Arrays with Simple Fields

Arrays are considered as repetitive tags / fields in XML data. For the following XML data, “array_field” is considered to be an array enclosing simple fields, whereas “simple_field” is considered to be a simple field not enclosed in an array.

<record>
    <array_field>content</array_field>
    <array_field>content</array_field>
    <simple_field>content</simple_field>
</record>
xml

This record can be described by the following schema:

{
  "namespace": "nifi",
  "name": "test",
  "type": "record",
  "fields": [
    {
      "name": "array_field",
      "type": {
        "type": "array",
        "items": "string"
      }
    },
    {
      "name": "simple_field",
      "type": "string"
    }
  ]
}
json

If a field in a schema is embedded in an array, the reader expects zero, one or more occurrences of the field in a record. The field “array_field” principally also could be defined as a simple field, but then the second occurrence of this field would replace the first in the record object. Moreover, the field “simple_field” could also be defined as an array. In this case, the reader would put it into the record object as an array with one element.

Example: Tags with Attributes

XML fields frequently not only contain content, but also attributes. The following record contains a field with an attribute “attr” and content:

<record>
    <field_with_attribute attr="attr_content">content of field</field_with_attribute>
</record>
xml

To parse the content of the field “field_with_attribute” together with the attribute “attr”, two requirements have to be fulfilled:

  • In the schema, the field has to be defined as record.

  • The property “Field Name for Content” has to be set.

  • As an option, the property “Attribute Prefix” also can be set.

For the example above, the following property settings are assumed:

Property Name Property Value

Field Name for Content

field_name_for_content

Attribute Prefix

prefix_

The schema can be defined as follows:

{
  "name": "test",
  "namespace": "nifi",
  "type": "record",
  "fields": [
    {
      "name": "field_with_attribute",
      "type": {
        "name": "RecordForTag",
        "type": "record",
        "fields": [
          {
            "name": "attr",
            "type": "string"
          },
          {
            "name": "field_name_for_content",
            "type": "string"
          }
        ]
      }
      ]
    }
json

Note that the field “field_name_for_content” not only has to be defined in the property section, but also in the schema, whereas the prefix for attributes is not part of the schema. It will be appended when an attribute named “attr” is found at the respective position in the XML data and added to the record. The record object of the above example will be structured as follows:

Record (
    Record "field_with_attribute" (
        RecordField "prefix_attr" = "attr_content",
        RecordField "field_name_for_content" = "content of field"
    )

Principally, the field “field_with_attribute” could also be defined as a simple field. In this case, the attributes simply would be ignored. Vice versa, the simple field in example 1 above could also be defined as a record (assuming that the property “Field Name for Content” is set.

It is possible that the schema is not provided explicitly, but schema inference is used. For details on XML attributes and schema inference, see “Example: Tags with Attributes and Schema Inference” below.

Example: Tags within tags

XML data is frequently nested. In this case, tags enclose other tags:

<record>
    <field_with_embedded_fields attr="attr_content">
        <embedded_field>embedded content</embedded_field>
        <another_embedded_field>another embedded content</another_embedded_field>
    </field_with_embedded_fields>
</record>
xml

The enclosing fields always have to be defined as records, irrespective whether they include attributes to be parsed or not. In this example, the tag “field_with_embedded_fields” encloses the fields “embedded_field” and ” another_embedded_field”, which are both simple fields. The schema can be defined as follows:

{
  "name": "test",
  "namespace": "nifi",
  "type": "record",
  "fields": [
    {
      "name": "field_with_embedded_fields",
      "type": {
        "name": "RecordForEmbedded",
        "type": "record",
        "fields": [
          {
            "name": "attr",
            "type": "string"
          },
          {
            "name": "embedded_field",
            "type": "string"
          },
          {
            "name": "another_embedded_field",
            "type": "string"
          }
        ]
      }
      ]
    }
json

Notice that this case does not require the property “Field Name for Content” to be set as this is only required for tags containing attributes and content.

Example: Tags with Attributes and Schema Inference

When the record’s schema is not provided but inferred based on the data itself, providing a value for the “Field Name for Content” property is especially important. (For detailed information on schema inference, see the “Schema Inference” section below.) Let’s focus on cases where an XML element (called <field_with_attribute> in the examples) has an XML attribute and some content and no sub-elements. For the examples below, let’s assume that a ConvertRecord processor is used, and it uses an XMLReader controller service and an XMLRecordSetWriter controller service. The settings for XMLReader are provided separately for each example. The settings for XMLRecordSetWriter are common for all the examples below. This way an XML to XML conversion is executed and comparing the input data with the output highlights the schema inference behavior. The same behavior can be observed if a different Writer controller service is used. XMLRecordSetWriter was chosen for these examples so that the input and the output are easily comparable. The settings of the common XMLRecordSetWriter are the following:

Property Name Property Value

Schema Access Strategy

Inherit Record Schema

Suppress Null Values

Never Suppress

XML Attributes and Schema Inference Example 1

The simplest case is when XML attributes are ignored completely during schema inference. To achieve this, the “Parse XML Attributes” property in XMLReader is set to “false”.

XMLReader settings:

Property Name Property Value

Schema Access Strategy

Infer Schema

Parse XML Attributes

false

Expect Records as Array

false

Field Name for Content

not set

Input:

<record>
    <field_with_attribute attr="attr_content">content of field</field_with_attribute>
</record>
xml

Output:

<record>
    <field_with_attribute>content of field</field_with_attribute>
</record>
xml

If “Parse XML Attributes” is “false”, the XML attribute is not parsed. Its name does not appear in the inferred schema and its value is ignored. The reader behaves as if the XML attribute was not there.

Important note: “Field Name for Content” was not set in this example. This could lead to data loss if ” field_with_attribute” had child elements, similarly to what is described in “XML Attributes and Schema Inference Example 2” and “XML Attributes and Schema Inference Example 4”. To avoid that, “Field Name for Content” needs to be assigned a value that is different from any existing XML tags in the data, like in “XML Attributes and Schema Inference Example 6”.

XML Attributes and Schema Inference Example 2

XMLReader settings:

Property Name Property Value

Schema Access Strategy

Infer Schema

Parse XML Attributes

true

Expect Records as Array

false

Field Name for Content

not set

Input:

<record>
    <field_with_attribute attr="attr_content">content of field</field_with_attribute>
</record>
xml

As mentioned above, the element called “field_with_attribute” has an attribute and some content but no sub-element.

Output:

<record>
    <field_with_attribute>
        <attr>attr_content</attr>
        <value></value>
    </field_with_attribute>
</record>
xml

In the XMLReader’s settings, no value is set for the “Field Name for Content” property. In such cases the schema inference logic adds a field named “value” to the schema. However, since “Field Name for Content” is not set, the data processing logic is instructed not to consider the original content of the parent XML tags (<field_with_attribute> the content of which is “content of field” in the example). So a new field named “value” appears in the schema but no value is assigned to it from the data, thus the field is empty. The XML attribute (named “attr”) is processed, a field named ” attr” is added to the schema and the attribute’s value (“attr_content”) is assigned to it. In a case like this, the parent field’s original content is lost and a new field named “value” appears in the schema with no data assigned to it. This is to make sure that no data is overwritten in the record if it already contains a field named “value”. More on that case in Example 4 and Example 5.

XML Attributes and Schema Inference Example 3

In this example, the XMLReader’s “Field Name for Content” property is filled with the value “original_content”. The input data is the same as in the previous example.

XMLReader settings:

Property Name Property Value

Schema Access Strategy

Infer Schema

Parse XML Attributes

true

Expect Records as Array

false

Field Name for Content

original_content

Input:

<record>
    <field_with_attribute attr="attr_content">content of field</field_with_attribute>
</record>
xml

Output:

<record>
    <field_with_attribute>
        <attr>attr_content</attr>
        <original_content>content of field</original_content>
    </field_with_attribute>
</record>
xml

The XMLReader’s “Field Name for Content” property contains the value “original_content” (the concrete value is not important, what is important is that a value is provided and it does not clash with the name of any sub-element in <field_with_attribute>). This explicitly tells the XMLReader controller service to create a field named ” original_content” and make the original content of the parent XML tag the value of the field named “original_content”. Adding the XML attributed named “attr” works just like in the first example. Since the <field_with_attribute> element had no child-element with the name “original_content”, no data is lost.

XML Attributes and Schema Inference Example 4

In this example, XMLReader’s “Field Name for Content” property is left empty. In the input data, the <field_with_attribute> element has some content and a sub-element named <value>.

XMLReader settings:

Property Name Property Value

Schema Access Strategy

Infer Schema

Parse XML Attributes

true

Expect Records as Array

false

Field Name for Content

not set

Input:

<record>
    <field_with_attribute attr="attr_content">content of field<value>123</value>
    </field_with_attribute>
</record>
xml

Output:

<record>
    <field_with_attribute>
        <attr>attr_content</attr>
        <value>123</value>
    </field_with_attribute>
</record>
xml

The “Field Name for Content” property is not set, and the XML element has a sub-element named “value”. The name of the sub-element clashes with the default field name added to the schema by the Schema Inference logic (see Example 2). As seen in the output data, the input XML attribute’s value is added to the record just like in the previous examples. The value of the <value> element is retained, but the content of the <field_with_attribute> that was outside the sub-element, is lost.

XML Attributes and Schema Inference Example 5

In this example, XMLReader’s “Field Name for Content” property is given the value “value”. In the input data, the <field_with_attribute> element has some content and a sub-element named <value>. The name of the sub-element clashes with the value of the “Field Name for Content” property.

XMLReader settings:

Property Name Property Value

Schema Access Strategy

Infer Schema

Parse XML Attributes

true

Expect Records as Array

false

Field Name for Content

value

Input:

<record>
    <field_with_attribute attr="attr_content">content of field<value>123</value>
    </field_with_attribute>
</record>
xml

Output:

<record>
    <field_with_attribute>
        <attr>attr_content</attr>
        <value>content of field</value>
    </field_with_attribute>
</record>
xml

The “Field Name for Content” property’s value is “value”, and the XML element has a sub-element named “value”. The name of the sub-element clashes with the value of the “Field Name for Content” property. The value of the <value> element is replaced by the content of the <field_with_attribute> element, and the original content of the <value> element is lost.

XML Attributes and Schema Inference Example 6

To avoid losing any data, the XMLReader’s “Field Name for Content” property needs to be given a value that does not clash with any sub-element’s name in the input data. In this example the input data is the same as in the previous one, but the “Field Name for Content” property’s value is “original_content”, a value that does not clash with any sub-element name. No data is lost in this case.

XMLReader settings:

Property Name Property Value

Schema Access Strategy

Infer Schema

Parse XML Attributes

true

Expect Records as Array

false

Field Name for Content

original_content

Input:

<record>
    <field_with_attribute attr="attr_content">content of field<value>123</value>
    </field_with_attribute>
</record>
xml

Output:

<record>
    <field_with_attribute>
        <attr>attr_content</attr>
        <value>123</value>
        <original_content>content of field</original_content>
    </field_with_attribute>
</record>
xml

It can be seen in the output data, that the attribute has been added to the <field_with_attribute> element as a sub-element, the <value> retained its value, and the original content of the <field_with_attribute> element has been added as a sub-element named “original_content”. This is because a value was chosen for the “Field Name for Content” property that does not clash with any of the existing sub-elements of the input XML element (<field_with_attribute>). No data is lost.

Example: Array of records

For further explanation of the logic of this reader, an example of an array of records shall be demonstrated. The following record contains the field “array_field”, which repeatedly occurs. The field contains two embedded fields.

<record>
    <array_field>
        <embedded_field>embedded content 1</embedded_field>
        <another_embedded_field>another embedded content 1</another_embedded_field>
    </array_field>
    <array_field>
        <embedded_field>embedded content 2</embedded_field>
        <another_embedded_field>another embedded content 2</another_embedded_field>
    </array_field>
</record>
xml

This XML data can be parsed similarly to the data in example 4. However, the record defined in the schema of example 4 has to be embedded in an array.

{
  "namespace": "nifi",
  "name": "test",
  "type": "record",
  "fields": [
    {
      "name": "array_field",
      "type": {
        "type": "array",
        "items": {
          "name": "RecordInArray",
          "type": "record",
          "fields": [
            {
              "name": "embedded_field",
              "type": "string"
            },
            {
              "name": "another_embedded_field",
              "type": "string"
            }
          ]
        }
      }
    }
  ]
}
json
Example: Array in record

In XML data, arrays are frequently enclosed by tags:

<record>
    <field_enclosing_array>
        <element>content 1</element>
        <element>content 2</element>
    </field_enclosing_array>
    <field_without_array>content 3</field_without_array>
</record>
xml

For the schema, embedded tags have to be described by records. Therefore, the field “field_enclosing_array” is a record that embeds an array with elements of type string:

{
  "namespace": "nifi",
  "name": "test",
  "type": "record",
  "fields": [
    {
      "name": "field_enclosing_array",
      "type": {
        "name": "EmbeddedRecord",
        "type": "record",
        "fields": [
          {
            "name": "element",
            "type": {
              "type": "array",
              "items": "string"
            }
          }
        ]
      }
    },
    {
      "name": "field_without_array",
      "type": "string"
    }
  ]
}
json
Example: Maps

A map is a field embedding fields with different names:

<record>
    <map_field>
        <field1>content</field1>
        <field2>content</field2>                     ...
    </map_field>
    <simple_field>content</simple_field>
</record>
xml

This data can be processed using the following schema:

{
  "namespace": "nifi",
  "name": "test",
  "type": "record",
  "fields": [
    {
      "name": "map_field",
      "type": {
        "type": "map",
        "items": string
      }
    },
    {
      "name": "simple_field",
      "type": "string"
    }
  ]
}
json
Schema Inference

While NiFi’s Record API does require that each Record have a schema, it is often convenient to infer the schema based on the values in the data, rather than having to manually create a schema. This is accomplished by selecting a value of ” Infer Schema” for the “Schema Access Strategy” property. When using this strategy, the Reader will determine the schema by first parsing all data in the FlowFile, keeping track of all fields that it has encountered and the type of each field. Once all data has been parsed, a schema is formed that encompasses all fields that have been encountered.

A common concern when inferring schemas is how to handle the condition of two values that have different types. For example, consider a FlowFile with the following two records:

<root>
    <record>
        <name>John</name>
        <age>8</age>
        <values>N/A</values>
    </record>
    <record>
        <name>Jane</name>
        <age>Ten</age>
        <values>8</values>
        <values>Ten</values>
    </record>
</root>
xml

It is clear that the “name” field will be inferred as a STRING type. However, how should we handle the “age” field? Should the field be an CHOICE between INT and STRING? Should we prefer LONG over INT? Should we just use a STRING? Should the field be considered nullable?

To help understand how this Record Reader infers schemas, we have the following list of rules that are followed in the inference logic:

  • All fields are inferred to be nullable.

  • When two values are encountered for the same field in two different records (or two values are encountered for an ARRAY type), the inference engine prefers to use a “wider” data type over using a CHOICE data type. A data type “A” is said to be wider than data type “B” if and only if data type “A” encompasses all values of “B” in addition to other values. For example, the LONG type is wider than the INT type but not wider than the BOOLEAN type (and BOOLEAN is also not wider than LONG). INT is wider than SHORT. The STRING type is considered wider than all other types except MAP, RECORD, ARRAY, and CHOICE.

  • If two values are encountered for the same field in two different records (or two values are encountered for an ARRAY type), but neither value is of a type that is wider than the other, then a CHOICE type is used. In the example above, the “values” field will be inferred as a CHOICE between a STRING or an ARRRAY.

  • If the “Time Format,” “Timestamp Format,” or “Date Format” properties are configured, any value that would otherwise be considered a STRING type is first checked against the configured formats to see if it matches any of them. If the value matches the Timestamp Format, the value is considered a Timestamp field. If it matches the Date Format, it is considered a Date field. If it matches the Time Format, it is considered a Time field. In the unlikely event that the value matches more than one of the configured formats, they will be matched in the order: Timestamp, Date, Time. I.e., if a value matched both the Timestamp Format and the Date Format, the type that is inferred will be Timestamp. Because parsing dates and times can be expensive, it is advisable not to configure these formats if dates, times, and timestamps are not expected, or if processing the data as a STRING is acceptable. For use cases when this is important, though, the inference engine is intelligent enough to optimize the parsing by first checking several very cheap conditions. For example, the string’s length is examined to see if it is too long or too short to match the pattern. This results in far more efficient processing than would result if attempting to parse each string value as a timestamp.

  • The MAP type is never inferred. Instead, the RECORD type is used.

  • If two elements exist with the same name and the same parent (i.e., two sibling elements have the same name), the field will be inferred to be of type ARRAY.

  • If a field exists but all values are null, then the field is inferred to be of type STRING.

Caching of Inferred Schemas

This Record Reader requires that if a schema is to be inferred, that all records be read in order to ensure that the schema that gets inferred is applicable for all records in the FlowFile. However, this can become expensive, especially if the data undergoes many different transformations. To alleviate the cost of inferring schemas, the Record Reader can be configured with a “Schema Inference Cache” by populating the property with that name. This is a Controller Service that can be shared by Record Readers and Record Writers.

Whenever a Record Writer is used to write data, if it is configured with a “Schema Cache,” it will also add the schema to the Schema Cache. This will result in an identifier for that schema being added as an attribute to the FlowFile.

Whenever a Record Reader is used to read data, if it is configured with a “Schema Inference Cache”, it will first look for a “schema.cache.identifier” attribute on the FlowFile. If the attribute exists, it will use the value of that attribute to lookup the schema in the schema cache. If it is able to find a schema in the cache with that identifier, then it will use that schema instead of reading, parsing, and analyzing the data to infer the schema. If the attribute is not available on the FlowFile, or if the attribute is available but the cache does not have a schema with that identifier, then the Record Reader will proceed to infer the schema as described above.

The end result is that users are able to chain together many different Processors to operate on Record-oriented data. Typically, only the first such Processor in the chain will incur the “penalty” of inferring the schema. For all other Processors in the chain, the Record Reader is able to simply lookup the schema in the Schema Cache by identifier. This allows the Record Reader to infer a schema accurately, since it is inferred based on all data in the FlowFile, and still allows this to happen efficiently since the schema will typically only be inferred once, regardless of how many Processors handle the data.

XMLRecordSetWriter

Writes a RecordSet to XML. The records are wrapped by a root tag.

Tags: xml, resultset, writer, serialize, record, recordset, row

Properties

Schema Write Strategy

Specifies how the schema for a Record should be added to the data.

Schema Cache

Specifies a Schema Cache to add the Record Schema to so that Record Readers can quickly lookup the schema.

Schema Reference Writer

Service implementation responsible for writing FlowFile attributes or content header with Schema reference information

Schema Access Strategy

Specifies how to obtain the schema that is to be used for interpreting the data.

Schema Registry

Specifies the Controller Service to use for the Schema Registry

Schema Name

Specifies the name of the schema to lookup in the Schema Registry property

Schema Version

Specifies the version of the schema to lookup in the Schema Registry. If not specified then the latest version of the schema will be retrieved.

Schema Branch

Specifies the name of the branch to use when looking up the schema in the Schema Registry property. If the chosen Schema Registry does not support branching, this value will be ignored.

Schema Text

The text of an Avro-formatted Schema

Schema Reference Reader

Service implementation responsible for reading FlowFile attributes or content to determine the Schema Reference Identifier

Date Format

Specifies the format to use when reading/writing Date fields. If not specified, Date fields will be assumed to be number of milliseconds since epoch (Midnight, Jan 1, 1970 GMT). If specified, the value must match the Java java.time.format.DateTimeFormatter format (for example, MM/dd/yyyy for a two-digit month, followed by a two-digit day, followed by a four-digit year, all separated by '/' characters, as in 01/01/2017).

Time Format

Specifies the format to use when reading/writing Time fields. If not specified, Time fields will be assumed to be number of milliseconds since epoch (Midnight, Jan 1, 1970 GMT). If specified, the value must match the Java java.time.format.DateTimeFormatter format (for example, HH:mm:ss for a two-digit hour in 24-hour format, followed by a two-digit minute, followed by a two-digit second, all separated by ':' characters, as in 18:04:15).

Timestamp Format

Specifies the format to use when reading/writing Timestamp fields. If not specified, Timestamp fields will be assumed to be number of milliseconds since epoch (Midnight, Jan 1, 1970 GMT). If specified, the value must match the Java java.time.format.DateTimeFormatter format (for example, MM/dd/yyyy HH:mm:ss for a two-digit month, followed by a two-digit day, followed by a four-digit year, all separated by '/' characters; and then followed by a two-digit hour in 24-hour format, followed by a two-digit minute, followed by a two-digit second, all separated by ':' characters, as in 01/01/2017 18:04:15).

Suppress Null Values

Specifies how the writer should handle a null field

Pretty Print XML

Specifies whether or not the XML should be pretty printed

Omit XML Declaration

Specifies whether or not to include XML declaration

Name of Root Tag

Specifies the name of the XML root tag wrapping the record set. This property has to be defined if the writer is supposed to write multiple records in a single FlowFile.

Name of Record Tag

Specifies the name of the XML record tag wrapping the record fields. If this is not set, the writer will use the record name in the schema.

Wrap Elements of Arrays

Specifies how the writer wraps elements of fields of type array

Array Tag Name

Name of the tag used by property "Wrap Elements of Arrays" to write arrays

Character Set

The Character set to use when writing the data to the FlowFile

Additional Details

The XMLRecordSetWriter Controller Service writes record objects to XML. The Controller Service must be configured with a schema that describes the structure of the record objects. Multiple records are wrapped by a root node. The name of the root node can be configured via property. If no root node is configured, the writer expects only one record for each FlowFile (that will not be wrapped). As Avro does not support defining attributes for records, this writer currently does not support writing XML attributes.

Example: Simple records
RecordSet (
  Record (
    Field "name1" = "value1",
    Field "name2" = 42
  ),
  Record (
    Field "name1" = "value2",
    Field "name2" = 84
  )
)

This record can be described by the following schema:

{
  "name": "test",
  "namespace": "nifi",
  "type": "record",
  "fields": [
    {
      "name": "name1",
      "type": "string"
    },
    {
      "name": "name2",
      "type": "int"
    }
  ]
}
json

Assuming that “root_name” has been configured as the name for the root node and “record_name” has been configured as the name for the record nodes, the writer will write the following XML:

<root_name>
    <record_name>
        <name1>value1</name1>
        <name2>42</name2>
    </record_name>
    <record_name>
        <name1>value2</name1>
        <name2>84</name2>
    </record_name>
</root_name>
xml

The writer furthermore can be configured how to treat null or missing values in records:

RecordSet (
  Record (
    Field "name1" = "value1",
    Field "name2" = null
  ),
  Record (
    Field "name1" = "value2",
  )
)

If the writer is configured always to suppress missing or null values, only the field of name “name1” will appear in the XML. If the writer ist configured only to suppress missing values, the field of name “name2” will appear in the XML as a node without content for the first record. If the writer is configured never to suppress anything, the field of name ” name2” will appear in the XML as a node without content for both records.

Example: Arrays

The writer furthermore can be configured how to write arrays:

RecordSet (
  Record (
    Field "name1" = "value1",
    Field "array_field" = [ 1, 2, 3 ]
  )
)

This record can be described by the following schema:

{
  "name": "test",
  "namespace": "nifi",
  "type": "record",
  "fields": [
    {
      "name": "array_field",
      "type": {
        "type": "array",
        "items": int
      }
    },
    {
      "name": "name1",
      "type": "string"
    }
  ]
}
json

If the writer is configured not to wrap arrays, it will transform the record to the following XML:

<root_name>
    <record_name>
        <name1>value1</name1>
        <array_field>1</array_field>
        <array_field>2</array_field>
        <array_field>3</array_field>
    </record_name>
</root_name>
xml

If the writer is configured to wrap arrays using the field name as wrapper and “elem” as tag name for element nodes, it will transform the record to the following XML:

<root_name>
    <record_name>
        <name1>value1</field2>
        <array_field>
            <elem>1</elem>
            <elem>2</elem>
            <elem>3</elem>
        </array_field>
    </record_name>
</root_name>
xml

If the writer is configured to wrap arrays using “wrap” as wrapper and the field name as tag name for element nodes, it will transform the record to the following XML:

<root_name>
    <record_name>
        <name1>value1</field2>
        <wrap>
            <array_field>1</array_field>
            <array_field>2</array_field>
            <array_field>3</array_field>
        </wrap>
    </record_name>
</root_name>
xml

YamlTreeReader

Parses YAML into individual Record objects. While the reader expects each record to be well-formed YAML, the content of a FlowFile may consist of many records, each as a well-formed YAML array or YAML object. If an array is encountered, each element in that array will be treated as a separate record. If the schema that is configured contains a field that is not present in the YAML, a null value will be used. If the YAML contains a field that is not present in the schema, that field will be skipped. Please note this controller service does not support resolving the use of YAML aliases. Any alias present will be treated as a string. See the Usage of the Controller Service for more information and examples.

Tags: yaml, tree, record, reader, parser

Properties

Schema Access Strategy

Specifies how to obtain the schema that is to be used for interpreting the data.

Schema Registry

Specifies the Controller Service to use for the Schema Registry

Schema Name

Specifies the name of the schema to lookup in the Schema Registry property

Schema Version

Specifies the version of the schema to lookup in the Schema Registry. If not specified then the latest version of the schema will be retrieved.

Schema Branch

Specifies the name of the branch to use when looking up the schema in the Schema Registry property. If the chosen Schema Registry does not support branching, this value will be ignored.

Schema Text

The text of an Avro-formatted Schema

Schema Reference Reader

Service implementation responsible for reading FlowFile attributes or content to determine the Schema Reference Identifier

Schema Inference Cache

Specifies a Schema Cache to use when inferring the schema. If not populated, the schema will be inferred each time. However, if a cache is specified, the cache will first be consulted and if the applicable schema can be found, it will be used instead of inferring the schema.

Starting Field Strategy

Start processing from the root node or from a specified nested node.

Starting Field Name

Skips forward to the given nested JSON field (array or object) to begin processing.

Schema Application Strategy

Specifies whether the schema is defined for the whole JSON or for the selected part starting from "Starting Field Name".

Max String Length

The maximum allowed length of a string value when parsing the JSON document

Allow Comments

Whether to allow comments when parsing the JSON document

Date Format

Specifies the format to use when reading/writing Date fields. If not specified, Date fields will be assumed to be number of milliseconds since epoch (Midnight, Jan 1, 1970 GMT). If specified, the value must match the Java java.time.format.DateTimeFormatter format (for example, MM/dd/yyyy for a two-digit month, followed by a two-digit day, followed by a four-digit year, all separated by '/' characters, as in 01/01/2017).

Time Format

Specifies the format to use when reading/writing Time fields. If not specified, Time fields will be assumed to be number of milliseconds since epoch (Midnight, Jan 1, 1970 GMT). If specified, the value must match the Java java.time.format.DateTimeFormatter format (for example, HH:mm:ss for a two-digit hour in 24-hour format, followed by a two-digit minute, followed by a two-digit second, all separated by ':' characters, as in 18:04:15).

Timestamp Format

Specifies the format to use when reading/writing Timestamp fields. If not specified, Timestamp fields will be assumed to be number of milliseconds since epoch (Midnight, Jan 1, 1970 GMT). If specified, the value must match the Java java.time.format.DateTimeFormatter format (for example, MM/dd/yyyy HH:mm:ss for a two-digit month, followed by a two-digit day, followed by a four-digit year, all separated by '/' characters; and then followed by a two-digit hour in 24-hour format, followed by a two-digit minute, followed by a two-digit second, all separated by ':' characters, as in 01/01/2017 18:04:15).

See Also

Additional Details

The YamlTreeReader Controller Service reads a YAML Object and creates a Record object either for the entire YAML Object tree or a subpart (see “Starting Field Strategies” section). The Controller Service must be configured with a Schema that describes the structure of the YAML data. If any field exists in the YAML that is not in the schema, that field will be skipped. If the schema contains a field for which no YAML field exists, a null value will be used in the Record (or the default value defined in the schema, if applicable).

If the root element of the YAML is a YAML Array, each YAML Object within that array will be treated as its own separate Record. If the root element is a YAML Object, the YAML will all be treated as a single Record.

Schemas and Type Coercion

When a record is parsed from incoming data, it is separated into fields. Each of these fields is then looked up against the configured schema (by field name) in order to determine what the type of the data should be. If the field is not present in the schema, that field is omitted from the Record. If the field is found in the schema, the data type of the received data is compared against the data type specified in the schema. If the types match, the value of that field is used as-is. If the schema indicates that the field should be of a different type, then the Controller Service will attempt to coerce the data into the type specified by the schema. If the field cannot be coerced into the specified type, an Exception will be thrown.

The following rules apply when attempting to coerce a field value from one data type to another:

  • Any data type can be coerced into a String type.

  • Any numeric data type (Byte, Short, Int, Long, Float, Double) can be coerced into any other numeric data type.

  • Any numeric value can be coerced into a Date, Time, or Timestamp type, by assuming that the Long value is the number of milliseconds since epoch (Midnight GMT, January 1, 1970).

  • A String value can be coerced into a Date, Time, or Timestamp type, if its format matches the configured “Date Format,” “Time Format,” or “Timestamp Format.”

  • A String value can be coerced into a numeric value if the value is of the appropriate type. For example, the String value 8 can be coerced into any numeric type. However, the String value 8.2 can be coerced into a Double or Float type but not an Integer.

  • A String value of “true” or “false” (regardless of case) can be coerced into a Boolean value.

  • A String value that is not empty can be coerced into a Char type. If the String contains more than 1 character, the first character is used and the rest of the characters are ignored.

  • Any “date/time” type (Date, Time, Timestamp) can be coerced into any other “date/time” type.

  • Any “date/time” type can be coerced into a Long type, representing the number of milliseconds since epoch (Midnight GMT, January 1, 1970).

  • Any “date/time” type can be coerced into a String. The format of the String is whatever DateFormat is configured for the corresponding property (Date Format, Time Format, Timestamp Format property). If no value is specified, then the value will be converted into a String representation of the number of milliseconds since epoch (Midnight GMT, January 1, 1970).

If none of the above rules apply when attempting to coerce a value from one data type to another, the coercion will fail and an Exception will be thrown.

Schema Inference

While NiFi’s Record API does require that each Record have a schema, it is often convenient to infer the schema based on the values in the data, rather than having to manually create a schema. This is accomplished by selecting a value of ” Infer Schema” for the “Schema Access Strategy” property. When using this strategy, the Reader will determine the schema by first parsing all data in the FlowFile, keeping track of all fields that it has encountered and the type of each field. Once all data has been parsed, a schema is formed that encompasses all fields that have been encountered.

A common concern when inferring schemas is how to handle the condition of two values that have different types. For example, consider a FlowFile with the following two records:

[
  {
    "name": "John",
    "age": 8,
    "values": "N/A"
  },
  {
    "name": "Jane",
    "age": "Ten",
    "values": [
      8,
      "Ten"
    ]
  }
]
json

It is clear that the “name” field will be inferred as a STRING type. However, how should we handle the “age” field? Should the field be an CHOICE between INT and STRING? Should we prefer LONG over INT? Should we just use a STRING? Should the field be considered nullable?

To help understand how this Record Reader infers schemas, we have the following list of rules that are followed in the inference logic:

  • All fields are inferred to be nullable.

  • When two values are encountered for the same field in two different records (or two values are encountered for an ARRAY type), the inference engine prefers to use a “wider” data type over using a CHOICE data type. A data type “A” is said to be wider than data type “B” if and only if data type “A” encompasses all values of “B” in addition to other values. For example, the LONG type is wider than the INT type but not wider than the BOOLEAN type (and BOOLEAN is also not wider than LONG). INT is wider than SHORT. The STRING type is considered wider than all other types except MAP, RECORD, ARRAY, and CHOICE.

  • If two values are encountered for the same field in two different records (or two values are encountered for an ARRAY type), but neither value is of a type that is wider than the other, then a CHOICE type is used. In the example above, the “values” field will be inferred as a CHOICE between a STRING or an ARRRAY.

  • If the “Time Format,” “Timestamp Format,” or “Date Format” properties are configured, any value that would otherwise be considered a STRING type is first checked against the configured formats to see if it matches any of them. If the value matches the Timestamp Format, the value is considered a Timestamp field. If it matches the Date Format, it is considered a Date field. If it matches the Time Format, it is considered a Time field. In the unlikely event that the value matches more than one of the configured formats, they will be matched in the order: Timestamp, Date, Time. I.e., if a value matched both the Timestamp Format and the Date Format, the type that is inferred will be Timestamp. Because parsing dates and times can be expensive, it is advisable not to configure these formats if dates, times, and timestamps are not expected, or if processing the data as a STRING is acceptable. For use cases when this is important, though, the inference engine is intelligent enough to optimize the parsing by first checking several very cheap conditions. For example, the string’s length is examined to see if it is too long or too short to match the pattern. This results in far more efficient processing than would result if attempting to parse each string value as a timestamp.

  • The MAP type is never inferred. Instead, the RECORD type is used.

  • If a field exists but all values are null, then the field is inferred to be of type STRING.

Caching of Inferred Schemas

This Record Reader requires that if a schema is to be inferred, that all records be read in order to ensure that the schema that gets inferred is applicable for all records in the FlowFile. However, this can become expensive, especially if the data undergoes many different transformations. To alleviate the cost of inferring schemas, the Record Reader can be configured with a “Schema Inference Cache” by populating the property with that name. This is a Controller Service that can be shared by Record Readers and Record Writers.

Whenever a Record Writer is used to write data, if it is configured with a “Schema Cache,” it will also add the schema to the Schema Cache. This will result in an identifier for that schema being added as an attribute to the FlowFile.

Whenever a Record Reader is used to read data, if it is configured with a “Schema Inference Cache”, it will first look for a “schema.cache.identifier” attribute on the FlowFile. If the attribute exists, it will use the value of that attribute to lookup the schema in the schema cache. If it is able to find a schema in the cache with that identifier, then it will use that schema instead of reading, parsing, and analyzing the data to infer the schema. If the attribute is not available on the FlowFile, or if the attribute is available but the cache does not have a schema with that identifier, then the Record Reader will proceed to infer the schema as described above.

The end result is that users are able to chain together many different Processors to operate on Record-oriented data. Typically, only the first such Processor in the chain will incur the “penalty” of inferring the schema. For all other Processors in the chain, the Record Reader is able to simply lookup the schema in the Schema Cache by identifier. This allows the Record Reader to infer a schema accurately, since it is inferred based on all data in the FlowFile, and still allows this to happen efficiently since the schema will typically only be inferred once, regardless of how many Processors handle the data.

Starting Field Strategies

When using YamlTreeReader, two different starting field strategies can be selected. With the default Root Node strategy, the YamlTreeReader begins processing from the root element of the YAML and creates a Record object for the entire YAML Object tree, while the Nested Field strategy defines a nested field from which to begin processing.

Using the Nested Field strategy, a schema corresponding to the nested YAML part should be specified. In case of schema inference, the YamlTreeReader will automatically infer a schema from nested records.

Root Node Strategy

Consider the following YAML is read with the default Root Node strategy:

- id: 17
  name: John
  child:
    id: "1"
  dob: 10-29-1982
  siblings:
    - name: Jeremy
      id: 4
    - name: Julia
      id: 8
- id: 98
  name: Jane
  child:
    id: 2
  dob: 08-30-1984
  gender: F
  siblingIds: []
  siblings: []
yml

Also, consider that the schema that is configured for this YAML is as follows (assuming that the AvroSchemaRegistry Controller Service is chosen to denote the Schema):

{
  "type": "record",
  "name": "nifiRecord",
  "namespace": "org.apache.nifi",
  "fields": [
    {
      "name": "id",
      "type": [
        "int",
        "null"
      ]
    },
    {
      "name": "name",
      "type": [
        "string",
        "null"
      ]
    },
    {
      "name": "child",
      "type": [
        {
          "type": "record",
          "name": "childType",
          "fields": [
            {
              "name": "id",
              "type": [
                "int",
                "string",
                "null"
              ]
            }
          ]
        },
        "null"
      ]
    },
    {
      "name": "dob",
      "type": [
        "string",
        "null"
      ]
    },
    {
      "name": "siblings",
      "type": [
        {
          "type": "array",
          "items": {
            "type": "record",
            "name": "siblingsType",
            "fields": [
              {
                "name": "name",
                "type": [
                  "string",
                  "null"
                ]
              },
              {
                "name": "id",
                "type": [
                  "int",
                  "null"
                ]
              }
            ]
          }
        },
        "null"
      ]
    },
    {
      "name": "gender",
      "type": [
        "string",
        "null"
      ]
    },
    {
      "name": "siblingIds",
      "type": [
        {
          "type": "array",
          "items": "string"
        },
        "null"
      ]
    }
  ]
}
json

Let us also assume that this Controller Service is configured with the “Date Format” property set to “MM-dd-yyyy”, as this matches the date format used for our YAML data. This will result in the YAML creating two separate records, because the root element is a YAML array with two elements.

The first Record will consist of the following values:

Field NameField Valueid17nameJohngender_null_dob11-30-1983siblings_array with two elements, each of which is itself a Record:_

Field Name Field Value

name

Jeremy

and:

Field Name Field Value

name

Julia

The second Record will consist of the following values:

Field Name Field Value

id

98

name

Jane

gender

F

dob

08-30-1984

siblings

empty array

Nested Field Strategy

Using the Nested Field strategy, consider the same YAML where the specified Starting Field Name is “siblings”. The schema that is configured for this YAML is as follows:

{
  "namespace": "nifi",
  "name": "siblings",
  "type": "record",
  "fields": [
    {
      "name": "name",
      "type": "string"
    },
    {
      "name": "id",
      "type": "int"
    }
  ]
}
json

The first Record will consist of the following values:

Field Name Field Value

name

Jeremy

id

4

The second Record will consist of the following values:

Field Name Field Value

name

Julia

id

8

Schema Application Strategies

When using YamlTreeReader with “Nested Field Strategy” and the “Schema Access Strategy” is not “Infer Schema”, it can be configured for the entire original YAML (“Whole document” strategy) or for the nested field section (“Selected part” strategy).

ZendeskRecordSink

Create Zendesk tickets using the Zendesk API.The service requires a Zendesk account with configured access.

Tags: zendesk, record, sink

Properties

Web Client Service Provider

Controller service for HTTP client operations.

Subdomain Name

Name of the Zendesk subdomain.

User Name

Login user to Zendesk subdomain.

Authentication Type

Type of authentication to Zendesk API.

Authentication Credential

Password or authentication token for Zendesk login user.

Comment Body

The content or the path to the comment body in the incoming record.

Subject

The content or the path to the subject in the incoming record.

Priority

The content or the path to the priority in the incoming record.

Type

The content or the path to the type in the incoming record.

Cache Size

Specifies how many Zendesk ticket should be cached.

Cache Expiration

Specifies how long a Zendesk ticket that is cached should remain in the cache.

Additional Details

Description

The sink uses the Zendesk API to ingest tickets into Zendesk, using the incoming records to construct request objects.

Authentication

Zendesk API uses basic authentication. Either a password or an authentication token has to be provided. In Zendesk API Settings, it’s possible to generate authentication tokens, eliminating the need for users to expose their passwords. This approach also offers the advantage of fast token revocation when required.

Property values

There are multiple ways of providing property values to the request object:

Record Path:

The property value is going to be evaluated as a record path if the value is provided inside brackets starting with a ‘%’.

Example:

The incoming record look like this.

{
  "record": {
    "description": "This is a sample description.",
    "issue\_type": "Immediate",
    "issue": {
      "name": "General error",
      "type": "Immediate"
    },
    "project": {
      "name": "Maintenance"
    }
  }
}
json

We are going to provide Record Path values for the Comment Body, Subject, Priority and Type processor attributes:

Comment Body : %{/record/description}
Subject : %{/record/issue/name}
Priority : %{/record/issue/type}
Type : %{/record/project/name}

The constructed request object that is going to be sent to the Zendesk API will look like this:

{
  "comment": {
    "body": "This is a sample description."
  },
  "subject": "General error",
  "priority": "Immediate",
  "type": "Maintenance"
}
json

Constant:

The property value is going to be treated as a constant if the provided value doesn’t match with the Record Path format.

Example:

We are going to provide constant values for the Comment Body, Subject, Priority and Type processor attributes:

Comment Body : Sample description
Subject : Sample subject
Priority : High
Type : Sample type

The constructed request object that is going to be sent to the Zendesk API will look like this:

{
  "comment": {
    "body": "Sample description"
  },
  "subject": "Sample subject",
  "priority": "High",
  "type": "Sample type"
}
json
Additional properties

The processor offers a set of frequently used Zendesk ticket attributes within its property list. However, users have the flexibility to include any desired number of additional properties using dynamic properties. These dynamic properties utilize their keys as Json Pointer, which denote the paths within the request object. Correspondingly, the values of these dynamic properties align with the predefined property attributes. The possible Zendesk request attributes can be found in the Zendesk API documentation

Property Key values:

The dynamic property key must be a valid Json Pointer value which has the following syntax rules:

  • The path starts with /.

  • Each segment is separated by /.

  • Each segment can be interpreted as either an array index or an object key.

Example:

We are going to add a new dynamic property to the processor:

/request/new_object : This is a new property
/request/new_array/0 : This is a new array element

The constructed request object will look like this:

{
  "request": {
    "new_object": "This is a new property",
    "new_array": [
      "This is a new array element"
    ]
  }
}
json
Caching

The sink caches Zendesk tickets with the same content in order to avoid duplicate issues. The cache size and expiration date can be set on the sink service.

Reporting Tasks

AzureLogAnalyticsProvenanceReportingTask

Publishes Provenance events to to a Azure Log Analytics workspace.

Tags: azure, provenace, reporting, log analytics

Properties

Log Analytics Workspace Id

Log Analytics Workspace Id

Log Analytics Custom Log Name

Log Analytics Custom Log Name

Log Analytics Workspace Key

Azure Log Analytic Worskspace Key

Application ID

The Application ID to be included in the metrics sent to Azure Log Analytics WS

Instance ID

Id of this NiFi instance to be included in the metrics sent to Azure Log Analytics WS

Job Name

The name of the exporting job

Log Analytics URL Endpoint Format

Log Analytics URL Endpoint Format

Event Type to Include

Comma-separated list of event types that will be used to filter the provenance events sent by the reporting task. Available event types are [CREATE, RECEIVE, FETCH, SEND, UPLOAD, REMOTE_INVOCATION, DOWNLOAD, DROP, EXPIRE, FORK, JOIN, CLONE, CONTENT_MODIFIED, ATTRIBUTES_MODIFIED, ROUTE, ADDINFO, REPLAY, UNKNOWN]. If no filter is set, all the events are sent. If multiple filters are set, the filters are cumulative.

Event Type to Exclude

Comma-separated list of event types that will be used to exclude the provenance events sent by the reporting task. Available event types are [CREATE, RECEIVE, FETCH, SEND, UPLOAD, REMOTE_INVOCATION, DOWNLOAD, DROP, EXPIRE, FORK, JOIN, CLONE, CONTENT_MODIFIED, ATTRIBUTES_MODIFIED, ROUTE, ADDINFO, REPLAY, UNKNOWN]. If no filter is set, all the events are sent. If multiple filters are set, the filters are cumulative. If an event type is included in Event Type to Include and excluded here, then the exclusion takes precedence and the event will not be sent.

Component Type to Include

Regular expression to filter the provenance events based on the component type. Only the events matching the regular expression will be sent. If no filter is set, all the events are sent. If multiple filters are set, the filters are cumulative.

Component Type to Exclude

Regular expression to exclude the provenance events based on the component type. The events matching the regular expression will not be sent. If no filter is set, all the events are sent. If multiple filters are set, the filters are cumulative. If a component type is included in Component Type to Include and excluded here, then the exclusion takes precedence and the event will not be sent.

Component ID to Include

Comma-separated list of component UUID that will be used to filter the provenance events sent by the reporting task. If no filter is set, all the events are sent. If multiple filters are set, the filters are cumulative.

Component ID to Exclude

Comma-separated list of component UUID that will be used to exclude the provenance events sent by the reporting task. If no filter is set, all the events are sent. If multiple filters are set, the filters are cumulative. If a component UUID is included in Component ID to Include and excluded here, then the exclusion takes precedence and the event will not be sent.

Component Name to Include

Regular expression to filter the provenance events based on the component name. Only the events matching the regular expression will be sent. If no filter is set, all the events are sent. If multiple filters are set, the filters are cumulative.

Component Name to Exclude

Regular expression to exclude the provenance events based on the component name. The events matching the regular expression will not be sent. If no filter is set, all the events are sent. If multiple filters are set, the filters are cumulative. If a component name is included in Component Name to Include and excluded here, then the exclusion takes precedence and the event will not be sent.

Start Position

If the Reporting Task has never been run, or if its state has been reset by a user, specifies where in the stream of Provenance Events the Reporting Task should start

Include Null Values

Indicate if null values should be included in records. Default will be false

Platform

The value to use for the platform field in each event.

Instance URL

The URL of this instance to use in the Content URI of each event.

Batch Size

Specifies how many records to send in a single batch, at most.

AzureLogAnalyticsReportingTask

Sends JVM-metrics as well as Apache NiFi-metrics to a Azure Log Analytics workspace.Apache NiFi-metrics can be either configured global or on process-group level.

Tags: azure, metrics, reporting, log analytics

Properties

Send JVM Metrics

Send JVM Metrics in addition to the NiFi-metrics

Log Analytics Workspace Id

Log Analytics Workspace Id

Log Analytics Custom Log Name

Log Analytics Custom Log Name

Log Analytics Workspace Key

Azure Log Analytic Worskspace Key

Application ID

The Application ID to be included in the metrics sent to Azure Log Analytics WS

Instance ID

Id of this NiFi instance to be included in the metrics sent to Azure Log Analytics WS

Process group ID(s)

If specified, the reporting task will send metrics the configured ProcessGroup(s) only. Multiple IDs should be separated by a comma. If none of the group-IDs could be found or no IDs are defined, the Root Process Group is used and global metrics are sent.

Job Name

The name of the exporting job

Log Analytics URL Endpoint Format

Log Analytics URL Endpoint Format

ControllerStatusReportingTask

Logs the 5-minute stats that are shown in the NiFi Summary Page for Processors and Connections, as well optionally logging the deltas between the previous iteration and the current iteration. Processors' stats are logged using the org.apache.nifi.controller.ControllerStatusReportingTask.Processors logger, while Connections' stats are logged using the org.apache.nifi.controller.ControllerStatusReportingTask.Connections logger. These can be configured in the NiFi logging configuration to log to different files, if desired.

Tags: stats, log

Properties

Show Deltas

Specifies whether or not to show the difference in values between the current status and the previous status

Reporting Granularity

When reporting information, specifies the granularity of the metrics to report

Additional Details

Description:

Reporting Task that creates a log message for each Processor and each Connection in the flow. For Processors, the following information is included (sorted by descending Processing Timing):

  • Processor Name

  • Processor ID

  • Processor Type

  • Run Status

  • Flow Files In (5 mins)

  • FlowFiles Out (5 mins)

  • Bytes Read from Disk (5 mins)

  • Bytes Written to Disk (5 mins)

  • Number of Tasks Completed (5 mins)

  • Processing Time (5 mins)

For Connections, the following information is included (sorted by descending size of queued FlowFiles):

  • Connection Name

  • Connection ID

  • Source Component Name

  • Destination Component Name

  • Flow Files In (5 mins)

  • FlowFiles Out (5 mins)

  • FlowFiles Queued

It may be convenient to redirect the logging output of this ReportingTask to a separate log file than the typical application log. This can be accomplished by modified the logback.xml file in the NiFi conf/ directory such that a logger with the name org.apache.nifi.controller.ControllerStatusReportingTask is configured to write to a separate log.

Additionally, it may be convenient to disable logging for Processors or for Connections or to split them into separate log files. This can be accomplished by using the loggers named org.apache.nifi.controller.ControllerStatusReportingTask.Processors and org.apache.nifi.controller.ControllerStatusReportingTask.Connections, respectively.

ExtendedPrometheusReportingTask

Reports metrics in Prometheus format by creating a /metrics HTTP(S) endpoint which can be used for external monitoring of the application. The reporting task reports a set of metrics regarding the JVM (optional) and the NiFi instance. Note that if the underlying Jetty server (i.e. the Prometheus endpoint) cannot be started (for example if two PrometheusReportingTask instances are started on the same port), this may cause a delay in shutting down NiFi while it waits for the server resources to be cleaned up.In addition to the original PrometheusReportingTask this reporting task is able to export bulletins.

Tags: reporting, prometheus, metrics, time series data, virtimo, bulletins

Properties

Prometheus Metrics Endpoint Port

The Port where prometheus metrics can be accessed

Instance ID

Id of this NiFi instance to be included in the metrics sent to Prometheus

Metrics Reporting Strategy

The granularity on which to report metrics. Options include only the root process group, all process groups, or all components

Send JVM metrics

Send JVM metrics in addition to the NiFi metrics

Send Bulletins

Send Bulletins in addition to the NiFi metrics

SSL Context Service

The SSL Context Service to use in order to secure the server. If specified, the server willaccept only HTTPS requests; otherwise, the server will accept only HTTP requests

Client Authentication

Specifies whether or not the Reporting Task should authenticate clients. This value is ignored if the <SSL Context Service> Property is not specified or the SSL Context provided uses only a KeyStore and not a TrustStore.

MonitorDiskUsage

Checks the amount of storage space available for the specified directory and warns (via a log message and a System-Level Bulletin) if the partition on which it lives exceeds some configurable threshold of storage space

Tags: disk, storage, warning, monitoring, repo

Properties

Threshold

The threshold at which a bulletin will be generated to indicate that the disk usage of the partition on which the directory found is of concern

Directory Location

The directory path of the partition to be monitored.

Directory Display Name

The name to display for the directory in alerts.

MonitorMemory

Checks the amount of Java Heap available in the JVM for a particular JVM Memory Pool. If the amount of space used exceeds some configurable threshold, will warn (via a log message and System-Level Bulletin) that the memory pool is exceeding this threshold.

Tags: monitor, memory, heap, jvm, gc, garbage collection, warning

Properties

Memory Pool

The name of the JVM Memory Pool to monitor. The allowed values for Memory Pools are platform and JVM dependent and may vary for different versions of Java and from published documentation. This reporting task will become invalidated if configured to use a Memory Pool that is not available on the currently running host platform and JVM

Usage Threshold

Indicates the threshold at which warnings should be generated. This can be a percentage or a Data Size

Reporting Interval

Indicates how often this reporting task should report bulletins while the memory utilization exceeds the configured threshold

QueryNiFiReportingTask

Publishes NiFi status information based on the results of a user-specified SQL query. The query may make use of the CONNECTION_STATUS, PROCESSOR_STATUS, BULLETINS, PROCESS_GROUP_STATUS, JVM_METRICS, CONNECTION_STATUS_PREDICTIONS, FLOW_CONFIG_HISTORY, or PROVENANCE tables, and can use any functions or capabilities provided by Apache Calcite. Note that the CONNECTION_STATUS_PREDICTIONS table is not available for querying if analytics are not enabled (see the nifi.analytics.predict.enabled property in nifi.properties). Attempting a query on the table when the capability is disabled will cause an error.

Tags: status, connection, processor, jvm, metrics, history, bulletin, prediction, process, group, provenance, record, sql, flow, config

Properties

SQL Query

SQL SELECT statement specifies which tables to query and how data should be filtered/transformed. SQL SELECT can select from the CONNECTION_STATUS, PROCESSOR_STATUS, BULLETINS, PROCESS_GROUP_STATUS, JVM_METRICS, CONNECTION_STATUS_PREDICTIONS, or PROVENANCE tables. Note that the CONNECTION_STATUS_PREDICTIONS table is not available for querying if analytics are not enabled).

Record Destination Service

Specifies the Controller Service to use for writing out the query result records to some destination.

Include Zero Record Results

When running the SQL statement, if the result has no data, this property specifies whether or not the empty result set will be transmitted.

Default Decimal Precision

When a DECIMAL/NUMBER value is written as a 'decimal' Avro logical type, a specific 'precision' denoting number of available digits is required. Generally, precision is defined by column data type definition or database engines default. However undefined precision (0) can be returned from some database engines. 'Default Decimal Precision' is used when writing those undefined precision numbers.

Default Decimal Scale

When a DECIMAL/NUMBER value is written as a 'decimal' Avro logical type, a specific 'scale' denoting number of available decimal digits is required. Generally, scale is defined by column data type definition or database engines default. However when undefined precision (0) is returned, scale can also be uncertain with some database engines. 'Default Decimal Scale' is used when writing those undefined numbers. If a value has more decimals than specified scale, then the value will be rounded-up, e.g. 1.53 becomes 2 with scale 0, and 1.5 with scale 1.

Stateful

Scope: Local

Stores the Reporting Task’s last execution time so that on restart the task knows where it left off.

Additional Details

Summary

This reporting task can be used to issue SQL queries against various NiFi metrics information, modeled as tables, and transmit the query results to some specified destination. The query may make use of the CONNECTION_STATUS, PROCESSOR_STATUS, BULLETINS, PROCESS_GROUP_STATUS, JVM_METRICS, CONNECTION_STATUS_PREDICTIONS, or PROVENANCE tables, and can use any functions or capabilities provided by Apache Calcite, including JOINs, aggregate functions, etc.

The results are transmitted to the destination using the configured Record Sink service, such as SiteToSiteReportingRecordSink (for sending via the Site-to-Site protocol) or DatabaseRecordSink (for sending the query result rows to a relational database).

The reporting task can uniquely handle items from the bulletin and provenance repositories. This means that an item will only be processed once when the query is set to unique. The query can be set to unique by defining a time window with special sql placeholders ($bulletinStartTime, $bulletinEndTime, $provenanceStartTime, $provenanceEndTime) that the reporting task will evaluate runtime. See the SQL Query Examples section.

Table Definitions

Below is a list of definitions for all the “tables” supported by this reporting task. Note that these are not persistent/materialized tables, rather they are non-materialized views for which the sources are re-queried at every execution. This means that a query executed twice may return different results, for example if new status information is available, or in the case of JVM_METRICS (for example), a new snapshot of the JVM at query-time.

CONNECTION_STATUS
Column Data Type

id

String

groupId

String

name

String

sourceId

String

sourceName

String

destinationId

String

destinationName

String

backPressureDataSizeThreshold

String

backPressureBytesThreshold

long

backPressureObjectThreshold

long

isBackPressureEnabled

boolean

inputCount

int

inputBytes

long

queuedCount

int

queuedBytes

long

outputCount

int

outputBytes

long

maxQueuedCount

int

maxQueuedBytes

long

PROCESSOR_STATUS
Column Data Type

id

String

groupId

String

name

String

processorType

String

averageLineageDuration

long

bytesRead

long

bytesWritten

long

bytesReceived

long

bytesSent

long

flowFilesRemoved

int

flowFilesReceived

int

flowFilesSent

int

inputCount

int

inputBytes

long

outputCount

int

outputBytes

long

activeThreadCount

int

terminatedThreadCount

int

invocations

int

processingNanos

long

runStatus

String

executionNode

String

BULLETINS
Column Data Type

bulletinId

long

bulletinCategory

String

bulletinGroupId

String

bulletinGroupName

String

bulletinGroupPath

String

bulletinLevel

String

bulletinMessage

String

bulletinNodeAddress

String

bulletinNodeId

String

bulletinSourceId

String

bulletinSourceName

String

bulletinSourceType

String

bulletinTimestamp

Date

bulletinFlowFileUuid

String

PROCESS_GROUP_STATUS
Column Data Type

id

String

groupId

String

name

String

bytesRead

long

bytesWritten

long

bytesReceived

long

bytesSent

long

bytesTransferred

long

flowFilesReceived

int

flowFilesSent

int

flowFilesTransferred

int

inputContentSize

long

inputCount

int

outputContentSize

long

outputCount

int

queuedContentSize

long

activeThreadCount

int

terminatedThreadCount

int

queuedCount

int

versionedFlowState

String

processingNanos

long

JVM_METRICS

The JVM_METRICS table has dynamic columns in the sense that the “garbage collector runs” and “garbage collector time columns” appear for each Java garbage collector in the JVM.
The column names end with the name of the garbage collector substituted for the <garbage_collector_name> expression below:

Column Data Type

jvm_daemon_thread_count

int

jvm_thread_count

int

jvm_thread_states_blocked

int

jvm_thread_states_runnable

int

jvm_thread_states_terminated

int

jvm_thread_states_timed_waiting

int

jvm_uptime

long

jvm_head_used

double

jvm_heap_usage

double

jvm_non_heap_usage

double

jvm_file_descriptor_usage

double

jvm_gc_runs_`<garbage_collector_name>`

long

jvm_gc_time_`<garbage_collector_name>`

long

CONNECTION_STATUS_PREDICTIONS
Column Data Type

connectionId

String

predictedQueuedBytes

long

predictedQueuedCount

int

predictedPercentBytes

int

predictedPercentCount

int

predictedTimeToBytesBackpressureMillis

long

predictedTimeToCountBackpressureMillis

long

predictionIntervalMillis

long

PROVENANCE
Column Data Type

eventId

long

eventType

String

timestampMillis

long

durationMillis

long

lineageStart

long

details

String

componentId

String

componentName

String

componentType

String

processGroupId

String

processGroupName

String

entityId

String

entityType

String

entitySize

long

previousEntitySize

long

updatedAttributes

Map<String,String>

previousAttributes

Map<String,String>

contentPath

String

previousContentPath

String

parentIds

Array

childIds

Array

transitUri

String

remoteIdentifier

String

alternateIdentifier

String

FLOW_CONFIG_HISTORY
Column Data Type

actionId

int

actionTimestamp

long

actionUserIdentity

String

actionSourceId

String

actionSourceName

String

actionSourceType

String

actionOperation

String

configureDetailsName

String

configureDetailsPreviousValue

String

configureDetailsValue

String

connectionSourceId

String

connectionSourceName

String

connectionSourceType

String

connectionDestinationId

String

connectionDestinationName

String

connectionDestinationType

String

connectionRelationship

String

moveGroup

String

moveGroupId

String

movePreviousGroup

String

movePreviousGroupId

String

purgeEndDate

long

SQL Query Examples

Example: Select all fields from the CONNECTION_STATUS table:

SELECT * FROM CONNECTION_STATUS
sql

Example: Select connection IDs where time-to-backpressure (based on queue count) is less than 5 minutes:

SELECT connectionId FROM CONNECTION_STATUS_PREDICTIONS WHERE predictedTimeToCountBackpressureMillis < 300000
sql

Example: Get the unique bulletin categories associated with errors:

SELECT DISTINCT(bulletinCategory) FROM BULLETINS WHERE bulletinLevel = "ERROR"
sql

Example: Select all fields from the BULLETINS table with time window:

SELECT * from BULLETINS WHERE bulletinTimestamp > $bulletinStartTime AND bulletinTimestamp <= $bulletinEndTime
sql

Example: Select all fields from the PROVENANCE table with time window:

SELECT * from PROVENANCE where timestampMillis > $provenanceStartTime and timestampMillis <= $provenanceEndTime
sql

Example: Select connection-related fields from the FLOW_CONFIG_HISTORY table:

SELECT connectionSourceName, connectionDestinationName, connectionRelationship
from FLOW_CONFIG_HISTORY
sql

ScriptedReportingTask

Provides reporting and status information to a script. ReportingContext, ComponentLog, and VirtualMachineMetrics objects are made available as variables (context, log, and vmMetrics, respectively) to the script for further processing. The context makes various information available such as events, provenance, bulletins, controller services, process groups, Java Virtual Machine metrics, etc.

Tags: reporting, script, execute, groovy

Properties

Script Engine

No Script Engines found

Script File

Path to script file to execute. Only one of Script File or Script Body may be used

Script Body

Body of script to execute. Only one of Script File or Script Body may be used

Module Directory

Comma-separated list of paths to files and/or directories which contain modules required by the script.

Dynamic Properties

A script engine property to update

Updates a script engine property specified by the Dynamic Property’s key with the value specified by the Dynamic Property’s value

Restricted

Making changes to this component requires admin rights. Non-admins can still schedule this component.

SiteToSiteBulletinReportingTask

Publishes Bulletin events using the Site To Site protocol. Note: only up to 5 bulletins are stored per component and up to 10 bulletins at controller level for a duration of up to 5 minutes. If this reporting task is not scheduled frequently enough some bulletins may not be sent.

Tags: bulletin, site, site to site

Properties

Destination URL

The URL of the destination NiFi instance or, if clustered, a comma-separated list of address in the format of http(s)://host:port/nifi. This destination URL will only be used to initiate the Site-to-Site connection. The data sent by this reporting task will be load-balanced on all the nodes of the destination (if clustered).

Input Port Name

The name of the Input Port to deliver data to.

SSL Context Service

The SSL Context Service to use when communicating with the destination. If not specified, communications will not be secure.

Instance URL

The URL of this instance to use in the Content URI of each event.

Compress Events

Indicates whether or not to compress the data being sent.

Communications Timeout

Specifies how long to wait to a response from the destination before deciding that an error has occurred and canceling the transaction

Transport Protocol

Specifies which transport protocol to use for Site-to-Site communication.

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP + AuthN

Record Writer

Specifies the Controller Service to use for writing out the records.

Include Null Values

Indicate if null values should be included in records. Default will be false

Platform

The value to use for the platform field in each event.

Restricted

Making changes to this component requires admin rights. Non-admins can still schedule this component.

Additional Details

The Site-to-Site Bulletin Reporting Task allows the user to publish Bulletin events using the Site To Site protocol. Note: only up to 5 bulletins are stored per component and up to 10 bulletins at controller level for a duration of up to 5 minutes. If this reporting task is not scheduled frequently enough some bulletins may not be sent.

Record writer

The user can define a Record Writer and directly specify the output format and data with the assumption that the input schema is the following:

{
  "type": "record",
  "name": "bulletins",
  "namespace": "bulletins",
  "fields": [
    {
      "name": "objectId",
      "type": "string"
    },
    {
      "name": "platform",
      "type": "string"
    },
    {
      "name": "bulletinId",
      "type": "long"
    },
    {
      "name": "bulletinCategory",
      "type": [
        "string",
        "null"
      ]
    },
    {
      "name": "bulletinGroupId",
      "type": [
        "string",
        "null"
      ]
    },
    {
      "name": "bulletinGroupName",
      "type": [
        "string",
        "null"
      ]
    },
    {
      "name": "bulletinGroupPath",
      "type": [
        "string",
        "null"
      ]
    },
    {
      "name": "bulletinLevel",
      "type": [
        "string",
        "null"
      ]
    },
    {
      "name": "bulletinMessage",
      "type": [
        "string",
        "null"
      ]
    },
    {
      "name": "bulletinNodeAddress",
      "type": [
        "string",
        "null"
      ]
    },
    {
      "name": "bulletinNodeId",
      "type": [
        "string",
        "null"
      ]
    },
    {
      "name": "bulletinSourceId",
      "type": [
        "string",
        "null"
      ]
    },
    {
      "name": "bulletinSourceName",
      "type": [
        "string",
        "null"
      ]
    },
    {
      "name": "bulletinSourceType",
      "type": [
        "string",
        "null"
      ]
    },
    {
      "name": "bulletinTimestamp",
      "type": [
        "string",
        "null"
      ],
      "doc": "Format: yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"
    },
    {
      "name": "bulletinFlowFileUuid",
      "type": [
        "string",
        "null"
      ]
    }
  ]
}
json

SiteToSiteMetricsReportingTask

Publishes same metrics as the Ambari Reporting task using the Site To Site protocol.

Tags: status, metrics, site, site to site

Properties

Destination URL

The URL of the destination NiFi instance or, if clustered, a comma-separated list of address in the format of http(s)://host:port/nifi. This destination URL will only be used to initiate the Site-to-Site connection. The data sent by this reporting task will be load-balanced on all the nodes of the destination (if clustered).

Input Port Name

The name of the Input Port to deliver data to.

SSL Context Service

The SSL Context Service to use when communicating with the destination. If not specified, communications will not be secure.

Instance URL

The URL of this instance to use in the Content URI of each event.

Compress Events

Indicates whether or not to compress the data being sent.

Communications Timeout

Specifies how long to wait to a response from the destination before deciding that an error has occurred and canceling the transaction

Transport Protocol

Specifies which transport protocol to use for Site-to-Site communication.

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP + AuthN

Record Writer

Specifies the Controller Service to use for writing out the records.

Include Null Values

Indicate if null values should be included in records. Default will be false

Hostname

The Hostname of this NiFi instance to be included in the metrics

Application ID

The Application ID to be included in the metrics

Output Format

The output format that will be used for the metrics. If Record Format is selected, a Record Writer must be provided. If Ambari Format is selected, the Record Writer property should be empty.

Additional Details

The Site-to-Site Metrics Reporting Task allows the user to publish NiFi’s metrics (as in the Ambari reporting task) to the same NiFi instance or another NiFi instance. This provides a great deal of power because it allows the user to make use of all the different Processors that are available in NiFi in order to process or distribute that data.

Ambari format

There are two available output formats. The first one is the Ambari format as defined in the Ambari Metrics Collector API which is a JSON with dynamic keys. If using this format you might be interested in the below Jolt specification to transform the data.

[
  {
    "operation": "shift",
    "spec": {
      "metrics": {
        "*": {
          "metrics": {
            "*": {
              "$": "metrics.[#4].metrics.time",
              "@": "metrics.[#4].metrics.value"
            }
          },
          "*": "metrics.[&1].&"
        }
      }
    }
  }
]
json

This would transform the below sample:

{
  "metrics": [
    {
      "metricname": "jvm.gc.time.G1OldGeneration",
      "appid": "nifi",
      "instanceid": "8927f4c0-0160-1000-597a-ea764ccd81a7",
      "hostname": "localhost",
      "timestamp": "1520456854361",
      "starttime": "1520456854361",
      "metrics": {
        "1520456854361": "0"
      }
    },
    {
      "metricname": "jvm.thread_states.terminated",
      "appid": "nifi",
      "instanceid": "8927f4c0-0160-1000-597a-ea764ccd81a7",
      "hostname": "localhost",
      "timestamp": "1520456854361",
      "starttime": "1520456854361",
      "metrics": {
        "1520456854361": "0"
      }
    }
  ]
}
json

into:

{
  "metrics": [
    {
      "metricname": "jvm.gc.time.G1OldGeneration",
      "appid": "nifi",
      "instanceid": "8927f4c0-0160-1000-597a-ea764ccd81a7",
      "hostname": "localhost",
      "timestamp": "1520456854361",
      "starttime": "1520456854361",
      "metrics": {
        "time": "1520456854361",
        "value": "0"
      }
    },
    {
      "metricname": "jvm.thread_states.terminated",
      "appid": "nifi",
      "instanceid": "8927f4c0-0160-1000-597a-ea764ccd81a7",
      "hostname": "localhost",
      "timestamp": "1520456854361",
      "starttime": "1520456854361",
      "metrics": {
        "time": "1520456854361",
        "value": "0"
      }
    }
  ]
}
json
Record format

The second format is leveraging the record framework of NiFi so that the user can define a Record Writer and directly specify the output format and data with the assumption that the input schema is the following:

{
  "type": "record",
  "name": "metrics",
  "namespace": "metrics",
  "fields": [
    {
      "name": "appid",
      "type": "string"
    },
    {
      "name": "instanceid",
      "type": "string"
    },
    {
      "name": "hostname",
      "type": "string"
    },
    {
      "name": "timestamp",
      "type": "long"
    },
    {
      "name": "loadAverage1min",
      "type": "double"
    },
    {
      "name": "availableCores",
      "type": "int"
    },
    {
      "name": "FlowFilesReceivedLast5Minutes",
      "type": "int"
    },
    {
      "name": "BytesReceivedLast5Minutes",
      "type": "long"
    },
    {
      "name": "FlowFilesSentLast5Minutes",
      "type": "int"
    },
    {
      "name": "BytesSentLast5Minutes",
      "type": "long"
    },
    {
      "name": "FlowFilesQueued",
      "type": "int"
    },
    {
      "name": "BytesQueued",
      "type": "long"
    },
    {
      "name": "BytesReadLast5Minutes",
      "type": "long"
    },
    {
      "name": "BytesWrittenLast5Minutes",
      "type": "long"
    },
    {
      "name": "ActiveThreads",
      "type": "int"
    },
    {
      "name": "TotalTaskDurationSeconds",
      "type": "long"
    },
    {
      "name": "TotalTaskDurationNanoSeconds",
      "type": "long"
    },
    {
      "name": "jvmuptime",
      "type": "long"
    },
    {
      "name": "jvmheap_used",
      "type": "double"
    },
    {
      "name": "jvmheap_usage",
      "type": "double"
    },
    {
      "name": "jvmnon_heap_usage",
      "type": "double"
    },
    {
      "name": "jvmthread_statesrunnable",
      "type": [
        "int",
        "null"
      ]
    },
    {
      "name": "jvmthread_statesblocked",
      "type": [
        "int",
        "null"
      ]
    },
    {
      "name": "jvmthread_statestimed_waiting",
      "type": [
        "int",
        "null"
      ]
    },
    {
      "name": "jvmthread_statesterminated",
      "type": [
        "int",
        "null"
      ]
    },
    {
      "name": "jvmthread_count",
      "type": "int"
    },
    {
      "name": "jvmdaemon_thread_count",
      "type": "int"
    },
    {
      "name": "jvmfile_descriptor_usage",
      "type": "double"
    },
    {
      "name": "jvmgcruns",
      "type": [
        "long",
        "null"
      ]
    },
    {
      "name": "jvmgctime",
      "type": [
        "long",
        "null"
      ]
    }
  ]
}
json

SiteToSiteProvenanceReportingTask

Publishes Provenance events using the Site To Site protocol.

Tags: provenance, lineage, tracking, site, site to site

Properties

Destination URL

The URL of the destination NiFi instance or, if clustered, a comma-separated list of address in the format of http(s)://host:port/nifi. This destination URL will only be used to initiate the Site-to-Site connection. The data sent by this reporting task will be load-balanced on all the nodes of the destination (if clustered).

Input Port Name

The name of the Input Port to deliver data to.

SSL Context Service

The SSL Context Service to use when communicating with the destination. If not specified, communications will not be secure.

Instance URL

The URL of this instance to use in the Content URI of each event.

Compress Events

Indicates whether or not to compress the data being sent.

Communications Timeout

Specifies how long to wait to a response from the destination before deciding that an error has occurred and canceling the transaction

Batch Size

Specifies how many records to send in a single batch, at most.

Transport Protocol

Specifies which transport protocol to use for Site-to-Site communication.

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP + AuthN

Record Writer

Specifies the Controller Service to use for writing out the records.

Include Null Values

Indicate if null values should be included in records. Default will be false

Platform

The value to use for the platform field in each event.

Event Type to Include

Comma-separated list of event types that will be used to filter the provenance events sent by the reporting task. Available event types are [CREATE, RECEIVE, FETCH, SEND, UPLOAD, REMOTE_INVOCATION, DOWNLOAD, DROP, EXPIRE, FORK, JOIN, CLONE, CONTENT_MODIFIED, ATTRIBUTES_MODIFIED, ROUTE, ADDINFO, REPLAY, UNKNOWN]. If no filter is set, all the events are sent. If multiple filters are set, the filters are cumulative.

Event Type to Exclude

Comma-separated list of event types that will be used to exclude the provenance events sent by the reporting task. Available event types are [CREATE, RECEIVE, FETCH, SEND, UPLOAD, REMOTE_INVOCATION, DOWNLOAD, DROP, EXPIRE, FORK, JOIN, CLONE, CONTENT_MODIFIED, ATTRIBUTES_MODIFIED, ROUTE, ADDINFO, REPLAY, UNKNOWN]. If no filter is set, all the events are sent. If multiple filters are set, the filters are cumulative. If an event type is included in Event Type to Include and excluded here, then the exclusion takes precedence and the event will not be sent.

Component Type to Include

Regular expression to filter the provenance events based on the component type. Only the events matching the regular expression will be sent. If no filter is set, all the events are sent. If multiple filters are set, the filters are cumulative.

Component Type to Exclude

Regular expression to exclude the provenance events based on the component type. The events matching the regular expression will not be sent. If no filter is set, all the events are sent. If multiple filters are set, the filters are cumulative. If a component type is included in Component Type to Include and excluded here, then the exclusion takes precedence and the event will not be sent.

Component ID to Include

Comma-separated list of component UUID that will be used to filter the provenance events sent by the reporting task. If no filter is set, all the events are sent. If multiple filters are set, the filters are cumulative.

Component ID to Exclude

Comma-separated list of component UUID that will be used to exclude the provenance events sent by the reporting task. If no filter is set, all the events are sent. If multiple filters are set, the filters are cumulative. If a component UUID is included in Component ID to Include and excluded here, then the exclusion takes precedence and the event will not be sent.

Component Name to Include

Regular expression to filter the provenance events based on the component name. Only the events matching the regular expression will be sent. If no filter is set, all the events are sent. If multiple filters are set, the filters are cumulative.

Component Name to Exclude

Regular expression to exclude the provenance events based on the component name. The events matching the regular expression will not be sent. If no filter is set, all the events are sent. If multiple filters are set, the filters are cumulative. If a component name is included in Component Name to Include and excluded here, then the exclusion takes precedence and the event will not be sent.

Start Position

If the Reporting Task has never been run, or if its state has been reset by a user, specifies where in the stream of Provenance Events the Reporting Task should start

Stateful

Scope: Local

Stores the Reporting Task’s last event Id so that on restart the task knows where it left off.

Restricted

Making changes to this component requires admin rights. Non-admins can still schedule this component.

Additional Details

The Site-to-Site Provenance Reporting Task allows the user to publish all the Provenance Events from a NiFi instance back to the same NiFi instance or another NiFi instance. This provides a great deal of power because it allows the user to make use of all the different Processors that are available in NiFi in order to process or distribute that data. When possible, it is advisable to send the Provenance data to a different NiFi instance than the one that this Reporting Task is running on, because when the data is received over Site-to-Site and processed, that in and of itself will generate Provenance events. As a result, there is a cycle that is created. However, the data is sent in batches (1,000 by default). This means that for each batch of Provenance events that are sent back to NiFi, the receiving NiFi will have to generate only a single event per component.

By default, when published to a NiFi instance, the Provenance data is sent as a JSON array. However, the user can define a Record Writer and directly specify the output format and data with the assumption that the input schema is defined as follows:

{
  "type": "record",
  "name": "provenance",
  "namespace": "provenance",
  "fields": [
    {
      "name": "eventId",
      "type": "string"
    },
    {
      "name": "eventOrdinal",
      "type": "long"
    },
    {
      "name": "eventType",
      "type": "string"
    },
    {
      "name": "timestampMillis",
      "type": "long"
    },
    {
      "name": "durationMillis",
      "type": "long"
    },
    {
      "name": "lineageStart",
      "type": {
        "type": "long",
        "logicalType": "timestamp-millis"
      }
    },
    {
      "name": "details",
      "type": [
        "null",
        "string"
      ]
    },
    {
      "name": "componentId",
      "type": [
        "null",
        "string"
      ]
    },
    {
      "name": "componentType",
      "type": [
        "null",
        "string"
      ]
    },
    {
      "name": "componentName",
      "type": [
        "null",
        "string"
      ]
    },
    {
      "name": "processGroupId",
      "type": [
        "null",
        "string"
      ]
    },
    {
      "name": "processGroupName",
      "type": [
        "null",
        "string"
      ]
    },
    {
      "name": "entityId",
      "type": [
        "null",
        "string"
      ]
    },
    {
      "name": "entityType",
      "type": [
        "null",
        "string"
      ]
    },
    {
      "name": "entitySize",
      "type": [
        "null",
        "long"
      ]
    },
    {
      "name": "previousEntitySize",
      "type": [
        "null",
        "long"
      ]
    },
    {
      "name": "updatedAttributes",
      "type": {
        "type": "map",
        "values": "string"
      }
    },
    {
      "name": "previousAttributes",
      "type": {
        "type": "map",
        "values": "string"
      }
    },
    {
      "name": "actorHostname",
      "type": [
        "null",
        "string"
      ]
    },
    {
      "name": "contentURI",
      "type": [
        "null",
        "string"
      ]
    },
    {
      "name": "previousContentURI",
      "type": [
        "null",
        "string"
      ]
    },
    {
      "name": "parentIds",
      "type": {
        "type": "array",
        "items": "string"
      }
    },
    {
      "name": "childIds",
      "type": {
        "type": "array",
        "items": "string"
      }
    },
    {
      "name": "platform",
      "type": "string"
    },
    {
      "name": "application",
      "type": "string"
    },
    {
      "name": "remoteIdentifier",
      "type": [
        "null",
        "string"
      ]
    },
    {
      "name": "alternateIdentifier",
      "type": [
        "null",
        "string"
      ]
    },
    {
      "name": "transitUri",
      "type": [
        "null",
        "string"
      ]
    }
  ]
}
json

SiteToSiteStatusReportingTask

Publishes Status events using the Site To Site protocol. The component type and name filter regexes form a union: only components matching both regexes will be reported. However, all process groups are recursively searched for matching components, regardless of whether the process group matches the component filters.

Tags: status, metrics, history, site, site to site

Properties

Destination URL

The URL of the destination NiFi instance or, if clustered, a comma-separated list of address in the format of http(s)://host:port/nifi. This destination URL will only be used to initiate the Site-to-Site connection. The data sent by this reporting task will be load-balanced on all the nodes of the destination (if clustered).

Input Port Name

The name of the Input Port to deliver data to.

SSL Context Service

The SSL Context Service to use when communicating with the destination. If not specified, communications will not be secure.

Instance URL

The URL of this instance to use in the Content URI of each event.

Compress Events

Indicates whether or not to compress the data being sent.

Communications Timeout

Specifies how long to wait to a response from the destination before deciding that an error has occurred and canceling the transaction

Batch Size

Specifies how many records to send in a single batch, at most.

Transport Protocol

Specifies which transport protocol to use for Site-to-Site communication.

Proxy Configuration Service

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP + AuthN

Record Writer

Specifies the Controller Service to use for writing out the records.

Include Null Values

Indicate if null values should be included in records. Default will be false

Platform

The value to use for the platform field in each status record.

Component Type Filter Regex

A regex specifying which component types to report. Any component type matching this regex will be included. Component types are: Processor, RootProcessGroup, ProcessGroup, RemoteProcessGroup, Connection, InputPort, OutputPort

Component Name Filter Regex

A regex specifying which component names to report. Any component name matching this regex will be included.

Additional Details

The Site-to-Site Status Reporting Task allows the user to publish Status events using the Site To Site protocol. The component type and name filter regexes form a union: only components matching both regexes will be reported. However, all process groups are recursively searched for matching components, regardless of whether the process group matches the component filters.

Record writer

The user can define a Record Writer and directly specify the output format and data with the assumption that the input schema is the following:

{
  "type": "record",
  "name": "status",
  "namespace": "status",
  "fields": [
    {
      "name": "statusId",
      "type": "string"
    },
    {
      "name": "timestampMillis",
      "type": {
        "type": "long",
        "logicalType": "timestamp-millis"
      }
    },
    {
      "name": "timestamp",
      "type": "string"
    },
    {
      "name": "actorHostname",
      "type": "string"
    },
    {
      "name": "componentType",
      "type": "string"
    },
    {
      "name": "componentName",
      "type": "string"
    },
    {
      "name": "parentId",
      "type": [
        "string",
        "null"
      ]
    },
    {
      "name": "parentName",
      "type": [
        "string",
        "null"
      ]
    },
    {
      "name": "parentPath",
      "type": [
        "string",
        "null"
      ]
    },
    {
      "name": "platform",
      "type": "string"
    },
    {
      "name": "application",
      "type": "string"
    },
    {
      "name": "componentId",
      "type": "string"
    },
    {
      "name": "activeThreadCount",
      "type": [
        "long",
        "null"
      ]
    },
    {
      "name": "flowFilesReceived",
      "type": [
        "long",
        "null"
      ]
    },
    {
      "name": "flowFilesSent",
      "type": [
        "long",
        "null"
      ]
    },
    {
      "name": "bytesReceived",
      "type": [
        "long",
        "null"
      ]
    },
    {
      "name": "bytesSent",
      "type": [
        "long",
        "null"
      ]
    },
    {
      "name": "queuedCount",
      "type": [
        "long",
        "null"
      ]
    },
    {
      "name": "bytesRead",
      "type": [
        "long",
        "null"
      ]
    },
    {
      "name": "bytesWritten",
      "type": [
        "long",
        "null"
      ]
    },
    {
      "name": "terminatedThreadCount",
      "type": [
        "long",
        "null"
      ]
    },
    {
      "name": "runStatus",
      "type": [
        "string",
        "null"
      ]
    },
    {
      "name": "bytesTransferred",
      "type": [
        "long",
        "null"
      ]
    },
    {
      "name": "flowFilesTransferred",
      "type": [
        "long",
        "null"
      ]
    },
    {
      "name": "inputContentSize",
      "type": [
        "long",
        "null"
      ]
    },
    {
      "name": "outputContentSize",
      "type": [
        "long",
        "null"
      ]
    },
    {
      "name": "queuedContentSize",
      "type": [
        "long",
        "null"
      ]
    },
    {
      "name": "versionedFlowState",
      "type": [
        "string",
        "null"
      ]
    },
    {
      "name": "activeRemotePortCount",
      "type": [
        "long",
        "null"
      ]
    },
    {
      "name": "inactiveRemotePortCount",
      "type": [
        "long",
        "null"
      ]
    },
    {
      "name": "receivedContentSize",
      "type": [
        "long",
        "null"
      ]
    },
    {
      "name": "receivedCount",
      "type": [
        "long",
        "null"
      ]
    },
    {
      "name": "sentContentSize",
      "type": [
        "long",
        "null"
      ]
    },
    {
      "name": "sentCount",
      "type": [
        "long",
        "null"
      ]
    },
    {
      "name": "averageLineageDuration",
      "type": [
        "long",
        "null"
      ]
    },
    {
      "name": "transmissionStatus",
      "type": [
        "string",
        "null"
      ]
    },
    {
      "name": "targetURI",
      "type": [
        "string",
        "null"
      ]
    },
    {
      "name": "inputBytes",
      "type": [
        "long",
        "null"
      ]
    },
    {
      "name": "inputCount",
      "type": [
        "long",
        "null"
      ]
    },
    {
      "name": "outputBytes",
      "type": [
        "long",
        "null"
      ]
    },
    {
      "name": "outputCount",
      "type": [
        "long",
        "null"
      ]
    },
    {
      "name": "transmitting",
      "type": [
        "boolean",
        "null"
      ]
    },
    {
      "name": "sourceId",
      "type": [
        "string",
        "null"
      ]
    },
    {
      "name": "sourceName",
      "type": [
        "string",
        "null"
      ]
    },
    {
      "name": "destinationId",
      "type": [
        "string",
        "null"
      ]
    },
    {
      "name": "destinationName",
      "type": [
        "string",
        "null"
      ]
    },
    {
      "name": "maxQueuedBytes",
      "type": [
        "long",
        "null"
      ]
    },
    {
      "name": "maxQueuedCount",
      "type": [
        "long",
        "null"
      ]
    },
    {
      "name": "queuedBytes",
      "type": [
        "long",
        "null"
      ]
    },
    {
      "name": "backPressureBytesThreshold",
      "type": [
        "long",
        "null"
      ]
    },
    {
      "name": "backPressureObjectThreshold",
      "type": [
        "long",
        "null"
      ]
    },
    {
      "name": "backPressureDataSizeThreshold",
      "type": [
        "string",
        "null"
      ]
    },
    {
      "name": "isBackPressureEnabled",
      "type": [
        "string",
        "null"
      ]
    },
    {
      "name": "processorType",
      "type": [
        "string",
        "null"
      ]
    },
    {
      "name": "averageLineageDurationMS",
      "type": [
        "long",
        "null"
      ]
    },
    {
      "name": "flowFilesRemoved",
      "type": [
        "long",
        "null"
      ]
    },
    {
      "name": "invocations",
      "type": [
        "long",
        "null"
      ]
    },
    {
      "name": "processingNanos",
      "type": [
        "long",
        "null"
      ]
    },
    {
      "name": "executionNode",
      "type": [
        "string",
        "null"
      ]
    },
    {
      "name": "counters",
      "type": [
        "null",
        {
          "type": "map",
          "values": "string"
        }
      ]
    }
  ]
}
json

Parameter Providers

AwsSecretsManagerParameterProvider

Fetches parameters from AWS SecretsManager. Each secret becomes a Parameter group, which can map to a Parameter Context, with key/value pairs in the secret mapping to Parameters in the group.

Tags: aws, secretsmanager, secrets, manager

Properties

Secret Listing Strategy

Strategy to use for listing secrets.

Secret Name Pattern

A Regular Expression matching on Secret Name that identifies Secrets whose parameters should be fetched. Any secrets whose names do not match this pattern will not be fetched.

Secret Names

Comma-separated list of secret names to fetch.

Region

AWS Credentials Provider Service

Service used to obtain an Amazon Web Services Credentials Provider

Communications Timeout

SSL Context Service

Specifies an optional SSL Context Service that, if provided, will be used to create connections

Additional Details

Mapping AWS Secrets to Parameter Contexts

The AwsSecretsManagerParameterProvider maps a Secret to a Parameter Context, with key/value pairs in the Secret mapping to parameters. To create a compatible secret from the AWS Console:

  1. From the Secrets Manager service, click the “Store a new Secret” button

  2. Select “Other type of secret”

  3. Under “Key/value”, enter your parameters, with the parameter names being the keys and the parameter values being the values. Click Next.

  4. Enter the Secret name. This will determine which Parameter Context receives the parameters. Continue through the rest of the wizard and finally click the “Store” button.

Alternatively, from the command line, run a command like the following:

aws secretsmanager create-secret –name “[Context]” –secret-string ‘\{ “[Param]”: “[secretValue]”, “[Param2]”: ” [secretValue2]” }’

In this example, [Context] should be the intended name of the Parameter Context, [Param] and [Param2] should be parameter names, and [secretValue] and [secretValue2] should be the values of each respective parameter.

Configuring the Parameter Provider

AWS Secrets must be explicitly matched in the “Secret Name Pattern” property in order for them to be fetched. This prevents more than the intended Secrets from being pulled into NiFi.

AzureKeyVaultSecretsParameterProvider

Fetches parameters from Azure Key Vault Secrets. Each secret becomes a Parameter, which map to a Parameter Group byadding a secret tag named 'group-name'.

Tags: azure, keyvault, key, vault, secrets

Properties

Azure Credentials Service

Controller service used to obtain Azure credentials to be used with Key Vault client.

Key Vault URI

Vault URI of the Key Vault that contains the secrets

Group Name Pattern

A Regular Expression matching on the 'group-name' tag value that identifies Secrets whose parameters should be fetched. Any secrets without a 'group-name' tag value that matches this Regex will not be fetched.

Additional Details

Mapping Azure Key Vault Secrets to Parameter Contexts

The AzureKeyVaultSecretsParameterProvider maps a Secret to a Parameter, which can be grouped by adding a “group-name” tag. To create a compatible secret from the Azure Portal:

  1. Go to the “Key Vault” service

  2. Create your own Key Vault

  3. In your own Key Vault, navigate to “Secrets”

  4. Create a secret with the name corresponding to the parameter name, and the value corresponding to the parameter value. Under “Tags”, add a tag with a Key of “group-name” and a value of the intended Parameter Group name.

Alternatively, from the command line, run a command like the following once you have a Key Vault:

az keyvault secret set --vault-name [Vault Name] --name [Parameter Name] --value [Parameter Value] --tags group-name=[Parameter Group Name]

In this example, [Parameter Group Name] should be the intended name of the Parameter Group, [Parameter Name] should be the parameter name, and

should be the name you chose for your Key Vault in Azure

Configuring the Parameter Provider

Azure Key Vault Secrets must be explicitly matched in the “Group Name Pattern” property in order for them to be fetched. This prevents more than the intended Secrets from being pulled into NiFi.

DatabaseParameterProvider

Fetches parameters from database tables

Tags: database, dbcp, sql

Properties

Database Type

Database Type for generating statements specific to a particular service or vendor. The Generic Type supports most cases but selecting a specific type enables optimal processing or additional features.

Database Dialect Service

Database Dialect Service for generating statements specific to a particular service or vendor.

Database Connection Pooling Service

The Controller Service that is used to obtain a connection to the database.

Parameter Grouping Strategy

The strategy used to group parameters.

Table Name

The name of the database table containing the parameters.

Table Names

A comma-separated list of names of the database tables containing the parameters.

Parameter Name Column

The name of a column containing the parameter name.

Parameter Value Column

The name of a column containing the parameter value.

Parameter Group Name Column

The name of a column containing the name of the parameter group into which the parameter should be mapped.

SQL WHERE clause

A optional SQL query 'WHERE' clause by which to filter all results. The 'WHERE' keyword should not be included.

Additional Details

Providing Parameters from a Database

The DatabaseParameterProvider at its core maps database rows to Parameters, specified by a Parameter Name Column and Parameter Value Column. The Parameter Group name must also be accounted for, and may be specified in different ways using the Parameter Grouping Strategy.

Before discussing the actual configuration, note that in some databases, the words ‘PARAMETER’, ‘PARAMETERS’, ‘GROUP’, and even ‘VALUE’ are reserved words. If you choose a column name that is a reserved word in the database you are using, make sure to quote it per the database documentation.

Also note that you should use the preferred table name and column name case for your database. For example, Postgres prefers lowercase table and column names, while Oracle prefers capitalized ones. Choosing the appropriate case can avoid unexpected issues in configuring your DatabaseParameterProvider.

The default configuration uses a fully column-based approach, with the Parameter Group Name also specified by columns in the same table. An example of a table using this configuration would be:

PARAMETER_CONTEXTS

PARAMETER_NAME PARAMETER_VALUE PARAMETER_GROUP

param.foo

value-foo

group_1

param.bar

value-bar

group_1

param.one

value-one

group_2

param.two

value-two

group_2

Table 1: Database table example with Grouping Strategy = Column

In order to use the data from this table, set the following Properties:

  • Parameter Grouping Strategy - Column

  • Table Name - PARAMETER_CONTEXTS

  • Parameter Name Column - PARAMETER_NAME

  • Parameter Value Column - PARAMETER_VALUE

  • Parameter Group Name Column - PARAMETER_GROUP

Once fetched, the parameters in this example will look like this:

Parameter Group group_1:

  • param.foo - value-foo

  • param.bar - value-bar

Parameter Group group_2:

  • param.one - value-one

  • param.two - value-two

Grouping Strategy

The default Grouping Strategy is by Column, which allows you to specify the parameter Group name explicitly in the Parameter Group Column. Note that if the value in this column is NULL, an exception will be thrown.

The other Grouping Strategy is by Table, which maps each table to a Parameter Group and sets the Parameter Group Name to the table name. In this Grouping Strategy, the Parameter Group Column is not used. An example configuration using this strategy would be:

  • Parameter Grouping Strategy - Table

  • Table Names - KAFKA, S3

  • Parameter Name Column - PARAMETER_NAME

  • Parameter Value Column - PARAMETER_VALUE

An example of some tables that may be used with this strategy:

KAFKA

PARAMETER_NAME PARAMETER_VALUE

brokers

http://localhost:9092

topic

my-topic

password

my-password

Table 2: ‘KAFKA’ Database table example with Grouping Strategy = Table

S3

PARAMETER_NAME PARAMETER_VALUE

bucket

my-bucket

secret.access.key

my-key

Table 3: ‘S3’ Database table example with Grouping Strategy = Table

Once fetched, the parameters in this example will look like this:

Parameter Group KAFKA:

Parameter Group S3:

  • bucket - my-bucket

  • secret.access.key - my-key

Filtering rows

If you need to include only some rows in a table as parameters, you can use the ‘SQL WHERE clause’ property. An example of this is as follows:

  • Parameter Grouping Strategy - Table

  • Table Names - KAFKA, S3

  • Parameter Name Column - PARAMETER_NAME

  • Parameter Value Column - PARAMETER_VALUE

  • SQL WHERE clause - OTHER_COLUMN = ‘my-parameters’

Here we are assuming there is another column, ‘OTHER_COLUMN’ in both the KAFKA and S3 tables. Only rows whose ’ OTHER_COLUMN’ value is ‘my-parameters’ will then be fetched from these tables.

EnvironmentVariableParameterProvider

Fetches parameters from environment variables

Tags: environment, variable

Properties

Parameter Group Name

The name of the parameter group that will be fetched. This indicates the name of the Parameter Context that may receive the fetched parameters.

Environment Variable Inclusion Strategy

Indicates how Environment Variables should be included

Include Environment Variables

Specifies environment variable names that should be included from the fetched environment variables.

Restricted

Making changes to this component requires admin rights. Non-admins can still schedule this component.

GcpSecretManagerParameterProvider

Fetches parameters from GCP Secret Manager. Each secret becomes a Parameter, which can be mapped to a Parameter Group by adding a GCP label named 'group-name'.

Tags: gcp, secret, manager

Properties

Group Name Pattern

A Regular Expression matching on the 'group-name' label value that identifies Secrets whose parameters should be fetched. Any secrets without a 'group-name' label value that matches this Regex will not be fetched.

Project ID

Google Cloud Project ID

GCP Credentials Provider Service

The Controller Service used to obtain Google Cloud Platform credentials.

Additional Details

Mapping GCP Secrets to Parameter Contexts

The GcpSecretManagerParameterProvider maps a Secret to a Parameter, which can be grouped by adding a “group-name” label. To create a compatible secret from the GCP Console:

  1. From the Secret Manager service, click the “Create Secret” button

  2. Enter the Secret name. This is the name of a parameter. Enter a value.

  3. Under “Labels”, add a label with a Key of “group-name” and a value of the intended Parameter Group name.

Alternatively, from the command line, run a command like the following:

printf "[Parameter Value]" | gcloud secrets create --labels=group-name="[Parameter Group Name]" "[Parameter Name]" --data-file=-

In this example, [Parameter Group Name] should be the intended name of the Parameter Group, [Parameter Name] should be the parameter name, and [Parameter Value] should be the value of the parameter.

Configuring the Parameter Provider

GCP Secrets must be explicitly matched in the “Group Name Pattern” property in order for them to be fetched. This prevents more than the intended Secrets from being pulled into NiFi.

HashiCorpVaultParameterProvider

Provides parameters from HashiCorp Vault Key/Value Version 1 and Version 2 Secrets. Each Secret represents a parameter group, which will map to a Parameter Context. The keys and values in the Secret map to Parameters.

Tags: hashicorp, vault, secret

Properties

HashiCorp Vault Client Service

The service used to interact with HashiCorp Vault

Key/Value Path

The HashiCorp Vault path to the Key/Value Secrets Engine

Key/Value Version

The version of the Key/Value Secrets Engine

Secret Name Pattern

A Regular Expression indicating which Secrets to include as parameter groups to map to Parameter Contexts by name.

KubernetesSecretParameterProvider

Fetches parameters from files, in the format provided by Kubernetes mounted secrets. Parameter groups are indicated by a set of directories, and files within the directories map to parameter names. The content of the file becomes the parameter value. Since Kubernetes mounted Secrets are base64-encoded, the parameter provider defaults to Base64-decoding the value of the parameter from the file.

Tags: file

Properties

Parameter Group Directories

A comma-separated list of directory absolute paths that will map to named parameter groups. Each directory that contains files will map to a parameter group, named after the innermost directory in the path. Files inside the directory will map to parameter names, whose values are the content of each respective file.

Parameter Value Byte Limit

The maximum byte size of a parameter value. Since parameter values are pulled from the contents of files, this is a safeguard that can prevent memory issues if large files are included.

Parameter Value Encoding

Indicates how parameter values are encoded inside Parameter files.

Restricted

Making changes to this component requires admin rights. Non-admins can still schedule this component.

Additional Details

Deriving Parameters from mounted Kubernetes Secret files

The KubernetesSecretParameterProvider maps a directory to a parameter group named after the directory, and the files within the directory to parameters. Each file’s name is mapped to a parameter, and the content of the file becomes the value. Hidden files and nested directories are ignored.

While this provider can be useful in a range of cases since it simply reads parameter values from local files, it particularly matches the mounted volume secret structure in Kubernetes. A full discussion of Kubernetes secrets is beyond the scope of this document, but a brief overview can illustrate how these secrets can be mapped to parameter groups.

Kubernetes Mounted Secrets Example

Assume a secret is configured as follows:

data:
  admin_username: my-username (base64-encoded)
  admin_password: my-password (base64-encoded)
  access_key: my-key (base64-encoded)
yml

Assume a deployment has the following configuration:

spec:
  volumes:
  - name: system-credentials
    secret:
    items:
      - key: admin_username
        path: sys.admin.username
      - key: admin_password
        path: sys.admin.password
      - key: access_key
        path: sys.access.key
        secretName: system-creds
        containers:
  - volumeMounts:
      - mountPath: /etc/secrets/system-credentials
        name: system-credentials
        readOnly: true
yml

Then, this secret will appear on disk as follows:

$ ls /etc/secrets/system-credentials
sys.access.key sys.admin.password sys.admin.username

Therefore, to map this secret to a parameter group that will populate a Parameter Context named ‘system-credentials’, you should simply provide the following configuration to the KubernetesSecretParameterProvider:

  • Parameter Group Directories - /etc/secrets/system-credentials

The ‘system-credentials’ parameter context will then contain the following parameters:

  • sys.access.key - my-key

  • sys.admin.username - my-username

  • sys.admin.password - my-password

OnePasswordParameterProvider

Fetches parameters from 1Password Connect Server

Tags: 1Password

Properties

Web Client Service Provider

Controller service for HTTP client operations.

Connect Server

HTTP endpoint of the 1Password Connect Server to connect to. Example: http://localhost:8080

Access Token

Access Token used for authentication against the 1Password APIs.

Flow Analysis Rules

DisallowComponentType

Produces rule violations for each component (i.e. processors or controller services) of a given type.

Tags: component, processor, controller service, type

Properties

Component Type

Components of the given type will produce a rule violation (i.e. they shouldn’t exist). Either the simple or the fully qualified name of the type should be provided.

RestrictBackpressureSettings

This rule will generate a violation if backpressure settings of a connection exceed configured thresholds. Improper configuration of backpressure settings can lead to decreased performance because of excessive swapping and can fill up the content repository with too much in-flight data.

Tags: connection, backpressure

Properties

Minimum Backpressure Object Count Threshold

This is the minimum value that should be set for the Object Count backpressure setting on connections. This can be used to prevent a user from setting a value of 0 which disables backpressure based on count.

Maximum Backpressure Object Count Threshold

This is the maximum value that should be set for the Object Count backpressure setting on connections. This can be used to prevent a user from setting a very high value that may be leading to a lot of swapping.

Minimum Backpressure Data Size Threshold

This is the minimum value that should be set for the Data Size backpressure setting on connections. This can be used to prevent a user from setting a value of 0 which disables backpressure based on size.

Maximum Backpressure Data Size Threshold

This is the maximum value that should be set for the Data Size backpressure setting on connections. This can be used to prevent a user from setting a very high value that may be filling up the content repo.

SystemSafeguard

This rule prevents some unfavourable and sometimes system-endangering configurations. InvokeHTTP with an auto-terminated relationship ‘Response’ and a run schedule of ‘0’.

Tags: component, processor, controller service, type

Properties