Skip to main content

Setting output names and rollover / upload policies

ADLS Gen1 Writer,, ADLS Gen2 Writer Azure Blob Writer FileWriter, GCS Writer, HDFS Writer, and S3 Writer support the following options to define the paths and names for output files, when new files are created, and how many files are retained.GCS Writer

Dynamic output names

The blobname string in Azure Blob Writer, the bucketname and objectname strings in GCSWriter and S3Writer, the directory and filename strings in ADLS Gen1 Writer, ADLS Gen2 Writer, File Writer, and HDFSWriter, and the foldername string in AzureBlobWriter, GCS Writer, and S3Writer may include field-name tokens that will be replaced with values from the input stream. For example, if the input stream included yearString, monthString, and dayString fields, directory:'%yearString%/%monthString%/%dayString%' would create a new directory for each date, grouped by month and year subdirectories. If desirable, you may filter these fields from the output using the members property in DSVFormatter or JSONFormatter.

Note: If a bucketname in GCSWriter or S3Writer, directoryname in ADLS Gen1 Writer or ADLS Gen2 Writer, or foldername in S3Writer contains two field-name tokens, such as %field1%%field2%, the second creates a subfolder of the first.

When the target's input is the output of a CDC or DatabaseReader source, values from the WAEvent metadata or userdata map or JSONNodeEvent metadata map may be used in these names using the syntax %@metadata(<field name>)% or %@userdata(<field name>)%, for example, %@metadata(TableName)%. You may combine multiple metadata and/or userdata values, for example, %@metadata(name)%/%@userdata(TableName)%'. You may also mix field, metadata, and userdata values.

For S3Writer bucket names, do not include punctuation between values. Hyphens, the only punctuation allowed, will be added automatically. For more information, see Amazon's Rules for Bucket Naming.

Rollover and upload policies

rolloverpolicy or uploadpolicy trigger output file or blob rollover different ways depending on which parameter you specify:

parameter

rollover trigger

example

eventcount (or size)

specified number of events have been accumulated (in Kinesis Writer, this is specified as a size in bytes rather than a number of events)

 eventcount:10000

filesize

specified file size has been reached (value must be specified in megabytes, the maximum is 2048M)

filesize:1024M

interval

specified time has elapsed (use s for second or m for minute, h for hour)

interval:1h

You may specify both eventcount and interval, in which case rollover is triggered whenever one of the limits is reached. For example, eventcount:10000,interval:1h will start a new file after one hour has passed since the current file was created or after 10,000 events have been written to it, whichever happens first.

Caution

When the rollover policy includes an interval or eventcount, and the file or blob name includes the creation time, be sure that the writer will never receive events so quickly that it will create a second file with the same name, since that may result in lost data. You may work around this by using nanosecond precision in the creation time or by using a sequence number instead of the creation time.

If you drop and re-create the application and there are existing files in the output directory, start will fail with a "file ... already exists in the current directory" error. To retain the existing files in that directory, add a sequencestart parameter to the rollover policy, or add %<time format>%, %epochtime%, or %epochtimems% to the file name.

If file lineage is not enabled, when the writer is is stopped and restarted, the sequence will restart from the beginning, overwriting existing files, so you might want to back up the output directory before restarting.

  filelimit

By default, there is no limit to how many files are retained (filelimit:-1). You may change that by including the filelimit parameter in the rollover policy:

rolloverpolicy:'interval:1h,filelimit:24'

  sequencestart, incrementsequenceby

If you prefer to start with a value greater than 0 or increment by a value greater than 1, add the sequencestart or incrementsequenceby parameters to rolloverpolicy or uploadpolicy, for example:

rolloverpolicy:'eventcount:10000,sequencestart:10,incrementsequenceby:10'

File and blob names

By default, output file or blob names are the base name (specified in filename in ADLS Gen1 Writer, ADLS Gen2 Writer, File Writer, and HDFS Writer, blobname in AzureBlobWriter, or objectname in GCS Writer and S3 Writer) with a counter between the filename and extension. The counter starts at 00 and is incremented by one for each new file. For example, if filename is myfile.txt, the output files will be myfile.00.txtmyfile.01.txt, and so on. If filelimit is set to a positive value, the counter will start over at 00 when it reaches that number, overwriting the existing files. If filelimit is -1, the counter will increment indefinitely (in which case you might want to create a cron job to delete old log files).

You may customize the output file names using the following tokens. They are allowed only in the file name, not in the extension.

%n...%

Use one n for each digit you want in the file name. The sequence number may be placed elsewhere than at the end of the file name by including the %<n...>% token in the writer's filename property value:

filename:'batch_%nnn%_data.csv'

With that setting, rolled-over files will be named batch_000_data.csv, batch_001_data.csv, and so on.

%<time format>%, %epochtime%, %epochtimems%

The formatted file creation time may be included in the file name by including the %<time format>% token in the writer's filename property value:

filename:'batch_%yyyy-MM-dd_HH_mm_ss%_data.csv'

With that setting:

  • a file rolled over at 1:00 pm on June 1, 2016, would be named batch_2016-06-01_13_00_00_data.csv

  • one rolled over five minutes later would be named batch_2016-06-01_13_05_00_data.csv

and so on. You may use any valid Joda DateTimeFormat pattern.

The unformatted epoch file creation time may be included in the file name by using the %epochtime%  token instead. For millisecond resolution, use %epochtimems%.

Examples combining multiple options

The following are a few examples of the many ways these options can be combined.

properties

behavior

  • rolloverpolicy: 'EventCount:5000'

  • directory: '%yearString%/%monthString%/%dayString%'

  • filename: 'output_%yyyy-MM-dd_HH_mm_ss%

  • file is rolled over every 5,000 events

  • files are grouped by day in a year/month/day directory tree

  • file names include the timestamp

  • rolloverpolicy: 'EventCount:5000'

  • directory: '%yearString%/%monthString%/%dayString%'

  • filename: 'output_%nnnnnn%

  • file is rolled over every 5,000 events

  • each day's files are grouped in a directory in a year/month/day directory tree

  • file names include the six-digit sequence number

  • directory: '%tableName%/%City%/'

  • filename: 'output_%nnnnnn%.csv'

  • files are grouped by city in a table_name/city_name directory tree

  • file names include the six-digit sequence number

See also Kafka Writer for discussion of how to write to specific partitions based on field values.