Skip to main content

HDFS Writer

Writes to files in the Hadoop Distributed File System (HDFS). 

Warning

If your version of Hadoop does not include the fix for HADOOP-10786, HDFSWriter may terminate due to Kerberos ticket expiration.

To write to MapR-FS, use MapRFSWriter. HDFSWriter and MapRFSWriter use the same properties except for the difference in hadoopurl noted below and the different names for the configuration path property.

HDFS Writer properties

property

type

default value

notes

authentication policy

String

If the HDFS cluster uses Kerberos authentication, provide credentials in the format Kerberos, Principal:<Kerberos principal name>, KeytabPath:<fully qualified keytab file name>. Otherwise, leave blank. For example: authenticationpolicy:'Kerberos, Principal:nn/ironman@EXAMPLE.COM, KeytabPath:/etc/security/keytabs/nn.service.keytab'

Directory

String

The full path to the directory in which to write the files. See Setting output names and rollover / upload policies for advanced options.

File Name

String 

The base name of the files to be written. See Setting output names and rollover / upload policies.

flush policy

String

eventcount:10000, interval:30s

If data is not flushed properly with the default setting, you may use this property to specify how many events Striim will accumulate before writing and/or the maximum number of seconds that will elapse between writes. For example:

  • flushpolicy:'eventcount:5000'

  • flushpolicy:'interval:10s'

  • flushpolicy:'interval:10s, eventcount:5000'

Note that changing this setting may significantly degrade performance.

With a setting of 'eventcount:1', each event will be written immediately. This can be useful during development, debugging, testing, and troubleshooting.

hadoopConfigurationPath

String

If using Kerberos authentication, specify the path to Hadoop configuration files such as core-site.xml and hdfs-site.xml. If this path is incorrect or the configuration changes, authentication may fail.

hadoopurl

String

The URI for the HDFS cluster NameNode. See below for an example. The default HDFS NameNode IPC port is 8020 or 9000 (depending on the distribution). Port 50070 is for the web UI and should not be specified here.

For an HDFS cluster with high availability, use the value of the dfs.nameservices property from hdfs-site.xml with the syntax hadoopurl:'hdfs://<value>', for example, hdfs://'mycluster'.  When the current NameNode fails, Striim will automatically connect to the next one.

When using MapRFSWriter, you may start the URL with hdfs:// or maprfs:/// (there is no functional difference).

MapRDBConfigurationPath

String

see notes for hadoopConfigurationPath

Rollover on DDL

Boolean

True

Has effect only when the input stream is the output stream of a CDC reader source. With the default value of True, rolls over to a new file when a DDL event is received. Set to False to keep writing to the same file.

Rollover Policy

String

interval:60s

See Setting output names and rollover / upload policies.

This adapter has a choice of formatters. See Supported writer-formatter combinations for more information.Supported writer-formatter combinations

HDFS Writer sample application

The following sample writes some of the PosApp sample data to the file /output/hdfstestOut in the specified HDFS instance:

CREATE SOURCE CSVSource USING FileReader (
  directory:'Samples/PosApp/AppData',
  WildCard:'posdata.csv',
  positionByEOF:false
)
PARSE USING DSVParser (
  header:'yes'
)
OUTPUT TO CsvStream;

CREATE TYPE CSVType (
  merchantId String,
  dateTime DateTime,
  hourValue Integer,
  amount Double,
  zip String
);
CREATE STREAM TypedCSVStream OF CSVType;

CREATE CQ CsvToPosData
INSERT INTO TypedCSVStream
SELECT data[1],
  TO_DATEF(data[4],'yyyyMMddHHmmss'),
  DHOURS(TO_DATEF(data[4],'yyyyMMddHHmmss')),
  TO_DOUBLE(data[7]),
  data[9]
FROM CsvStream;

CREATE TARGET hdfsOutput USING HDFSWriter(
  filename:'hdfstestOut.txt',
  hadoopurl:'hdfs://node8057.example.com:8020',
  flushpolicy:'interval:10,eventcount:5000',
  authenticationpolicy:'Kerberos,Principal:striim/node8057.example.com@STRIIM.COM,
    KeytabPath:/etc/security/keytabs/striim.service.keytab',
  hadoopconfigurationpath:'/etc/hadoop/conf',
  directory:'/user/striim/PosAppOutput' 
)
FORMAT USING DSVFormatter (
)
INPUT FROM TypedCSVStream;