Processing Fixed Width and Complex Files
Processing Fixed Width and Complex Files
Pointers
The first decision you will have to make is if it's structured at all. If it is a known type like CSV, JSON, AVRO, XML or Parquet then just use a record.
If it's semi-structured like a log file, GrokReader may work or ExtractGrok.
If it's like CSV, you may be able to tweak the CSV reader to work (say header or no header) or try one of the two CSV parsers NiFi has (Jackson or Apache Commons).
If it's a format like PDF, Word, Excel, RTF or something like that, I have a custom processor that uses Apache Tika and that should be able to parse it into text. Once it is text you can probably work with it.
Examples
- https://community.cloudera.com/t5/Support-Questions/How-to-parse-w-fixed-width-instead-of-char-delimited/td-p/102597
- https://community.cloudera.com/t5/Support-Questions/Best-way-to-parse-Fixed-width-file-using-Nifi-Kindly-help/m-p/177637
- https://community.cloudera.com/t5/Support-Questions/Split-one-Nifi-flow-file-into-Multiple-flow-file-based-on/td-p/203387
- https://community.cloudera.com/t5/Support-Questions/Splitting-a-Nifi-flowfile-into-multiple-flowfiles/td-p/139930
- https://community.cloudera.com/t5/Support-Questions/How-to-ExtractText-from-flow-file-using-Nifi-Processor/td-p/190826
- https://community.cloudera.com/t5/Community-Articles/Running-SQL-on-FlowFiles-using-QueryRecord-Processor-Apache/ta-p/246671
- https://www.datainmotion.dev/2020/12/smart-stocks-with-flank-nifi-kafka.html
- https://www.datainmotion.dev/2021/01/flank-real-time-transit-information-for.html
- https://medium.com/@nlabadie/apache-nifi-netflow-to-syslog-117d46867ae1
- https://medium.com/@nlabadie/apache-nifi-sftp-csv-to-syslog-json-d9da6938defa
- https://medium.com/@nlabadie/apache-nifi-pulling-from-mysql-and-sending-to-syslog-181dd4ae969c
- https://stackoverflow.com/questions/59291548/how-to-use-nifi-extractgrok-properly
Documentation
- AttributesToCSV
- AttributesToJSON
- ConvertExcelToCSVProcessor
- ConvertRecord
- ConvertText
- CSVReader
- EvaluateJSONPath
- EvaluateXPath
- EvaluateXQuery
- ExecuteScript
- ExecuteStreamCommand
- ExtractGrok
- ExtractText
- FlattenJson
- ForkRecord
- GrokReader
- JsonPathReader
- JsonTreeReader
- JoltTransformJSON
- JoltTransformRecord
- LookupAttribute
- LookupRecord
- MergeContent
- MergeRecord
- ModifyBytes
- ParseSyslog*
- PartitionRecord
- QueryRecord
- ReaderLookup
- ReplaceText
- ReplaceTextWithMapping
- ScriptedReader
- ScriptedRecordSink
- ScriptedTransformRecord
- SegmentContent
- SplitContent
- SplitJson
- SplitRecord
- SplitText
- SplitXml
- SyslogReader
- TransformXml
- UnpackContent
- UpdateAttribute
- UpdateRecord
- ValidCsv
- ValidateRecord
- ValidateXml
Custom Processors
- https://community.cloudera.com/t5/Community-Articles/Parsing-Any-Document-with-Apache-NiFi-1-5-with-Apache-Tika/ta-p/247672
- https://community.cloudera.com/t5/Community-Articles/Creating-HTML-from-PDF-Excel-and-Word-Documents-using-Apache/ta-p/247968
- https://github.com/tspannhw/nifi-extracttext-processor
Helper Projects, SDK, Libraries and Services
- https://tika.apache.org/ - Apache Tika can be integrated as a custom processor or called via REST and run as a seperate server/service.
- Cloudera Machine Learning - call this service from REST and have AI do it. https://blog.cloudera.com/integrating-machine-learning-models-into-your-big-data-pipelines-in-real-time-with-no-coding/
- REST Service - there may be a service you can run locally or use in the cloud that may be able to parse it. NiFi can call this
- Python - execute a stream command and have Python or a shell script or OS executeable do it!
- Spark - try custom Spark with Java, Python or Scala.
- Flink - try custom Flink with Java.
- XSLT
- XPath
- XQuery
- JsonPath
- Json
- https://github.com/AbsaOSS/cobrix
- https://github.com/tspannhw/EverythingApacheNiFi
- You may need to use Cache: https://www.datainmotion.dev/2021/01/flank-using-apache-kudu-as-cache-for.html