Price Comparisons Using Retail REST APIs with Apache NiFi, Kafka and Flink SQL
MiNiFi Agent Update March 2021
Cloudera Agent Availability
- https://docs.cloudera.com/cem/1.2.2/release-notes/topics/cem-minifi-cpp-agent-updates.html
- https://docs.cloudera.com/cem/1.2.2/release-notes/topics/cem-minifi-cpp-download-locations.html
Getting Started
MiNiFi (C++)
Version cpp-0.9.0
Release Date: 1 March 2021
Highlights of 0.9.0 release include:
- Added support for RocksDB-based content repository for better performance
- Added SQL extension
- Improved task scheduling
- Various C2 improvements
- Bug fixes and improvements to TailFile, ConsumeWindowsEventLog, MergeContent, CompressContent, PublishKafka, InvokeHTTP
- Implemented RetryFlowFile and smart handling of loopback connections
- Added a way to encrypt sensitive config properties and the flow configuration
- Implemented full S3 support
- Reduced memory footprint when working with many flow files
Build Notes:
It is advised that you use the bootstrap.sh when not building on windows.
https://cwiki.apache.org/confluence/display/MINIFI/Release+Notes#ReleaseNotes-Versioncpp-0.9.0
Download Now As Source or Pre-Build for Your Platform
https://nifi.apache.org/minifi/download.html
Kafka Replication with Cloudera Streams Replication Manager
Kafka Replication with Cloudera Streams Replication Manager
Spring Data JPA Against Big Data Sources
Spring Data JPA Against Big Data Sources
New Features of Apache NiFi 1.13.2
New Features of Apache NiFi 1.13.2
Release Notes: https://cwiki.apache.org/confluence/display/NIFI/Release+Notes#ReleaseNotes-Version1.13.0
- ListenFTP
- UpdateHiveTable - Hive DDL changes -Hive Update Schema ie Data Drift ie Hive Schema Migration!!!!
- SampleRecord - different sampling approaches to records (Interval Sampling, Probabilistic Sampling, Reservoir Sampling)
- CDC Updates
- Kudu updates
- AMQP and MQTT Integration Upgrades
- ConsumeMQTT - readers and writers added
- HTTP access to NiFi by default is now configured to accept connections to 127.0.0.1/localhost only. If you want to allow broader access for some reason for HTTP and you understand the security implications you can still control that as always by changing the 'nifi.web.http.host' property in nifi.properties as always. That said, please take the time to configure proper HTTPS. We offer detailed instructions and tooling to assist.
- ConsumeMQTT - add record reader/writer
- The ability to run NiFi with no GUI as MiNiFi/NiFi combined code base continues.
- Support for Kudu Dates (https://kudu.apache.org/releases/1.12.0/docs/release_notes.html)
- Updated GRPC versions
- Apache Calcite update
- PutDatabaseRecord update
- ConsumeMQTT: now with readers
- UpdateAttribute: set record.sink.name to kafka and recordreader.name to json.
- SampleRecord: sample a few of the records
- PutRecord: Use reader and destination service
- UpdateHiveTable: new sink
- [NIFI-7386] - AzureStorageCredentialsControllerService should also connect to storage emulator
- [NIFI-7429] - Add Status History capabilities for system level metrics
- [NIFI-7549] - Adding Hazelcast based implementation for DistributedMapCacheClient
- [NIFI-7624] - Build a ListenFTP processor
- [NIFI-7745] - Add a SampleRecord processor
- [NIFI-7796] - Add Prometheus metrics for total bytes received and bytes sent for components
- [NIFI-7801] - Add acknowledgement check to Splunk
- [NIFI-7821] - Create a Cassandra implementation of DistributedMapCacheClient
- [NIFI-7879] - Create record path function for UUID v5
- [NIFI-7906] - Add graph processor with flexibility to query graph database conditioned on flowfile content and attributes
- [NIFI-7989] - Add Hive "data drift" processor
- [NIFI-8136] - Allow State Management to be tied to Process Session
- [NIFI-8142] - Add "on conflict do nothing" feature to PutDatabaseRecord
- [NIFI-8146] - Allow RecordPath to be used for specifying operation type and data fields when using PutDatabaseRecord
- [NIFI-8175] - Add a WindowsEventLogReader
Cloudera Flow Management on DataHub Public Cloud
Cloudera Edge Manager 1.2.2 Release
February 15, 2021
CEM MiNiFi C++ Agent - 1.21.01
release includes:- Support for JSON output in the Consume Windows Even Log processor
- Full Expression Language support on Windows
- Full S3 support (List, Fetch, Get, Put)
Drivers to use with NiFi
Cloudera JDBC 2.6.20 driver for Apache Impala.
- [IMPJ-601] Updated third-party libraries
- The JDBC 4.1 driver has been updated to use the following libraries:
- log4j 2.2.1
- slf4j 1.7.30
- [IMPJ-607] Updated CDP support
- The driver now supports CDP 7.1
- For a list of supported CDP versions, see the Installation and Configuration Guide
- https://www.cloudera.com/downloads/connectors/impala/jdbc/2-6-20.html
- https://docs.cloudera.com/documentation/other/connectors/impala-jdbc/2-6-20.html
- https://www.datainmotion.dev/2021/02/ingest-into-cloud.html
- https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-dbcp-service-nar/1.13.0/org.apache.nifi.dbcp.DBCPConnectionPool/index.html
- https://www.datainmotion.dev/2019/10/migrating-apache-flume-flows-to-apache_15.html
- https://dev.to/tspannhw/read-apache-impala-apache-kudu-tables-and-send-to-apache-kafka-in-bulk-easily-with-apache-nifi-4n3g
- https://www.datainmotion.dev/2019/04/oracle-golden-gate-to-apache-kafka-to.html
Ingest Into the Cloud
Ingest Into the Cloud
Other resources:
- https://www.cloudera.com/tutorials/collecting-data-with-cdp-public-cloud.html
- https://github.com/tspannhw/ClouderaPublicCloudCDFWorkshop
- https://github.com/tspannhw/EverythingApacheNiFi
- https://www.datainmotion.dev/2020/10/top-25-use-cases-of-cloudera-flow.html
- https://www.datainmotion.dev/2020/12/new-release-for-hdf-352-and-cloudera.html
- https://www.datainmotion.dev/2020/04/streaming-data-with-cloudera-data-flow.html
- https://www.datainmotion.dev/2020/07/using-cloudera-data-platform-with-flow.html
- https://github.com/tspannhw/cdp-datahub-azure-nifikafka
- https://github.com/tspannhw/ClouderaNow2020
Using Apache NiFi in OpenShift and Anywhere Else to Act as Your Global Integration Gateway
Using Apache NiFi in OpenShift and Anywhere Else to Act as Your Global Integration Gateway
What does it look like?
Where Can I Run This Magic Engine:
Private Cloud, Public Cloud, Hybrid Cloud, VM, Bare Metal, Single Node, Laptop, Raspberry Pi or anywhere you have a 1GB of RAM and some CPU is a good place to run a powerful graphical integration and dataflow engine. You can also run MiNiFi C++ or Java agents if you want it even smaller.
Sounds Too Powerful and Expensive:
Apache NiFi is Open Source and can be run freely anywhere.
For What Use Cases:
Microservices, Images, Deep Learning and Machine Learning Models, Structured Data, Unstructured Data, NLP, Sentiment Analysis, Semistructured Data, Hive, Hadoop, MongoDB, ElasticSearch, SOLR, ETL/ELT, MySQL CDC, MySQL Insert/Update/Delete/Query, Hosting Unlimited REST Services, Interactive with Websockets, Ingesting Any REST API, Natively Converting JSON/XML/CSV/TSV/Logs/Avro/Parquet, Excel, PDF, Word Documents, Syslog, Kafka, JMS, MQTT, TCP/IP, UDP, FTP, sFTP, Files, Directories, Google Forms, Object Stores, NoSQL, Lookups, Hosting Web sites, Updates and live SQL on data streams.
MySQL/REST/MQTT/JMS/REST/Files/S3/Object Stores. You also have an expert available on NiFi here. https://www.datainmotion.dev/2019/10/migrating-apache-flume-flows-to-apache_9.html.
This makes these tasks much easier to develop, deploy, manage and control. A single Data Engineer can now build, deploy and manage thousands of data streams in batch, microbatch and streams.
- https://www.datainmotion.dev/2020/12/simple-change-data-capture-cdc-with-sql.html
- https://www.datainmotion.dev/2021/01/flank-real-time-transit-information-for.html
- https://www.datainmotion.dev/2021/01/flank-using-apache-kudu-as-cache-for.html
- https://www.datainmotion.dev/2020/12/ingesting-websocket-data-for-live-stock.html
- https://www.datainmotion.dev/2020/12/smart-stocks-with-flank-nifi-kafka.html?es_id=1fb9486166
- https://dzone.com/articles/lets-build-a-simple-ingest-to-cloud-data-warehouse
- https://dzone.com/articles/real-time-streaming-deep-learning-pipelines-with-d
- https://www.datainmotion.dev/2020/10/automating-building-migration-backup.html
- https://www.datainmotion.dev/2020/10/tracking-satellites-with-apache-nifi.html
- https://www.datainmotion.dev/2020/07/ingesting-all-weather-data-with-apache.html
- https://dev.to/tspannhw/ingesting-all-the-weather-data-with-apache-nifi-2ho4
How about version control?
NiFi Registry provides easy to integrate version control with full REST API and can export your flows to a Git repository like Github.
- https://www.datainmotion.dev/2019/11/nifi-toolkit-cli-for-nifi-110.html
- https://pierrevillard.com/2018/04/09/automate-workflow-deployment-in-apache-nifi-with-the-nifi-registry/
- https://dzone.com/articles/devops-for-apache-nifi-17-and-more
- https://community.cloudera.com/t5/Community-Articles/Big-Data-DevOps-Apache-NiFi-Flow-Versioning-and-Automation/ta-p/247976
DevOps?
- https://www.datainmotion.dev/2021/01/automating-starting-services-in-apache.html
- https://www.datainmotion.dev/2020/09/devops-working-with-parameter-contexts.html
- https://www.datainmotion.dev/2019/11/nifi-toolkit-cli-for-nifi-110.html
How about deployment?
Apache NiFi can run anywhere!
You can run Apache NiFi on a single VM or localhost or laptop:
https://nifi.apache.org/download.html
On OpenShift
https://catalog.redhat.com/software/containers/cdt-common-rns/nifi/6026bb6c2937380b51711b73
https://github.com/rromannissen/nifi-openshift
Apache NiFi Stateless can run all in RAM, one event at a time like a Job or Function as a Service:
https://github.com/SamHjelmfelt/OpenWhisk-YarnDeployment
Docker
https://hub.docker.com/r/apache/nifi
What If I don't like easy to use Web UIs?
You can code everything with REST calls:
https://nifi.apache.org/docs/nifi-docs/rest-api/index.html
Okay, maybe not that low-level, what about a CLI?
You can run and install it here: https://nifi.apache.org/docs/nifi-docs/html/toolkit-guide.html#nifi_CLI.
Can I get more information:
- https://github.com/tspannhw/EverythingApacheNiFi
- https://nifi.apache.org/docs.html
- https://nifi.apache.org/docs/nifi-docs/html/nifi-in-depth.html
- https://nifi.apache.org/docs/nifi-docs/html/developer-guide.html
- https://www.youtube.com/watch?v=RjWstt7nRVY
- https://nifi.apache.org/docs/nifi-docs/html/walkthroughs.html
- https://nifi.apache.org/docs/nifi-docs/html/overview.html
- https://nifi.apache.org/docs/nifi-docs/html/getting-started.html
- https://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html
- https://nifi.apache.org/docs/nifi-docs/html/record-path-guide.html
- https://nifi.apache.org/docs/nifi-docs/html/nifi-in-depth.html
- https://nifi.apache.org/docs/nifi-docs/html/developer-guide.html
- https://nifi.apache.org/docs/nifi-docs/rest-api/index.html
- https://nifi.apache.org/docs/nifi-docs/html/user-guide.html
- https://nifi.apache.org/docs/nifi-docs/html/toolkit-guide.html
- https://www.datainmotion.dev/2020/06/no-more-spaghetti-flows.html
I need support:
A partner of Amazon, Google, Microsoft, Oracle, IBM and thousands more, you can trust Cloudera for enterprise cloud hosting, support and development. Cloudera has a majority of developers of Apache NiFi working on Open Source.
https://www.cloudera.com/products/cdf.html
Automating Starting Services in Apache NiFi and Applying Parameters
Automating Starting Services in Apache NiFi and Applying Parameters
Automate all the things! You can call these commands interactively or script all of them with awesome devops tools. Andre and Dan can tell you more about that.
Enable All NiFi Services on the Canvas
By running this three times, I get any stubborn ones or ones that needed something previously running. This could be put into a loop and check the status before trying again.
nifi pg-list nifi pg-status nifi pg-get-services
The NiFi CLI has interactive help available and also some good documentation:
https://nifi.apache.org/docs/nifi-docs/html/toolkit-guide.html#nifi_CLI
/opt/demo/nifi-toolkit-1.12.1/bin/cli.sh nifi pg-enable-services -u http://edge2ai-1.dim.local:8080 --processGroupId root
/opt/demo/nifi-toolkit-1.12.1/bin/cli.sh nifi pg-enable-services -u http://edge2ai-1.dim.local:8080 --processGroupId root
/opt/demo/nifi-toolkit-1.12.1/bin/cli.sh nifi pg-enable-services -u http://edge2ai-1.dim.local:8080 --processGroupId root
We could then start a process group if we wanted:
nifi pg-start -u http://edge2ai-1.dim.local:8080 -pgid 2c1860b3-7f21-36f4-a0b8-b415c652fc62
List all process groups
/opt/demo/nifi-toolkit-1.12.1/bin/cli.sh nifi pg-list -u http://edge2ai-1.dim.local:8080
List Parameters
/opt/demo/nifi-toolkit-1.12.1/bin/cli.sh nifi list-param-contexts -u http://edge2ai-1.dim.local:8080 -verbose
Set parameters to set parameter context for a process group, you can loop to do all.
- pgid => parameter group id
- pcid => parameter context id
I need to put this in a shell or python script:
/opt/demo/nifi-toolkit-1.12.1/bin/cli.sh nifi pg-set-param-context -u http://edge2ai-1.dim.local:8080 -verbose -pgid 2c1860b3-7f21-36f4-a0b8-b415c652fc62 -pcid 39f0f296-0177-1000-ffff-ffffdccb6d90
Example
https://github.com/tspannhw/ApacheConAtHome2020/blob/main/scripts/setupnifi.sh
You could also use the NiFi REST API or Dan's awesome Python API (https://nipyapi.readthedocs.io/en/latest/).
References
- https://www.datainmotion.dev/2020/09/devops-working-with-parameter-contexts.html
- https://www.datainmotion.dev/2019/11/nifi-toolkit-cli-for-nifi-110.html
- https://www.datainmotion.dev/2020/06/no-more-spaghetti-flows.html
- https://www.datainmotion.dev/2020/07/report-on-this-apache-nifi-1114-monitor.html
- https://github.com/tspannhw/EverythingApacheNiFi
- https://www.datainmotion.dev/2020/12/cloudera-data-platform-using-apache.html
- https://www.datainmotion.dev/2020/03/using-nifi-cli-to-restore-nifi-flows.html
- https://www.datainmotion.dev/2020/10/automating-building-migration-backup.html
- https://docs.cloudera.com/HDPDocuments/HDF3/HDF-3.5.2/nifi-toolkit/content/nifi_cli.html
- https://levelup.gitconnected.com/an-overview-of-apache-nifi-and-toolkit-cli-deployments-785978dbce3b
- https://pierrevillard.com/2018/04/09/automate-workflow-deployment-in-apache-nifi-with-the-nifi-registry/
- https://dzone.com/articles/devops-for-apache-nifi-17-and-more
Migrating from Apache Storm to Apache Flink
Migrating from Apache Storm to Apache Flink
The first thing you need to do is to not just pick up and dump to a new system, but to see what can be reconfigured, refactored or reimagined. For some routing, transformation or simple ingest type applications or solution parts you may want to use Apache NiFi.
For others Spark or Spark Streaming can quickly meet your needs. For simple Thing to Kafka or Kafka to Thing flows, a flow with Kafka Connect is appropriate. For things that need to run in individual devices, containers, pods you may want to move a small application to NiFi Stateless. There are also sometimes a simple Kafka Stream application will meet your needs.
For many use cases you can replace a compiled application with some solid Flink SQL code. For some discussions, check this out.
For some really good information on how to migrate Storm solutions to Flink, Cloudera has a well documented solution for you:
Conceptual
https://docs.cloudera.com/csa/1.2.0/stormflink-migration/topics/csa-stormflink-concept.html
Architecture
https://docs.cloudera.com/csa/1.2.0/stormflink-migration/topics/csa-stormflink-architecture.html
Redistribution
https://docs.cloudera.com/csa/1.2.0/stormflink-migration/topics/csa-stormflink-redistribution.html
References
- https://docs.cloudera.com/csa/1.2.0/development/topics/csa-application-logic.html
- https://www.datainmotion.dev/2020/12/smart-stocks-with-flank-nifi-kafka.html
- https://www.datainmotion.dev/2019/11/introducing-mm-flank-apache-flink-stack.html
- https://www.datainmotion.dev/2020/07/flank-in-cloud-huge-cloudera-data.html
- https://www.datainmotion.dev/2020/05/flank-low-code-streaming-populating.html
- https://www.datainmotion.dev/2020/10/running-flink-sql-against-kafka-using.html
- https://github.com/tspannhw/EverythingApacheNiFi
- https://github.com/tspannhw/ApacheConAtHome2020
- https://github.com/tspannhw/SmartStocks
- https://github.com/tspannhw/ClouderaFlinkSQLForPartners
- https://github.com/tspannhw/SmartWeather