Getting Started with Spring Cloud Data Flow

In this article, I will show you how you can get started with Spring Cloud Data Flow. Spring Cloud Data Flow is an amazing platform for building data integration and processing pipelines. It has a very user-friendly graphical dashboard where you can define your streams, making your work with data an absolute pleasure.

The goal of this article is to have you learn to build some simple data pipelines by the time you are finished reading. Before we get started there are a few system requirements:

  • You should have JDK 8 installed, as at the time of writing Spring Cloud Data Flow is somewhat tricky to get to work with JDK 9 (missing JAXB libraries)
  • You should have Docker installed. If you are not sure why this is useful, I have written an article explaining docker use as a development tool. If you still don’t want to install docker, you need to be able to get MySql, Redis, and RabbitMQ accessible from your machine.
  • You should have Apache Maven installed on your machine. The official installation guide should be easy enough to follow.

Assuming that you have the tools required, we can get started!

Getting Spring Cloud Data Flow Server up and running

As I have mentioned, in order to get the platform running, you need some middleware. The first is RabbitMQ, you could use Kafka for your stream communication, but for the simplicity of this tutorial we are going to go with RabbitMQ:

docker run --name dataflow-rabbit -p 15672:15672 -p 5672:5672 -d rabbitmq:3-management

Running this command will start RabbitMQ Docker container on your machine exposed on the default ports. You will also get a management console that will let you check on the status of your broker.

In order to get analytics from Spring Cloud Data Flow, you will need Redis as well. This is not 100% required, but since it is not much hassle- let’s get it started. If you are running Data Flow in a production deployment you will definitely want it:

docker run --name dataflow-redis -p 6379:6379 -d redis

The last pre-requisite is a MySql instance. If you do not have it, you will end up with an in-memory H2 database powering Data Flow. The problem with that is that you will lose all your data on a server restart. This could be actually desirable for testing, but incredibly frustrating if you invest some time in configuring your Streams only to lose them on a server restart. While creating the container, we will set a custom password and create a database for Data Flow:

docker run --name dataflow-mysql -e MYSQL_ROOT_PASSWORD=dataflow -e MYSQL_DATABASE=scdf -p 3306:3306 -d mysql:5.7

With these three docker containers up and running you are ready to get the Data Flow server and get it started. You can download it from here: https://repo.spring.io/libs-release/org/springframework/cloud/spring-cloud-dataflow-server-local/1.3.0.RELEASE/spring-cloud-dataflow-server-local-1.3.0.RELEASE.jar. This is the latest version at the time of writing, the official project website could have a more up-to-date link, but it is not guaranteed to work exactly the same.

We have downloaded the local version of the server. That means that the different applications composing our Streams will be deployed as local Java processes. There are Cloud Foundry and Kubernetes versions of the server available if you want something more production-ready.

Time to start the server. We will pass MySQL and RabbitMQ parameters in this starting command. Redis default properties are good enough:

java -jar spring-cloud-dataflow-server-local-1.3.0.RELEASE.jar --spring.datasource.url=jdbc:mysql://localhost:3306/scdf --spring.datasource.username=root --spring.datasource.password=dataflow --spring.datasource.driver-class-name=org.mariadb.jdbc.Driver --spring.rabbitmq.host=127.0.0.1 --spring.rabbitmq.port=5672 --spring.rabbitmq.username=guest --spring.rabbitmq.password=guest

Spring Cloud Data Flow – First Look

Hopefully, your server started without a problem and you are seeing something like this in your console:

The beautiful logo on the start of Spring Cloud Data Flow
Data Flow successfully started

If you want you can look into your MySQL instance where you should see a bunch of tables created:

Data Flow creates multiple tables to keep track of things

Time to see Spring Cloud Data Flow itself! Go to http://localhost:9393/dashboard to see the dashboard:

Spring Cloud Data Flow dashboard

It looks a bit empty! This is because we did not load any starter apps. Spring Cloud Stream App Starters is a project that provides a multitude of ready-to-go starter apps for building Streams. You can read from FTP, HTTP, JDBC, twitter and more, process and save to a multitude of sources. There are three main concepts to which each Application can belong:

  • Source – These are the available sources of data. You start your streaming pipelines from them.
  • Processor – These take data and send them further in the processing pipeline. They sit in the middle.
  • Sink – They are the endpoints for the streams. This is where the data ends in the end.

These are being constantly added, and you can see the up-to-date list on the official project site. Currently we have:

Source Processor Sink
file aggregator aggregate-counter
ftp bridge cassandra
gemfire filter counter
gemfire-cq groovy-filter field-value-counter
http groovy-transform file
jdbc header-enricher ftp
jms httpclient gemfire
load-generator pmml gpfdist
loggregator python-http hdfs
mail python-jython hdfs-dataset
mongodb scriptable-transform jdbc
mqtt splitter log
rabbit tasklaunchrequest-transform mongodb
s3 tcp-client mqtt
sftp tensorflow pgcopy
syslog transform rabbit
tcp twitter-sentiment redis-pubsub
tcp-client router
time s3
trigger sftp
triggertask task-launcher-cloudfoundry
twitterstream task-launcher-local
task-launcher-yarn
tcp
throughput
websocket

This is an impressive list! So how do we get them into the Spring Cloud Data Server? Could not be easier! First, we are going to use the RabbitMQ + Maven flavour for the starters, as this is how we set up the server. From the project website, the URL for the stable release is http://bit.ly/Celsius-SR1-stream-applications-rabbit-maven. We can supply this to the Data Flow server. First, click:

And then populate the URI and click the Import button:

Bulk Importing the starter apps

If all went well, then you should see multiple starter apps available.

Multiple starter apps available for use in Data Flow

Building our first Data Flow Stream

We are now ready to build the first Data Flow Stream. To do this we will head to the Streams tab on the Dashboard and click the Create Stream button:

Click the highlighted button to create a new Stream

Here, we will create a stream that reads from HTTP endpoint, upper-cases the content and saves it all to a file in c:/dataflow-output (if you are on Windows, otherwise you can choose different directory). The aim of this exercise is to show you how Source, Processor and Sink connect together and how seamless it all is! Let’s drag and drop the following into the workspace:

  • Source – HTTP
  • Processor – transform
  • Sink – file

You should see the following:

The three components that will create our stream

As you can see there are red exclamation marks displayed. That means that the Stream is not healthy. You should click on the tiny squares in the graphical representation to connect the streams, or alternatively, in the text field, you can specify how the Stream should be composed:

http | transform | file

With that, we just need to configure our stream accordingly. This can be done by either clicking on the graphical representation cog-wheel icon that appears when selected:

Click on the highlighted icon to configure the component

Or by using the text field. One thing that you get from that graphical interface is quite a nice way of inputting the properties. For example to configure that HTTP source we can simply set the port like that:

We can see what other properties are configurable

Lets set the remaining properties via the text-field. The final Stream description should look like this:

http --port=7171 | transform --expression=payload.toUpperCase() | file --directory=c:/dataflow-output

Great! We have the first stream, now lets click Create Stream button visible just above the input text-field and set the stream name to Upper-Case-Stream:

Creating our first stream

I have ticked the Deploy stream(s) box to have the stream automatically deployed. The stream should be deployed shortly:

You can see the stream successfully deployed

Trying out the Stream

It would not be much fun to just create the stream and not try it! To do that you can get Postman running to send a few requests to that HTTP endpoint:

Sending example request

You will quickly see that there are relevant queues and exchanges created in the connected RabbitMQ instance:

Queues are created, and so are Exchanges. Other processes could listen to those!

And finally looking into that file and directory that we wanted to save the results of our Stream:

The upper-cased results of processing in all its glory!

Congratulations! You have made it through the creation of your first Spring Cloud Data Flow Stream!

Where from here?

I hope that reading this introduction got you excited about using Spring Cloud Data Flow- I certainly enjoyed writing about it! You should be aware that there is also Spring Cloud Data Flow Shell available if you need to work with the platform in a shell only environment (or if you prefer to!).

There is much more to Spring Cloud Data Flow. You can create your own Sources, Processors, and Sinks. You can create Tasks (run on demand) processes rather than Streams. There are complicated processing workflows that you can design. This all is to be discovered and used- hopefully with the knowledge from this article you are ready to start exploring yourself.

9 thoughts on “Getting Started with Spring Cloud Data Flow”

  1. Thanks for useful article for version 1.5.2.RELEASE bulk application import url do not work. It is asking “Please provide a valid URI pointing to the respective properties file.” Do you know what is the reason?

      1. Hi Bartosz, thanks for your comment finally I fixed that by using this url : http://bit.ly/Darwin-GA-stream-applications-rabbit-maven. I found this in the updated documentation. I have other question regarding that we are using docker image , each time after restart it will undo the changes (bulk import and stream definitions). Do you have any suggestion about how I can easily apply changes each time in docker image ? by the way if Spring Cloud Data Flow using mysql database to store data why always it reseting after restart?

        1. Hey Mehdi, you have two options here not to lose your data.

          1. You can run MySQL outside of Docker. This is fairly easy and your state will be preserved. You can get MySQL 5.7 from here https://dev.mysql.com/downloads/mysql/5.7.html
          2. You can run MySQL with Docker storing your data outside of Docker. This is explained here: https://hub.docker.com/_/mysql/#Where%20to%20Store%20Data This is a bit more difficult, but understanding Docker well will help you a lot in the future, so I recommend reading this and learning about docker volumes: https://docs.docker.com/storage/volumes/

          I hope that will solve your problems!

          1. Thanks Bartosz, Where can I change database to point my external my sql database in Spring Cloud Data Flow?

  2. I mean based on what I found in the documentation I need to pass database parameters for jar file like this :
    java -jar spring-cloud-dataflow-server-local/target/spring-cloud-dataflow-server-local-{project-version}.jar \
    –spring.datasource.url=jdbc:mysql://localhost:3306/mydb \
    –spring.datasource.username= \
    –spring.datasource.password= \
    –spring.datasource.driver-class-name=org.mariadb.jdbc.Driver
    But what if I want to set mysql as default database and do not need to pass always this parameters?

  3. Sorry for asking so many questions , I faced this issue also after deploying the stream when I want to send the request to http using postman I’m getting java.net.ConnectException: Connection refused (Connection refused), Do you have any idea what’s going wrong?

  4. Hi Bartosz, Thanks for the great tutorial.

    I am facing issue while deploying the stream. Its giving exception related to zookeeper – connection. Do we need to install zookeepre and is there any specific setting rquired. Below is the stacktrace

    Caused by: org.I0Itec.zkclient.exception.ZkTimeoutException: Unable to connect to zookeeper server within timeout: 10000
    at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:1232) ~[zkclient-0.9.jar!/:na]
    at org.I0Itec.zkclient.ZkClient.(ZkClient.java:156) ~[zkclient-0.9.jar!/:na]
    at org.I0Itec.zkclient.ZkClient.(ZkClient.java:130) ~[zkclient-0.9.jar!/:na]
    at kafka.utils.ZkUtils$.createZkClientAndConnection(ZkUtils.scala:76) ~[kafka_2.11-0.10.1.1.jar!/:na]
    at kafka.utils.ZkUtils$.apply(ZkUtils.scala:58) ~[kafka_2.11-0.10.1.1.jar!/:na]
    at kafka.utils.ZkUtils.apply(ZkUtils.scala) ~[kafka_2.11-0.10.1.1.jar!/:na]
    at org.springframework.cloud.stream.binder.kafka.provisioning.KafkaTopicProvisioner.createTopicAndPartitions(KafkaTopicProvisioner.java:171) ~[spring-cloud-stream-binder-kafka-core-1.3.1.RELEASE.jar!/:1.3.1.RELEASE]
    at org.springframework.cloud.stream.binder.kafka.provisioning.KafkaTopicProvisioner.createTopicsIfAutoCreateEnabledAndAdminUtilsPresent(KafkaTopicProvisioner.java:153) ~[spring-cloud-stream-binder-kafka-core-1.3.1.RELEASE.jar!/:1.3.1.RELEASE]
    at org.springframework.cloud.stream.binder.kafka.provisioning.KafkaTopicProvisioner.provisionConsumerDestination(KafkaTopicProvisioner.java:132) ~[spring-cloud-stream-binder-kafka-core-1.3.1.RELEASE.jar!/:1.3.1.RELEASE]
    at org.springframework.cloud.stream.binder.kafka.provisioning.KafkaTopicProvisioner.provisionConsumerDestination(KafkaTopicProvisioner.java:60) ~[spring-cloud-stream-binder-kafka-core-1.3.1.RELEASE.jar!/:1.3.1.RELEASE]
    at org.springframework.cloud.stream.binder.AbstractMessageChannelBinder.doBindConsumer(AbstractMessageChannelBinder.java:225) ~[spring-cloud-stream-1.3.1.RELEASE.jar!/:1.3.1.RELEASE]

    1. Hey shachi, it seems that you are trying to use the Kafka binder rather than RabbitMQ as in this tutorial. If you are planning on using Kafka, that’s fine, but you will need to learn about this technology. If you are going to use RabbitMQ, you will need to make sure to put the correct properties as in this tutorial or the documentation: https://cloud.spring.io/spring-cloud-stream-app-starters/

Comments are closed.