General

Q: What is Amazon Kinesis Data Analytics?
Amazon Kinesis Data Analytics is the easiest way to analyze streaming data, gain actionable insights, and respond to your business and customer needs in real time. Amazon Kinesis Data Analytics reduces the complexity of building, managing, and integrating streaming applications with other Amazon Web Services services. You can quickly build SQL queries and sophisticated Apache Flink applications in a supported language such as Java or Scala using built-in templates and operators for common processing functions to organize, transform, aggregate, and analyze data at any scale.
 
Amazon Kinesis Data Analytics takes care of everything required to run your real-time applications continuously and scales automatically to match the volume and throughput of your incoming data. With Amazon Kinesis Data Analytics, you only pay for the resources your streaming applications consume. There is no minimum fee or setup cost.
 
Q: What is real-time stream processing and why do I need it?
Data is coming at us at lightning speeds due to an explosive growth of real-time data sources. Whether it is log data from mobile and web applications, purchase data from ecommerce sites, or sensor data from IoT devices, the data delivers information that can help companies learn about what their customers, organization, and business are doing right now. By having visibility into this data as it arrives, you can monitor your business in real time and quickly leverage new business opportunities. For example, making promotional offers to customers based on where they might be at a specific time, or monitoring social sentiment and changing customer attitudes to identify and act on new opportunities.

To take advantage of these opportunities, you need a different set of analytics tools for collecting and analyzing real-time streaming data than what has been available traditionally for static, stored data. With traditional analytics, you gather the information, store it in a database, and analyze it hours, days, or weeks later. Analyzing real-time data requires a different approach, different tools, and different services. Instead of running database queries on stored data, streaming analytics services process the data continuously before the data is stored. Streaming data flows at an incredible rate that can vary up and down all the time. Streaming analytics services need to process this data when it arrives, often at speeds of millions of events per hour.
 
Q: What can I do with Kinesis Data Analytics?
You can use Kinesis Data Analytics for many use cases to process data continuously and get insights in seconds or minutes rather than waiting days or even weeks. Kinesis Data Analytics enables you to quickly build end-to-end stream processing applications for log analytics, clickstream analytics, Internet of Things (IoT), ad tech, gaming, and more. The four most common use cases are streaming extract-transform-load (ETL), continuous metric generation, responsive real-time analytics, and interactive querying of data streams.

Streaming ETL

Streaming ETL applications enable you to clean, enrich, organize, and transform raw data prior to loading your data lake or data warehouse in real-time, reducing or eliminating batch ETL steps. These applications can buffer small records into larger files prior to delivery, and perform sophisticated joins across streams and tables. For example, you can build an application that continuously reads IoT sensor data stored in Amazon Managed Streaming for Apache Kafka (Amazon MSK), organize the data by sensor type, remove duplicate data, normalizes data per a specified schema, and then deliver the data to Amazon S3.
 
Continuous metric generation

Continuous metric generation applications enable you to monitor and understand how your data is trending over time. Your applications can aggregate streaming data into critical information and seamlessly integrate it with reporting databases and monitoring services to serve your applications and users in real-time. With Kinesis Data Analytics, you can use SQL or Apache Flink code in a supported language to continuously generate time-series analytics over time windows. For example, you can build a live leaderboard for a mobile game by computing the top players every minute and then sending it to Amazon DynamoDB. Or, you can track the traffic to your website by calculating the number of unique website visitors every five minutes and then sending the processed results to Amazon Redshift.

Responsive real-time analytics

Responsive real-time analytics applications send real-time alarms or notifications when certain metrics reach predefined thresholds, or in more advanced cases, when your application detects anomalies using machine learning algorithms. These applications enable you to respond immediately to changes in your business in real-time like predicting user abandonment in mobile apps and identifying degraded systems. For example, an application can compute the availability or success rate of a customer-facing API over time, and then send results to Amazon CloudWatch. You can build another application to look for events that meet certain criteria, and then automatically notify the right customers using Amazon Kinesis Data Streams and Amazon Simple Notification Service (SNS).
 
Interactive analysis of data streams

Interactive analysis enables streaming data exploration in real time. With ad hoc queries or programs, you can inspect streams from Amazon MSK or Amazon Kinesis Data Streams and visualize how data looks like within those streams. For example, you can view how a real-time metric that computes the average over a time window behaves and send the aggregated data to a destination of your choice. Interactive analysis also helps with iterative development of stream processing applications. The queries you build will continuously update as new data arrives. With Amazon Kinesis Data Analytics Studio you can deploy these queries to run continuously with autoscaling and durable state backups enabled.
 
Q: How do I get started with Apache Flink applications for Kinesis Data Analytics?
Sign into the Amazon Kinesis Data Analytics console and create a new stream processing application. You can also use the Amazon CLI and Amazon SDKs. Once you create an application, go to your favorite Integrated Development Environment, connect to Amazon Web Services, and install the open source Apache Flink libraries and Amazon SDKs in your language of choice. Apache Flink is an open source framework and engine for processing data streams. The extensible libraries include more than 25 pre-built stream processing operators like window and aggregate, and Amazon Web Services service integrations like Amazon MSK, Amazon Kinesis Data Streams, and Amazon Kinesis Data Firehose. Once built, you upload your code to Amazon Kinesis Data Analytics and the service takes care of everything required to run your real-time applications continuously including scaling automatically to match the volume and throughput of your incoming data.
 
You can get started from the Amazon Kinesis Data Analytics console and create a new Studio notebook. Once you start the notebook, you can open it in Apache Zeppelin to immediately write code in SQL, Python, or Scala. You can interactively develop applications using the notebook interface for Amazon Kinesis Data Streams, Amazon MSK, and Amazon S3 using built-in integrations, and various other sources with custom connectors. You can use all the operators that Apache Flink supports in Flink SQL and the Table API to perform ad hoc querying of your data streams and to develop your stream processing application. Once you are ready, with a few clicks, you can easily promote your code to a continuously running stream processing application with autoscaling and durable state.
 
Q: How do I get started with Apache Beam applications for Kinesis Data Analytics?
Using Apache Beam to create your Kinesis Data Analytics application is very similar to getting started with Apache Flink. Please follow the instructions in the question above and be sure to install any components necessary for applications to run on Apache Beam, per the instructions in the Developer Guide. Note that Kinesis Data Analytics supports Java SDK’s only when running on Apache Beam.
 
Q: How do I get started with SQL applications for Kinesis Data Analytics?
Sign into the Amazon Kinesis Data Analytics console and create a new stream processing application. You can also use the Amazon CLI and Amazon SDKs. You can build an end-to-end application in three simple steps: 1) configure incoming streaming data, 2) write your SQL queries, and 3) point to where you want the results loaded. Kinesis Data Analytics recognizes standard data formats such as JSON, CSV, and TSV, and automatically creates a baseline schema. You can refine this schema, or if your data is unstructured, you can define a new one using our intuitive schema editor. Then, the service applies the schema to the input stream and makes it look like a SQL table that is continually updated so that you can write standard SQL queries against it. You use our SQL editor to build your queries.
The SQL editor comes with all the bells and whistles including syntax checking and testing against live data. We also give you templates that provide the SQL code for anything from a simple stream filter to advanced anomaly detection and top-K analysis. Kinesis Data Analytics takes care of provisioning and elastically scaling all of the infrastructure to handle any data throughput. You don’t need to plan, provision, or manage infrastructure.
 
Q: What are the limits of Kinesis Data Analytics?
Kinesis Data Analytics elastically scales your application to accommodate for the data throughput of your source stream and your query complexity for most scenarios. For detailed information on service Limits for Apache Flink applications, visit the Limits section in the Amazon Kinesis Data Analytics for Apache Flink Developer Guide. For detailed information on service limits, see Limits in the Amazon Kinesis Data Analytics for SQL Developer Guide.

Key concepts

Q: What is a Kinesis Data Analytics application?
An application is the Kinesis Data Analytics entity that you work with. Kinesis Data Analytics applications continuously read and process streaming data in real time. You write application code in a language supported by Apache Flink to process the incoming streaming data and produce output. Then, Kinesis Data Analytics writes the output to a configured destination.

Input – The streaming source for your application. In the input configuration, you map the streaming source to an in-application data stream(s). Data flows from in your data source(s) into your in-application data streams. You process data from these in-application data streams using your application code, sending processed data to subsequent in-application data streams or destinations. You add inputs inside application code for Apache Flink applications and Studio notebooks, and via the API for Kinesis Data Analytics for SQL applications.
 
Application code – A series of Apache Flink operators or SQL statements that process input and produce output. In its simplest form, application code can be a single Apache Flink operator or SQL statement that reads from an in-application data stream associated with a streaming source and writes to an in-application data stream associated with an output. For a Studio notebook, this could be a simple Flink SQL select query, with the results shown in context within the notebook. You can write Apache Flink code in its supported languages for Kinesis Data Analytics for Apache Flink applications or Studio notebooks, or SQL code that splits the initial in-application data stream into multiple streams and applies additional logic to these separate streams for Kinesis Data Analytics for SQL applications.
 
Output – You can create one or more in-application streams to store intermediate results. You can then optionally configure an application output to persist data from specific in-application streams to an external destination. You add these outputs inside application code for Apache Flink applications and Studio notebooks, and for Kinesis Data Analytics for SQL applications.
 
Q: What is an in-application data stream?
An in-application data stream is an entity that continuously stores data in your application for you to perform processing. Your applications continuously writes to and reads from in-application data streams. For Apache Flink and Studio applications, you interact with in-application stream by processing data via stream operators. Operators transform one or more data streams into a new data stream. For SQL applications, you interact with an in-application stream in the same way you would a SQL table by using SQL statements. You apply SQL statements to one or more data streams and insert the results into a new data stream.
 
Q: What application code is supported?
For Apache Flink applications, Kinesis Data Analytics supports applications built using the Apache Flink open source libraries and the Amazon SDKs. For SQL applications, Kinesis Data Analytics supports the ANSI SQL with some extensions to the SQL standard to make it easier to work with streaming data. Kinesis Data Analytics Studio supports code built using Apache Flink-compatible SQL, Python, and Scala.

Managing applications

Q: How can I monitor the operations and performance of my Kinesis Data Analytics applications?
Amazon Web Services provides various tools that you can use to monitor your Kinesis Data Analytics applications. You can configure some of these tools to do the monitoring for you. For more information about how to monitor your application, see:

Monitoring Kinesis Data Analytics in the Amazon Kinesis Data Analytics for Apache Flink Developer Guide.
Monitoring Kinesis Data Analytics in the Amazon Kinesis Data Analytics for SQL Developer Guide.

Q: How do I manage and control access to my Kinesis Data Analytics applications?
Kinesis Data Analytics needs permissions to read records from the streaming data sources that you specify in your application. Kinesis Data Analytics also needs permissions to write your application output to destinations that you specify in your application output configuration. You can grant these permissions by creating IAM roles that Kinesis Data Analytics can assume. The permissions you grant to this role determine what Kinesis Data Analytics can do when the service assumes the role. For more information, see:

Granting Permissions in the Amazon Kinesis Data Analytics for Apache Flink Developer Guide.
Granting Permissions in the Amazon Kinesis Data Analytics for SQL Developer Guide.

Q: How does Kinesis Data Analytics scale my application?
Kinesis Data Analytics elastically scales your application to accommodate the data throughput of your source stream and your query complexity for most scenarios. Kinesis Data Analytics provisions capacity in the form of Amazon Kinesis Processing Units (KPU). One KPU provides you with 1 vCPU and 4GB memory.
For Apache Flink applications, Kinesis Data Analytics assigns 50GB of running application storage per KPU that your application uses for checkpoints and is available for you to use via temporary disk. A checkpoint is an up-to-date backup of a running application that is used to recover immediately from an application disruption. You can also control the parallel execution for your Kinesis Data Analytics for Apache Flink application tasks (such as reading from a source or executing an operator) using the Parallelism and ParallelismPerKPU parameters in the API. Parallelism defines the number of concurrent instances of a task. All operators, sources, and sinks execute with a defined parallelism, by default 1. Parallelism per KPU defines the amount of the number of parallel tasks that can be scheduled per Kinesis Processing Unit (KPU) of your application, by default 1. For more information, see Scaling in the Amazon Kinesis Data Analytics for Apache Flink Developer Guide.

For SQL applications, each streaming source is mapped to a corresponding in-application stream. While this is not required for many customers, you can more efficiently use KPUs by increasing the number of in-application streams that your source is mapped to by specifying the input parallelism parameter. Kinesis Data Analytics evenly assigns the streaming data source’s partitions, such as Amazon Kinesis data stream’s shards, to the number of in-application data streams that you specified. For example, if you have a 10-shard Amazon Kinesis data stream as a streaming data source and you specify an input parallelism of two, Kinesis Data Analytics assigns five Amazon Kinesis shards to two in-application streams named “SOURCE_SQL_STREAM_001” and “SOURCE_SQL_STREAM_002”. For more information, see Configuring Application Input in the Amazon Kinesis Data Analytics for SQL Developer Guide.

Q: What are the best practices associated for building and managing my Kinesis Data Analytics applications?
For information about best practices for Apache Flink applications, see the Best Practices section of the Amazon Kinesis Data Analytics for Apache Flink Developer Guide. The section covers best practices for fault tolerance, performance, logging, coding, and more.

For information about best practices for SQL applications, see the Best Practices section of the Amazon Kinesis Data Analytics for SQL Developer Guide. The section covers managing applications, defining input schema, connecting to outputs, and authoring application code.

Q: Can I access resources behind an Amazon VPC with a Kinesis Data Analytics for Apache Flink application?
Yes. You can access resources behind an Amazon VPC. You can learn how to configure your application for VPC access in the Using an Amazon VPC section of the Amazon Kinesis Data Analytics Developer Guide.

Q: Can a single Kinesis Data Analytics for Apache Flink application have access to multiple VPCs?
No. If multiple subnets are specified, they must all be in the same VPC. You can connect to other VPCs by peering your VPCs.

Q: Can a Kinesis Data Analytics for Apache Flink application connected to a VPC also be able to access the internet and Amazon Web Services Service endpoints?
Kinesis Data Analytics for Apache Flink applications and Kinesis Data Analytics Studio notebooks configured to access resources in a particular VPC will not have access to the internet as a default configuration. You can learn how to configure access to the internet for your application in the Internet and Service Access section of the Amazon Kinesis Data Analytics Developer Guide.

Pricing and billing

Q: How much does Kinesis Data Analytics cost?
With Amazon Kinesis Data Analytics, you pay only for what you use. There are no resources to provision or upfront costs associated with Amazon Kinesis Data Analytics.
You are charged an hourly rate based on the number of Amazon Kinesis Processing Units (or KPUs) used to run your streaming application. A single KPU is a unit of stream processing capacity comprised of 1 vCPU compute and 4 GB memory. Amazon Kinesis Data Analytics automatically scales the number of KPUs required by your stream processing application as the demands of memory and compute vary in response to processing complexity and the throughput of streaming data processed.
 
For Apache Flink applications, you are charged a single additional KPU per application, used for application orchestration. Apache Flink applications are also charged for running application storage and durable application backups. Running application storage is used for Amazon Kinesis Data Analytics’ stateful processing capabilities and is charged per GB-month. Durable application backups are optional and provide a point-in-time recovery point for applications, charged per GB-month.
 
For Amazon Kinesis Data Analytics Studio, in development or interactive mode, you are charged an additional KPU for application orchestration and one for interactive development. You are also charged for running application storage. You are not charged for durable application backups.
 
For more information about pricing, see the Amazon Kinesis Data Analytics pricing page.
 
Q: Am I charged for a Kinesis Data Analytics application that is running but not processing any data from the source?
For Apache Flink applications, you are charged a minimum of two KPUs and 50 GB running application storage if your Kinesis Data Analytics application is running. For SQL applications, you are charged a minimum of one KPU if your Kinesis Data Analytics application is running.
 
Q: Other than Kinesis Data Analytics costs, are there any other costs that I might incur?
Kinesis Data Analytics is a fully managed stream processing solution, independent from the streaming source that it reads data from and the destinations it writes processed data to. You will be billed independently for the services you read from and write to in your application.
Q: What is Apache Flink?
Apache Flink is an open source framework and engine for stream and batch data processing. It makes streaming applications easy to build, because it provides powerful operators and solves the core streaming problems like duplicate processing very well. Apache Flink provides data distribution, communication, and fault tolerance for distributed computations over data streams.
 
Q: How do I develop applications?
You can start by downloading the open source libraries that include the Amazon SDK, Apache Flink, and connectors for Amazon Web Services services. You can get instructions on how to download the libraries and create your first application in the Amazon Kinesis Data Analytics for Apache Flink Developer Guide.
 
Q: What does my application code look like?
You write your Apache Flink application code using data streams and stream operators. Application data streams are the data structure you perform processing against using your code. Data continuously flows from the sources into application data streams. One or more stream operators are used to define your processing on the application data streams, including transform, partition, aggregate, join and window. Data streams and operators can be put together in serial and parallel chains. A short example using pseudo code is shown below.
DataStream <GameEvent> rawEvents = env.addSource(
	New KinesisStreamSource(“input_events”));
DataStream <UserPerLevel> gameStream =
	rawEvents.map(event - > new UserPerLevel(event.gameMetadata.gameId, 
			event.gameMetadata.levelId,event.userId));
gameStream.keyBy(event -> event.gameId)
            .keyBy(1)
            .window(TumblingProcessingTimeWindows.of(Time.minutes(1)))
            .apply(...) - > {...};
gameStream.addSink(new KinesisStreamSink("myGameStateStream"));
Q: How do I use the operators?
Operators take an application data stream as input and send processed data to an application data stream as output. Operators can be put together to build applications with multiple steps and don’t require advanced knowledge of distributed systems to implement and operate.
 
Q: What operators are supported?
Kinesis Data Analytics for Apache Flink includes over 25 operators that can be used to solve a wide variety of use cases including Map, KeyBy, aggregations, Window Join, and Window. Map allows you to perform arbitrary processing, taking one element from an incoming data stream and producing another element. KeyBy logically organizes data using a specified key enabling you to process similar data points together. Aggregations performs processing across multiple keys like sum, min, and max. Window Join joins two data streams together on a given key and window. Window group date using a key and typically time-based operation, like counting the number of unique items over a 5-minute time-period.
You can build custom operators if these do not meet your needs. You can find more examples in the Operators section of the Amazon Kinesis Data Analytics for Apache Flink Developer Guide. You can find a full list of operators in the Operators section of the Apache Flink documentation.
 
Q: What integrations are supported in a Kinesis Data Analytics for Apache Flink application?
You can setup pre-built integrations with minimal code, or build your own integration to connect to virtually any data source. The open source libraries based on Apache Flink support streaming sources and destinations, or sinks, for the delivery of process data. This also includes support for data enrichment via asynchronous input/output connectors. A list of specific connectors included in the open source libraries are shown below.

• Streaming data sources: Amazon Managed Streaming for Apache Kafka (Amazon MSK), Amazon Kinesis Data Streams
• Destinations, or sinks: Amazon Kinesis Data Streams, Amazon Kinesis Data Firehose, Amazon DynamoDB, Amazon Elasticsearch Service, and Amazon S3 (through file sink integrations).

Apache Flink also includes other connectors including Apache Kafka, Apache Cassandra, Elasticsearch and more.
 
Q: Are custom integrations supported?
You can add a source or destination to your application by building upon a set of primitives that enable you to read and write from files, directories, sockets, or anything that you can access over the internet. Apache Flink provides these primitives for data sources and data sinks. The primitives come with configurations like the ability to read and write data continuously or once, asynchronously or synchronously, and much more. For example, you can setup an application to read continuously from Amazon S3 by extending the existing file based source integration.
 
Q: What delivery model does Kinesis Data Analytics for Apache Flink applications provide?
Apache Flink applications in Kinesis Data Analytics use an “exactly once” delivery model if an application is built using idempotent operators, including sources and sinks. This means the processed data will impact downstream results once and only once. Checkpoints save the current application state and enable Kinesis Data Analytics for Apache Flink applications to recover the position of the application to provide the same semantics as a failure-free execution. Checkpoints for applications are provided via Apache Flink’s checkpointing functionality. By default, Kinesis Data Analytics for Apache Flink applications uses exactly-once semantics. Your application will support exactly once processing semantics if you design your applications using sources, operators, and sinks that utilize Apache Flink’s exactly once semantics.
 
Q: Do I have access to local storage from my application storage?
Yes. Kinesis Data Analytics for Apache Flink applications provides your application 50 GB of running application storage per Kinesis Processing Unit (KPU). Kinesis Data Analytics scales storage with your application. Running application storage is used for saving application state using checkpoints. It is also accessible to your application code to use as temporary disk for caching data or any other purpose. Kinesis Data Analytics can remove data from running application storage not saved via checkpoints (e.g operators, sources, sinks) at any time. All data stored in running application storage is encrypted at rest.
 
Q: How does Kinesis Data Analytics for Apache Flink automatically backup my application?
Kinesis Data Analytics automatically backs up your running application’s state using checkpoints and snapshots. Checkpoints save the current application state and enable Kinesis Data Analytics for Apache Flink applications to recover the position of the application to provide the same semantics as a failure-free execution. Checkpoints utilize running application storage. Snapshots save a point in time recovery point for applications. Snapshots utilize durable application backups.
 
Q: What are application snapshots?
Snapshots enable you to create and restore your application to a previous point in time. This enables you to maintain previous application state and rollback your application at any time. You control how snapshots you have at any given from zero to thousands of snapshots. Snapshots use durable application backups and Kinesis Data Analytics charges you based on their size. Kinesis Data Analytics encrypts data saved in snapshots by default. You can delete individual snapshots through the API or all snapshots by deleting your application.
 
Q: What versions of Apache Flink are supported?
Amazon Kinesis Data Analytics for Apache Flink applications supports Apache Flink 1.6, 1.8 and 1.11. Apache Flink 1.11 in Kinesis Data Analytics supports Java Development Kit version 11, Python 3.7 and Scala 2.1.2. You can find more information in Creating Applications section of the Amazon Web Services Developer Guide.

Building Amazon Kinesis Analytics Studio applications

Configuring input for SQL applications

Q: How do I develop a Studio application?
You can start from the Amazon Kinesis Data Analytics Studio, Amazon Kinesis Data Streams, or Amazon MSK consoles with a few clicks to launch a serverless notebook to immediately query data streams and perform interactive data analytics.
 
Interactive data analytics: You can write code in the notebook in SQL, Python, or Scala to interact with your streaming data, with query response times in seconds. You can use built-in visualizations to explore the data and view real-time insights on your streaming data from within your notebook, and easily develop stream processing applications powered by Apache Flink. Once your code is ready to run as a production application, you can transition with a single click to a stream processing application that processes GBs of data per second, without servers. Stream processing application: Once you are ready to promote your code to production, you can build your code by clicking. You can click on ‘Deploy as stream processing application’ in the notebook interface or issue a single command in the CLI, and Studio takes care of all the infrastructure management necessary for you to run your stream processing application at scale, with autoscaling and durable state enabled, just as in an Amazon Kinesis Data Analytics for Apache Flink application.

Q: What does my application code look like?
You can write code in the notebook in your preferred language of SQL, Python, or Scala using Apache Flink’s Table API. The Table API is a high-level abstraction and relational API that supports a superset of SQL’s capabilities. It offers familiar operations such as select, filter, join, group by, aggregate, etc., along with stream specific concepts like windowing. You use %<interpreter> to specify the language to be used in a section of the notebook, and easily switch between languages. Interpreters are Apache Zeppelin plug-ins that enable developers to specify a language or data processing engine for each section of the notebook. You can also build user defined functions and reference them to improve code functionality.

Q: What SQL operations are supported?
You can perform SQL operations such as Scan and Filter (SELECT, WHERE), Aggregations (GROUP BY, GROUP BY WINDOW,HAVING), Set (UNION, UNIONALL, INTERSECT, IN, EXISTS), Order (ORDER BY, LIMIT), Joins (INNER, OUTER, Timed Window –BETWEEN, AND, joining with temporal tables – tables that track changes over time), Top N, deduplication, and pattern recognition. Some of these queries such as GROUP BY, OUTER JOIN, and Top N are “results updating” for streaming data,which means that the results are continuously updating as the streaming data is processed. Other DDL statements such as CREATE, ALTER, and DROP are also supported. For a complete list of queries and samples, see https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/sql/queries.html.

Q: How are Python and Scala supported?
Apache Flink’s Table API supports Python and Scala through language integration using Python strings and Scala expressions. The operations supported are very similar to the SQL operations supported, including select, order, group, join, filter, and windowing. A full list of operations and samples are included here: https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/tableApi.html
 
Q: What versions of Apache Flink and Apache Zeppelin are supported?
Kinesis Data Analytics Studio supports Apache Flink 1.11 and Apache Zeppelin 0.9.

Q: What integrations are supported in a Kinesis Data Analytics Studio application by default?
  • Data sources: Amazon Managed Streaming for Apache Kafka (Amazon MSK), Amazon Kinesis Data Streams, Amazon S3
  • Destinations, or sinks: Amazon MSK, Amazon Kinesis Data Streams, and Amazon S3
 
Q: Are custom integrations supported?
You can configure additional integrations with a few additional steps and lines of Apache Flink code (Python, Scala, or Java) to define connections with all Apache Flink supported integrations including destinations such as Amazon Elasticsearch Service, Amazon ElastiCache for Redis, Amazon Aurora, Amazon Redshift, Amazon DynamoDB, Amazon Keyspaces, and more. You can attach executables for these custom connectors when you create or configure your Studio application.
 
Q: Should I develop with Kinesis Data Analytics Studio or Kinesis Data Analytics SQL?
We recommend getting started with Kinesis Data Analytics Studio as it offers a more comprehensive stream processing experience with exactly-once processing. Kinesis Data Analytics Studio offers stream processing application development in your language of choice (SQL, Python, and Scala), scales to GB/s of processing, supports long running computations over hours or even days, performs code updates within seconds, handles multiple input streams, and works with a variety of input streams including Amazon Kinesis Data Streams and Amazon MSK.

Building SQL applications

Configuring input for SQL applications

Q: What inputs are supported in a Kinesis Data Analytics SQL application?
SQL applications in Kinesis Data Analytics support two types of inputs: streaming data sources and reference data sources. A streaming data source is continuously generated data that is read into your application for processing. A reference data source is static data that your application uses to enrich data coming in from streaming sources. Each application can have no more than one streaming data source and no more than one reference data source. An application continuously reads and processes new data from streaming data sources, including Amazon Kinesis Data Streams or Amazon Kinesis Data Firehose. An application reads a reference data source, including Amazon S3, in its entirety for use in enriching the streaming data source through SQL JOINs.
 
Q: What is a reference data source?
A reference data source is static data that your application uses to enrich data coming in from streaming sources. You store reference data as an object in your S3 bucket. When the SQL application starts, Kinesis Data Analytics reads the S3 object and creates an in-application SQL table to store the reference data. Your application code can then join it with an in-application stream. You can update the data in the SQL table by calling the UpdateApplication API.
 
Q: How do I set up a streaming data source in my SQL application?
A streaming data source can be an Amazon Kinesis data stream or an Amazon Kinesis Data Firehose delivery stream. Your Kinesis Data Analytics SQL application continuously reads new data from streaming data sources as it arrives in real time. The data is made accessible in your SQL code through an in-application stream. An in-application stream acts like a SQL table because you can create, insert, and select from it. However, the difference is that an in-application stream is continuously updated with new data from the streaming data source.

You can use the Amazon Web Services Management Console to add a streaming data source. You can learn more about sources in the Configuring Application Input section of the Kinesis Data Analytics for SQL Developer Guide.
 
Q: How do I set up a reference data source in my SQL application?
A reference data source can be an Amazon S3 object. Your Kinesis Data Analytics SQL application reads the S3 object in its entirety when it starts running. The data is made accessible in your SQL code through a table. The most common use case for using a reference data source is to enrich the data coming from the streaming data source using a SQL JOIN.

Using the Amazon CLI, you can add a reference data source by specifying the S3 bucket, object, IAM role, and associated schema. Kinesis Data Analytics loads this data when you start the application, and reloads it each time you make any update API call.
 
Q: What data formats are supported for SQL applications?
SQL applications in Kinesis Data Analytics can detect the schema and automatically parses UTF-8 encoded JSON and CSV records using the DiscoverInputSchema API. This schema is applied to the data read from the stream as part of the insertion into an in-application stream.

For other UTF-8 encoded data that does not use a delimiter, uses a different delimiter than CSV, or in cases were the discovery API did not fully discover the schema, you can define a schema using the interactive schema editor or use string manipulation functions to structure your data. For more information, see Using the Schema Discovery Feature and Related Editing in the Amazon Kinesis Data Analytics for SQL Developer Guide.
 
Q: How is my input stream exposed to my SQL code?
Kinesis Data Analytics for SQL applies your specified schema and inserts your data into one or more in-application streams for streaming sources, and a single SQL table for reference sources. The default number of in-application streams is the one that meets the needs of most of your use cases. You should increase this if you find that your application is not keeping up with the latest data in your source stream as defined by CloudWatch metric MillisBehindLatest. The number of in-application streams required is impacted by both the amount of throughput in your source stream and your query complexity. The parameter for specifying the number of in-application streams that are mapped to your source stream is called input parallelism.

Authoring application code for SQL applications

Application code is a series of SQL statements that process input and produce output. These SQL statements operate on in-application streams and reference tables. An in-application stream is like a continuously updating table on which you can perform the SELECT and INSERT SQL operations. Your configured sources and destinations are exposed to your SQL code through in-application streams. You can also create additional in-application streams to store intermediate query results.

You can use the following pattern to work with in-application streams:

• Always use a SELECT statement in the context of an INSERT statement. When you select rows, you insert results into another in-application stream.
• Use an INSERT statement in the context of a pump. You use a pump to make an INSERT statement continuous, and write to an in-application stream.
• You use a pump to tie in-application streams together, selecting from one in-application stream and inserting into another in-application stream.

The following SQL code provides a simple, working application: 
CREATE OR REPLACE STREAM "DESTINATION_SQL_STREAM" (
    ticker_symbol VARCHAR(4),
    change DOUBLE,
    price DOUBLE);

CREATE OR REPLACE PUMP "STREAM_PUMP" AS 
  INSERT INTO "DESTINATION_SQL_STREAM"    
    SELECT STREAM ticker_symbol, change, price    
    FROM "SOURCE_SQL_STREAM_001";
For more information about application code, see Application Code in the Amazon Kinesis Data Analytics for SQL Developer Guide.
 
Q: How does Kinesis Data Analytics help me with writing SQL code?
Kinesis Data Analytics includes a library of analytics templates for common use cases including streaming filters, tumbling time windows, and anomaly detection. You can access these templates from the SQL editor in the Amazon Web Services Management Console. After you create an application and navigate to the SQL editor, the templates are available in the upper-left corner of the console.
 
Q: How can I perform real-time anomaly detection in Kinesis Data Analytics?
Kinesis Data Analytics includes pre-built SQL functions for several advanced analytics including one for anomaly detection. You can simply make a call to this function from your SQL code for detecting anomalies in real-time. Kinesis Data Analytics uses the Random Cut Forest algorithm to implement anomaly detection.

Configuring destinations in SQL applications

Q: What destinations are supported?
Kinesis Data Analytics for SQL supports up to three destinations per application. You can persist SQL results to Amazon S3, Amazon Redshift, and Amazon Elasticsearch Service (through Amazon Kinesis Data Firehose), and Amazon Kinesis Data Streams. You can write to a destination not directly supported by Kinesis Data Analytics by sending SQL results to Amazon Kinesis Data Streams, and leveraging its integration with Amazon Lambda to send to a destination of your choice.
 
Q: How do I set up a destination?
In your application code, you write the output of SQL statements to one or more in-application streams. Optionally, you can add an output configuration to your application to persist everything written to specific in-application streams to up to four external destinations. These external destinations can be an Amazon S3 bucket, Amazon Redshift table, Amazon Elasticsearch Service domain (through Amazon Kinesis Data Firehose) and an Amazon Kinesis data stream. Each application supports up to four destinations, which can be any combination of the above. For more information, see Configuring Output Streams in the Amazon Kinesis Data Analytics for SQL Developer Guide.
 
Q: My preferred destination is not directly supported. How can I send SQL results to this destination?
You can use Amazon Lambda to write to a destination that is not directly supported using Kinesis Data Analytics for SQL applications. We recommend that you write results to an Amazon Kinesis data stream, and then use Amazon Lambda to read the processed results and send it to the destination of your choice. For more information, see the Example: Amazon Lambda Integration in the Amazon Kinesis Data Analytics for SQL Developer Guide. Alternatively, you can use a Kinesis Data Firehose delivery stream to load the data into Amazon S3, and then trigger an Amazon Lambda function to read that data and send it to the destination of your choice.
 
Q: What delivery model does Kinesis Data Analytics provide?
SQL applications in Kinesis Data Analytics uses an "at least once" delivery model for application output to the configured destinations. Kinesis Data Analytics applications take internal checkpoints, which are points in time when output records were delivered to the destinations and there was no data loss. The service uses the checkpoints as needed to ensure that your application output is delivered at least once to the configured destinations. For more information about the delivery model, see Configuring Application Output in the Amazon Kinesis Data Analytics for SQL Developer Guide.

Comparison to other stream processing solutions

Q: How does Amazon Kinesis Data Analytics for SQL differ from running my own application using the Amazon Kinesis Client Library?
The Amazon Kinesis Client Library (KCL) is a pre-built library that helps you build consumer applications for reading and processing data from an Amazon Kinesis data stream. The KCL handles complex issues such as adapting to changes in data stream volume, load balancing streaming data, coordinating distributed services, and processing data with fault-tolerance. The KCL enables you to focus on business logic while building applications.

With Kinesis Data Analytics, you can process and query real-time, streaming data. You use standard SQL to process your data streams, so you don’t have to learn any new programming languages. You just point Kinesis Data Analytics to an incoming data stream, write your SQL queries, and then specify where you want the results loaded. Kinesis Data Analytics uses the KCL to read data from streaming data sources as one part of your underlying application. The service abstracts this from you, as well as many of the more complex concepts associated with using the KCL, such as checkpointing.

If you want a fully managed solution and you want to use SQL to process the data from your data stream, you should use Kinesis Data Analytics. Use the KCL if you need to build a custom processing solution whose requirements are not met by Kinesis Data Analytics, and you are able to manage the resulting consumer application.
Close
Hot Contact Us

Hotline Contact Us

1010 0766
Beijing Region
Operated By Sinnet
1010 0966
Ningxia Region
Operated By NWCD