Q: What is Amazon Kinesis Data Streams?
Amazon Kinesis Data Streams is a fully managed streaming data service. You can continuously add various types of data such as clickstreams, application logs, and social media to a Kinesis stream from hundreds of thousands of sources. Within seconds, the data will be available for your Kinesis Applications to read and process from the stream.
Q: What does Amazon Kinesis Data Streams manage on my behalf?
Amazon Kinesis Data Streams manages the infrastructure, storage, networking, and configuration needed to stream your data at the level of your data throughput. You do not have to worry about provisioning, deployment, ongoing-maintenance of hardware, software, or other services for your data streams. In addition, Amazon Kinesis Data Streams synchronously replicates data across three facilities in an AWS Region, providing high availability and data durability.
Q: What can I do with Amazon Kinesis Data Streams?
Amazon Kinesis Data Streams is useful for rapidly moving data off data producers and then continuously processing the data, be it to transform the data before emitting to a data store, run real-time metrics and analytics, or derive more complex data streams for further processing. The following are typical scenarios for using Amazon Kinesis Data Streams:
- Accelerated log and data feed intake: Instead of waiting to batch up the data, you can have your data producers push data to a Kinesis stream as soon as the data is produced, preventing data loss in case of data producer failures. For example, system and application logs can be continuously added to a stream and be available for processing within seconds.
- Real-time metrics and reporting: You can extract metrics and generate reports from Kinesis stream data in real-time. For example, your Kinesis Application can work on metrics and reporting for system and application logs as the data is streaming in, rather than wait to receive data batches.
- Real-time data analytics: With Amazon Kinesis Data Streams, you can run real-time streaming data analytics. For example, you can add clickstreams to your Kinesis stream and have your Kinesis Application run analytics in real time, enabling you to gain insights out of your data at a scale of minutes instead of hours or days.
- Complex stream processing: You can create Directed Acyclic Graphs (DAGs) of Kinesis Applications and data streams. In this scenario, one or more Kinesis Applications can add data to another Kinesis stream for further processing, enabling successive stages of stream processing.
Q: How do I use Kinesis Streams?
After you sign up for Amazon Web Services, you can start using Kinesis Streams by:
- Creating a Kinesis stream through either Kinesis Data Streams Management Console or CreateStream operation.
- Configuring your data producers to continuously add data to your stream.
- Building your Kinesis Applications to read and process data from your stream, using either Kinesis Data Streams API or Kinesis Client Library (KCL).
Q: What are the limits of Kinesis Streams?
The throughput of a Kinesis stream is designed to scale without limits via increasing the number of shards within a stream. However, there are certain limits you should keep in mind while using Amazon Kinesis Data Streams:
- Records of a stream are accessible for up to 24 hours from the time they are added to the stream.
- The maximum size of a data blob (the data payload before Base64-encoding) within one record is 1 megabyte (MB).
- Each shard can support up to 1000 PUT records per second.
For more information about other API level limits, see Kinesis Data Streams Limits.
Q: How does Amazon Kinesis Data Streams differ from Amazon SQS?
Amazon Kinesis Data Streams enables real-time processing of streaming big data. It provides ordering of records, as well as the ability to read and/or replay records in the same order to multiple Kinesis Applications. The Kinesis Client Library (KCL) delivers all records for a given partition key to the same record processor, making it easier to build multiple applications reading from the same Kinesis stream (for example, to perform counting, aggregation, and filtering).
Amazon Simple Queue Service (Amazon SQS) offers a reliable, highly scalable hosted queue for storing messages as they travel between computers. Amazon SQS lets you easily move data between distributed application components and helps you build applications in which messages are processed independently (with message-level ack/fail semantics), such as automated workflows.
Q: When should I use Amazon Kinesis Data Streams, and when should I use Amazon SQS?
We recommend Amazon Kinesis Data Streams for use cases with requirements that are similar to the following:
- Routing related records to the same record processor (as in streaming MapReduce). For example, counting and aggregation are simpler when all records for a given key are routed to the same record processor.
- Ordering of records. For example, you want to transfer log data from the application host to the processing/archival host while maintaining the order of log statements.
- Ability for multiple applications to consume the same stream concurrently. For example, you have one application that updates a real-time dashboard and another that archives data. You want both applications to consume data from the same stream concurrently and independently.
- Ability to consume records in the same order a few hours later. For example, you have a billing application and an audit application that runs a few hours behind the billing application. Because Amazon Kinesis Data Streams stores data for up to 24 hours, you can run the audit application up to 24 hours behind the billing application.
We recommend Amazon SQS for use cases with requirements that are similar to the following:
- Messaging semantics (such as message-level ack/fail) and visibility timeout. For example, you have a queue of work items and want to track the successful completion of each item independently. Amazon SQS tracks the ack/fail, so the application does not have to maintain a persistent checkpoint/cursor. Amazon SQS will delete acked messages and redeliver failed messages after a configured visibility timeout.
- Individual message delay. For example, you have a job queue and need to schedule individual jobs with a delay. With Amazon SQS, you can configure individual messages to have a delay of up to 15 minutes.
- Dynamically increasing concurrency/throughput at read time. For example, you have a work queue and want to add more readers until the backlog is cleared. With Amazon Kinesis Data Streams, you can scale up to a sufficient number of shards (note, however, that you'll need to provision enough shards ahead of time).
- Leveraging Amazon SQS’s ability to scale transparently. For example, you buffer requests and the load changes as a result of occasional load spikes or the natural growth of your business. Because each buffered request can be processed independently, Amazon SQS can scale transparently to handle the load without any provisioning instructions from you.
Key Amazon Kinesis Data Streams Concepts
Shard is the base throughput unit of a Kinesis stream. One shard provides a capacity of 1MB/sec data input and 2MB/sec data output. One shard can support up to 1000 PUT records per second. You will specify the number of shards needed when you create a stream. For example, you can create a stream with two shards. This stream has a throughput of 2MB/sec data input and 4MB/sec data output, and allows up to 2000 PUT records per second. You can dynamically add or remove shards from your stream as your data throughput changes via resharding.
A record is the unit of data stored in a Kinesis stream. A record is composed of a sequence number, partition key, and data blob. Data blob is the data of interest your data producer adds to a stream. The maximum size of a data blob (the data payload before Base64-encoding) is 1 megabyte (MB).
Partition key is used to segregate and route records to different shards of a stream. A partition key is specified by your data producer while adding data to a Kinesis stream. For example, assuming you have a stream with two shards (shard 1 and shard 2). You can configure your data producer to use two partition keys (key A and key B) so that all records with key A are added to shard 1 and all records with key B are added to shard 2.
A sequence number is a unique identifier for each record. Sequence number is assigned by Amazon Kinesis Data Streams when a data producer calls PutRecord or PutRecords operation to add data to a Kinesis stream. Sequence numbers for the same partition key generally increase over time; the longer the time period between PutRecord or PutRecords requests, the larger the sequence numbers become.
Creating Amazon Kinesis Data Streams
Q: How do I create an Amazon Kinesis stream?
Q: How do I decide the throughput of my Kinesis stream?
The throughput of a Kinesis stream is determined by the number of shards within the stream. Follow the steps below to estimate the initial number of shards your stream needs. Note that you can dynamically adjust the number of shards within your stream via resharding.
- Estimate the average size of the record written to the stream in kilobytes (KB), rounded up to the nearest 1 KB. (average_data_size_in_KB)
- Estimate the number of records written to the stream per second. (number_of_records_per_second)
- Decide the number of Kinesis Applications consuming data concurrently and independently from the stream. (number_of_consumers)
- Calculate the incoming write bandwidth in KB (incoming_write_bandwidth_in_KB), which is equal to the average_data_size_in_KB multiplied by the number_of_records_per_seconds.
- Calculate the outgoing read bandwidth in KB (outgoing_read_bandwidth_in_KB), which is equal to the incoming_write_bandwidth_in_KB multiplied by the number_of_consumers.
You can then calculate the initial number of shards (number_of_shards) your stream needs using the following formula:
number_of_shards = max (incoming_write_bandwidth_in_KB/1000, outgoing_read_bandwidth_in_KB/2000)
Q: What is the minimum throughput I can request for my Kinesis stream?
The throughput of a Kinesis stream scales by unit of shard. One single shard is the smallest throughput of a stream, which provides 1MB/sec data input and 2MB/sec data output.
Q: What is the maximum throughput I can request for my Kinesis stream?
The throughput of a Kinesis stream is designed to scale without limits. By default, each account can provision 10 shards per region. You can use the Kinesis Data Streams Limits form to request more than 10 shards within a single region.
Q: How can record size affect the throughput of my Kinesis stream?
A shard provides 1MB/sec data input rate and supports up to 1000 PUT records per sec. Therefore, if the record size is less than 1KB, the actual data input rate of a shard will be less than 1MB/sec, limited by the maximum number of PUT records per second.
Adding Data to Amazon Kinesis Data Streams
Q: What is the difference between PutRecord and PutRecords?
PutRecord operation allows a single data record within an API call and PutRecords operation allows multiple data records within an API call. For more information about PutRecord and PutRecords operations, see PutRecord and PutRecords.
Q: What is Kinesis Producer Library (KPL)?
Kinesis Producer Library (KPL) is an easy to use and highly configurable library that helps you put data into a Kinesis stream. KPL presents a simple, asynchronous, and reliable interface that enables you to quickly achieve high producer throughput with minimal client resources.
Q: What programming languages or platforms can I use to access Amazon Kinesis Data Streams API?
Q: What programming language is Kinesis Producer Library (KPL) available in?
Kinesis Producer Library (KPL)'s core is built with C++ module and can be compiled to work on any platform with a recent C++ compiler. The library is currently available in a Java interface. We are looking to add support for other programming languages.
Q: What happens if the capacity limits of a Kinesis stream are exceeded while the data producer adds data to the stream?
The capacity limits of a Kinesis stream are defined by the number of shards within the stream. The limits can be exceeded by either data throughput or the number of PUT records. While the capacity limits are exceeded, the put data call will be rejected with a ProvisionedThroughputExceeded exception. If this is due to a temporary rise of the stream’s input data rate, retry by the data producer will eventually lead to completion of the requests. If this is due to a sustained rise of the stream’s input data rate, you should increase the number of shards within your stream to provide enough capacity for the put data calls to consistently succeed. In both cases, Amazon CloudWatch metrics allow you to learn about the change of the stream’s input data rate and the occurrence of ProvisionedThroughputExceeded exceptions.
Q: What data is counted against the data throughput of a Kinesis stream during a PutRecord or PutRecords call?
Your data blob, partition key, and stream name are required parameters of a PutRecord or PutRecords call. The size of your data blob (before Base64 encoding) and partition key will be counted against the data throughput of your Kinesis stream, which is determined by the number of shards within the stream.
Q: What is enhanced fan-out?
Enhanced fan-out is an optional feature for Amazon Kinesis Data Streams consumers that provides logical 2 MB/sec throughput pipes between consumers and shards. This allows customers to scale the number of consumers reading from a data stream in parallel, while maintaining high performance.
Q: How do consumers use enhanced fan-out?
Consumers must first register themselves with the Amazon Kinesis Data Streams service. By default, consumer registration activates enhanced fan-out. If you are using the KCL, KCL version 2.x takes care of registering your consumers automatically, and uses the name of the KCL application as the consumer name. Once registered, all registered consumers will have their own logical enhanced fan-out throughput pipes provisioned for them. Then, consumers use the HTTP/2 SubscribeToShard API to retrieve data inside of these throughput pipes. The HTTP/1 GetRecords API does not currently support enhanced fan-out, so you will need to upgrade to KCL 2.x, or alternatively register your consumer and have the consumer call the SubscribeToShard API.
Q: How is enhanced fan-out utilized by a consumer?
Consumers utilize enhanced fan-out by retrieving data with the SubscribeToShard API. The name of the registered consumer is used within the SubscribeToShard API, which leads to utilization of the enhanced fan-out benefit provided to the registered consumer.
Q: When should I use enhanced fan-out?
You should use enhanced fan-out if you have, or expect to have, multiple consumers retrieving data from a stream in parallel, or if you have at least one consumer that requires the use of the SubscribeToShard API to provide sub-200ms data delivery speeds between producers and consumers.
Q: Can I have consumers using enhanced fan-out, and others not using it?
Yes, you can have multiple consumers using enhanced fan-out and others not using enhanced fan-out at the same time. The use of enhanced fan-out does not impact the limits of shards for traditional GetRecords usage.
Q: Is there a limit on the number of consumers using enhanced fan-out on a given stream?
There is a default limit of 5 consumers using enhanced fan-out per data stream. If you need more than 5, please submit a limit increase request though AWS support. Keep in mind that you can have more than 5 total consumers reading from a stream by having 5 consumers using enhanced fan-out and other consumers not using enhanced fan-out at the same time.
Q: How do consumers register to use enhanced fan-out and the HTTP/2 SubscribeToShard API?
We recommend using KCL 2.x, which will automatically register your consumer and use both enhanced fan-out and the HTTP/2 SubscribeToShard API. Otherwise, you can manually register a consumer using the RegisterStreamConsumer API and then you can use the SubscribeToShard API with the name of the consumer you registered.
Q: Is there a cost associated with the use of enhanced fan-out?
Yes, there is an on-demand hourly cost for every combination of shard in a stream and consumer (a consumer-shard hour) registered to use enhanced fan-out, in addition to a data retrieval cost for every GB retrieved. See the Amazon Kinesis Data Streams pricing page for more details.
Q: How is a consumer-shard hour calculated?
A consumer-shard hour is calculated by multiplying the number of registered stream consumers with the number of shards in the stream. For example, if a consumer-shard hour costs ¥ 0.1211, for a 10 shard data stream, this consumer using enhanced fan-out would be able to read from 10 shards, and thus incur a consumer-shard hour charge of ¥ 1.211 per hour (1 consumer x 10 shards x ¥ 0.1211 per consumers-shard hour). If there were two consumers registered for enhanced fan-out simultaneously, the total consumer-shard hour charge would be ¥ 0.2422 per hour (2 consumers x 10 shards x ¥ 0.1211).
Q: Does consumer-shard hour billing for enhanced fan-out automatically prorate if I terminate or start a consumer within the hour?
Yes, you will only pay for the prorated portion of the hour the consumer was registered to use enhanced fan-out.
Q: How does billing for enhanced fan-out data retrievals work?
You pay a low per GB rate that is metered per byte of data retrieved by consumers using enhanced fan-out. There is no payload roundup or delivery minimum.
Q: Do I need to change my producers or my data stream to use enhanced fan-out?
No, enhance fan-out can be activated without impacting data producers or data streams.
Reading and Processing Data from Amazon Kinesis Data Streams
Kinesis Client Library (KCL) for Java | Python | Ruby | Node.js | .NET is a pre-built library that helps you easily build Kinesis Applications for reading and processing data from a Kinesis stream. KCL handles complex issues such as adapting to changes in stream volume, load-balancing streaming data, coordinating distributed services, and processing data with fault-tolerance. KCL enables you to focus on business logic while building applications.
Q: What is Kinesis Connector Library?
Kinesis Connector Library is a pre-built library that helps you easily integrate Kinesis Streams with other AWS services and third-party tools. Kinesis Client Library (KCL) for Java | Python | Ruby | Node.js | .NET is required for using Kinesis Connector Library. The current version of this library provides connectors to Amazon DynamoDB, Amazon S3, and Elasticsearch. The library also includes sample connectors of each type, plus Apache Ant build files for running the samples.
Q: What is Kinesis Storm Spout?
Kinesis Storm Spout is a pre-built library that helps you easily integrate Kinesis Streams with Apache Storm. The current version of Kinesis Storm Spout fetches data from Kinesis stream and emits it as tuples. You will add the spout to your Storm topology to leverage Kinesis Streams as a reliable, scalable, stream capture, storage, and replay service.
Q: What programming language are Kinesis Client Library (KCL), Kinesis Connector Library, and Kinesis Storm Spout available in?
Kinesis Client Library (KCL) is currently available in Java, Python, Ruby, Node.js, and .NET. Kinesis Connector Library and Kinesis Storm Spout are currently available in Java. We are looking to add support for other programming languages.
Q: Do I have to use Kinesis Client Library (KCL) for my Kinesis Application?
No, you can also use Kinesis Data Streams API to build your Kinesis Application. However, we recommend using Kinesis Client Library (KCL) for Java | Python | Ruby | Node.js | .NET if applicable because it performs heavy-lifting tasks associated with distributed stream processing, making it more productive to develop applications.
Q: How does Kinesis Client Library (KCL) interact with a Kinesis Application?
Kinesis Client Library (KCL) for Java | Python | Ruby | Node.js | .NET acts as an intermediary between Amazon Kinesis Data Streams and your Kinesis Application. KCL uses the IRecordProcessor interface to communicate with your application. Your application implements this interface, and KCL calls into your application code using the methods in this interface.
For more information about building application with KCL, see Developing Consumer Applications for Kinesis Streams Using the Kinesis Client Library.
A Kinesis Application can have multiple application instances and a worker is the processing unit that maps to each application instance. A record processor is the processing unit that processes data from a shard of a Kinesis stream. One worker maps to one or more record processors. One record processor maps to one shard and processes records from that shard.
At startup, an application calls into Kinesis Client Library (KCL) for Java | Python | Ruby | Node.js | .NET to instantiate a worker. This call provides KCL with configuration information for the application, such as the stream name and AWS credentials. This call also passes a reference to an IRecordProcessorFactory implementation. KCL uses this factory to create new record processors as needed to process data from the stream. KCL communicates with these record processors using the IRecordProcessor interface.
Q: How does Kinesis Client Library (KCL) keep tracking data records being processed by a Kinesis Application?
Kinesis Client Library (KCL) for Java | Python | Ruby | Node.js | .NET automatically creates an Amazon DynamoDB table for each Kinesis Application to track and maintain state information such as resharding events and sequence number checkpoints. The DynamoDB table shares the same name with the application so that you need to make sure your application name doesn’t conflict with any existing DynamoDB tables under the same account within the same region.
All workers associated with the same application name are assumed to be working together on the same Kinesis stream. If you run an additional instance of the same application code, but with a different application name, KCL treats the second instance as an entirely separate application also operating on the same stream.
Please note that your account will be charged for the costs associated with the Amazon DynamoDB table in addition to the costs associated with Amazon Kinesis Data Streams.
For more information about how KCL tracks application state, see Tracking Kinesis Application state.
Q: How can I automatically scale up the processing capacity of my Kinesis Application using Kinesis Client Library (KCL)?
You can create multiple instances of your Kinesis Application and have these application instances run across a set of Amazon EC2 instances that are part of an Auto Scaling group. While the processing demand increases, an Amazon EC2 instance running your application instance will be automatically instantiated. Kinesis Client Library (KCL) for Java | Python | Ruby | Node.js | .NET will generate a worker for this new instance and automatically move record processors from overloaded existing instances to this new instance.
Q: Why does GetRecords call return empty result while there is data within my Kinesis stream?
One possible reason is that there is no record at the position specified by the current shard iterator. This could happen even if you are using TRIM_HORIZON as shard iterator type. A Kinesis stream represents a continuous stream of data. You should call GetRecords operation in a loop and the record will be returned when the shard iterator advances to the position where the record is stored.
Q: What happens if the capacity limits of a Kinesis stream are exceeded while Kinesis Application reads data from the stream?
The capacity limits of a Kinesis stream are defined by the number of shards within the stream. The limits can be exceeded by either data throughput or the number of read data calls. While the capacity limits are exceeded, the read data call will be rejected with a ProvisionedThroughputExceeded exception. If this is due to a temporary rise of the stream’s output data rate, retry by the Kinesis Application will eventually lead to completions of the requests. If this is due to a sustained rise of the stream’s output data rate, you should increase the number of shards within your stream to provide enough capacity for the read data calls to consistently succeed. In both cases, Amazon CloudWatch metrics allow you to learn about the change of the stream’s output data rate and the occurrence of ProvisionedThroughputExceeded exceptions.
Managing Amazon Kinesis Data Streams
Q: How do I change the throughput of my Kinesis stream?
You can change the throughput of a Kinesis stream by adjusting the number of shards within the stream (resharding). There are two types of resharding operations: shard split and shard merge. In a shard split, a single shard is divided into two shards, which increases the throughput of the stream. In a shard merge, two shards are merged into a single shard, which decreases the throughput of the stream. For more information about Kinesis Streams resharding, see Resharding a Stream.
Q: How often can I and how long does it take to change the throughput of my Kinesis stream?
A resharding operation such as shard split or shard merge takes a few seconds. You can only perform one resharding operation at a time. Therefore, for a Kinesis stream with only one shard, it takes a few seconds to double the throughput by splitting one shard. For a stream with 1000 shards, it takes 30K seconds (8.3 hours) to double the throughput by splitting 1000 shards. We recommend increasing the throughput of your stream ahead of the time when extra throughput is needed.
Q: Does Amazon Kinesis Data Streams remain available when I change the throughput of my Kinesis stream via resharding?
Yes. You can continue adding data to and reading data from your Kinesis stream while resharding is performing to change the throughput of the stream.
Q: How do I monitor the operations and performance of my Kinesis stream?
Amazon Kinesis Data Streams Management Console displays key operational and performance metrics such as throughput of data input and output of your Kinesis streams. Kinesis Streams also integrates with Amazon CloudWatch so that you can collect, view, and analyze CloudWatch metrics for your streams. For more information about Kinesis Streams metrics, see Monitoring Kinesis Data Streams with Amazon CloudWatch.
Q: How do I manage and control access to my Kinesis stream?
Kinesis Streams integrates with AWS Identity and Access Management (IAM), a service that enables you to securely control access to your AWS services and resources for your users. For example, you can create a policy that only allows a specific user or group to add data to your Kinesis stream. For more information about access management and control of your stream, see Controlling Access to Kinesis Data Streams Resources using IAM.
Q: How do I log API calls made to my Kinesis stream for security analysis and operational troubleshooting?
Amazon Kinesis Data Streams integrates with Amazon CloudTrail, a service that records AWS API calls for your account and delivers log files to you. For more information about API call logging and a list of supported Amazon Kinesis Data Streams API operations, see Logging Kinesis Data Streams API calls Using Amazon CloudTrail.
Q: How do I effectively manage my Kinesis streams and the costs associated with these streams?
Amazon Kinesis Data Streams allows you to tag your Kinesis streams for easier resource and cost management. A tag is a user-defined label expressed as a key-value pair that helps organize AWS resources. For example, you can tag your streams by cost centers so that you can categorize and track your Amazon Kinesis Data Streams costs based on cost centers. For more information about Amazon Kinesis Data Streams tagging, see Tagging Your Kinesis Streams.
Pricing and Billing
Q: How much does Amazon Kinesis Data Streams cost?
Amazon Kinesis Data Streams uses simple pay as you go pricing. There is neither upfront cost nor minimum fees and you only pay for the resources you use. The costs of Amazon Kinesis Data Streams has two dimensions:
- Hourly Shard cost determined by the number of shards within your Kinesis stream.
- PUT Payload Unit cost determined by the number of 25KB payload units that your data producers add to your stream.
- Extended data retention is an optional cost determined by the number of shard hours incurred by your data stream. When extended data retention is enabled, you pay the extended retention rate for each shard in your stream.
- Enhanced fan-out is an optional cost with two cost dimensions: consumer-shard hours and data retrievals. Consumer-shard hours reflect the number of shards in a stream multiplied by the number of consumers using enhanced fan-out. Data retrievals are determined by the number of GBs delivered to consumers using enhanced fan-out.
For more information about Amazon Kinesis Data Streams costs, sign into the Billing Console.
Q: Does my PUT Payload Unit cost change by using PutRecords operation instead of PutRecord operation?
PUT Payload Unit charge is calculated based on the number of 25KB payload units added to your Kinesis stream. PUT Payload Unit cost is consistent when using PutRecords operation or PutRecord operation.
Q: Other than Amazon Kinesis Data Streams costs, are there any other costs that might incur to my Kinesis Data Streams usage?
Kinesis Client Library (KCL) uses Amazon DynamoDB table to track state information of record processing. If you use KCL for your Kinesis Applications, you will be charged for Amazon DynamoDB resources in addition to Amazon Kinesis Data Streams costs.
Please note that the above are two common but not exhaustive cases.