It is entirely possible that what I am going to describe here is an edge case not many people hit with their Kafka deployments. However, in my experience, when Kafka is used to ingest large volumes of data, it makes perfect sense. Considering that every now and then people ask for a cold storage feature on the Kafka mailing list, I am not the only one who would find this useful.
According to the Kafka website: Apache Kafka is publish-subscribe messaging rethought as a distributed commit log. It is fast: it can handle hundreds of megabytes of data from thousands of clients. It is durable: Kafka persists the data on disk and provides data replication to ensure no data loss. It is scalable: Kafka systems can be scaled without a hassle.
To understand where the need for cold storage comes from, it is important to understand how Kafka system works and what trade-offs a Kafka based system needs to deal with.
Kafka stores all data on disk. The servers running Kafka are limited by the amount of storage they can handle. It is always an arbitrary number defined by the amount of storage available to the single Kafka server. Single Kafka node within the cluster may be responsible for one or more topics. Each topic is stored in files called segments. All segments of a topic have an index file describing which parts of the queue reside in which segment. The segment size is configurable. The total amount of data kept for a topic is either the cumulative size of the topic or the oldest message to be kept. For example, a value of two weeks or 500GB of data; depending on which is reached first, that value will be used to determine how much data to keep. If a segment size for such topic is 1GB, the maximum of 500 of segments will be kept.
The process presented above applies to a regular topic. This kind of topic stores all the data in sequence. Kafka also offers compacted topics. These are more like key / value store.
The process of cleaning up old segments is called log compaction. In case of regular topics, log compaction will remove all segments falling out of range of data to be kept. In case of compacted topics, only the most recent value for a given key is kept.
The cold storage case applies only to regular topics.
The problem with regular topics, as they are implemented today in Kafka, is that, it is not possible to move excess segments out from under Kafka management without restarting Kafka broker.
Consider a data processing system using Kafka ingesting hundreds of megabytes of data every second. Such system will have a certain number of servers, call them terminators, used by millions of devices / applications as a connection point. These servers accept the data, they don’t do any processing, just put the data in Kafka. The volume of data is so large that the system needs a certain trade-off: how much data should be kept in Kafka and what to do with historical data? Historical data is important. It needs to be stored somewhere in a format ready for post-processing in case of, for example, having to introduce a new feature during the life-time of a product.
Today, most of the implementations solve this problem in the following way: the raw data is ingested with Kafka. Some processes work with that data by reading Kafka topics and crunching for real-time dashboards and whatnot. There is also a process somewhere which exists only for the purpose of reading all ingested data out of Kafka and putting it in the raw format in external storage. S3 or Swift Object Store come to mind.
There are two drawbacks of such solution. First: the storage of the raw data is basically another format. It needs naming rules to know how to access the data in the future, compression, transport mechanism, verification, replication. Second: the data is already in Kafka, so why the need for consuming it out for Kafka, putting the load on the system and using additional processing power for applying compression and moving the data out to external storage?
Kafka comes with a feature which allows it to build the index file from an arbitrary segment file. What this means is: instead of consuming the topic and uploading raw data to external storage, it is possible to move a segment file as it is persisted on disk. The data in the topic is still stored in its raw format, albeit inside of a segment file. The advantage of storing segments files is such that there is no additional cost of consuming for cold storage purpose and no cost in feeding back to Kafka for reprocessing. One can simply download the segment file and use Kafka libraries — no need to run Kafka cluster at all — to read data out. Such segment files can be processed in parallel with Storm, Spark, Hadoop, or any other sufficient tool.
Nothing stops anyone from doing this today. The simplest way is to have a program running on the Kafka machine which would check when the old segments are closed, copy them to external storage and let Kafka simply deal with old segments as it does now. This is, however, another one of those “roll it out yourself” approaches. Kafka could help. There are two possible options I can think of.
First approach: if Kafka provided a notification mechanism and could trigger a program when a segment file is to be discarded, it would become feasible to provide a standard method of moving data to cold storage in reaction to those events. Once the program finishes backing the segments up, it could tell Kafka “it is now safe to delete these segments”.
The second option is to provide an additional value for the log.cleanup.policy setting, call it cold-storage. In case of this value, Kafka would move the segment files — which otherwise would be deleted — to another destination on the server. They can be picked up from there and moved to the cold storage.
Both of these options ensure that Kafka data can be archived without having to interfere with Kafka process itself. Considering data replication features in Kafka, the former method seems more plausible. It would free the cluster operator from having to track file system changes. Furthermore, it could be implemented in such a way that, if there are any listeners waiting for those events for a given topic, Kafka switches to this mode automatically. It does put a responsibility on the operator to ensure flawless execution of the archive process but it is an opt-in mechanism — the operator is aware of the behavior.
A well defined, standardized method for moving Kafka segments to the cold storage would significantly improve the availability and processing of historical data.