HTTP response status codes indicate whether a specific HTTP request has been successfully completed. Responses are grouped in five classes: informational responses, successful responses, redirects, client errors, and servers errors. 102 Processing. It is intended for cases where another process or server handles the request, or for batch. For example, the batch file will read the cell D4 (column/row) and will use an IF statement so: IF the cell B4 equals to '1st January, 2013' then OUTPUT 'you are using the January file' Can this be done using a windows batch file? And can anyone assist me in structuring the code for the batch file? Thanks in advance.
Transaction processing is a way of computing that divides work into individual, indivisible operations, called transactions.[1] A transaction processing system (TPS) is a software system, or software/hardware combination, that supports transaction processing.
- 3Processing types
- 4Transaction processing system features
- 5Types of transaction processing
- 7Backup procedures
- 7.2Types of back-up procedures
History[edit]
The first transaction processing system was SABRE, made by IBM for American Airlines, which became operational in 1970. Designed to process up to 83,000 transactions a day, the system ran on two IBM 7090 computers. SABRE was migrated to IBM System/360 computers in 1972, and became an IBM product first as Airline control Program (ACP) and later as Transaction Processing Facility (TPF). In addition to airlines TPF is used by large banks, credit card companies, and hotel chains.
The Hewlett-PackardNonStop system (formerly Tandem NonStop) was a hardware and software system designed for Online Transaction Processing (OLTP) introduced in 1976. The systems were designed for transaction processing and provided an extreme level of availability and data integrity.
List of transaction processing systems[edit]
- IBM Transaction Processing Facility (TPF) – 1960. At Amity Unlike most other transaction processing systems TPF is a dedicated operating system for transaction processing on IBM System z mainframes. Originally Airline Control Program (ACP).
- IBM Information Management System (IMS) – 1966. A joint hierarchical database and information management system with extensive transaction processing capabilities. Runs on OS/360 and successors.
- IBM Customer Information Control System (CICS) – 1969. A transaction manager designed for rapid, high-volume online processing, CICS originally used standard system datasets, but now has a connection to IBM's DB/2 relational database system. Runs on OS/360 and successors and DOS/360 and successors, IBM AIX, VM, and OS/2. Non-mainframe versions are called TXSeries.
- Tuxedo – 1980s. Transactions for Unix, Extended for Distributed Operations developed by AT&T Corporation, now owned by Oracle Corporation. Tuxedo is a cross-platform TPS.
- UNIVACTransaction Interface Package (TIP) – 1970s. A transaction processing monitor for UNIVAC 1100/2200 series computers.[2]
- Burroughs Corporation supported transaction processing capabilities in its MCP operating systems using GEMCOS (Generalized Message Control System of 1980). As of 2012 UNISYSClearPath Enterprise Servers include Transaction Server, 'an extremely flexible, high-performance message and application control system.'[3]
- Digital Equipment Corporation (DEC) Application Control and Management System (ACMS) – 1985. 'Provides an environment for creating and controlling online transaction processing (OLTP) applications on the VMS operating system.'[4][5] Runs on VAX/VMS systems.
- Digital Equipment Corporation (DEC) Message Control System (MCS-10) for PDP-10TOPS-10 systems.
- HoneywellMultics Transaction Processing. Feature (TP) – 1979.[6]
- Transaction Management eXecutive (TMX) was NCR Corporation's proprietary transaction processing system running on NCR Tower 5000-series systems. This system was used mainly by financial institutions in the 1980s and 1990s.
- Hewlett-Packard NonStop system – 1976. NonStop is an integrated hardware and software system specifically designed for transaction processing. Originally from Tandem Computers.
- TransarcEncina – 1991.[7] Transarc was purchased by IBM in 1994. Encina was discontinued as a product and folded into IBM's TXSeries.[8] Encina support was discontinued in 2006.
Processing types[edit]
Transaction processing is distinct from and can be contrasted with other computer processing models, such as batch processing, time-sharing, and real-time processing.[9]
Batch processing[edit]
Batch processing is execution of a series of programs (jobs) on a computer without manual intervention. Several transactions, called a batch are collected and processed at the same time. The results of each transaction are not immediately available when the transaction is being entered;[1] there is a time delay.
![Online Online](/uploads/1/2/5/6/125698795/137393255.png)
Real-time processing[edit]
'Real time systems attempt to guarantee an appropriate response to a stimulus or request quickly enough to affect the conditions that caused the stimulus.'[9]Each transaction in realtime processing is unique; it is not part of a group of transactions.
Transaction processing[edit]
A Transaction Processing System (TPS) is a type of information system that collects, stores, modifies and retrieves the data transactions of an enterprise. Transaction processing systems also attempt to provide predictable response times to requests, although this is not as critical as for real-time systems. Rather than allowing the user to run arbitrary programs as time-sharing, transaction processing allows only predefined, structured transactions. Each transaction is usually short duration and the processing activity for each transaction is programmed in advance.
Transaction processing system features[edit]
The following features are considered important in evaluating transaction processing systems.[9]
Performance[edit]
Fast performance with a rapid response time is critical. Transaction processing systems are usually measured by the number of transactions they can process in a given period of time.
Continuous availability[edit]
The system must be available during the time period when the users are entering transactions. Many organizations rely heavily on their TPS; a breakdown will disrupt operations or even stop the business.
Data integrity[edit]
The system must be able to handle hardware or software problems without corrupting data. Multiple users must be protected from attempting to change the same piece of data at the same time, for example two operators cannot sell the same seat on an airplane.
Ease of use[edit]
Often users of transaction processing systems are casual users. The system should be simple for them to understand, protect them from>
The following features are desirable in a database system used in transaction processing systems:
- Good data placement: The database should be designed to access patterns of data from many simultaneous users.
- Short transactions: Short transactions enables quick processing. This avoids concurrency and paces the systems.
- Real-time backup: Backup should be scheduled between low times of activity to prevent lag of the server.
- High normalization: This lowers redundant information to increase the speed and improve concurrency, this also improves backups.
- Archiving of historical data: Uncommonly used data are moved into other databases or backed up tables. This keeps tables small and also improves backup times.
- Good hardware configuration: Hardware must be able to handle many users and provide quick response times.
Backup procedures[edit]
A Dataflow Diagram of backup and recovery procedures
Since business organizations have become very dependent on transaction processing, a breakdown may disrupt the business' regular routine and stop its operation for a certain amount of time. In order to prevent data loss and minimize disruptions there have to be well-designed backup and recovery procedures. The recovery process can rebuild the system when it goes down.
Recovery process[edit]
A TPS may fail for many reasons such as system failure, human errors, hardware failure, incorrect or invalid data, computer viruses, software application errors or natural or man-made disasters. As it's not possible to prevent all failures, a TPS must be able to detect and correct errors when they occur and cope with failures. A TPS will go through a recovery of the database which may involve the backup, journal, checkpoint, and recovery manager:
- Journal: A journal maintains an audit trail of transactions and database changes. Transaction logs and Database change logs are used, a transaction log records all the essential data for each transactions, including data values, time of transaction and terminal number. A database change log contains before and after copies of records that have been modified by transactions.
- Checkpoint: The purpose of checkpointing is to provide a snapshot of the data within the database. A checkpoint, in general, is any identifier or other reference that identifies the state of the database at a point in time. Modifications to database pages are performed in memory and are not necessarily written to disk after every update. Therefore, periodically, the database system must perform a checkpoint to write these updates which are held in-memory to the storage disk. Writing these updates to storage disk creates a point in time in which the database system can apply changes contained in a transaction log during recovery after an unexpected shut down or crash of the database system. If a checkpoint is interrupted and a recovery is required, then the database system must start recovery from a previous successful checkpoint. Checkpointing can be either transaction-consistent or non-transaction-consistent (called also fuzzy checkpointing). Transaction-consistent checkpointing produces a persistent database image that is sufficient to recover the database to the state that was externally perceived at the moment of starting the checkpointing. A non-transaction-consistent checkpointing results in a persistent database image that is insufficient to perform a recovery of the database state. To perform the database recovery, additional information is needed, typically contained in transaction logs. Transaction consistent checkpointing refers to a consistent database, which doesn't necessarily include all the latest committed transactions, but all modifications made by transactions, that were committed at the time checkpoint creation was started, are fully present. A non-consistent transaction refers to a checkpoint which is not necessarily a consistent database, and can't be recovered to one without all log records generated for open transactions included in the checkpoint. Depending on the type of database management system implemented a checkpoint may incorporate indexes or storage pages (user data), indexes and storage pages. If no indexes are incorporated into the checkpoint, indexes must be created when the database is restored from the checkpoint image.
- Recovery Manager: A recovery manager is a program which restores the database to a correct condition which allows transaction processing to be restarted.
Depending on how the system failed, there can be two different recovery procedures used. Generally, the procedures involves restoring data that has been collected from a backup device and then running the transaction processing again. Two types of recovery are backward recovery and forward recovery:
- Backward recovery: used to undo unwanted changes to the database. It reverses the changes made by transactions which have been aborted.
- Forward recovery: it starts with a backup copy of the database. The transaction will then reprocess according to the transaction journal that occurred between the time the backup was made and the present time.
Types of back-up procedures[edit]
There are two main types of back-up procedures: grandfather-father-son and partial backups:
Grandfather-father-son[edit]
This procedure involves taking complete backups of all data at regular intervals – daily, weekly, monthly, or whatever is appropriate. Multiple generations of backup are retained, often three which gives rise to the name. The most recent backup is the son, the previous the father, and the oldest backup is the grandfather. This method is commonly used for a batch transaction processing system with a magnetic tape. If the system fails during a batch run, the master file is recreated by restoring the son backup and then restarting the batch. However, if the son backup fails, is corrupted or destroyed, then the previous generation of backup (the father) is used. Likewise, if that fails, then the generation of backup previous to the father (i.e. the grandfather) is required. Of course the older the generation, the more the data may be out of date. Organize only of records that have changed. For example, a full backup could be performed weekly, and partial backups taken nightly. Recovery using this scheme involves restoring the last full backup and then restoring all partial backups in order to produce an up-to-date database. This process is quicker than taking only complete backups, at the expense of longer recovery time.
Backup plus journal[edit]
This technique is also used in conjunction with regular complete backups. The master file is backed up at regular intervals. Completed transactions since the last backup are stored separately and are called journals, or journal files. The master file can be recreated by restoring the last complete backup and then reprocessing transactions from the journal files. This will produce the most up-to-date copy of the database, but recovery may take longer because of the time required to process a volume of journal records.
Advantages[edit]
- Batch or real-time processing available.
- Reduction in processing time, lead time and order cycle time.
- Reduction in inventory, personnel and ordering costs.
- Increase in productivity and customer satisfaction.
See also[edit]
References[edit]
- ^IBM Corporation. 'CICS Transaction Server for z/OS, Version 3.2 Transaction processing'. Retrieved Nov 12, 2012.
- ^'Terminals Help Manage Aluminum Firm's Production'. Computerworld. July 26, 1976. Retrieved November 14, 2012.
- ^UNISYS Corporation (2012). Transaction Server for ClearPath MCP Configuration Guide(PDF).
- ^Digital Equipment Corporation (1989). VAX ACMS Guide to Creating Transaction Processing Applications.
- ^Bell, Gordon. 'Digital Computing Timeline (1985)'. Retrieved November 15, 2012.
- ^Van Vleck, Thomas. 'Multics Glossary -T-'. Retrieved November 15, 2012.
- ^Transarc. 'Corporate Overview'. Archived from the original on February 3, 1999. Retrieved November 16, 2012.
- ^IBM Corporation. 'TXSeries for Multiplatforms'. Retrieved November 16, 2012.
- ^ abcSchuster, Stewart A. (June 15, 1981). 'In Depth: Relational Data Base Management'. Computerworld. Retrieved November 16, 2012.
Further reading[edit]
- Gerhard Weikum, Gottfried Vossen, Transactional information systems: theory, algorithms, and the practice of concurrency control and recovery, Morgan Kaufmann, 2002, ISBN1-55860-508-8
Retrieved from 'https://en.wikipedia.org/w/index.php?title=Transaction_processing_system&oldid=920271322'
According to a recent report by IBM Marketing cloud, “90 percent of the data in the world today has been created in the last two years alone, creating 2.5 quintillion bytes of data every day – and with new devices, sensors and technologies emerging, the data growth rate will likely accelerate even more”.
Technically this means our Big Data Processing world is going to be more complex and more challenging. And a lot of use cases (e.g. mobile app ads, fraud detection, cab booking, patient monitoring,etc) need data processing in real-time, as and when data arrives, to make quick actionable decisions. This is why Distributed Stream Processing has become very popular in Big Data world.
Today there are a number of open source streaming frameworks available. Interestingly, almost all of them are quite new and have been developed in last few years only. So it is quite easy for a new person to get confused in understanding and differentiating among streaming frameworks. In this post I will first talk about types and aspects of Stream Processing in general and then compare the most popular open source Streaming frameworks : Flink, Spark Streaming, Storm, Kafka Streams, Samza. I will try to explain about each one (very briefly), their use cases, strengths, limitations, similarities and differences. The objective is that by the end of this article , you should have better understanding about the state of streaming world in open source landscape today. Lets begin.
What is Streaming/Stream Processing :
The most simplistic definition I found was : a type of data processing engine that is designed with infinite data sets in mind. Nothing more.
Unlike Batch processing where data is bounded with a start and an end in a job and the job finishes after processing that finite data, Streaming is meant for processing unbounded data coming in realtime continuously for days,months,years and forever. As such, being always meant for up and running, a streaming application is a challenge to impelment and maintain.
Important Aspects of Stream Processing:
There are some important characteristics and terms associated with Stream processing which we should be aware of, in order to understand strengths and limitations of any Streaming framework :
- Delivery Guarantees : It means what is the guarantee that no matter what, a particular incoming record in a streaming engine will be processed. It can be of 3 types: Atleast-once (will be processed atleast one time even in case of failures) , Atmost-once (may not be processed in case of failures) or Exactly-once (will be processed one and exactly one time even in case of failures) . Obviously Exactly-once is desirable but is hard to achieve in distributed systems . It is achieved always with some tradeoffs in performance.
- Fault Tolerance : In case of failures like node failures, network failures, etc, framework should be able to recover and should start processing again from the point where it left. This is achieved through checkpointing the state of streaming to some persistent storage from time to time. For example, while processing data from Kafka, checkpoint kafka offsets to zookeeper after getting record processing it. If failed, start again from the checkpointed offsets.
- State Management : In case of stateful processing requirements where we need to maintain some state (e.g. counts of each distinct word seen in records), framework should be able to provide some mechanism to preserve and update state information.
- Performance : This includes latency(how soon a record can be processed), throughput (records processed/second) and scalability. Latency should be as minimum as possible while throughput should be as high as possible. It is difficult to achieve both at same time but effort should be to strive for both and find a sweet balance.
- Advanced Features : (Event Time Processing, Watermarks, Windowing) These are features needed if stream processing requirements are complex. For example, processing records based on time when it was generated at source (event time processing). To know more in detail, highly recommend to go through these must-read posts by Google guy Tyler Akidau : part1 and part2. It contains essence of Stream Processing future.
- Maturity : Important from adoption point of view, it is highly desirable that the framework is already proven and battle tested at scale by big companies. More likely to get good community support and help on stackoverflow.
Two Types of Stream Processing:
Now being aware of the terms we just discussed, it is now easy to understand that there are 2 approaches to implement a Streaming framework:
- Native Streaming : Also known as Native Streaming. It means every incoming record is processed as soon as it arrives, without waiting for others. There are some continuous running processes (which we call as operators/tasks/bolts depending upon the framework) which run for ever and every record passes through these processes to get processed. Examples : Storm, Flink, Kafka Streams, Samza.
- Micro-batching : Also known as Fast Batching. It means incoming records in every few seconds are batched together and then processed in a single mini batch with delay of few seconds. Examples: Spark Streaming, Storm-Trident.
Both approaches have some advantages and disadvantages.
Native Streaming feels natural as every record is processed as soon as it arrives, allowing the framework to achieve the minimum latency possible. But it also means that it is hard to achieve fault tolerance without compromising on throughput as for each record, we need to track and checkpoint once processed. Also, state management is easy as there are long running processes which can maintain the required state easily.
Micro-batching , on the other hand, is quite opposite. Fault tolerance comes for free as it is essentially a batch and throughput is also high as processing and checkpointing will be done in one shot for group of records. But it will be at some cost of latency and it will not feel like a natural streaming. Also efficient state management will be a challenge to maintain.
Streaming Frameworks One By One:
Storm :
Storm is the hadoop of Streaming world. It is the oldest open source streaming framework and one of the most mature and reliable one. It is true streaming and is good for simple event based use cases. I have shared details about Storm at length in these posts: part1 and part2.
Advantages:
- Very low latency,true streaming, mature and high throughput
- Excellent for non-complicated streaming use cases
Disadvantages
- No implicit support for state management
- No advanced features like Event time processing, aggregation,windowing,sessions,watermarks,etc
- Atleast-once guarantee
Spark Streaming :
Spark has emerged as true successor of hadoop in Batch processing and the first framework to fully support the Lambda Architecture (where both Batch and Streaming are implemented; Batch for correctness, Streaming for Speed). It is immensely popular, matured and widely adopted. Spark Streaming comes for free with Spark and it uses micro batching for streaming. Before 2.0 release, Spark Streaming had some serious feature and performance limitations but with new release 2.0+ , it has come in a new avatar called Structured Streaming and is equipped with many good features like custom memory management (on similar line as Flink) called tungsten, watermarks, event time processing support,etc. Also Structured Streaming is much more abstract and there is option to switch between micro-batching and continuous streaming mode in 2.3.0 release. Continuous Streaming mode promises to give sub latency like Storm and Flink, but it is still in initial stage to be tested with many limitations in operations (operations like watermarks, group by, etc not supported, so good for simple streaming use cases like Storm with very low latency requirements)
Advantages:
- Supports Lambda architecture, comes free with Spark
- High throughput, good for many use cases where sub-latency is not required
- Fault tolerance by default due to micro-batch nature
- Simple to use higher level APIs
- Big community and aggressive improvements
- Exactly Once
Disadvantages
- Not true streaming, not suitable for low latency requirements
- Too many parameters to tune. Hard to get it right. Have written a post on my personal experience while tuning Spark Streaming
- Stateless by nature
- Lags behind Flink in many advanced features
Flink :
Flink is also from similar academic background like Spark. While Spark came from UC Berkley, Flink came from Berlin TU University. Like Spark it also supports Lambda architecture. But the approach and implementation is quite different to that of Spark. While Spark is essentially a batch with Spark-Streaming as micro-batching and special case of Spark Batch, Flink is essentially a true streaming engine treating batch as special case of streaming with bounded data. Though APIs in both frameworks are similar from developer point of view, but they don't have any similarity in implementations. In Flink, each function like map, filter, reduce, etc is implemented as long running operator (similar to Bolt in Storm)
Flink looks like a true successor to Storm like Spark succeeded hadoop in batch.
Advantages:
- Leader of innovation in open source Streaming landscape
- First True streaming framework with all advanced features like event time processing,watermarks,etc
- Low latency with high throughput, configurable according to requirements
- Auto-adjusting, not too many parameters to tune
- Exactly Once
- Getting widely accepted by big companies at scale like Uber,Alibaba.
Disadvantages
- Little late in game, there was lack of adoption initially
- Community is not as big as Spark but growing at fast pace now
- No known adoption of the Flink Batch as of now, only popular for streaming.
Kafka Streams :
Kafka Streams , unlike other streaming frameworks, is a light weight library. It is useful for streaming data coming from Kafka , doing transformation and then sending back to kafka. We can think of it as a library similar to Java Executor Service Thread pool, but with inbuilt support for Kafka. It can be integrated well with any application and will work out of the box.
Due to its light weight nature, can be used in microservices type architecture. There is no match in terms of performance with Flink but at the same time does not need separate cluster to run, is very handy and very quick and easy to deploy and start working. Depending upon nature of the associated application, whether its distributed or single node, Kafka Streams will be distributed or single node as well. Internally it uses Kafka Consumer group and works on the Kafka log philosophy.
This post by Kafka and Flink authors thoroughly explains the use cases of Kafka Streams vs Flink Streaming.
EDIT 01/05/2018: One major advantage of Kafka Streams is that its processing is Exactly Once end to end. It is possible because the source as well as destination, both are Kafka and from Kafka 0.11 version released around june 2017, Exactly once is supported. For enabling this feature, we just need to enable a flag and it will work out of the box. For more details shared here and here.
Advantages:
- Very light weight library, good for microservices, IOT applications
- Exactly Once ( Kafka 0.11 onwards).
- Does not need dedicated cluster
- Inherits all Kafka good characteristics
- Supports Stream joins, internally uses rocksDb for maintaining state.
Disadvantages
- Tightly coupled with Kafka, can not use without Kafka in picture
- Quite new in infancy stage, yet to be tested in big companies
- Not for heavy lifting work like Spark Streaming,Flink.
Samza :
Will cover Samza in short. Samza from 100 feet above, looks like very similar to Kafka Streams in approach. There are many similarities. Both of these frameworks have been developed from same developers who implemented Samza at LinkedIn and then founded Confluent where they wrote Kafka Streams. Both these technologies are tightly coupled with Kafka, take raw data from Kafka and then put back processed data back to Kafka. Use the same Kafka Log philosophy. Samza is kind of scaled version of Kafka Streams. While Kafka Streams is a library intended for microservices, Samza is full fledge cluster processing which runs on Yarn. Samza needs Kafka for source/sink and Yarn for stream processing in the same way as MapReduce needs hdfs for source/sink and yarn for batch processing.
Advantages :
- Very good in maintaining large states of information (good for use case of joining streams) using RocksDb and kafka log.
- Fault Tolerant and High performant using Kafka properties
- One of the options to consider if already using Yarn and Kafka in the processing pipeline.
- Good Yarn citizen
- Low latency , High throughput , mature and tested at scale
Disadvantages :
- Tightly coupled with Kafka and Yarn. Not easy to use if either of these not in your processing pipeline.
- Atleast-once processing guarantee. Although after Kafka 0.11 release which talks about Exactly Once, it might change.
- Lack of advanced streaming features like Watermarks, Sessions, triggers,etc
Comparison of Streaming Frameworks:
We can compare technologies only with similar offerings. While Storm, Kafka Streams and Samza look great for simpler use cases, the real competition is clearly between the heavyweights with advanced features: Spark vs Flink
When we talk about comparison, we generally tend to ask: Show me the numbers:)
Benchmarking is a good way to compare only when it has been done by third parties. For example one of the well know stream processing bench marking is yahoo benchmarking.
But this was at times before Spark Streaming 2.0 when it had limitations with RDDs and project tungsten was not in place. Now with Structured Streaming post 2.0 release , Spark Streaming is trying to catch up a lot and it seems like there is going to be tough fight ahead.
Recently benchmarking has kind of become open cat fight between Spark and Flink. Spark had recently done benchmarking comparison with Flink to which Flink developers responded with another benchmarking after which Spark guys edited the post. It is better not to completely rely on benchmarking in stream processing because even a small tweaking in configuration or use case can completely change the numbers. Nothing is better than doing a small POC ourselves before arriving at conclusion.
As of today, it looks like Flink is leading the Streaming Analytics space, with being first with most of the desired aspects like exactly once, throughput, latency, state management, fault tolerance, advance features, etc. These have been possible because of some of the true innovations of Flink like light weighted snapshots and off heap custom memory management.
One important concern with Flink was maturity and adoption level till sometime back but now big companies like Uber, Alibaba, CapitalOne are using Flink streaming at massive scale certifying the potential of Flink Streaming.Recently, Uber open sourced their latest Streaming analytics framework called AthenaX which is built on top of Flink engine. In this post, they have discussed at length, how they moved their streaming analytics from Storm to Apache Samza to now Flink.
One important point to note, if you have already noticed, is that all native streaming frameworks like Flink, Kafka Streams, Samza which support state management use RocksDb internally. RocksDb is unique in sense it maintains persistent local state on each node and helps in boosting performance of processing systems by avoiding network calls. It has become crucial part of new streaming systems. I have shared detailed info on RocksDb in one of the previous posts.
How to Choose the Best Streaming Framework :
This is the most important part. And the honest answer is:it depends:)
As a developer , we can not be biased for any framework. We should keep in mind that no single processing framework can be silver bullet for every use case. Every framework will always have some strengths and some limitations too. Still, with some experience, will share few pointers which might help in taking decisions:
- Depends on the use cases: If the use case is simple, there is no need to go for the latest and greatest framework if it is complicated to learn and implement. A lot depends on how much we are willing to invest for how much we want in return. For example, if it is simple IOT kind of event based alerting system, Storm or Kafka Streams is perfectly fine to work with.
- Future Considerations: At the same time, we also need to have a conscious consideration over what will be the possible future use cases? Is it possible that demands of advanced features like event time processing, aggregation, stream joins, etc can come in future? If answer is yes or may be, then its is worth considering a framework with advanced features like Spark Streaming or Flink. Once invested and implemented in one technology, it is not easy to switch framework later as it involves substantial effort and time. For example, In one of my past projects, we were having a Storm pipeline which was up and running from last 2-3 years and it was working perfectly fine. One fine day, a requirement came to uniquify incoming events based on some Id and report only unique events. Now this demands a feature of state management which is not inherently supported by Storm. Although as workaround, I implemented using in-memory hashmap with rolling time based but it was with the limitation that the entire memory state will go away on restart. Also, we hit some issues while doing such changes which I have shared in one of the previous posts. The point I am trying to make here is, if we try to implement something on our own which the framework does not implicitly provides, we are landing in unknown territory and bound to hit unknown issues.
- Existing Tech Stack : One more important point is to consider the existing tech stack. If the existing pipeline already leverages Kafka, then Kafka Streams or Samza might be an easier fit. Similarly, if the processing pipeline is based on Lambda architecture and Spark or Flink is already in place for batch processing then it makes sense to consider Spark Streaming or Flink Streaming. For example, in my one of previous projects I already had Spark Batch in pipeline and so when the streaming requirement came, it was quite easy to pick Spark Streaming which required almost the same skill set and similar code base.
In short, If we understand strengths and limitations of the frameworks along with our use cases well, then it is easier to pick or atleast filtering down the available options. Lastly it is always good to have POCs once couple of options have been selected. Everyone has different taste bud after all.
Not Covered :
There are some other promising frameworks like Apache Apex which I have not been able to recover because I have not explored them yet. Will try to update or write successor posts if get to know about newer frameworks. Also important to know, there is initiative from Google to bring different batch and stream processing engines on a common ground. They are leading an open source project called Apache Beam to develop a common SDK where we can just write APIs independent of the framework, and then can run the same code on any engine (called runner). Currently supported engines are Flink, Spark, Apex ( open source ones) and Google Dataflow (google proprietary).
Conclusion :
Apache Streaming space is evolving at such fast pace that .........Continue Reading