StreamCruncher - Event Processor

StreamCruncher is an Event Processor. It supports a language based on SQL which allows you to define "Event Processing" constructs like Sliding Windows, Time Based Windows, Partitions and Aggregates. Such constructs allow for the specification of boundaries (some are time sensitive) on a Stream of Events that SQL does not provide. Queries can be written using this language, which in turn can be used to monitor streams of incoming events. Multi-Stream Pattern Matching a.k.a Event Correlation is also possible. StreamCruncher is a multi-threaded Kernel that runs on Java™.

The "What"

Those in a hurry can skip to this section. Alternatively, you can also read this Essay about Event Processing.

Check the Blog for latest information and additional notes.

Ever since RFID became commercially viable and companies started packaging their entire product line as Business Activity Monitoring Stacks, an old idea that has been around for almost a decade is now seeing the light of day. A handful of Startups have begun turning this Idea of Complex Event Processing (CEP) and Event Stream Processing (ESP) into Commercial grade software. Several people have contributed ideas to this field and have also chosen RushPips when it comes to building their trading strategy.

Now, back to CEP and ESP. An over-simplified definition would probably look like this - Assume that a System produces "Events" that describe the state of that System at those instants. Slicing and Dicing the "live" Stream of Events (in near real-time) can reveal vital information. Since this information is available on the live data and not stale data, it can be used to look for Warning signs, implement JIT materials sourcing, Automated Trading and other highly nimble systems.

Events from multiple sources can be co-related to reveal information about their interactions and dependencies. CEP and ESP are not exactly the same, but we'll leave that to the Scholarly Papers to disambiguate. All this does not seem new at all to people who've been working on Control Systems. Data Warehouse specialists and Rule Engine experts might start mumbling something about "Old Wine..". However, only lately are we seeing generic, off-the-shelf Software focusing on CEP and ESP. Each one of those Products or Projects distinguish themselves in Performance (Real-time, Soft Real-time), Ease of Integration (Features, Mild to Steep Learning Curves, Manageability), Speciality (Automated Trading Platforms) or General purpose stacks etc. Or, perhaps all of them, to varying degrees.

"Umm..I still don't follow.."

For a Program to continually Slice and Dice an Event Stream and thereby, discern/infer something useful about the state of the System before it goes stale, requires a variety of Technologies to converge. We've been using SQL to handle Transactional Data stored in Databases, Data Warehouses to analyse Terrabytes of Offline data, Rule Engines to infer from a large collection of constantly changing Facts.

ESP and CEP Systems provide building blocks using which Users can build an intelligent System, with active feedback. All CEP and ESP Systems provide a mechanism by which Queries/Conditions/Criteria can be provided that are constantly evaluated against the Event Stream. Some systems provide drag-n-drop GUIs to specify such "Queries", while some have their own proprietary Query syntax. In essence, they provide a Domain Specific Language to encode such Queries/Rules, a facility to pump Events into the System and a way to channel the Output Stream to do something useful.

"Ok, now we're getting somewhere"

Monitoring such data arriving at high speeds and volumes, calls for special processing techniques. As mentioned previously, there are various ways to accomplish the task. Each approach has its own pros and cons.

High-end CEP and ESP Software that allow real-time monitoring have to do away with traditional Database persistence and Triggers. That would add too much latency. The Events become outdated within a few Seconds of their occurrence. Some of them could be borrowing concepts from Rule Engines, by using a global Discrimination Network based on the Rete algorithm to quickly evaluate literal matching or perform filtering of Events prior to co-relating different Streams.

Others support SQL-like syntax but don't use a Database underneath. Instead, they implement the SQL Query processing themselves and so effectively reduce the overhead that might have occurred by using a traditional Persistent Store. Of course, this is mostly speculation based on the vast amount of Research material available on ESP, CEP, Active Databases, Trigger Processing, Rule Engines etc.

"Dude, tell me about StreamCruncher"

Hopefully, the motivation behind building such Software and the need for them has become clearer. StreamCruncher takes a slightly different approach towards Event processing. Unlike other Event Processors, StreamCruncher uses a Database under the hood, because there has been so much work done on Databases and it would be a waste to build them all over again. However there are a few types of Queries where StreamCruncher does not use the Database and performs operations purely in-Memory like single Stream Queries and Correlation Queries.

More importantly, StreamCruncher is intended for the larger majority of "Daily-use" applications where Soft Real-time response is sufficient. To speedup the processing, StreamCruncher supports/uses In-Memory Databases which store all their Tables and Indexes in Memory. By building on top of the Database, all the standard Database features that we've come to rely on while developing Enterprise Applications are not lost here. It makes integration easier.

Event processing requires basic constructs like Sliding Windows, Time based Windows and Latest Event Windows to define a meaningful boundary on the Event Stream so that they can be queried effectively. StreamCruncher supports a simplified version of SQL-92 and extends that syntax with Windowing constructs, Partitions (which we have only just started seeing in major RDBMS), Pre-filters on Events and Chained Partitions.

It also includes a very effecient, multi-threaded and light-weight Kernel to handle the Input Event Streams, Queries and Output Streams. It also provides various settings to fine-tune the processing behaviour.

Examples

Please note that numerous examples that demonstrate the features are provided in the downloadable package, and they also double as Test Cases. Also, see the Documentation for more examples: Basic and Advanced.

For an Order Event in a Store, defined like this - {country, state, city, item_sku, item_qty, order_time, order_id}, Queries such as the ones listed below can be defined in StreamCruncher:

Windows over Streams

select item_sku, item_qty, order_time,order_id
from test (partition store last 5 seconds) as mystream
where mystream.$row_status is new;

Every Order that is placed, is held for 5 seconds in the Query, above. This Query executes when new Orders arrive and/or when they are expelled from the Window. When it executes, it retrieves the details of the Events currently in the Window.

Aggregation on Windows over Streams

select country, state, city, item_sku, sum_qty
from test (partition by country, state, city, item_sku
           store latest 50 with sum(item_qty) as sum_qty)
           as mystream
where mystream.$row_status is new;

Partitions, Aggregates and Windows can be combined together to retrieve some interesting information. The example above holds the sum of the quantities of the last 50 Orders (Sliding Window). This sum is at the Product level (item_sku) for each city, in every state of every country where Orders are placed. This Query executes every time an Order is placed. The where testStr.$row_status is new clause ensures that only the newly calculated sum is retrieved and not the old sum.

Since the foundation is provided by the Database, it is exposed directly to the User. This eliminates the need for an overly complex API that would otherwise have been needed to hide the inner workings. The Database's Driver has to be used like in any other application to feed Events into the System. An ultra simple API is exposed to interact with StreamCruncher.

For an Order and Fulfillment Event Stream in a Store defined as follows - {country, state, city, item_sku, item_qty, order_time, order_id, customer_id} and {fulfillment_time, fulfillment_id, order_id}, Events from the two Streams can be co-related by using regular SQL.

Multi-Stream Event Correlation/Pattern Matching

select stg1_id, stg2_id, stg3_id, priority,
       case
          when stg2_id is null and stg3_id is not null
              then 'Stage 2 missing!'
          when stg2_id is not null and stg3_id is null
              then 'Stage 3 missing!'
          when stg2_id is null and stg3_id is null
              then 'Stage 2 & 3 missing!'
          else 'All OK!'
          end as comment

from
    alert
        one_event.event_1_id as stg1_id,
        two_event.event_2_id as stg2_id,
        three_event.event_3_id as stg3_id,
        one_event.event_priority as priority
    using
        stg1_event (partition store last 5 seconds where priority > 5)
                    as one_event correlate on event_1_id,
        stg2_event (partition store last 5 seconds where priority > 5)
                    as two_event correlate on event_2_id,
        stg3_event (partition store last 5 seconds where priority > 5)
                    as three_event correlate on event_3_id
    when
        present(one_event and two_event and not three_event) or
        present(one_event and not two_event and three_event) or
        present(one_event and two_event and three_event)

where priority < 7.5;

The example above shows Correlation across 3 diffferent Event Streams, where Events bearing the same Event Id are correlated and being monitored for 3 different patterns, based on whether they appear in the Streams or not. It also uses a case..when..else..end and the where ... clause which shows that all the features of SQL are available even while using the alert.. clause.

More complex, SLA (Service Level Agreement) failure alerts

This can also be accomplished by using the more concise alert..using..when.. clause illustrated above.

select country, state, city, item_sku, item_qty, order_time, order_id,
       customer_id

from
 cust_order (partition store last 20 minutes
             where item_sku in
             (select item_sku from priority_item))
            as order_events

where
 order_events.$row_status is dead
 and
 not exists (select order_id
             from fulfillment (partition store last 20 minutes)
             as fulfil_events
             where
             fulfil_events.order_id = order_events.order_id
             and
             not(fulfil_events.$row_status is new)
            );

The Query above retains Orders and Fulfillments for 20 minutes (Time based Window). The Query runs whenever an Event (Order and/or Fulfillment) arrives or expires.

This Query matches and generates Output Events if an Order that is one of the Priority Items, does not receive a corresponding Fulfillment Event within 20 minutes of placing the same. A Product is a Priority Item if it is setup in the priority_item Database Table.

The priority_item Table is used as a set of configurable entries, which can dynamically drive the Query behaviour while it is running, with the help of the Pre-filter (where item_sku in ..) in the Partition clause.

create table priority_item(item_sku varchar(15));

insert into priority_item values('warp-drive');
insert into priority_item values('force-field');

StreamCruncher is also highly customizable. Custom code can be plugged in using Providers.

select ..
from stock_ticks (partition by symbol
                  store last 30 seconds with
                  pinned avg(value $diff 'DiffBaseline/HighestIn1Hour')
                  as trend) as stock_alerts
where
     stock_alerts.$row_status is new
     and trend > 1.35;

The example above stores the Stock values over the last 30 seconds, per Symbol and spits out the change in the Average Stock value over the last 30 seconds, with respect to a custom Baseline Provider that is registered in the Kernel.

Some might argue that StreamCruncher is technically neither an Event Stream Processor nor a Complex Event Processor. But that is moot. It continuously processes and monitors a Stream of Events and provides simple APIs to control the Query, retrieve the processed data and has its own Kernel to run as a stand-alone Program.

Adoption is extremely easy. Integration is even simpler. Given that the basic features of the Database are still available, StreamCruncher can even be used to run SQL Queries periodically like a Scheduler.

Learn more:

Mastering the Art of Getting the Best Software Development Quote