StreamCruncher

Syntax

StreamCruncher 's Query syntax is based on a simplified version of the Oracle-7 ANTLR grammar. Since the Kernel supports many Databases that comply with the ANSI SQL standards to varying degrees, only the simplest form of SQL-92 is allowed to be used in the Queries.

The complete syntax is provided as a single image which can be viewed here. (You might have to "enable Popups" in your Browser). In the diagram, Ovals mean terminals and literals. Rectangles mean non-terminals.

Other topics

Performance

StreamCruncher requires the JVM Heap and GC settings to be set to meet the needs of the Application. The following setting proved to be quite good on a 1.6 GHz Intel Pentium M running Windows XP Home with Sun JDK 1.6 RC. However, these settings have to be modified to suit different hardware, Query and Event Stream behaviour.

-server -XX:+UseBiasedLocking -XX:+DoEscapeAnalysis -XX:CompileThreshold=5000 -XX:ThreadStackSize=256 -XX:+UseParallelGC -XX:ParallelGCThreads=1 -XX:MaxGCPauseMillis=100 -Xms768m -Xmx768m

Since the underlying Database plays a major role, tuning the Database Server is also crucial. Please consult the Database manuals for details. Indexes must be created and used. Only the essential ones such as the one on the Event Id may be required. The Index on the Id column of the Output Event gets created automatically, by the Kernel. The Query must also be written efficiently. All the rules for writing good, performant SQL Queries apply here too.

Memory

StreamCruncher stores everything in memory. When the Kernel shuts down, the Queries and their Windows, Partitions, Aggregates etc are discarded. When the Kernel restarts, they are created afresh as and when Events arrive. The Events themselves are not stored in memory, but in the Database. However, they are still moved in and out of StreamCruncher to maintain the Window/Partition values. If Partitions are used, sometimes Events are buffered by the Kernel before they enter the Windows. This can happen in Partitions with "Sliding Windows" and "Time based Windows" with the max N clause.

Partitions with "Sliding Windows" once created, will remain in the Kernel because the Windows never become empty again. "Pinned" Aggregated Partitions also stay in memory once created. These things should also be considered while estimating the total memory that will be required. If the Kernel is configured to use an In-memory Database that resides on the same machine, then obviously, much larger RAM is required than it would for just the Kernel. Partitions without Aggregates create copies of the original Events in their Windows, whether they are directly from a Stream or from another Partition in the Chain. The amount of RAM required would thus increase proportionately.

Sufficient tests must be conducted which will help estimate the maximum number of Queries to run, rate and volume of Input Events and Output Events that will pass through the Kernel, the Partitions that will remain in memory and so on.

Load distribution

Load-balancing and clustering over multiple Kernel instances is not supported.

Although the Kernel is multi-threaded, each Query runs for the most part, in a single Thread. However, if the Query has Partitions, then each Partition is processed in parallel (Not the individual small Windows, but the whole set of Windows in a Partition), but the final Query execution is single-threaded. Partitions also split their work between Threads so that Pre-Filtering and Pre-fetching from the Stream can go on in parallel with Partition processing and Query execution.

It might be good in some cases to run the Query less frequently, so that enough Events have accumulated. Running the Query too frequently (small, millisecond intervals) may add noticable latency. However, different settings must be tried before deciding on a configuration.

The Query can be split into more than one Query, by using the Pre-filter feature in Partitions. Since each Query runs in its own group of Threads, the Stream can be partitioned across multiple Queries as shown below, thereby spreading the load across multiple Queries.

Query 1:
--------
select country, state, city, item_sku, item_qty

from
 cust_order (partition store last 20 minutes
  where country = 'India') as order_events

where
 order_events.$row_status is new;

Query 2:
--------
select country, state, city, item_sku, item_qty

from
 cust_order (partition store last 20 minutes
  where country = 'US') as order_events

where
 order_events.$row_status is new;

Multi-Core Chips or Multi-Processor Hardware for both the Database Server and the Kernel, with plenty of RAM should be used for best results. The Database Server must ideally reside on the same Hardware as the Kernel. The Kernel JVM and the Database Servers must be tuned well.

Event Processor

StreamCruncher - Syntax & Other topics