Reference ArchitectureThe key technology components of a reference architecture include:
1. Information sources are depicted the left. These encompass a variety of machine and human actors either transmitting potentially thousands of real time messages per second.
2. A highly scalable messaging system to help bring these feeds into the architecture as well as normalize them and send them in for further processing.
3. A Complex Event Processing tier that can process these feeds at scale to understand relationships among them; where the relationships among these events are defined by business owners in a non technical or by developers in a technical language.
4. As a result of specific patterns being met that indicate potential fraud, business process workflows are created that follow a well defined process that is predefined and modeled by the business
5. Data that has business relevance and needs to be kept for offline or batch processing can be handled using a Java Data Grid and/or a storage platform. The idea to deploy Hadoop oriented workloads (MapReduce, or, Machine Learning) to understand fraud patterns as they occur over a period of time
6. Scaleout is preferred as a deployment approach as this helps the architecture scale linearly as the loads placed on the system increase over time
Here is a high-level depiction of the reference architecture:
Illustration 1: Reference Architecture for a Fraud Detection Application
Messaging Broker TierThe messaging broker tier is the first point of entry in a system. It fundamentally hosts a set of message queues. The broker tier needs to be highly scalable while supporting a variety of cross language clients and protocols from Java, C, C++, C#, Ruby, Perl, Python and PHP. Using various messaging patterns to support real-time messaging, this tier integrates application, endpoints and devices quickly and efficiently. The architecture of this tier needs to be flexible so as to allow it to be deployed in various configurations to connect to customized solutions at every endpoint, payment outlet, partner, or device.
Complex Event Processing TierThe Complex Event Processing (CEP) portion of the implementation, in this scenario, is an independent software module, but still completely integrated with the rest of the platform while running on a horizontally scaled infrastructure. Typically, the CEP teir has the following capabilities:
- understand and handle events as first class citizens of the platform
- select a set of interesting events in a cloud or stream of events
- detect the relevant relationships (patterns) among these events
- take appropriate actions based on the patterns detected
CEP allows the architecture to process multiple events with the goal of identifying the meaningful ones. This process involves:
- Detection of specific events
- Correlation of multiple discrete events based on causality, event attributes, and timing
- Abstraction into higher-level (i.e. complex or composite) events
It is this ability to detect, correlate and determine business relevance that powers a truly active decision-making capability.
Business Rules and Process Management System (BRMS)
The BPM tier is invoked for downstream handling as specific events are detected. BPM process and business rules can be defined by non technical as well as technical users of the fraud detection platform as shown in Illustration 2.
The BPM tier essentially spins up new processes that can be entirely automated or can have a human-in-the-loop to process fraudlent events. A result of this process can be many things. For instance, one result might be a call to a customer by a call center representative. Another result could be an update to a datastore that can be queried by a business intelligence application.
Illustration 2: CEP/BPM Layer for Fraud Detection Application
Storage TierThere are broad needs for two distinct data tiers that can be identified based on business requirements.
1. Some data needs to be pulled in near realtime, accessed in a low latency pattern as well as have calculations performed on this data. The design principle here needs to be "Write Many and Read Many" with an ability to scale out tiers of servers.
Java based in memory datagrids (IMDGs) are very suitable for this use case as they support a very high write rate. Data Grid (JDG) is a highly scalable and proven implementation of a distributed datagrid that gives users the ability to store, access, modify and transfer extremely large amounts of distributed data. Further, JDG offers a universal namespace for applications to pull in data from different sources for all the above functionality. A key advantage here is that datagrids can pool memory and can scaleout across a cluster of servers in a horizontal manner. Further, computation can be pushed into the tiers of servers running the datagrid as opposed to pulling data into the computation tier.
As the data volumes increase in size, datagrids can scale linearly to accommodate them.The standard means of doing so is through techniques such as data distribution and replication. Replicas are nothing but copies of the same segment or piece of data that are stored across (aka distributed) a cluster of servers for purposes of fault tolerance and speedy access. Smart clients can retrieve data from a subset of servers by understanding the topology of the grid. This speeds up query performance for tools like business intelligence dashboards and web portals that serve the business community. Datagrids also provide support for policies that can be used to quiesce data that is no longer needed or is transient (i.e has passed a certain time window).
2. The second data access pattern that needs to be supported is storage for data that is older. This is typically large scale historical data. The primary data access principle here is "Write Once, Read Many." This layer contains the immutable, constantly growing master dataset stored on a distributed file system like HDFS. Besides being a storage mechanism, the data stored in this layer can be formatted in a manner suitable for consumption from any tool within the Apache Hadoop ecosystem like Hive or Pig or Mahout.