Latest Event Updates

The flip side of rule engines on example of Drools and some valuable tips

Posted on Updated on

Hey there! This post is devoted to business rule engines and solely written based on my research and experience with Open-Source Drools engine that has been used in Production for more than 1 year in mission-critical systems. Undoubtedly, this engine is great for many features. But, I want to explain you what you should think of before using it or when not to use it all. Because it is very important step, once you adapted it in your project, it will be quite costly to remove one later.  I am going to be quick on the theory here, pointing you at issues and important things. If you don’t understand something, feel free to ask me but first please refer to the official documentation. I’ll try to underline the major issue I’ve come across, surely there are many more.

A rule engine is not a magical box that does something new . It is intended to be a tool that provides a higher level of abstraction so you can focus less on reinventing the wheel.

On one of our projects we had a critical system with thousands of small “if then else if…” requirements from business that were initially developed as a sequence of checks on java. Gradually, we realized how this clutter turned into a very difficult mess to maintain. Hence we made a decision to apply some open source rule engine framework. After some high-level research Drools framework became our choice, primarily because it is free and open source. No one had experience on the team with commercial rule engines, however.

Ok, let’s get to it!

rule-engine-inkscape

The order of rules execution

Rule-engines are a good fit if you don’t know the order in which rules are executed, e.g. the order doesn’t matter. Frequently, in order to understand this, you have to decompose your logic in such a way that you have separated simple flat rules not depending on each other.  In a nutshell, Rule engine is like a black box that have inputs and outputs. This is a necessary but not sufficient condition. Besides, when rules change objects data, that are used in other rules, it greatly increases complexity. Hence you always should be vigilant on the order of execution of different rules.

Otherwise, if the order is determinate, frequently it would be better-off using BPM-engines or something else.

Clear domain model with stable public API

Rule-engines are not a fit if your system doesn’t have a clear domain model. Most non-mature systems don’t have one. It’s evolution, sometimes requiring to understand with time. Even clear requirements not always so clear in practice. Why am I saying this? Well,  because a good rule of thumb it that:

“Rules should be working with only the Domain Model of your system with clear public API around it. If they don’t, your rules ultimately will turn into unmanageable mess. Period!”

It seems to be natural, but many developers don’t realize it. Imagine, what happens if you rules start using internal code or other API that is not a part of domain logic.  With such a disgusting approach, the rules will be aware of more then they need to, like internal implementations/behaviors  and other dependencies = tightly coupled code. It’s just like in OOP or SOA principles but in terms of rules. Tightly coupled code/rules is very difficult to change. Needless to say, if this code is not encapsulated. Eventually, you will have to modify your code triggering broken rules very often, when business requirements change.

Indepotence or exactly once

All rules must run multiple times without errors yielding the same result. It’s like writing a bash-script and also considered a good rule of thumb. Otherwise you may encounter lots of misbehaving weird things.

Forget Debugging tools, %100 coverage with tests is needed

Rule-engines are difficult to refactor, you would have to analyze them and/or keep the whole graph in mind if there are dependencies. In terms of Drools it is impossible to trace rules line-by-line. Drools-rules are declarative and don’t have a mechanism to debug them. Therefore, you should ideally cover 100% of your rules with JUnit-tests. I can give you a hint to use Spock-framework on top of Groovy as a Behavioural-Driven Development stuff, where each rule is simply covered with a test. Tests usually come directly from business specs. Each rule should contain only 1 simple thing as minimal as possible.

Cost of change

No one has any idea if there are conflicting rules when a new one is added or existing one is changed.  A complete regression test must be done when rules are changed. The Memory is sometimes a big issue too. And you can do very little to reduce the consumption – you use 3-rd party framework.

Some rules need to be triggered twice or more

Somebody of you can claim that Drools has a feature of rule-invalidation of Facts, but this again turns into unobvious, tangled mess getting in the way to quickly understand why rules yielded such a result. It’s up to you, but you never know whether you will have to flush/invalidate rules in the long run.

Rete Algorithm vs your Domain API and traversals

The majority of modern rule engines including Drools work upon so called Rete Algorithm. Make sure that your codebase, which your rules call, can be built by a Rete algorithm in the working memory. If it’s not, tree traversal of your codebase within rules on such a model will be a big performance issue!  Bear in mind, if you have internal traversal logic inside your public domain API (e.g. some of your methods do traversal for you) happening on LHS (Left hand side operation) clause , it is very bad for Drools’ performance as latter has its own flexible querying model directly related to Rete-tree. Drools itself should be able to build the tree for you based on your API and efficiently traverse it.

Performance and external systems

What if you are required to save a set of operations to a database and each operation must be checked via rule engine. And the next operation depends on the result of the previous one. This imposes performance problems. All operation could be validated and afterwards saved in 1 go. Think it through in advance. I know there are Rule Groups in Drools, but my point – it’s very complicated.

Rules Centralization

The same business rules may apply across different services, leading to redundancy and governance challenges. It imposes a burden upon us to keep the content of those rules in synch over time.  Thus, it is usually recommended to move them to a new dedicated rule service. The downside is that such a service becomes an additional architectural dependency for other services. The responsibility to build and maintain centralized rule service can introduce various headaches. When we need to change currently active rules or introduce new ones we need to ensure that these changes to the existing rule architecture do not have negative or unforeseen impacts, the worst of which can be a cascading effect that causes exceptions that compound across multiple rules (and multiple services), which should be thoroughly tested.

And please! Make sure that your services know nothing about rules but only talk to the service via a clear contract.

Conclusion

This write-up only scratched the surface of issues coming from my experience. To be honest with you, I would say “No” to applying rule engine frameworks in my future projects to a large number of systems. Initially it seems to be pretty easy, but eventually you realize, that third-party rule engines are more of a pain that benefit, especially when it comes to interdependencies and a domain model. Project maintenance becomes very costly due to this.  It’s not as simple as coding a bunch of if statements. The rules engine gives you the tools to handle rule relationships, but you still have to be able to imagine all of that in your mind.

The most important thing, it is quite difficult to use a rule engine effectively and correctly. A team should have solid knowledge of how to use it and understanding with implementation of business model should be mature. Rule engines work well if there is a really flat independent logic among rules/facts. Each framework restricts you to use their specific model. It’s not worth it!

Project Euler: a list of interesting problems

Image Posted on Updated on

Euler

If you are not aware a website called Project Euler has hundreds of algorithmic problems. Despite that most of them are related to math it’s a good resource to warm up/train your brain in coding. You can use any programming language that you want and track progress.

Here’s a list of interesting Euler’s problems in terms of diversity from my point of view with the aim to improve not only math but also programming skils (No/ Problem Titile):

11 – Largest product in a grid
12 – Highly divisible triangular number
15 – Lattice paths
24 – Lexicographic permutations
54 – Poker hands
59 – XOR decryption
62 – Cubic permutations
67 – Maximum path sum II
68 – Magic 5-gon ring
78 – Coin partitions
79 – Passcode derivation
81 – Path sum: two ways
86 – Cuboid route
94 – Almost equilateral triangles
96 – Sudoku
100 – Arranged probability
107 – Minimal network
109 – Darts
114 – Counting block combinations I
115 – Counting block combinations II
116 – Red, green or blue tiles
117 – Red, green, and blue tiles
145 – How many reversible numbers are there below one-billion?
148 – Exploring Pascal’s triangle
150 – Searching a triangular array for a sub-triangle having minimum-sum
154 – Exploring Pascal’s pyramid
165 – Intersections
166 – Criss Cross
181 – Investigating in how many ways objects of two different colours can be grouped
182 – RSA encryption
186 – Connectedness of a network
194 – Coloured Configurations
208 – Robot Walks
209 – Circular Logic
232 – The Race
267 – Billionaire
275 – Balanced Sculptures
280 – Ant and seeds

Note that it is highly recommended to solve all Euler’s problems one by one because solving a previous problem has a clue to the next ones.

Command And Query Responsibility Segregation and Event Sourcing: What you should think about in advance

Posted on Updated on

After quite a while this article has finally come out! I started to write it about 6 months ago but because of many factors in my life I procrastinated. This text is mainly devoted to those who want to see major pros and cons of this model. You won’t find here the nuts and bolts however. For this, move to the reference section where listed really cool URLs for the start. Further the pattern will be named as CQRS/ES for simplicity.

2 main goals of the pattern

Some developers consider that this pattern is solely used for scalability reasons forgetting that one greatly reduces data model complexity.

Events and snapshots – no removal but always append

it is well known that append-only architectures distribute more easily than updating architectures because there are far fewer locks to deal with. Append-only operations require no locks on a database-side which increases both performance and scalability.

CQRS + Event Model

One of the largest issues when using Event Sourcing is that you cannot ask the system a query such as “Give me all users whose first names are ‘Greg’”. This is due to not having a representation of current state.

With Event Sourcing the event model is also the persistence model on the Write side. This drastically lowers costs of development as no conversion between the models is needed.

Synchronizing write and read sides

The choice of integration model though is very important as translation and synchronization between models can be become a very expensive undertaking. The model that is best suited is the introduction of events, events are a well known integration pattern and offer the best mechanism for model synchronization.

CQRS suggests using 2 different databases: one optimized for reading (denormalized) and another for writing. It is a very tempting way to designing systems. The only big problem with this solution is that you need somehow to sync up your 2 data storages.

You must definitely have a mechanism to recreate a read-side from the write-side. You might need to do this if the read side data store got out of synchronization for some reason, or because you needed to modify the structure of the read-side data store to support a new query. Remember that the CQRS pattern does not mandate that you use different stores on the read side and write side. You could decide to use a single relational store with a schema in third normal form and a set of denormalized views over that schema. However, replaying events is a very convenient mechanism for resynchronizing the read-side data store with the write-side data store.

Storage technology replaceability

CQRS/ES makes it easy to change your technologies. For example, you could start with a file-based event store for prototyping and development, and later switch to a Windows Azure table-based store for production.

Relational vs NoSQL solutions and the cost of maintenance – think twice

Let’s mention major issues in case of using a relational database as a technology for this model.

Pitfall 1: Limit in number of columns per table

Read-side is always kept in a denormalized form for performance reasons and usually requires a huge number of columns. If you have e.g. Oracle 11g database and hit a maximum number of 1000 columns per table you won’t be able to add more columns.

Pitfall 2: Adding new columns

In many cases in order to add new data to the domain model you must release a bunch of applications/services along with changing a schema of a database on the read-side, especially when a column is not empty but already has some data. Here, I prefer a NoSQL databases such as MongoDB with relaxed schema that doesn’t require to change one (data are kept in JSON documents) thus you won’t have to change the schema of a database at all! CQRS/ES patterns with NoSQL db might have a simple Event Storage for the write-side with a blob-column in which data are represented as history and read-side as a schema-less NoSQL db.

Eventual consistency – think twice

While not required, it is common to use messaging between the write and read sides. This means that the system will be in an inconsistent state from time to time. This pattern is not a fit for e.g. financial transactions where a strong consistency is required between 2 sides. However, in majority of cases strong consistency can be used only on the write-side whereas the read-side is used for querying and reporting.

Simple vs costly

Events are simple objects that describe what has happened in the system. By simply saving events, you are avoiding the complications associated with saving complex domain objects to a relational store; namely, the object-relational impedance mismatch.

The event based model may be slightly more costly due to the need of definition of events but this cost is relatively low and it also offers a smaller Impedance Mismatch to bridge which helps to make up for the cost. The event based model also offers all of the benefits discussed in “Events” that also help to reduce the overall initial cost of the creation of the events. That said the CQRS and Event Sourcing model is actually less expensive in most cases. That said the CQRS and Event Sourcing model is actually less expensive in most cases.

Flexibility

As long as you have a stream of events, you can project it to any form, even a conventional SQL database. For instance, my favorite approach is to project event streams into JSON documents stored in a cloud storage.

Also a great benefit of the separate read layer is that it will not suffer from an impedance mismatch - It is connected directly to the data model, this can make queries much easier to optimize. On the write side you no longer need to worry about how your locks impact queries, and on the read side your database can be read-only.

Performance

Write-side:

Because events are immutable, you can use an append-only operation when you save them. Events are also simple, standalone objects. Both these factors can lead to better performance and scalability for the system than approaches that use complex relational storage models. On top of this events reduce network volumes unlike full snapshots in case of a pure CQRS model.

Read-Side:

A normalized database schema can fail to provide adequate response times because of the excessive table JOIN operations. Despite advances in relational database technology, a JOIN operation is still very expensive compared to a single-table read.

Scalability

Command: In most systems, especially web systems, the Command side generally processes a very small number of transactions as a percentage of the whole. Scalability therefore is not always important.

Query: In most systems, especially web systems, the Query side generally processes a very large number of transactions as a percentage of the whole (often times 2 or more orders of magnitude). Scalability is most often needed for the query side.

Distributed development teams

The architecture can be viewed as three distinct decoupled areas. The first is the client; it consumes DTOs and produces Commands. The second is the domain; it consumes commands and produces events. The third is the Read Model; it consumes events and produces DTOs. The decoupled nature of these three areas can be extremely valuable in terms of team characteristics. Instead of working in vertical slices the team can be working on three concurrent vertical slices, the client, the domain, and the read model. This allows for a much better scaling of the number of developers working on a project as since they are isolated from each other they cause less conflict when making changes. It would be reasonable to nearly triple a team size without introducing a larger amount of conflict due to not needing to introduce more communication for the developers.

Conclusion

There is no one right way to implement CQRS. It is not recommended to use this model for critical projects of course if you have no experience with one. Note that It is not possible to create an optimal solution for searching, reporting, and processing transactions utilizing a single model. 

References

Introduction to Domain Driven Design, CQRS and Event Sourcing

Command-Query-Responsibility-Segregation (CQRS) Makes Domain Models Practical

CQRS and Event Sourcing

CQRS – Did you mean CARS?

A CQRS and ES Deep Dive

Clarified CQRS

 

 

 

Distributed transactions and scalability issues in large-scale distributed systems

Image Posted on Updated on

Distributed transactions is the main evil of scalability

It is very hard to scale distributed transactions to an extremely high level, moreover they reduce throughput. Unlike a transaction on a local database, a distributed transaction involves altering data on multiple nodes. It can be a database + JMS broker or just a set of different databases. As an example let’s recall a classical 2-phase commit (2PC later) – a type of atomic commitment protocol in a back-end service with high volume of transactions. This protocol provides ACID-like properties for global transaction processing. I won’t go into details how it works under the hood, I’ll just tell you that C (Consistency) from ACID is the main evil of  scalability in distributed systems. It puts a great burden due to its complex coordination algorithm.  Overall throughput can drop up to a few times. Locks in all of the data sources are being held during 2PC. The longer duration locks create the risk of higher contention. The 2PC coordinator also represents a Single Point of Failure, which is unacceptable for critical systems.  For systems that have reasonably high volume of messages, or sensitive SLAs, it’s worth giving up strong consistency for throughput.

But how can I live without distributed transactions to achieve higher scalability?

Algorithms such as 2PC use “Exactly Once” technique whereas we will use “At least Once” technique. The difference is that a developer should take care of that in his application code to cope with it. Most queueing technologies provide acknowledgements that a message has been accepted (handling is a separate deal). Databases use local transactions. We can deal with downstream failures without coordination. Read on!

Idempotence and fault tolerance

From math, idempotence is as simple as that:

idempotence

That is, the result stays the same, no matter how many times a function gets called on the same argument. In distributed world Idempotence implies that an operation can be invoked repeatedly without changing the result. Why do I need one? Because we should somehow resolve processing duplicate requests in case of a system failure. Let’s make it clear by considering an example. A client-appliction sends a financial transaction to a server (there might be a cluster of them load-balanced or just one) and waits for acknowledgment. For some reason, at this particular time:

  • A server goes down or
  • Client goes down or
  • Network failure happens

In all of these 3 cases, a client-app didn’t get an acknowledgment message (reply) from the server about a transaction status. Of course, the client then should retry this transaction. The server must ensure that this financial transaction is accomplished “At least Once”. Here comes to the rescue idempotence. The server must remember a state – that a transaction with this Id has already been processed including saved acknowledgement message in order to check that it exists and reply with its acknowledgement message in case it does. We don’t have expensive distributed transactions anymore – “At least Once” is a more relaxed and scalable approach. That is, instead of locking resources everywhere, we can assume that messages will arrive at least once.

Optimistic locking

Even though this technique is quite old, one goes well with idempotence. If two people are trying to affect change to the same entity at the same time we don’t lock database records, rather we use a concept of versioning and optionally uniqueness. The idea is to save a version of each entity record in the database but to make sure before saving it wasn’t changed. A simple example is a self-service kiosk where people check-in before boarding at the airport. They can select a vacant seat from the seat map.

seatmap

Each seat has a version = 1. When multiple people make their choice in parallel before proceeding the system simply checks if a seat-version hasn’t changed. If it has a user is notified that the seat already been taken while she was thinking. This is a very simple example where version either can be 1 or 2. A more difficult situation could be in order-management systems where an order might have many versions but that doesn’t change the point how optimistic locking works.  The idea again yields great trade-off in terms of speed because we don’t use locking-mechanism.

Local atomic transactions and unique constraints

Local atomic transactions are usually restricted to a single store. Local transactions are primarily needed to apply a set of operations atomically to a single resource (e.g. relational database) as well as ensure correct ordering of operations within a transaction. In some cases, we can do away with transactions, particularly if we don’t care about the order of operations within a transaction. In that case we can process operations asynchronously leading to a better throughput again. Sometimes, a model requiring the order can be redesigned for  asynchronicity of operations.

Putting it all together

In order to achieve greater throughput a system should correspond to the following principles:

  1. You can retry the operation if there is a failure down the stream based on idempotence.
  2. Don’t use transactions and use optimistic locking if possible – it’s much cheaper.
  3. Local transactions based on a single phase commit for each resource are more scalable than distributed ones increasing overall application availability.
  4. Messages may be reordered.

Wrapping up

Such great systems as Google’s Bigtable or Spanner don’t support traditional ACID transactions because they have a heavy overhead on a highly distributed data storage model. I was lucky to use all above techniques in my applications too involving mission-critical financial transactions and must say that a few years ago not so many people knew about the techniques but now I can hear about them more and more often. Oh yeah, I almost forgot! I urge you to read this great article written by Pat Helland that has even more use-cases. I bumped at it during my research to know more. And remember, you can live without distributed transactions if you implement idempotence and downstream failures correctly.

References

1. Your Coffee Shop Doesn’t Use Two-Phase Commit by Gregor Hohpe.

2. Idempotence Is Not a Medical Condition by Pat Helland.

3. 2PC or not 2PC, Wherefore Art Thou XA? by Dan Pritchett.

4. Life beyond Distributed Transactions: an Apostate’s Opinion by Pat Helland.

Service Discovery in distributed systems

Quote Posted on Updated on

After quite a while I’ve decided to continue writing my blog. There are always lots of ideas to share with you but as usual a lack of time.

Introduction

Service Discovery is an architectural pattern and a key component of most distributed systems and service oriented architectures.

 In simple terms it is a central gateway for all client applications that want to use other services. A client application doesn’t have to know where a particular service is located (usually IP:port), moreover a service can be moved/redeployed to arbitrary boxes for any reasons. There’s no need to change connection details, update configs or whatever – Service Discovery will take care of that. You as a client-app just need to ask it to get access to other services. The pattern yields a good benefit especially when there are hundreds of client applications and/or dozens of services.

Most of the work in this article was made relying on trials and errors as well some research. I urge you to read also a great survey from Jason Wilder’s blog on Open Source Service Discovery frameworks.

In this particular section I won’t mention CAP-properties as they can vary depending on implementation. But CAP is important. If you want to know more about it I recommend moving to this article devoted to the CAP-theorem.

High level components

Service Discovery is the process of finding a suitable Service and its location for a given task that asked a service consumer.  Let’s break down Service Discovery into 3 main components:

Service Requestor

Any consuming application that want to use some service,  that is a client application or a consumer, user, etc.

Service Provider

In the most basic scenario there is a Service Provider that finds this service and talks to Service Requestor.

Service Registry

Service Registry stores information about all services. It can be either static or dynamic. This can be IP:port, access rights and so forth.

This model comes directly from SOA but can be adapted to your needs. Conversation among 3 above entities might differ depending on implementation. As a quick example of such segregation refer to Web Service Discovery.

Discovery options

1. Static service registry

This one is a very basic and easy option. A big drawback – it is static. Every time a service moves to another box, static repo should be updated manually. We won’t go further here as it is quite obvious and have little benefit for flexibility.

2. Announcement

This pattern implies that as soon as a service is up, it registers itself in the Service Registry – sometimes such a notification is called Announcement. This simple approach greatly simplifies maintenance of service registry. It’s worth mentioning  also that in distributed world we always care about No Single Point Of Failure principle where we have at least 2 redundant nodes for a service. Service Registry is usually implemented as a separate  independent component to be self-sufficient. Just for simplicity the registry can be a simple replicated memCached, ehCache or another storage, not necessarily cache.

A few words about ZooKeeper as a solution for Service Discovery.

A counterexample is ZooKeeper that does’t follow Shared Nothing architectural pattern. Service Registry is coupled with the Service Provider. In ZooKeeper’s world coordination-nodes in ensemble are NOT independent of each other because they share Service Registry and there’s a write-contention due to the order of messages. The last word leads to ZooKeeper’s poor write-scalability due to its synchronous replication among nodes. But in terms of requirements most applications involved into configuration management including Service Discovery don’t require frequent writes but rather reads not blocking overall scalability.

Let’s pick up where we left off –  announcement.

The process of announcement is depicted below:

Announcement_pb

3. Multicasting

Using IGMP-protocol, a Service provider subscribes to a particular group, all services send a multicast message to this group. The downside of this approach is that IGMP is optional for IPv4 and not every hardware supports it.

Health checking

Once a service is registered in Service Registry a Service Provider needs to know if it’s alive and healthy (that is, 100% of service is available).

Option 1 – on reconnect: As soon as a Service Requestor understands that there’s no connection to the service it worked with, it reverts back to the Service Provider to get a new one. Each time a Service Requestor accesses Service Provider, one does health-checking (let’s say it is almost the same as heartbeats plus probably some other stuff) to make sure that the service is alive, maintaining the Service Registry and returning response to the client containing information about a healthy service.

Option 2 – heartbeating: Service Provider receives heartbeats within equal time-slices from registered services. Once heartbeat is lost, the Service Provider removes corresponding record from the Service Registry. This is a good option to get a more real-time information though it is more difficult to implement. This way works ZooKeeper and other distributed systems.

Service Down_pn

In both options if some service goes down clients get back to Service Provider. What strategy to use is up to the developer but option 1 is more complex but a better one in most scenarios.

Load Balancing

Load balancing can be easily implemented inside Service Provider as one becomes a central gateway of access for all Service Requestors to other services. For the sake of simplicity a very simple variation is round-robin. I won’t go into details as it is a different topic and you can find on the internet.

Authentication

Again as with load balancing you might want to apply tokens with expiration or cryptographic keys.

Cooperation of all components

On a very high level the communication of components can happend as on the following picture:

Service Request_pb

Summary

We have defined what a Service Discovery is, broke it down into 3 components and described a few basic approaches.

HighLoad Conference 2013

Image Posted on Updated on

Today I was lucky to have attended the annual developers conference HighLoad++ 2013. This is the 7-th time. Some conference sessions are translated into English and available here: http://www.highload.co/freeonline/

OLYMPUS DIGITAL CAMERA

Listeners heard many buzzwords like horizontal scalability, replication, sharding and NoSql. Despite that I missed the first day of the conference what I really liked was not just new trends in technology but invaluable speakers’ experience  on real projects with pros and cons, especially:

1.  “MySql versus something else” by Mark Callaghan [FaceBook] on storage efficiency, framework for analysis and benchmarking and database algorithms: https://www.facebook.com/MySQLatFacebook

PA293677

2. “Cassandra vs In-Memory Data Grid in eCommerce” by Alexander Soloviev  [Grid Dynamics] was very informative on deep analysis and benchmarking against different cases and their pros and cons.

3. “Query Optimizer in MariaDB: now w/o indices” by Sergey Golubchik [Monty Program Ab]

MinHash algorithm or how to quickly find similarities among 2 documents

Quote Posted on Updated on

MinHash is a technique from Locality Sensitive Hashing allowing to find similarities among 2 sets. This is a  buzzword frequently met in Data Mining  and Data Science fields of CS. What surprising is that this method was invented in 1997 and used in AltaVista web-search engine back in the 90s to find similarities among web-documents and it also can be used to:

  • Find duplicates
  • Near duplicate image detection
  • Near neighbor search

Basically the algorithm can be applied to anything that can be presented by numbers.

Let’s start with a bit of math from theory of probability and statistics.

Define a formula of two sets A and B:

 J(A,B) = {{|A \cap B|}\over{|A \cup B|}}.

This is so-called  a Jaccard coefficient.

Where: J ∈ [0..1]

j = 0 – if A ∩ B = 0, that is 2 sets are disjoint meaning there are no similarities

j = 1 – if A ∩ B = A = B, that is 2 sets are identical.

A, B are more similar when their Jaccard coefficient is closer to 1.

This simple formula is cumbersome if the sets are quite large, e.g. 2 web-documents of more than 1MB in size. Ouch, that’s too much. 1MB of text-data is 1,048,576 characters provided that 1 ASCII char = 8 bits (of course for unicode charset it is greater).

Now that we understand a bit of theory let’s try to apply hashing to Jaccard coefficient. Everywhere I hear hashing it always leads to randomized algorithms.

Ok, let’s move on. The main idea is that similar objects hash to the same bucket. This follows from the fact that probability of collision higher for similar objects.

Here we give an example for 2 sets A and B but the algorithm can be applied to any number of sets.

1. Essentially, we need to construct a set of independent hash functions <h1,h2,h3,…hk> randomly.  k = O(1/ε2), ε > 0 such that the expected error of the estimate is at most ε. For example, 400 hashes would be required to estimate J(A,B) with an expected error less than or equal to .05. So, k can be varied to increase/decrease the likelihood of false negatives.

2. Next we initialize for each set A and B the  value to infinity.

3. For each element s in both sets A and B we compute the element’s hash:

 such as: If  then .

Eventually we should have  for both sets A and B.

4. If 2 sets A and B are similar then the probability P(  A =  B) = |A ∩ B| / |A U B|- is high and it is the actual Jaccard coefficient!

5. We calculated   statistics to estimate how similar are these 2 sets. General formula is: Similarity = identical s / k

In real world this requires considering more thoroughly different parameters, hash-functions etc. However, to demonstrate the algorithm I wrote a simple java code:

import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Random;
import java.util.Set;

public class LSHMinHash<T> {

    private final int hashes[];
    private final int numOfHashes;
    private final int numOfSets;
    private final Set<T> setA;
    private final Set<T> setB;
    private final Map<T, boolean[]> matrix;
    private final int[][] minHashes;

    public LSHMinHash(double e, Set<T> setA, Set<T> setB){
        this.numOfHashes = (int)(1 / (e * e));
        this.numOfSets = 2;
        this.setA = setA;
        this.setB = setB;
        matrix = buildSetMatrix(setA, setB);
        minHashes = initMinHashes(numOfSets, numOfHashes);
        hashes = computeHashes(numOfHashes);
    }

    private Map<T,boolean[]> buildSetMatrix(Set<T> setA, Set<T> setB) {

        Map<T,boolean[]> matrix = new HashMap<T,boolean[]>();

        for(T element : setA){
            matrix.put(element, new boolean[] { true, false } );
        }

        for(T element : setB){
            if(matrix.containsKey(element)){
                matrix.put(element, new boolean[] { true, true } );
            }else if(!matrix.containsKey(element)){
                matrix.put(element, new boolean[] { false, true } );
            }
        }

        return matrix;
    }

    private int[][] initMinHashes(int numOfSets, int numOfHashes) {
        int[][] minHashes = new int[numOfSets][numOfHashes];

        for (int i = 0; i < numOfSets; i++) {
            for (int j = 0; j < numOfHashes; j++) {
                minHashes[i][j] = Integer.MAX_VALUE;
            }
        }
        return minHashes;
    }

    private int[] computeHashes(int numOfHashes) {
        int[] hashes = new int[numOfHashes];
        Random r = new Random(31);

        for (int i = 0; i < numOfHashes; i++){
            int a = (int)r.nextInt();
            int b = (int)r.nextInt();
            int c = (int)r.nextInt();
            hashes[i] = (int)((a * (a * b * c >> 4) + a * a * b * c + c) & 0xFFFFFFFF);
        }
        return hashes;
    }

    private void computeMinHashForSet(Set<T> set, int setIndex){
        int hashIndex = 0;

        for(T element : matrix.keySet()) {
            for (int i = 0; i < numOfHashes; i++) {
                if(set.contains(element)) {
                    int hashValue = hashes[hashIndex];
                    if (hashValue < minHashes[setIndex][hashIndex]) {
                        minHashes[setIndex][hashIndex] = hashValue;
                    }
                }
            }
            hashIndex++;
        }
    }

    private double computeMinHash(int[][] minHashes, int numOfHashes) {
        int identicalMinHashes = 0;
        for (int i = 0; i < numOfHashes; i++){
            if (minHashes[0][i] == minHashes[1][i]) {
                identicalMinHashes++;
            }
        }
        return (1.0 * identicalMinHashes) / numOfHashes;
    }

    public double findSimilarities() {
        computeMinHashForSet(setA, 0);
        computeMinHashForSet(setB, 1);
        return computeMinHash(minHashes, numOfHashes);
    }

    public static void main(String[] args){
        Set<String> setA = new HashSet<String>();
        setA.add("THIS");
        setA.add("IS ");
        setA.add("THE");
        setA.add("CASE");

        Set<String> setB = new HashSet<String>();
        setB.add("THAT");
        setB.add("IS ");
        setB.add("THE");
        setB.add("CASE");

        double errorFactor = 0.001;

        LSHMinHash<String> minHash = new LSHMinHash<String>(errorFactor, setA, setB);
        System.out.println(minHash.findSimilarities());
    }
}