Efficient Collection Equality Comparison
This is a tiny Java library with a single method for testing if two Java collections contain the same elements:
se.llbit.util.CollectionComparison.isEqualCollection(a, b)
The method returns true if, and only if, the two input collections A and B contain the same elements, compared using x.equals(y), with the same number of occurrences of each element in both collections. The algorithm is efficient and well tested. A sketch of a correctness proof is available below.
##Background
I have seen many implementations of collection comparison algorithms in various projects. Many of those implementations were incorrect in some way. Comparing collections is more difficult than it first seems, and getting everything right is not easy.
Common incorrect Java solutions are:
a.containsAll(b)- completely wrong, only checks thatais a subset ofbif neither collection has duplicates.a.containsAll(b) && b.containsAll(a)- works only whenaandbhave no duplicates, i.e. they are sets.a.size() == b.size() && a.containsAll(b) && b.containsAll(a)- still only works whenaandbhave no duplicates.
A correct implementation of collection equality comparison is available in Apache Commons Collections 4.0, CollectionUtils.isEqualCollection. However, when I last looked at the Apache Commons algorithm I found that some performance improvements could be made to that algorithm. I documented an improved algorithm in this blog post. Since then I have further improved the algorithm. The final version is:
boolean isEqualCollection(Collection<?> a, Collection<?> b) {
if (a.size() != b.size()) {
return false;
}
Map<Object, Integer> map = new HashMap<Object, Integer>();
for (Object o : a) {
Integer val = map.get(o);
int count;
if (val != null) {
count = val.intValue();
} else {
count = 0;
}
map.put(o, Integer.valueOf(count + 1));
}
for (Object o : b) {
Integer val = map.get(o);
int count;
if (val != null) {
count = val.intValue();
if (count == 0) {
return false;
}
} else {
return false;
}
map.put(o, Integer.valueOf(count - 1));
}
return true;
}
My original blog post describes what I improved in the Apache Commons algorithm, but the blog post did not provide tests for my algorithm. After I posted that blog post I have copied the code in several of my projects, and I figured it would be useful to create a public Git repository with tests and give it an Open Source license. The code and tests in this repository are licensed under the Modified BSD License. The tests check some special cases that incorrect collection comparison algorithms fail to handle, and the tests cover all statements in the implementation.
##Correctness Proof Sketch
I have made a sketch of a proof for correctness here. I'm not used to writing proofs, so I hope it's not too poorly structured. If I made an error please report a bug on the issue tracker of this repository!
Definition: Two collections A and B are equal if and only if each element x in A occurs the same number of times in B, and each element y in B occurs the same number of times in A. Two elements x and y are considered equal if x.equals(y) returns true and x.equals(y) is equal to y.equals(x).
A description of how the algorithm works:
- An initial test checks that the number of elements in
A, denotedsize(A), is equal to the number of elements inB. Ifsize(A) != size(B)then the algorithm is done and returnsfalse. - A map containing occurrence counts of elements in
Ais built. If|x|denotes the number of occurrences of an elementxin collectionA, thenmap.get(x) == |x|for eachxinAafter the first loop. - The second loop updates the occurrence map by iterating over all elements in
B. For each elementoinB, the following happens:- (1) If there was no record for element
oin the occurrence map, then the algorithm exits with return valuefalsebecauseooccurred inBand notA. - Otherwise, the algorithm checks the occurrence count stored in the map for element
oand there is a choice:- (2) If the occurrence count was zero, then
ooccurs more often inBthanAand the method exits with return valuefalse. - Otherwise, the record for element
oin the occurrence map is updated by decreasing the count by one.
- (2) If the occurrence count was zero, then
- (1) If there was no record for element
- If the execution passes the second loop without returning, then the collections are equal so
trueis returned.
Lemma 1: The second loop ensures that every element in B occurs in A at least as many times as in B.
Proof: Suppose that some element x occurs in B but not in A, then there is no record of x in the occurrence map before the second loop, and if execution reaches the first occurrence of x in B then case (1) returns false. Otherwise, if there are n occurrences of x in A, and n+k occurrences of x in B, and k>0, then when the n+1th occurrence of x in B is encountered case (2) returns false.
Lemma: The algorithm returns true if and only if A and B are equal.
Proof: There are three possibilities:
- Either an element
xoccurs more times inBthan inA. - Or, an element
xoccurs more times inAthan inB. - Otherwise
AandBare equal.
Consider each case in turn:
- If an element
xoccurs more often inBthanAthen:- If
size(A) != size(B)the initial size check returnsfalse. - Otherwise the second loop returns
falsedue to either case (1) or (2) in the second loop at the latest wheno == x.
- If
- Otherwise, if an element
xoccurs more often inAthanBthen:- If
size(A) != size(B)the initial size check returnsfalse. - Otherwise there must be some other element
y != xthat occurs more often inBthanAand either case (1) or (2) in the second loop returnsfalse.
- If
- Otherwise,
AandBare equal. The execution reaches the last statement and returnstrue.
##Using the library
The library is available from The Central Repository!
To build using the tiny collection comparison library, just declare an extra dependency in your build script. For example as in this Gradle build:
apply plugin: 'maven'
apply plugin: 'java'
repositories {
mavenCentral()
}
dependencies {
compile 'se.llbit:collcompare:1.0.0'
}
You can then use the collection equality method in your code:
import static se.llbit.util.CollectionComparison.isEqualCollection;
...
isEqualCollection(a, b);
##Version History
- 1.0.0 Initial version.
- 1.0.1 Improved algorithm.
- Removed a redundant loop that checked the occurrence map after the second loop.
- Added correctness proof in README.
- Added an additional test case to reach 100% test coverage.
- Made the
CollectionComparisonclassabstract.