PDD

Advanced Bloom Filter Based Algorithms for Efficient Approximate Data De-Duplication in Streams

Лицензия

Лицензия

Группа

Группа

com.github.jparkie
Идентификатор

Идентификатор

pdd
Последняя версия

Последняя версия

0.1.1
Дата

Дата

Тип

Тип

jar
Описание

Описание

PDD
Advanced Bloom Filter Based Algorithms for Efficient Approximate Data De-Duplication in Streams
Ссылка на сайт

Ссылка на сайт

https://github.com/jparkie/PDD
Система контроля версий

Система контроля версий

https://github.com/jparkie/PDD

Скачать pdd

Имя Файла Размер
pdd-0.1.1.pom
pdd-0.1.1.jar 21 KB
pdd-0.1.1-sources.jar 15 KB
pdd-0.1.1-javadoc.jar 60 KB
Обзор

Как подключить последнюю версию

<!-- https://jarcasting.com/artifacts/com.github.jparkie/pdd/ -->
<dependency>
    <groupId>com.github.jparkie</groupId>
    <artifactId>pdd</artifactId>
    <version>0.1.1</version>
</dependency>
// https://jarcasting.com/artifacts/com.github.jparkie/pdd/
implementation 'com.github.jparkie:pdd:0.1.1'
// https://jarcasting.com/artifacts/com.github.jparkie/pdd/
implementation ("com.github.jparkie:pdd:0.1.1")
'com.github.jparkie:pdd:jar:0.1.1'
<dependency org="com.github.jparkie" name="pdd" rev="0.1.1">
  <artifact name="pdd" type="jar" />
</dependency>
@Grapes(
@Grab(group='com.github.jparkie', module='pdd', version='0.1.1')
)
libraryDependencies += "com.github.jparkie" % "pdd" % "0.1.1"
[com.github.jparkie/pdd "0.1.1"]

Зависимости

test (1)

Идентификатор библиотеки Тип Версия
junit : junit jar 4.11

Модули Проекта

Данный проект не имеет модулей.

ProbabilisticDeDuplicator (PDD)

Build Status codecov

Implementation of Advanced Bloom Filter Based Algorithms for Efficient Approximate Data De-Duplication in Streams as described by Suman K. Bera, Sourav Dutta, Ankur Narang, and Souvik Bhattacherjee.

This library seeks to provide a production-oriented library for probabilistically de-duplicating unbounded data streams in real-time streaming scenarios (i.e. Storm, Spark, Flink, and Samza) while utilizing a fixed bound on memory.

Accordingly, this library implements three novel Bloom Filter algorithms from the prior-mentioned paper all of which are shown to converge faster towards stability and to improve false-negative rates (FNR) by 2 to 300 times in comparison with Stable Bloom Filters.

Downloads

Maven

<dependency>
  <groupId>com.github.jparkie</groupId>
  <artifactId>pdd</artifactId>
  <version>0.1.1</version>
</dependency>

<dependency>
  <groupId>com.github.jparkie</groupId>
  <artifactId>pdd</artifactId>
  <version>0.1.2-SNAPSHOT</version>
</dependency>

Gradle

compile 'com.github.jparkie:pdd:0.1.1'

compile 'com.github.jparkie:pdd:0.1.2-SNAPSHOT'

Usage

This library provides three implementations of a ProbabilisticDeDuplicator:

  1. Biased Sampling based Bloom Filter (BSBF).

  2. Biased Sampling based Bloom Filter with Single Deletion (BSBFSD).

  3. Randomized Load Balanced Biased Sampling based Bloom Filter (RLBSBF).

Basic

final long NUM_BITS = 8 * 8L * 1024L * 1024L;
ProbabilisticDeDuplicator deDuplicator = null;

// Creates a BSBFDeDuplicator with 8MB of RAM and false-positive probability at 3%.
deDuplicator = RLBSBFDeDuplicator.create(NUM_BITS, 0.03D);

// Creates a BSBFDeDuplicator with 8MB of RAM and 5 hashing functions..
deDuplicator = new RLBSBFDeDuplicator(NUM_BITS, 5);

// The number of bits that the ProbabilisticDeDuplicator should use.
// Output: 67108864
System.out.println(deDuplicator.numBits());

// The number of hash functions that the ProbabilisticDeDuplicator should use.
// Output: 5
System.out.println(deDuplicator.numHashFunctions());

// Probabilistically classifies whether a given element is a distinct or a duplicate element.
// This operation does record the result into its history.
// Output: true
System.out.println(deDuplicator.classifyDistinct("Hello".getBytes()));
// Output: false
System.out.println(deDuplicator.classifyDistinct("Hello".getBytes()));

// Probabilistically peeks whether a given element is a distinct or a duplicate element.
// This operation does not record the result into its history.
// Output: true
System.out.println(deDuplicator.peekDistinct("World".getBytes()));
// Output: true
System.out.println(deDuplicator.peekDistinct("World".getBytes()));

// Version 0.1.2+: Calculate the probability that a distinct element of the stream is reported as duplicate.
// Output: Probability between 0 and 1.
System.out.println(deDuplicator.estimateFpp(actuallyDistinctProbability));
// Version 0.1.2+: Calculate the probability that a duplicate element of the stream is reported as distinct.
// Output: Probability between 0 and 1.
System.out.println(deDuplicator.estimateFnp(actuallyDistinctProbability));

// Reset the history of the ProbabilisticDeDuplicator.
deDuplicator.reset();

Binary Serialization

PDD provides serializers for each ProbabilisticDeDuplicator implementation to write to and to read from a versioned binary format.

// After Version 0.1.2:
// final ProbabilisticDeDuplicatorSerializer<RLBSBFDeDuplicator> serializer =
//                 RLBSBFDeDuplicatorSerializers.VERSION_2;

// Before Version 0.1.2:
final RLBSBFDeDuplicatorSerializer serializer = new RLBSBFDeDuplicatorSerializer();

final RLBSBFDeDuplicator deDuplicator = new RLBSBFDeDuplicator(64L, 1);
final Random random = new Random();
final byte[] element = new byte[128];
random.nextBytes(element);
assertTrue(deDuplicator.classifyDistinct(element));
final ByteArrayOutputStream out = new ByteArrayOutputStream();
serializer.writeTo(deDuplicator, out);
out.close();
final ByteArrayInputStream in = new ByteArrayInputStream(out.toByteArray());
final RLBSBFDeDuplicator serialized = serializer.readFrom(in);
in.close();
assertEquals(deDuplicator, serialized);

Java Serialization

PDD overrides the default object serialization for each ProbabilisticDeDuplicator implementation.

final RLBSBFDeDuplicator deDuplicator = new RLBSBFDeDuplicator(64L, 1);
final Random random = new Random();
final byte[] element = new byte[128];
random.nextBytes(element);
assertTrue(deDuplicator.classifyDistinct(element));
final ByteArrayOutputStream out = new ByteArrayOutputStream();
final ObjectOutputStream oos = new ObjectOutputStream(out);
oos.writeObject(deDuplicator);
oos.close();
out.close();
final ByteArrayInputStream in = new ByteArrayInputStream(out.toByteArray());
final ObjectInputStream ois = new ObjectInputStream(in);
final RLBSBFDeDuplicator serialized = (RLBSBFDeDuplicator) ois.readObject();
ois.close();
in.close();
assertEquals(deDuplicator, serialized);

Build

$ git clone https://github.com/jparkie/PDD.git
$ cd PDD/
$ ./gradlew build

References

Bera, S.K., Dutta, S., Narang, A., Bhattacherjee, S.: Advanced Bloom filter based algorithms for efficient approximate data de-duplication in streams (2012)

Версии библиотеки

Версия
0.1.1
0.1.0