原文:https://blog.csdn.net/qq_18495465/article/details/78500472
BloomFilter(大数据去重)+Redis(持久化)策略
 背景
 之前在重构一套文章爬虫系统时,其中有块逻辑是根据文章标题去重,原先去重的方式是,插入文章之前检查待插入文章的标题是否在ElasticSearch中存在,这无疑加重了ElasticSearch的负担也势必会影响程序的性能!
BloomFilter算法
 简介:布隆过滤器实际上是一个很长的二进制向量和一系列随机映射函数。布隆过滤器可以用于检索一个元素是否在一个集合中。它的优点是空间效率和查询时间都远远超过一般的算法,缺点是有一定的误识别率和删除困难。
 原理:当一个元素被加入集合时,通过K个散列函数将这个元素映射成一个位数组中的K个点,把它们置为1。检索时,我们只要看看这些点是不是都是1就(大约)知道集合中有没有它了:如果这些点有任何一个0,则被检元素一定不在;如果都是1,则被检元素很可能在。
 优点:相比于其它的数据结构,布隆过滤器在空间和时间方面都有巨大的优势。布隆过滤器存储空间和插入/查询时间都是常数(O(k))。而且它不存储元素本身,在某些对保密要求非常严格的场合有优势。
 缺点:一定的误识别率和删除困难。
 结合以上几点及去重需求(容忍误判,会误判在,在则丢,无妨),决定使用BlomFilter。
 思想
 位数组和k个散列函数
位数组
 初始状态时,BloomFilter是一个长度为m的位数组,每一位都置为0。 
添加元素(k个独立的hash函数)
 添加元素时,对x使用k个哈希函数得到k个哈希值,对m取余,对应的bit位设置为1。 
判断元素是否存在
 判断y是否属于这个集合,对y使用k个哈希函数得到k个哈希值,对m取余,所有对应的位置都是1,则认为y属于该集合(哈希冲突,可能存在误判),否则就认为y不属于该集合。
 图中y1不是集合中的元素,y2属于这个集合或者是一个false positive。 
  
 BloomFilter有以下参数:
m 位数组的长度
 n 加入其中元素的数量
 k 哈希函数的个数
 f False Positive
 问题:如何根据输入元素个数n,确定位数组的大小m和哈希函数的个数k?
BloomFilter的f满足下列公式:
  
 在给定m和n时,能够使f最小化的k值为:
  
 此时给出的f为:
  
 根据以上公式,对于任意给定的f,我们有:
  
 同时,我们需要k个hash来达成这个目标:
  
 由于k必须取整数,我们在Bloom Filter的程序实现中,还应该使用上面的公式来求得实际的f:
  
 以上3个公式是程序实现Bloom Filter的关键公式。
 故可以通过调节参数,使用Hash函数的个数,位数组的大小来降低失误率。
实现
 可以使用JDK自带的BitSet来实现,但存在两个问题:OOM和持久化问题。
 结合Redis的BitMap能够解决,唯一需要注意的是Redis的BitMap只支持2^32大小,对应到内存也就是512MB,数组的下标最大只能是2^32-1。不过这个限制可以通过构建多个Redis的Bitmap通过hash取模的方式分散一下即可。万分之一的误判率,512MB可以放下2亿条数据。
 好了,扯了这么多,贴代码!(注:在MagnusS/Java-BloomFilter的基础上加上了Redis持久化的实现)
@Component
 public class BloomFilter<E> {
    @Autowired
     private RedisTemplate<String, E> redisTemplate;
    @Value("${bloomfilter.expireDays}")
     private long expireDays;
    // total length of the Bloom filter
     private int sizeOfBloomFilter;
     // expected (maximum) number of elements to be added
     private int expectedNumberOfFilterElements;
     // number of hash functions
     private int numberOfHashFunctions;
     // encoding used for storing hash values as strings
     private final Charset charset = Charset.forName("UTF-8");
     // MD5 gives good enough accuracy in most circumstances. Change to SHA1 if it's needed
     private static final String hashName = "MD5";
     private static final MessageDigest digestFunction;
    // The digest method is reused between instances
     static {
         MessageDigest tmp;
         try {
             tmp = java.security.MessageDigest.getInstance(hashName);
         } catch (NoSuchAlgorithmException e) {
             tmp = null;
         }
         digestFunction = tmp;
     }
    public BloomFilter() {
         this(0.0001, 600000);
     }
    /**
      * Constructs an empty Bloom filter.
      *
      * @param m is the total length of the Bloom filter.
      * @param n is the expected number of elements the filter will contain.
      * @param k is the number of hash functions used.
      */
     public BloomFilter(int m, int n, int k) {
         this.sizeOfBloomFilter = m;
         this.expectedNumberOfFilterElements = n;
         this.numberOfHashFunctions = k;
     }
    /**
      * Constructs an empty Bloom filter with a given false positive probability.
      * The size of bloom filter and the number of hash functions is estimated
      * to match the false positive probability.
      *
      * @param falsePositiveProbability is the desired false positive probability.
      * @param expectedNumberOfElements is the expected number of elements in the Bloom filter.
      */
     public BloomFilter(double falsePositiveProbability, int expectedNumberOfElements) {
         this((int) Math.ceil((int) Math.ceil(-(Math.log(falsePositiveProbability) / Math.log(2))) * expectedNumberOfElements / Math.log(2)), // m = ceil(kn/ln2)
                 expectedNumberOfElements,
                 (int) Math.ceil(-(Math.log(falsePositiveProbability) / Math.log(2)))); // k = ceil(-ln(f)/ln2)
     }
    /**
      * Adds an object to the Bloom filter. The output from the object's
      * toString() method is used as input to the hash functions.
      *
      * @param element is an element to register in the Bloom filter.
      */
     public void add(E element) {
         add(element.toString().getBytes(charset));
     }
    /**
      * Adds an array of bytes to the Bloom filter.
      *
      * @param bytes array of bytes to add to the Bloom filter.
      */
     public void add(byte[] bytes) {
         if (redisTemplate.opsForValue().get(RedisConsts.CRAWLER_BLOOMFILTER) == null) {
             redisTemplate.opsForValue().setBit(RedisConsts.CRAWLER_BLOOMFILTER, 0, false);
             redisTemplate.expire(RedisConsts.CRAWLER_BLOOMFILTER, expireDays, TimeUnit.DAYS);
         }
        int[] hashes = createHashes(bytes, numberOfHashFunctions);
         for (int hash : hashes) {
             redisTemplate.opsForValue().setBit(RedisConsts.CRAWLER_BLOOMFILTER, Math.abs(hash % sizeOfBloomFilter), true);
         }
     }
    /**
      * Adds all elements from a Collection to the Bloom filter.
      *
      * @param c Collection of elements.
      */
     public void addAll(Collection<? extends E> c) {
         for (E element : c) {
             add(element);
         }
     }
    /**
      * Returns true if the element could have been inserted into the Bloom filter.
      * Use getFalsePositiveProbability() to calculate the probability of this
      * being correct.
      *
      * @param element element to check.
      * @return true if the element could have been inserted into the Bloom filter.
      */
     public boolean contains(E element) {
         return contains(element.toString().getBytes(charset));
     }
    /**
      * Returns true if the array of bytes could have been inserted into the Bloom filter.
      * Use getFalsePositiveProbability() to calculate the probability of this
      * being correct.
      *
      * @param bytes array of bytes to check.
      * @return true if the array could have been inserted into the Bloom filter.
      */
     public boolean contains(byte[] bytes) {
         int[] hashes = createHashes(bytes, numberOfHashFunctions);
         for (int hash : hashes) {
             if (!redisTemplate.opsForValue().getBit(RedisConsts.CRAWLER_BLOOMFILTER, Math.abs(hash % sizeOfBloomFilter))) {
                 return false;
             }
         }
         return true;
     }
    /**
      * Returns true if all the elements of a Collection could have been inserted
      * into the Bloom filter. Use getFalsePositiveProbability() to calculate the
      * probability of this being correct.
      *
      * @param c elements to check.
      * @return true if all the elements in c could have been inserted into the Bloom filter.
      */
     public boolean containsAll(Collection<? extends E> c) {
         for (E element : c) {
             if (!contains(element)) {
                 return false;
             }
         }
         return true;
     }
    /**
      * Generates digests based on the contents of an array of bytes and splits the result into 4-byte int's and store them in an array. The
      * digest function is called until the required number of int's are produced. For each call to digest a salt
      * is prepended to the data. The salt is increased by 1 for each call.
      *
      * @param data   specifies input data.
      * @param hashes number of hashes/int's to produce.
      * @return array of int-sized hashes
      */
     public static int[] createHashes(byte[] data, int hashes) {
         int[] result = new int[hashes];
        int k = 0;
         byte salt = 0;
         while (k < hashes) {
             byte[] digest;
             synchronized (digestFunction) {
                 digestFunction.update(salt);
                 salt++;
                 digest = digestFunction.digest(data);
             }
            for (int i = 0; i < digest.length / 4 && k < hashes; i++) {
                 int h = 0;
                 for (int j = (i * 4); j < (i * 4) + 4; j++) {
                     h <<= 8;
                     h |= ((int) digest[j]) & 0xFF;
                 }
                 result[k] = h;
                 k++;
             }
         }
         return result;
     }
    public int getSizeOfBloomFilter() {
         return this.sizeOfBloomFilter;
     }
    public int getExpectedNumberOfElements() {
         return this.expectedNumberOfFilterElements;
     }
    public int getNumberOfHashFunctions() {
         return this.numberOfHashFunctions;
     }
    /**
      * Compares the contents of two instances to see if they are equal.
      *
      * @param obj is the object to compare to.
      * @return True if the contents of the objects are equal.
      */
     @Override
     public boolean equals(Object obj) {
         if (obj == null) {
             return false;
         }
         if (getClass() != obj.getClass()) {
             return false;
         }
         final BloomFilter<E> other = (BloomFilter<E>) obj;
         if (this.sizeOfBloomFilter != other.sizeOfBloomFilter) {
             return false;
         }
         if (this.expectedNumberOfFilterElements != other.expectedNumberOfFilterElements) {
             return false;
         }
         if (this.numberOfHashFunctions != other.numberOfHashFunctions) {
             return false;
         }
         return true;
     }
    /**
      * Calculates a hash code for this class.
      *
      * @return hash code representing the contents of an instance of this class.
      */
     @Override
     public int hashCode() {
         int hash = 7;
         hash = 61 * hash + this.sizeOfBloomFilter;
         hash = 61 * hash + this.expectedNumberOfFilterElements;
         hash = 61 * hash + this.numberOfHashFunctions;
         return hash;
     }
    public static void main(String[] args) {
         BloomFilter<String> bloomFilter = new BloomFilter<>(0.0001, 600000);
         System.out.println(bloomFilter.getSizeOfBloomFilter());
         System.out.println(bloomFilter.getNumberOfHashFunctions());
     }
 }
 --------------------- 
 作者:Joker_Coding 
 来源:CSDN 
 原文:https://blog.csdn.net/qq_18495465/article/details/78500472 
 版权声明:本文为博主原创文章,转载请附上博文链接!



京公网安备 11010502036488号