此代码复现过程我也放在自己的博客上面,获得了一定的阅读量和点赞数

6.1 安装cmake (>= 2.8):

$ sudo apt install cmake

6.2 安装boost

$ apt-cache search boost
$ sudo apt-get install libboost-all-dev

搜到所有的boost库,然后安装相应的库

6.3 clone 项目

$ git clone https://github.com/jermp/dint

6.4 Building the code

$ git submodule init
$ git submodule update
$ mkdir build
$ cd build
$ cmake .. -DCMAKE_BUILD_TYPE=Release
$ make -j[number of jobs] // 这里我采用 make -j2

注:这一步原本采用单核2GB的学生机,因为实在跑不动,临时租了一个四核8GB的服务器

6.5 建立索引

$ Usage ./create_freq_index:
$       <index_type> <collection_basename> [output_filename] [--check]

$ ./create_freq_index single_rect_dint ../test/test_data/test_collection single_rect_dint.bin // 矩形字典(1)
$ ./create_freq_index single_packed_dint ../test/test_data/test_collection single_packed_dint.bin // 对齐/压缩字典(2)
$ ./create_freq_index multi_packed_dint ../test/test_data/test_collection multi_packed_dint.bin // 多次对齐/压缩字典(3)
$ ./queries single_packed_dint and single_packed_dint.bin < ../test/test_data/queries //performes the boolean AND queries contained in the data file queries over the index serialized to single_packed_dint.bin.(4)

6.6 结果分析

(1)矩形字典

root@iZbp13o6yrkkck8a5gw2e4Z:~/dint/build# ./create_freq_index single_rect_dint ../test/test_data/test_collection single_rect_dint.bin
2020-06-26 14:56:57: building or loading dictionary for docs...
2020-06-26 14:56:57: DONE
2020-06-26 14:56:57: building or loading dictionary for freqs...
2020-06-26 14:56:57: DONE
2020-06-26 14:56:57: Processing 10000 documents...

0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************

2020-06-26 14:56:57: Encoded 113306 sequences, 3327520 postings
2020-06-26 14:57:00: Usage distribution for docs:
rare: 2 (0.00305176%)
entries of size 1: 6539(9.97772%)
entries of size 2: 21422(32.6874%)
entries of size 4: 24429(37.2757%)
entries of size 8: 10441(15.9317%)
entries of size 16: 2698(4.11682%)
freq.: 5 (0.00762939%)
2020-06-26 14:57:00: Usage distribution for freqs:
rare: 2 (0.00305176%)
entries of size 1: 833(1.27106%)
entries of size 2: 8114(12.381%)
entries of size 4: 24011(36.6379%)
entries of size 8: 18225(27.8091%)
entries of size 16: 14346(21.8903%)
freq.: 5 (0.00762939%)
2020-06-26 14:57:00: single_rect_dint collection built in 2.59589 seconds
{"type": "single_rect_dint", "worker_threads": 4, "construction_time": 2.59589, "construction_user_time": 2.56776}
<TOP>: 12749754
    m_params: 5
    m_endpoints: 101096
        m_bits: 101088
    m_lists: 3735725
    m_docs_dict: 4456456
        m_table: 4456456
    m_freqs_dict: 4456456
        m_table: 4456456
2020-06-26 14:57:00: Documents: 2470345 bytes, 5.93919 bits per element
2020-06-26 14:57:00: Frequencies: 1265380 bytes, 3.04222 bits per element
2020-06-26 14:57:00: Index size: 0.0118741 [GiB]
{"type": "single_rect_dint", "size": 12749754, "docs_size": 2470345, "freqs_size": 1265380, "bits_per_doc": 5.93919, "bits_per_freq": 3.04222}

(2)对齐/压缩字典

root@iZbp13o6yrkkck8a5gw2e4Z:~/dint/build# ./create_freq_index single_packed_dint ../test/test_data/test_collection single_packed_dint.bin
2020-06-26 15:12:54: building or loading dictionary for docs...
2020-06-26 15:12:54: DONE
2020-06-26 15:12:54: building or loading dictionary for freqs...
2020-06-26 15:12:54: DONE
2020-06-26 15:12:54: Processing 10000 documents...

0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************

2020-06-26 15:12:54: Encoded 113306 sequences, 3327520 postings
2020-06-26 15:12:56: Usage distribution for docs:
2020-06-26 15:12:56: Usage distribution for freqs:
2020-06-26 15:12:56: single_packed_dint collection built in 2.16824 seconds
{"type": "single_packed_dint", "worker_threads": 4, "construction_time": 2.16824, "construction_user_time": 2.13624}
<TOP>: 7154382
    m_params: 5
    m_endpoints: 101096
        m_bits: 101088
    m_lists: 3735725
    m_docs_dict: 1268264
        m_offsets: 262152
        m_table: 1006112
    m_freqs_dict: 2049276
        m_offsets: 262152
        m_table: 1787124
2020-06-26 15:12:56: Documents: 2470345 bytes, 5.93919 bits per element
2020-06-26 15:12:56: Frequencies: 1265380 bytes, 3.04222 bits per element
2020-06-26 15:12:56: Index size: 0.00666304 [GiB]
{"type": "single_packed_dint", "size": 7154382, "docs_size": 2470345, "freqs_size": 1265380, "bits_per_doc": 5.93919, "bits_per_freq": 3.04222}

(3)多次对齐/压缩字典

root@iZbp13o6yrkkck8a5gw2e4Z:~/dint/build# ./create_freq_index multi_packed_dint ../test/test_data/test_collection multi_packed_dint.bin
2020-06-26 15:14:02: building or loading dictionary for docs...
2020-06-26 15:14:02: DONE
2020-06-26 15:14:02: building or loading dictionary for freqs...
2020-06-26 15:14:02: DONE
2020-06-26 15:14:02: Processing 10000 documents...

0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************

2020-06-26 15:14:02: Encoded 113306 sequences, 3327520 postings
2020-06-26 15:14:21: Usage distribution for docs:
2020-06-26 15:14:21: Usage distribution for freqs:
2020-06-26 15:14:21: multi_packed_dint collection built in 19.1152 seconds
{"type": "multi_packed_dint", "worker_threads": 4, "construction_time": 19.1152, "construction_user_time": 19.0864}
<TOP>: 13890400
    m_params: 5
    m_endpoints: 96248
        m_bits: 96240
    m_lists: 3006531
    m_docs_dict: 5804848
        m_start_offsets: 32
        m_offsets: 567252
        m_table: 5237564
    m_freqs_dict: 4982752
        m_start_offsets: 32
        m_offsets: 605364
        m_table: 4377356
2020-06-26 15:14:21: Documents: 1981821 bytes, 4.76468 bits per element
2020-06-26 15:14:21: Frequencies: 1024710 bytes, 2.4636 bits per element
2020-06-26 15:14:21: Index size: 0.0129364 [GiB]
{"type": "multi_packed_dint", "size": 13890400, "docs_size": 1981821, "freqs_size": 1024710, "bits_per_doc": 4.76468, "bits_per_freq": 2.4636}

(4) Vroom environment

​ vroom主要用来测试编码解码的速度,收我们对所有收集的序列进行编码:

./encode single_packed_dint ../test/test_data/test_collection.docs --dict dict.test_collection.docs.single_packed.DSF-65536-16 --out test.bin

​ 得到下列结果:

root@iZbp13o6yrkkck8a5gw2e4Z:~/dint/build# ./encode single_packed_dint ../test/test_data/test_collection.docs --dict dict.test_collection.docs.single_packed.DSF-65536-16 --out test.bin
2020-06-26 16:29:15: ./encode single_packed_dint ../test/test_data/test_collection.docs --dict dict.test_collection.docs.single_packed.DSF-65536-16 --out test.bin
2020-06-26 16:29:15: preparing for encoding...
2020-06-26 16:29:15: encoding docs...

0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
2020-06-26 16:29:16: encoded 113306 lists
2020-06-26 16:29:16: encoded 3327520 integers
2020-06-26 16:29:16: 0.00227213 [GiB]
2020-06-26 16:29:16: bits x integer: 5.86547
{"filename": "../test/test_data/test_collection.docs", "num_sequences": "113306", "num_integers": "3327520", "type": "single_packed_dint", "GiB": "0.00227213", "bpi": "5.86547"}
2020-06-26 16:29:16: writing encoded data...
2020-06-26 16:29:16: DONE

​ 再对它们解码

$ ./encode single_packed_dint ../test/test_data/test_collection.docs --dict dict.test_collection.docs.single_packed.DSF-65536-16 --out test.bin

​ 最后得到以下结果:

root@iZbp13o6yrkkck8a5gw2e4Z:~/dint/build# ./decode single_packed_dint test.bin --dict dict.test_collection.docs.single_packed.DSF-65536-16
2020-06-26 21:33:59: ./decode single_packed_dint test.bin --dict dict.test_collection.docs.single_packed.DSF-65536-16
2020-06-26 21:33:59: Dictionary memory: 1.20951 [MiB]
2020-06-26 21:34:00: decoding...
2020-06-26 21:34:00: elapsed time 0.0188706 [sec]
2020-06-26 21:34:00: 5.67106 [ns] x int
2020-06-26 21:34:00: 176333826 ints x [sec]
{"filename": "test.bin", "num_sequences": "113306", "num_integers": "3327520", "type": "single_packed_dint", "tot_elapsed_time": "0.0188706", "ns_x_int": "5.67106", "ints_x_sec": "176333826"}

(5)结果比较

​ 基于上述的实验结果,可以列出下表

Index docs[bpi] freqs[bpi]
single_rect 5.93919 3.04222
single_packed 5.93919 3.04222
multi_packed 4.76468 2.4636

​ 发现代码复现结果跟作者所做的实验结果相差不大。部分偏差主要来自于运行环境上的区别:

  • 作者的配置: Intel i7-7700, 3.6GHz, Linux 4.4.0, 64bits 个人电脑
  • 我的配置: 4核 CPU, 8GB 内存 Ubuntu 20.04 64bits 云服务器