此代码复现过程我也放在自己的博客上面,获得了一定的阅读量和点赞数
6.1 安装cmake (>= 2.8):
$ sudo apt install cmake
6.2 安装boost
$ apt-cache search boost $ sudo apt-get install libboost-all-dev
搜到所有的boost库,然后安装相应的库
6.3 clone 项目
$ git clone https://github.com/jermp/dint
6.4 Building the code
$ git submodule init $ git submodule update $ mkdir build $ cd build $ cmake .. -DCMAKE_BUILD_TYPE=Release $ make -j[number of jobs] // 这里我采用 make -j2
注:这一步原本采用单核2GB的学生机,因为实在跑不动,临时租了一个四核8GB的服务器
6.5 建立索引
$ Usage ./create_freq_index: $ <index_type> <collection_basename> [output_filename] [--check] $ ./create_freq_index single_rect_dint ../test/test_data/test_collection single_rect_dint.bin // 矩形字典(1) $ ./create_freq_index single_packed_dint ../test/test_data/test_collection single_packed_dint.bin // 对齐/压缩字典(2) $ ./create_freq_index multi_packed_dint ../test/test_data/test_collection multi_packed_dint.bin // 多次对齐/压缩字典(3) $ ./queries single_packed_dint and single_packed_dint.bin < ../test/test_data/queries //performes the boolean AND queries contained in the data file queries over the index serialized to single_packed_dint.bin.(4)
6.6 结果分析
(1)矩形字典
root@iZbp13o6yrkkck8a5gw2e4Z:~/dint/build# ./create_freq_index single_rect_dint ../test/test_data/test_collection single_rect_dint.bin 2020-06-26 14:56:57: building or loading dictionary for docs... 2020-06-26 14:56:57: DONE 2020-06-26 14:56:57: building or loading dictionary for freqs... 2020-06-26 14:56:57: DONE 2020-06-26 14:56:57: Processing 10000 documents... 0% 10 20 30 40 50 60 70 80 90 100% |----|----|----|----|----|----|----|----|----|----| *************************************************** 2020-06-26 14:56:57: Encoded 113306 sequences, 3327520 postings 2020-06-26 14:57:00: Usage distribution for docs: rare: 2 (0.00305176%) entries of size 1: 6539(9.97772%) entries of size 2: 21422(32.6874%) entries of size 4: 24429(37.2757%) entries of size 8: 10441(15.9317%) entries of size 16: 2698(4.11682%) freq.: 5 (0.00762939%) 2020-06-26 14:57:00: Usage distribution for freqs: rare: 2 (0.00305176%) entries of size 1: 833(1.27106%) entries of size 2: 8114(12.381%) entries of size 4: 24011(36.6379%) entries of size 8: 18225(27.8091%) entries of size 16: 14346(21.8903%) freq.: 5 (0.00762939%) 2020-06-26 14:57:00: single_rect_dint collection built in 2.59589 seconds {"type": "single_rect_dint", "worker_threads": 4, "construction_time": 2.59589, "construction_user_time": 2.56776} <TOP>: 12749754 m_params: 5 m_endpoints: 101096 m_bits: 101088 m_lists: 3735725 m_docs_dict: 4456456 m_table: 4456456 m_freqs_dict: 4456456 m_table: 4456456 2020-06-26 14:57:00: Documents: 2470345 bytes, 5.93919 bits per element 2020-06-26 14:57:00: Frequencies: 1265380 bytes, 3.04222 bits per element 2020-06-26 14:57:00: Index size: 0.0118741 [GiB] {"type": "single_rect_dint", "size": 12749754, "docs_size": 2470345, "freqs_size": 1265380, "bits_per_doc": 5.93919, "bits_per_freq": 3.04222}
(2)对齐/压缩字典
root@iZbp13o6yrkkck8a5gw2e4Z:~/dint/build# ./create_freq_index single_packed_dint ../test/test_data/test_collection single_packed_dint.bin 2020-06-26 15:12:54: building or loading dictionary for docs... 2020-06-26 15:12:54: DONE 2020-06-26 15:12:54: building or loading dictionary for freqs... 2020-06-26 15:12:54: DONE 2020-06-26 15:12:54: Processing 10000 documents... 0% 10 20 30 40 50 60 70 80 90 100% |----|----|----|----|----|----|----|----|----|----| *************************************************** 2020-06-26 15:12:54: Encoded 113306 sequences, 3327520 postings 2020-06-26 15:12:56: Usage distribution for docs: 2020-06-26 15:12:56: Usage distribution for freqs: 2020-06-26 15:12:56: single_packed_dint collection built in 2.16824 seconds {"type": "single_packed_dint", "worker_threads": 4, "construction_time": 2.16824, "construction_user_time": 2.13624} <TOP>: 7154382 m_params: 5 m_endpoints: 101096 m_bits: 101088 m_lists: 3735725 m_docs_dict: 1268264 m_offsets: 262152 m_table: 1006112 m_freqs_dict: 2049276 m_offsets: 262152 m_table: 1787124 2020-06-26 15:12:56: Documents: 2470345 bytes, 5.93919 bits per element 2020-06-26 15:12:56: Frequencies: 1265380 bytes, 3.04222 bits per element 2020-06-26 15:12:56: Index size: 0.00666304 [GiB] {"type": "single_packed_dint", "size": 7154382, "docs_size": 2470345, "freqs_size": 1265380, "bits_per_doc": 5.93919, "bits_per_freq": 3.04222}
(3)多次对齐/压缩字典
root@iZbp13o6yrkkck8a5gw2e4Z:~/dint/build# ./create_freq_index multi_packed_dint ../test/test_data/test_collection multi_packed_dint.bin 2020-06-26 15:14:02: building or loading dictionary for docs... 2020-06-26 15:14:02: DONE 2020-06-26 15:14:02: building or loading dictionary for freqs... 2020-06-26 15:14:02: DONE 2020-06-26 15:14:02: Processing 10000 documents... 0% 10 20 30 40 50 60 70 80 90 100% |----|----|----|----|----|----|----|----|----|----| *************************************************** 2020-06-26 15:14:02: Encoded 113306 sequences, 3327520 postings 2020-06-26 15:14:21: Usage distribution for docs: 2020-06-26 15:14:21: Usage distribution for freqs: 2020-06-26 15:14:21: multi_packed_dint collection built in 19.1152 seconds {"type": "multi_packed_dint", "worker_threads": 4, "construction_time": 19.1152, "construction_user_time": 19.0864} <TOP>: 13890400 m_params: 5 m_endpoints: 96248 m_bits: 96240 m_lists: 3006531 m_docs_dict: 5804848 m_start_offsets: 32 m_offsets: 567252 m_table: 5237564 m_freqs_dict: 4982752 m_start_offsets: 32 m_offsets: 605364 m_table: 4377356 2020-06-26 15:14:21: Documents: 1981821 bytes, 4.76468 bits per element 2020-06-26 15:14:21: Frequencies: 1024710 bytes, 2.4636 bits per element 2020-06-26 15:14:21: Index size: 0.0129364 [GiB] {"type": "multi_packed_dint", "size": 13890400, "docs_size": 1981821, "freqs_size": 1024710, "bits_per_doc": 4.76468, "bits_per_freq": 2.4636}
(4) Vroom environment
vroom主要用来测试编码解码的速度,收我们对所有收集的序列进行编码:
./encode single_packed_dint ../test/test_data/test_collection.docs --dict dict.test_collection.docs.single_packed.DSF-65536-16 --out test.bin
得到下列结果:
root@iZbp13o6yrkkck8a5gw2e4Z:~/dint/build# ./encode single_packed_dint ../test/test_data/test_collection.docs --dict dict.test_collection.docs.single_packed.DSF-65536-16 --out test.bin 2020-06-26 16:29:15: ./encode single_packed_dint ../test/test_data/test_collection.docs --dict dict.test_collection.docs.single_packed.DSF-65536-16 --out test.bin 2020-06-26 16:29:15: preparing for encoding... 2020-06-26 16:29:15: encoding docs... 0% 10 20 30 40 50 60 70 80 90 100% |----|----|----|----|----|----|----|----|----|----| *************************************************** 2020-06-26 16:29:16: encoded 113306 lists 2020-06-26 16:29:16: encoded 3327520 integers 2020-06-26 16:29:16: 0.00227213 [GiB] 2020-06-26 16:29:16: bits x integer: 5.86547 {"filename": "../test/test_data/test_collection.docs", "num_sequences": "113306", "num_integers": "3327520", "type": "single_packed_dint", "GiB": "0.00227213", "bpi": "5.86547"} 2020-06-26 16:29:16: writing encoded data... 2020-06-26 16:29:16: DONE
再对它们解码
$ ./encode single_packed_dint ../test/test_data/test_collection.docs --dict dict.test_collection.docs.single_packed.DSF-65536-16 --out test.bin
最后得到以下结果:
root@iZbp13o6yrkkck8a5gw2e4Z:~/dint/build# ./decode single_packed_dint test.bin --dict dict.test_collection.docs.single_packed.DSF-65536-16 2020-06-26 21:33:59: ./decode single_packed_dint test.bin --dict dict.test_collection.docs.single_packed.DSF-65536-16 2020-06-26 21:33:59: Dictionary memory: 1.20951 [MiB] 2020-06-26 21:34:00: decoding... 2020-06-26 21:34:00: elapsed time 0.0188706 [sec] 2020-06-26 21:34:00: 5.67106 [ns] x int 2020-06-26 21:34:00: 176333826 ints x [sec] {"filename": "test.bin", "num_sequences": "113306", "num_integers": "3327520", "type": "single_packed_dint", "tot_elapsed_time": "0.0188706", "ns_x_int": "5.67106", "ints_x_sec": "176333826"}
(5)结果比较
基于上述的实验结果,可以列出下表
Index | docs[bpi] | freqs[bpi] |
---|---|---|
single_rect | 5.93919 | 3.04222 |
single_packed | 5.93919 | 3.04222 |
multi_packed | 4.76468 | 2.4636 |
发现代码复现结果跟作者所做的实验结果相差不大。部分偏差主要来自于运行环境上的区别:
- 作者的配置: Intel i7-7700, 3.6GHz, Linux 4.4.0, 64bits 个人电脑
- 我的配置: 4核 CPU, 8GB 内存 Ubuntu 20.04 64bits 云服务器