PEEK133 阅读理解

题目链接

PEEK133 阅读理解

题目描述

给定 $N$ 篇短文和 $M$ 次查询。每次查询给出一个单词，要求输出该单词出现过的所有短文的编号（按升序排列）。

解题思路

这是一个经典的信息检索问题：给定一组文档和一个查询词，返回包含该查询词的所有文档。解决此类问题的标准且高效的数据结构是倒排索引 (Inverted Index)。

倒排索引的核心思想是建立一个从“单词”到“文档列表”的映射。常规的索引（正向索引）是从文档指向其包含的单词，而倒排索引则反过来。

具体实现步骤如下：

建立倒排索引：
- 我们选择一个合适的数据结构来存储这个映射。在 C++ 中，可以使用 std::map<string, vector<int>>；在 Java 中，可以使用 HashMap<String, List<Integer>>；在 Python 中，可以使用字典 dict，其值为列表。
- 我们遍历每一篇短文，从编号 1 到 $N$ 。
- 对于第 i 篇短文中的每一个单词 word：
  - 我们在倒排索引中查找 word。
  - 如果 word 不在索引中，我们为它创建一个新的空列表，并将当前短文编号 i 添加进去。
  - 如果 word 已经在索引中，我们检查其对应的列表。为了避免同一个短文编号被重复添加（例如一个单词在一篇文章中出现多次），我们只在列表的最后一个元素不是 i 的情况下，才将 i 添加到列表末尾。因为我们是按短文编号顺序处理的，所以这样可以保证列表中的编号不重复且自动保持升序。
处理查询：
- 索引建立完成后，我们开始处理 $M$ 次查询。
- 对于每个查询单词 query_word：
  - 我们直接在倒排索引中查找 query_word。
  - 如果找到了，就获取其对应的短文编号列表。
  - 将列表中的所有编号按升序（它们已经是升序的）用空格隔开输出。
  - 如果未找到，说明该单词从未在任何短文中出现，按要求输出一个空行。

这种“预处理/查询”的模式非常适合本题。建立索引的开销被分摊到多次快速查询中，使得总体效率非常高。

代码

cpp
java
python

#include <iostream>
#include <vector>
#include <string>
#include <map>
#include <sstream>

using namespace std;

int main() {
    ios_base::sync_with_stdio(false);
    cin.tie(NULL);

    int n;
    cin >> n;
    cin.ignore(); 

    map<string, vector<int>> inverted_index;

    for (int i = 1; i <= n; ++i) {
        string line;
        getline(cin, line);
        stringstream ss(line);
        int word_count;
        ss >> word_count;
        string word;
        while (ss >> word) {
            if (inverted_index[word].empty() || inverted_index[word].back() != i) {
                inverted_index[word].push_back(i);
            }
        }
    }

    int m;
    cin >> m;
    for (int i = 0; i < m; ++i) {
        string query_word;
        cin >> query_word;
        if (inverted_index.count(query_word)) {
            const auto& doc_ids = inverted_index[query_word];
            for (size_t j = 0; j < doc_ids.size(); ++j) {
                cout << doc_ids[j] << (j == doc_ids.size() - 1 ? "" : " ");
            }
            cout << '\n';
        } else {
            cout << '\n';
        }
    }

    return 0;
}

import java.util.Scanner;
import java.util.Map;
import java.util.HashMap;
import java.util.List;
import java.util.ArrayList;
import java.util.StringJoiner;

public class Main {
    public static void main(String[] args) {
        Scanner sc = new Scanner(System.in);

        int n = sc.nextInt();
        sc.nextLine(); 

        Map<String, List<Integer>> invertedIndex = new HashMap<>();

        for (int i = 1; i <= n; i++) {
            int wordCount = sc.nextInt();
            for (int j = 0; j < wordCount; j++) {
                String word = sc.next();
                invertedIndex.putIfAbsent(word, new ArrayList<>());
                List<Integer> docIds = invertedIndex.get(word);
                if (docIds.isEmpty() || docIds.get(docIds.size() - 1) != i) {
                    docIds.add(i);
                }
            }
            if (sc.hasNextLine()) {
                 sc.nextLine();
            }
        }

        int m = sc.nextInt();
        for (int i = 0; i < m; i++) {
            String queryWord = sc.next();
            if (invertedIndex.containsKey(queryWord)) {
                List<Integer> docIds = invertedIndex.get(queryWord);
                StringJoiner sj = new StringJoiner(" ");
                for (int id : docIds) {
                    sj.add(String.valueOf(id));
                }
                System.out.println(sj.toString());
            } else {
                System.out.println();
            }
        }
    }
}

import sys
from collections import defaultdict

def main():
    n = int(sys.stdin.readline())
    
    inverted_index = defaultdict(list)
    
    for i in range(1, n + 1):
        line = sys.stdin.readline().split()
        # word_count = int(line[0]) # Not strictly needed
        words = line[1:]
        for word in words:
            if not inverted_index[word] or inverted_index[word][-1] != i:
                inverted_index[word].append(i)
                
    m = int(sys.stdin.readline())
    
    for _ in range(m):
        query_word = sys.stdin.readline().strip()
        if query_word in inverted_index:
            print(*inverted_index[query_word])
        else:
            print()

if __name__ == "__main__":
    main()

算法及复杂度

算法：倒排索引。
时间复杂度：
- 建立索引：设所有短文的总单词数为 $W$ ，每个单词的平均长度为 $L$ 。建立索引需要遍历所有单词，每次操作涉及到哈希表/平衡树的插入和查找。总时间复杂度约为 $O(W \cdot L \cdot \log(\text{VocabSize}))$ ，其中 VocabSize 是不同单词的总数。对于哈希表，期望时间是 $O(W \cdot L)$ 。
- 查询：每次查询的时间复杂度为 $O(Q \cdot L \cdot \log(\text{VocabSize}))$ ，其中 $Q$ 是查询次数。对于哈希表，期望时间是 $O(Q \cdot L)$ 。
空间复杂度： $O(W \cdot L)$ ，用于存储倒排索引本身。

题解 | #阅读理解#

PEEK133 阅读理解

题目链接

题目描述

解题思路

代码

算法及复杂度