题解 | #机器翻译#

题目链接

机器翻译

题目描述

你正在实现一个机器翻译软件的缓存功能。这个缓存有 m 个单元。软件需要依次翻译一篇文章中的 n 个单词（由整数编码表示）。

翻译规则如下：

当需要翻译一个单词时，首先检查它是否在缓存中。
缓存命中：如果单词在缓存中，直接使用，不做任何操作。
缓存未命中：如果单词不在缓存中，需要从外存词典中查找。这时，未命中次数加一，并将该单词存入缓存。
- 如果缓存未满，新单词直接存入一个空闲单元。
- 如果缓存已满，则需要先将最早进入缓存的单词清除，然后再存入新单词。

你需要计算整个翻译过程中，总共有多少次缓存未命中（即查词典的次数）。

输入描述:

第一行输入两个整数 m (缓存容量) 和 n (文章单词数)。
第二行输入 n 个整数，代表文章中的单词。

输出描述:

输出一个整数，表示总的未命中次数。

示例输入:

3 7
1 2 1 5 4 4 1

输出:

解题思路

这个问题本质上是在模拟一个具有固定大小和先进先出（FIFO）淘汰策略的缓存系统。我们可以使用一个队列来完美地模拟缓存的行为，同时用一个 哈希集合（或哈希表） 来快速检查单词是否存在于缓存中，以实现 $\mathcal{O}(1)$ 的查找效率。

算法流程如下：

初始化：
- 创建一个队列 cache_queue，用于存储当前在缓存中的单词，并维持它们的进入顺序。
- 创建一个哈希集合 cache_set，用于快速判断一个单词是否在缓存中。
- 初始化未命中次数 miss_count = 0。
处理单词流：遍历文章中的每一个单词 word。
检查缓存：
- 使用哈希集合 cache_set 检查 word 是否存在。
- 如果 word 在 cache_set 中，则为缓存命中，我们继续处理下一个单词。
- 如果 word 不在 cache_set 中，则为缓存未命中： a. 将 miss_count 加 1。 b. 检查当前缓存大小（即 cache_queue 的大小）是否已达到容量 m。 - 如果已满，需要执行 FIFO 淘汰策略：从队列头部弹出一个最老的单词 old_word，并同时从哈希集合 cache_set 中移除 old_word。 c. 将新单词 word 加入队列尾部，并同时添加到哈希集合 cache_set 中。
返回结果：遍历完所有单词后，返回 miss_count 的值。

这个方法结合了队列的 FIFO 特性和哈希集合的快速查找能力，能够高效地解决问题。

代码

cpp
java
python

#include <iostream>
#include <vector>
#include <queue>
#include <set>

using namespace std;

int main() {
    int m, n;
    cin >> m >> n;
    
    queue<int> cache_queue;
    set<int> cache_set;
    int miss_count = 0;
    
    for (int i = 0; i < n; ++i) {
        int word;
        cin >> word;
        
        if (cache_set.find(word) == cache_set.end()) {
            miss_count++;
            if (cache_queue.size() == m) {
                int old_word = cache_queue.front();
                cache_queue.pop();
                cache_set.erase(old_word);
            }
            cache_queue.push(word);
            cache_set.insert(word);
        }
    }
    
    cout << miss_count << endl;
    
    return 0;
}

import java.util.HashSet;
import java.util.LinkedList;
import java.util.Queue;
import java.util.Scanner;

public class Main {
    public static void main(String[] args) {
        Scanner sc = new Scanner(System.in);
        int m = sc.nextInt();
        int n = sc.nextInt();
        
        Queue<Integer> cacheQueue = new LinkedList<>();
        HashSet<Integer> cacheSet = new HashSet<>();
        int missCount = 0;
        
        for (int i = 0; i < n; i++) {
            int word = sc.nextInt();
            
            if (!cacheSet.contains(word)) {
                missCount++;
                if (cacheQueue.size() == m) {
                    int oldWord = cacheQueue.poll();
                    cacheSet.remove(oldWord);
                }
                cacheQueue.add(word);
                cacheSet.add(word);
            }
        }
        
        System.out.println(missCount);
    }
}

from collections import deque

m, n = map(int, input().split())
words = list(map(int, input().split()))

cache_queue = deque()
cache_set = set()
miss_count = 0

for word in words:
    if word not in cache_set:
        miss_count += 1
        if len(cache_queue) == m:
            old_word = cache_queue.popleft()
            cache_set.remove(old_word)
        cache_queue.append(word)
        cache_set.add(word)
    
print(miss_count)

算法及复杂度

算法：队列与哈希集合模拟
时间复杂度： $\mathcal{O}(N)$ ，其中 N 是文章中的单词总数。对于每个单词，我们都在哈希集合中进行一次查找、可能的插入和删除操作，这些操作的平均时间复杂度都是 $\mathcal{O}(1)$ 。
空间复杂度： $\mathcal{O}(M)$ ，其中 M 是缓存的容量。队列和哈希集合最多存储 M 个单词。