LeetCode: 187. Repeated DNA Sequences

解题思路

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

Example:

Input: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"
Output: ["AAAAACCCCC", "CCCCCAAAAA"]

解题思路

四个核苷酸正好用二进制的两位表示，十个核苷酸序列用 20 位表示，即一个无符号整形(32位)。然后判断这些数字是否重复出现0。

AC 代码

class Solution {
public:
    vector<string> findRepeatedDnaSequences(string s) {
        // 对核苷酸标号
        map<char, unsigned int> nucleotide2Idx = {{'A', 0b00}, {'C', 0b01}, {'T', 0b10}, {'G', 0b11}};
        // 有效位的掩码：十个核苷酸，有效位是二十位
        unsigned int rightNineMask = 0x000fffff;

        size_t beg = 0, end = 0;
        unsigned int curSeq=0;
        set<unsigned int> occuredSeq;
        set<string> ans;

        for(size_t i = 0; i < s.size(); ++i)
        {
            curSeq = ((curSeq << 2) | nucleotide2Idx[s[i]]);
            curSeq = curSeq & rightNineMask;
            ++end;

            if(end - beg < 10)
            {
                continue;
            }
            if(end - beg > 10) ++beg;

            if(occuredSeq.find(curSeq) == occuredSeq.end())
            {
                occuredSeq.insert(curSeq);
            }
            else
            {
                ans.insert(s.substr(beg, 10));
            }
        }

        return vector<string>(ans.begin(), ans.end());
    }
};