LeetCode: 187. Repeated DNA Sequences
解题思路
All DNA is composed of a series of nucleotides abbreviated as A
, C
, G
, and T
, for example: "ACGAATTCCG"
. When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
Example:
Input: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"
Output: ["AAAAACCCCC", "CCCCCAAAAA"]
解题思路
四个核苷酸正好用二进制的两位表示,十个核苷酸序列用 20 位表示,即一个无符号整形(32位)。然后判断这些数字是否重复出现0。
AC 代码
class Solution {
public:
vector<string> findRepeatedDnaSequences(string s) {
// 对核苷酸标号
map<char, unsigned int> nucleotide2Idx = {{'A', 0b00}, {'C', 0b01}, {'T', 0b10}, {'G', 0b11}};
// 有效位的掩码:十个核苷酸,有效位是二十位
unsigned int rightNineMask = 0x000fffff;
size_t beg = 0, end = 0;
unsigned int curSeq=0;
set<unsigned int> occuredSeq;
set<string> ans;
for(size_t i = 0; i < s.size(); ++i)
{
curSeq = ((curSeq << 2) | nucleotide2Idx[s[i]]);
curSeq = curSeq & rightNineMask;
++end;
if(end - beg < 10)
{
continue;
}
if(end - beg > 10) ++beg;
if(occuredSeq.find(curSeq) == occuredSeq.end())
{
occuredSeq.insert(curSeq);
}
else
{
ans.insert(s.substr(beg, 10));
}
}
return vector<string>(ans.begin(), ans.end());
}
};