Python 爬虫之阅读呼叫转移（二）

一篇博客我们成功地从网页上爬下了小说的一个章节，理所当然地，接下来我们要把整本小说都爬下来。首先，我们要把程序从原来的读完一章就结束，改成读完一章之后可以继续进行下一章的阅读。

注意到每个小说章节的网页下面都有下一页的链接。通过查看网页源代码，稍微整理一下（  不显示了），我们可以看到这一部分的 HTML 是下面这种格式的：

[html]view plaincopy      
 <div id="footlink">      
   <script type="text/javascript" charset="utf-8" src="/scripts/style5.js"></script>      
   <a href="http://www.quanben.com/xiaoshuo/0/910/59301.html">上一页</a>          
   <a href="http://www.quanben.com/xiaoshuo/0/910/">返回目录</a>          
   <a href="http://www.quanben.com/xiaoshuo/0/910/59303.html">下一页</a>      
 </div>     

上一页返回目录下一页都在一个 id 为 footlink 的 div 中，如果想要对每个链接进行匹配的话，会抓取到网页上大量的其他链接，但是 footlink 的 div 只有一个啊！我们可以把这个 div 匹配到，抓下来，然后在这个抓下来的 div 里面再匹配 <a> 的链接，这时就只有三个了。只要取最后一个链接就是下一页的 url 的，用这个 url 更新我们抓取的目标 url ，这样就能一直抓到下一页。用户阅读逻辑为每读一个章节后，等待用户输入，如果是 quit 则退出程序，否则显示下一章。

基础知识：

上一篇的基础知识加上 Python 的 thread 模块.

源代码：

[python]view plaincopy      
 # -*- coding: utf-8 -*-      
       
 import urllib2      
 import re      
 import thread      
 import chardet      
       
 class Book_Spider:      
       
     def __init__(self):      
         self.pages = []      
         self.page = 1      
         self.flag = True      
         self.url = "http://www.quanben.com/xiaoshuo/10/10412/2095096.html"      
       
     # 将抓取一个章节      
     def GetPage(self):      
         myUrl = self.url      
         user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'      
         headers = { 'User-Agent' : user_agent }      
         req = urllib2.Request(myUrl, headers = headers)      
         myResponse = urllib2.urlopen(req)      
         myPage = myResponse.read()      
       
         charset = chardet.detect(myPage)      
         charset = charset['encoding']      
         if charset == 'utf-8' or charset == 'UTF-8':      
             myPage = myPage      
         else:      
             myPage = myPage.decode('gb2312','ignore').encode('utf-8')      
         unicodePage = myPage.decode("utf-8")      
       
         # 找出 id="content"的div标记      
         #抓取标题      
         my_title = re.search('<div.*?id="title"><h1>(.*?)</h1></div>',unicodePage,re.S)      
         my_title = my_title.group(1)      
         #抓取章节内容      
         my_content = re.search('<div.*?id="content">(.*?)</div>',unicodePage,re.S)      
         my_content = my_content.group(1)      
         my_content = my_content.replace("<br />","\n")      
         my_content = my_content.replace(" "," ")      
       
         #用字典存储一章的标题和内容      
         onePage = {      'title':my_title,'content':my_content}      
       
         #找到页面下方的连接区域      
         foot_link = re.search('<div.*?id="footlink">(.*?)</div>',unicodePage,re.S)      
         foot_link = foot_link.group(1)      
         #在连接的区域找下一页的连接，根据网页特点为第三个      
         nextUrl = re.findall(u'<a.*?href="(.*?)">(.*?)</a>',foot_link,re.S)      
         nextUrl = nextUrl[2][0]      
         # 更新下一次进行抓取的链接      
         self.url = nextUrl      
       
         return onePage      
       
     # 用于加载章节      
     def LoadPage(self):      
         while self.flag:      
             if(len(self.pages) - self.page < 3):      
                 try:      
                     # 获取新的页面      
                     myPage = self.GetPage()      
                     self.pages.append(myPage)      
                 except:      
                     print '无法连接网页！'      
       
     #显示一章      
     def ShowPage(self,curPage):      
             print curPage['title']      
             print curPage['content']      
             print "\n"      
             user_input = raw_input("当前是第 %d 章，回车读取下一章或者输入 quit 退出：" % self.page)      
             if(user_input == 'quit'):      
                 self.flag = False      
             print "\n"      
       
     def Start(self):      
         print u'开始阅读......\n'      
       
         # 新建一个线程      
         thread.start_new_thread(self.LoadPage,())      
       
         # 如果self的page数组中存有元素      
         while self.flag:      
             if self.page <= len(self.pages):      
                 nowPage = self.pages[self.page-1]      
                 self.ShowPage(nowPage)      
                 self.page += 1      
       
         print u"本次阅读结束"      
       
       
 #----------- 程序的入口处 -----------      
 print u"""     
 ---------------------------------------     
    程序：阅读呼叫转移     
    版本：0.2     
    作者：angryrookie     
    日期：2014-07-07     
    语言：Python 2.7     
    功能：按下回车浏览下一章节     
 ---------------------------------------     
 """      
       
 print u'请按下回车：'      
 raw_input(' ')      
 myBook = Book_Spider()      
 myBook.Start()     

上一篇Python 爬虫之阅读呼叫转移（一）
下一篇Python 爬虫之阅读呼叫转移（三）

   <dl class="digg digg&#95;enable" id="btnDigg" style="display&#58;inline&#45;block&#59; float&#58;left&#59; width&#58;72px&#59; height&#58;72px&#59; overflow&#58;hidden&#59; margin&#58;0px 2px&#59; background&#58;rgb&#40;255&#44;121&#44;0&#41;">    顶    <dd style="margin&#58;0px&#59; color&#58;rgb&#40;255&#44;255&#44;255&#41;&#59; line&#45;height&#58;22px&#59; font&#45;family&#58;Arial">     1    </dd>   </dl>   <dl class="digg digg&#95;enable" id="btnBury" style="display&#58;inline&#45;block&#59; float&#58;left&#59; width&#58;72px&#59; height&#58;72px&#59; overflow&#58;hidden&#59; margin&#58;0px 2px&#59; background&#58;rgb&#40;255&#44;121&#44;0&#41;">    踩   </dl>  

Python 爬虫 之 阅读呼叫转移（二）

Python 爬虫之阅读呼叫转移（二）