Scrapy 爬虫框架_牛客博客

一、基本使用

创建爬虫项目：
- 在命令行中输入： scrapy start ProjectName
- 表示在当前目录创建一个名为 ProjectName 的爬虫项目
创建爬虫：
- 在命令行中进入 ProjectName 的 spiders 目录
- 在命令行中输入： scrapy genspider test "itcast.cn"
- 表示在当前目录（spiders）创建一个名为 test 的爬虫（对应文件为 spiders/test.py）

编写爬虫代码：

在 test.py 中编写爬虫代码

  import scrapy

  class TestSpider(scrapy.Spider):
      # 爬虫名，启用爬虫时使用。
      name = 'test'
      # 允许爬取的范围，防止爬虫爬到别的网站
      allowed_domains = ['itcast.cn']
      # 最开始请求的 URL
      start_urls = ['http://http://www.itcast.cn/channel/teacher.shtml']

      # 处理 start_url 地址对应的响应
      # parse() 函数的名称不可更改
      def parse(self, response):
          ret = response.xpath("//div[@class='tea_con']//h3/text()")
          print(ret)

运行爬虫：
- 在命令行中，进入项目 ProjectName 所在的目录；
- 输入命令：scrapy crawl test 运行 test 爬虫。

二、pipelines.py 的使用

修改上述爬虫代码如下：

     import scrapy

     class TestSpider(scrapy.Spider):
         name = 'test'
         allowed_domains = ['itcast.cn']
         start_urls = ['http://www.itcast.cn/channel/teacher.shtml']

         def parse(self, response):
             li_list = response.xpath("//div[@class='tea_con']//li")
             for li in li_list:
                 item = {}
                 item["name"] = li.xpath(".//h3/text()").extract_first()
                 item["title"] = li.xpath(".//h4/text()").extract()[0]

                 # 将 item 传给 pipelines 处理
                 yield item

打开 pipelines.py 文件编写处理代码：

 class ProjectNamePipeline(object):
     # 在 process_item() 中实现存储方法，完成 pipelines 代码后，需要在 settings.py 中设置开启管道。
     def process_item(self, item, spider):
         item["hello"] = "world"
         print(item)
         return item

打开 settings.py 文件启用管道：

     # Configure item pipelines
     # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
     ITEM_PIPELINES = {
        # projectName.pipelines.ProjectNamePipeline 表示 pipeline 的位置，300是权重。
        'projectName.pipelines.ProjectNamePipeline': 300,
     }

再次运行爬虫验证

注：以上代码修改后未验证，如有错误，请以实际操作为准。（不要整段复制，容易出错，在自己创建的爬虫文件中修改部分代码即可）