爬虫16：爬虫框架Scrapy04

图片爬虫Spider 动态 Scrapy

Spider 2019-01-22

爬虫•目录爬虫•类别

爬虫框架Scrapy

图片管道

继承了框架内已经搭建好的图片管道类，并重写它的方法．pipelines.py文件的代码是有注释的．

网站：
http://image.so.com/z?ch=beauty
QueryString:
ch:beauty
sn:? # 表示图片编号　0（0-30）．．．
listtype:new
temp:1
Request URL:
http://image.so.com/zj?ch=beauty&sn=30&listtype=new&temp=1

step1

终端

cd /jent/project/spider/spider_14
scrapy startproject SO
cd Csdn
scrapy genspider so image.so.com

step2

items.py文件

import scrapy


class SoItem(scrapy.Item):
    # define the fields for your item here like:
    imgLink = scrapy.Field()

step3

so.py文件

import scrapy
import urllib.parse
from So.items import SoItem
import json


class SoSpider(scrapy.Spider):
    name = 'so'
    allowed_domains = ['image.so.com']
    # start_urls = ['http://image.so.com/']
    # 重写Spider中start_requests()方法

    def start_requests(self):
        # 拼接URL并发给调度器
        baseurl = 'http://image.so.com/zj?'
        for page in range(2):
            params = {
                'ch': 'beauty',
                'sn': str(page * 30),
                'listtype': 'new',
                'temp': '1'
            }
            params = urllib.parse.urlencode(params)
            url = baseurl + params

            yield scrapy.Request(url, callback=self.parseImage)

    def parseImage(self, response):
        item = SoItem()
        # response.text获取响应内容
        html = response.text
        imgDict = json.loads(html)

        for img in imgDict['list']:
            item['imgLink'] = img['qhimg_url']
            yield item

step4

settings.py

# 更改
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.81 Safari/537.36'
ROBOTSTXT_OBEY = False

# 添加
LOG_LEVEL = 'WARNING'
## 定义图片保存路径
IMAGES_STORE = '/jent/project/spider/spider_14/So'

step5

pipelines.py文件

# 导入scrapy定义好的图片管道类
from scrapy.pipelines.images import ImagesPipeline
import scrapy


# 继承图片管道类
class SoPipeline(ImagesPipeline):
    def get_media_requests(self, item, info):
        '''重写
        '''
        yield scrapy.Request(item['imgLink'])