爬虫16:爬虫框架Scrapy04

爬虫•目录 爬虫•类别


爬虫框架Scrapy

图片管道

继承了框架内已经搭建好的图片管道类,并重写它的方法.pipelines.py文件的代码是有注释的.

step1

终端

cd /jent/project/spider/spider_14
scrapy startproject SO
cd Csdn
scrapy genspider so image.so.com

step2

items.py文件

import scrapy


class SoItem(scrapy.Item):
    # define the fields for your item here like:
    imgLink = scrapy.Field()

step3

so.py文件

import scrapy
import urllib.parse
from So.items import SoItem
import json


class SoSpider(scrapy.Spider):
    name = 'so'
    allowed_domains = ['image.so.com']
    # start_urls = ['http://image.so.com/']
    # 重写Spider中start_requests()方法

    def start_requests(self):
        # 拼接URL并发给调度器
        baseurl = 'http://image.so.com/zj?'
        for page in range(2):
            params = {
                'ch': 'beauty',
                'sn': str(page * 30),
                'listtype': 'new',
                'temp': '1'
            }
            params = urllib.parse.urlencode(params)
            url = baseurl + params

            yield scrapy.Request(url, callback=self.parseImage)

    def parseImage(self, response):
        item = SoItem()
        # response.text获取响应内容
        html = response.text
        imgDict = json.loads(html)

        for img in imgDict['list']:
            item['imgLink'] = img['qhimg_url']
            yield item

step4

settings.py

# 更改
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.81 Safari/537.36'
ROBOTSTXT_OBEY = False

# 添加
LOG_LEVEL = 'WARNING'
## 定义图片保存路径
IMAGES_STORE = '/jent/project/spider/spider_14/So'

step5

pipelines.py文件

# 导入scrapy定义好的图片管道类
from scrapy.pipelines.images import ImagesPipeline
import scrapy


# 继承图片管道类
class SoPipeline(ImagesPipeline):
    def get_media_requests(self, item, info):
        '''重写
        '''
        yield scrapy.Request(item['imgLink'])

step6

终端启动爬虫

/jent/project/spider/spider_14/So >>>scrapy crawl so


博主个人能力有限,错误在所难免.
如发现错误请不要吝啬,发邮件给博主更正内容,在此提前鸣谢.
Email: JentChang@163.com (来信请注明文章标题,如果附带链接就更方便了)
你也可以在下方的留言板留下你宝贵的意见.


上一篇
统计学40:置信区间 统计学40:置信区间
统计学•目录 统计学•类别 math 置信区间You samole 36 apples from your farm’s harvest of over 200,000 apples. The mean weight of the sa
2019-01-23
下一篇
爬虫19:分布式爬虫 爬虫19:分布式爬虫
爬虫•目录 爬虫•类别 分布式爬虫 博主个人能力有限,错误在所难免.如发现错误请不要吝啬,发邮件给博主更正内容,在此提前鸣谢.Email: JentChang@163.com (来信请注明文章标题,如果附带链接就更方便了)你也可以在下方的
2019-01-22
目录