爬虫框架Scrapy
图片管道
继承了框架内已经搭建好的图片管道类,并重写它的方法.pipelines.py文件的代码是有注释的.
QueryString:
ch:beauty
sn:? # 表示图片编号 0(0-30)...
listtype:new
temp:1Request URL:
http://image.so.com/zj?ch=beauty&sn=30&listtype=new&temp=1
step1
终端
cd /jent/project/spider/spider_14
scrapy startproject SO
cd Csdn
scrapy genspider so image.so.com
step2
items.py文件
import scrapy
class SoItem(scrapy.Item):
# define the fields for your item here like:
imgLink = scrapy.Field()
step3
so.py文件
import scrapy
import urllib.parse
from So.items import SoItem
import json
class SoSpider(scrapy.Spider):
name = 'so'
allowed_domains = ['image.so.com']
# start_urls = ['http://image.so.com/']
# 重写Spider中start_requests()方法
def start_requests(self):
# 拼接URL并发给调度器
baseurl = 'http://image.so.com/zj?'
for page in range(2):
params = {
'ch': 'beauty',
'sn': str(page * 30),
'listtype': 'new',
'temp': '1'
}
params = urllib.parse.urlencode(params)
url = baseurl + params
yield scrapy.Request(url, callback=self.parseImage)
def parseImage(self, response):
item = SoItem()
# response.text获取响应内容
html = response.text
imgDict = json.loads(html)
for img in imgDict['list']:
item['imgLink'] = img['qhimg_url']
yield item
step4
settings.py
# 更改
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.81 Safari/537.36'
ROBOTSTXT_OBEY = False
# 添加
LOG_LEVEL = 'WARNING'
## 定义图片保存路径
IMAGES_STORE = '/jent/project/spider/spider_14/So'
step5
pipelines.py文件
# 导入scrapy定义好的图片管道类
from scrapy.pipelines.images import ImagesPipeline
import scrapy
# 继承图片管道类
class SoPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
'''重写
'''
yield scrapy.Request(item['imgLink'])
step6
终端启动爬虫
/jent/project/spider/spider_14/So >>>scrapy crawl so
博主个人能力有限,错误在所难免.
如发现错误请不要吝啬,发邮件给博主更正内容,在此提前鸣谢.
Email: JentChang@163.com (来信请注明文章标题,如果附带链接就更方便了)
你也可以在下方的留言板留下你宝贵的意见.