爬虫14:爬虫框架Scrapy02

爬虫•目录 爬虫•类别


爬虫框架Scrapy

yield

  • 把1个函数当作1个生成器来使用
  • 让函数暂停,等待下一次调用

    记录执行的位置,每次开始都是从上次停止的地地方继续,不会从头开始.

def fun_1():
    print('启动生成器')
    for i in range(3):
        yield i
    print('*' * 10)

fun = fun_1()
while 1:
    try:
        print(next(fun))
    except:
        break

知识点

  • extract()
    获取选择器对象中的文本内容
    response.xpath(‘’).extract() 返回列表
  • pipelines.py
    每个类中必须有函数process_item
    def process_item(self, item, spider):
      return item

    如果没有return item 那么意味着返回值为None,优先级低的管道实际是用的高的返回的item.

案例

step1

终端

cd /jent/project/spider/spider_14
scrapy startproject Csdn
cd Csdn
scrapy genspider csdn blog.csdn.net

step2

items.py文件

import scrapy


class CsdnItem(scrapy.Item):
    # define the fields for your item here like:
    # 标题
    title = scrapy.Field()
    # 时间
    time = scrapy.Field()
    # 数量
    number = scrapy.Field()

step3

csdn.py文件

import scrapy
from Csdn.items import CsdnItem


class CsdnSpider(scrapy.Spider):
    name = 'csdn'
    # 允许爬取的域名
    allowed_domains = ['blog.scdn.net']
    # 启始地址
    start_urls = ['https://blog.csdn.net/sanpang2288/article/details/86668926']

    def parse(self, response):
        item = CsdnItem()
        item['title'] = response.xpath('//h1[@class="title-article"]/text()').extract()[0]
        item['time'] = response.xpath('//span[@class="time"]/text()').extract()[0]
        item['number'] = response.xpath('//span[@class="read-count"]/text()').extract()[0]

        yield item

step4

settings.py

# 更改
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.81 Safari/537.36'
ROBOTSTXT_OBEY = False
## 数字为执行优先级
ITEM_PIPELINES = {
   'Csdn.pipelines.CsdnPipeline': 300,
   'Csdn.pipelines.CsdnMongoPipeline': 200,
   'Csdn.pipelines.CsdnMysqlPipeline': 100
}

# 添加
## 下一节介绍
LOG_LEVEL = 'WARNING'

## MongoDB相关变量
MONGO_HOST = 'localhost'
MONGO_PORT = 27017
MONGO_DB = 'csdndb'
MONGO_SET = 'csdnset'

## Mysql相关变量
MYSQL_HOST = 'localhost'
MYSQL_USER = 'root'
MYSQL_PWD = '123456'
MYSQL_DB = 'csdndb'

step5

pipelines.py文件

import pymongo
import pymysql
from Csdn.settings import *


class CsdnPipeline(object):
    def process_item(self, item, spider):
        print(item['title'])
        print(item['time'])
        print(item['number'])
        return item


class CsdnMongoPipeline(object):
    '''存入MongoDB数据库
    '''
    def __init__(self):
        self.conn = pymongo.MongoClient(MONGO_HOST, MONGO_PORT)
        self.db = self.conn[MONGO_DB]
        self.myset = self.db[MONGO_SET]

    def process_item(self, item, spider):
        d = dict(item)
        self.myset.insert_one(d)
        return item


class CsdnMysqlPipeline(object):
    '''存入Mysql数据库
    '''
    def __init__(self):
        self.db = pymysql.connect(
            MYSQL_HOST,
            MYSQL_USER,
            MYSQL_PWD,
            MYSQL_DB,
            charset='utf8'
        )
        self.sursor = self.db.cursor()

    def process_item(self, item, spider):
        ins = 'insert into csdntab values(%s, %s, %s)'
        L = [
            item['title'],
            item['time'],
            item['number']
        ]
        self.cursor.exectue(ins, L)
        self.db.commit()
        return item

    def close_spider(self, spider):
        '''自动调用的结束函数
        '''
        self.sursor.close()
        self.db.close()

    # 数据库建表命令
    # create database csdndb charset=utf8;
    # use scdndb;
    # cerate table csdntb(
    #     title varchar(100),
    #     time varchar(100),
    #     number varchar(100)
    # )charset=utf8;

step6

终端启动爬虫

/jent/project/spider/spider_14/Csdn >>>scrapy crawl csdn


博主个人能力有限,错误在所难免.
如发现错误请不要吝啬,发邮件给博主更正内容,在此提前鸣谢.
Email: JentChang@163.com (来信请注明文章标题,如果附带链接就更方便了)
你也可以在下方的留言板留下你宝贵的意见.


上一篇
爬虫15:爬虫框架Scrapy03 爬虫15:爬虫框架Scrapy03
爬虫•目录 爬虫•类别 爬虫框架Scrapy警告级别/日志文件LOG_LEVEL = ‘’LOG_FILE = ‘文件名.log’ LOG_LEVEL值:  级别 值 表示 结果 1 CRITICAL 严重错误 显示CRIT
2019-01-22
下一篇
爬虫13:爬虫框架Scrapy 爬虫13:爬虫框架Scrapy
爬虫•目录 爬虫•类别 爬虫框架Scrapy异步处理框架,可配置和可扩展程度非常高高,python中使用最广泛的爬虫框架详细 Scrapy安装镜像(Anaconda Prompt安装命令)conda config --add chann
2019-01-22
目录