爬虫框架Scrapy
yield
- 把1个函数当作1个生成器来使用
- 让函数暂停,等待下一次调用
记录执行的位置,每次开始都是从上次停止的地地方继续,不会从头开始.
def fun_1():
print('启动生成器')
for i in range(3):
yield i
print('*' * 10)
fun = fun_1()
while 1:
try:
print(next(fun))
except:
break
知识点
- extract()
获取选择器对象中的文本内容
response.xpath(‘’).extract() 返回列表 - pipelines.py
每个类中必须有函数process_item
def process_item(self, item, spider):
return item如果没有return item 那么意味着返回值为None,优先级低的管道实际是用的高的返回的item.
案例
step1
终端
cd /jent/project/spider/spider_14
scrapy startproject Csdn
cd Csdn
scrapy genspider csdn blog.csdn.net
step2
items.py文件
import scrapy
class CsdnItem(scrapy.Item):
# define the fields for your item here like:
# 标题
title = scrapy.Field()
# 时间
time = scrapy.Field()
# 数量
number = scrapy.Field()
step3
csdn.py文件
import scrapy
from Csdn.items import CsdnItem
class CsdnSpider(scrapy.Spider):
name = 'csdn'
# 允许爬取的域名
allowed_domains = ['blog.scdn.net']
# 启始地址
start_urls = ['https://blog.csdn.net/sanpang2288/article/details/86668926']
def parse(self, response):
item = CsdnItem()
item['title'] = response.xpath('//h1[@class="title-article"]/text()').extract()[0]
item['time'] = response.xpath('//span[@class="time"]/text()').extract()[0]
item['number'] = response.xpath('//span[@class="read-count"]/text()').extract()[0]
yield item
step4
settings.py
# 更改
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.81 Safari/537.36'
ROBOTSTXT_OBEY = False
## 数字为执行优先级
ITEM_PIPELINES = {
'Csdn.pipelines.CsdnPipeline': 300,
'Csdn.pipelines.CsdnMongoPipeline': 200,
'Csdn.pipelines.CsdnMysqlPipeline': 100
}
# 添加
## 下一节介绍
LOG_LEVEL = 'WARNING'
## MongoDB相关变量
MONGO_HOST = 'localhost'
MONGO_PORT = 27017
MONGO_DB = 'csdndb'
MONGO_SET = 'csdnset'
## Mysql相关变量
MYSQL_HOST = 'localhost'
MYSQL_USER = 'root'
MYSQL_PWD = '123456'
MYSQL_DB = 'csdndb'
step5
pipelines.py文件
import pymongo
import pymysql
from Csdn.settings import *
class CsdnPipeline(object):
def process_item(self, item, spider):
print(item['title'])
print(item['time'])
print(item['number'])
return item
class CsdnMongoPipeline(object):
'''存入MongoDB数据库
'''
def __init__(self):
self.conn = pymongo.MongoClient(MONGO_HOST, MONGO_PORT)
self.db = self.conn[MONGO_DB]
self.myset = self.db[MONGO_SET]
def process_item(self, item, spider):
d = dict(item)
self.myset.insert_one(d)
return item
class CsdnMysqlPipeline(object):
'''存入Mysql数据库
'''
def __init__(self):
self.db = pymysql.connect(
MYSQL_HOST,
MYSQL_USER,
MYSQL_PWD,
MYSQL_DB,
charset='utf8'
)
self.sursor = self.db.cursor()
def process_item(self, item, spider):
ins = 'insert into csdntab values(%s, %s, %s)'
L = [
item['title'],
item['time'],
item['number']
]
self.cursor.exectue(ins, L)
self.db.commit()
return item
def close_spider(self, spider):
'''自动调用的结束函数
'''
self.sursor.close()
self.db.close()
# 数据库建表命令
# create database csdndb charset=utf8;
# use scdndb;
# cerate table csdntb(
# title varchar(100),
# time varchar(100),
# number varchar(100)
# )charset=utf8;
step6
终端启动爬虫
/jent/project/spider/spider_14/Csdn >>>scrapy crawl csdn
博主个人能力有限,错误在所难免.
如发现错误请不要吝啬,发邮件给博主更正内容,在此提前鸣谢.
Email: JentChang@163.com (来信请注明文章标题,如果附带链接就更方便了)
你也可以在下方的留言板留下你宝贵的意见.