爬虫15:爬虫框架Scrapy03

爬虫•目录 爬虫•类别


爬虫框架Scrapy

警告级别/日志文件

LOG_LEVEL = ‘’
LOG_FILE = ‘文件名.log’

LOG_LEVEL值:

 级别 表示 结果
1 CRITICAL 严重错误 显示CRITICAL
2 ERROR 普通错误  显示ERROR及以上
3 WARNING 警告信心 显示WARNING及以上
4 INFO 一般信息  只显示INFO
5 DEBUG 调试信息  显示DEBUG & INFO

保存csv json文件

在setting.py文件设置编码
FEED_EXPORT_ENCODING = ‘编码’

windows excel打开csv为’db-18030’

scrapy crawl 程序名 -o 程序名.json
scrapy crawl 程序名 -o 程序名.csv

CSV可能会出现空行 更改原码设置newline
-> scrapy文件夹
-> exporters.py文件
ctrl+F Csv->Class CsvItem…
-> self.stream = io.TextIOWrapper(file, newline='',…)

腾讯招聘项目

 目标 表达式
基准  //tr[@class="even"] | //tr[@class="odd"]
职位名称 ./td[1]/a/text()
职位类别 ./td[2]/text()
招聘人数 ./td[3]/text()
招聘地点 ./td[4]/text()
发布时间 ./td[5]/text()
职位链接 ./td[1]/a/@href

step1

终端

cd /jent/project/spider/spider_14
scrapy startproject Tengxun
cd Csdn
scrapy genspider tengxun hr.tencent.com

step2

items.py文件

import scrapy


class TengxunItem(scrapy.Item):
    # define the fields for your item here like:
    zhName = scrapy.Field()
    zhLink = scrapy.Field()
    zhType = scrapy.Field()
    zhNum = scrapy.Field()
    zhAddress = scrapy.Field()
    zhTime = scrapy.Field()

step3

tengxun.py文件

import scrapy
from Tengxun.items import TengxunItem


class TengxunSpider(scrapy.Spider):
    name = 'tengxun'
    allowed_domains = ['hr.tencent.com']
    url = 'https://hr.tencent.com/position.php?start='
    # 第1页地址发给引擎
    start_urls = ['https://hr.tencent.com/position.php?start=0']

    def parse(self, response):
        for page in range(0, 2860, 10):
            fullurl = self.url + str(page)
            yield scrapy.Request(fullurl, callback=self.parseHtml)

    def parseHtml(self, response):
        item = TengxunItem()
        baseList = response.xpath('//tr[@class="even"]|//tr[@class="odd"]')
        for base in baseList:
            try:
                item['zhName'] = base.xpath('./td[1]/a/text()').extract()[0]
            except:
                item['zhName'] = 'None'
            try:
                item['zhType'] = base.xpath('./td[2]/text()').extract()[0]
            except:
                item['zhType'] = 'None'
            try:
                item['zhNum'] = base.xpath('./td[3]/text()').extract()[0]
            except:
                item['zhNum'] = 'None'
            try:
                item['zhAddress'] = base.xpath('./td[4]/text()').extract()[0]
            except:
                item['zhAddress'] = 'None'
            try:
                item['zhTime'] = base.xpath('./td[5]/text()').extract()[0]
            except:
                item['zhTime'] = 'None'
            try:
                item['zhLink'] = base.xpath('./td[1]/a/@href').extract()[0]
            except:
                item['zhLink'] = 'None'

            yield item

step4

settings.py

# 更改
ROBOTSTXT_OBEY = False
CONCURRENT_REQUESTS = 32
DEFAULT_REQUEST_HEADERS = {
  'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.81 Safari/537.36',
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
}
ITEM_PIPELINES = {
   'Tengxun.pipelines.TengxunPipeline': 300,
   'Tengxun.pipelines.TengxunMongoPiprline': 200
}

# 添加
LOG_LEVEL = 'WARNING'
## 导出编码
FEED_EXPORT_ENCODING = 'db-18030'

MOGON_HOST = 'localhost'
MONGO_POST = 27017
MONGO_DB = 'tengxundb'
MONGO_SET = 'tengxunset'

step5

pipelines.py文件

import pymongo
from Tengxun.settings import *


class TengxunPipeline(object):
    def process_item(self, item, spider):
        print('*' * 20)
        print(item['zhName'])
        print(item['zhType'])
        print(item['zhNum'])
        print(item['zhAddress'])
        print(item['zhTime'])
        print(item['zhLink'])
        print('*' * 20)

        return item


class TengxunMongoPiprline(object):
    conn = pymongo.MongoClient(MOGON_HOST, MONGO_POST)
    db = conn[MONGO_DB]
    myset = db[MONGO_SET]

    def process_item(self, item, spider):
        d = dict(item)
        self.myset.insert_one(d)
        return item

    def close_spider(self, spider):
        print('OVER')

step6

终端启动爬虫

/jent/project/spider/spider_14/Tengxun >>>scrapy crawl tengxun -o tengxun.csv


博主个人能力有限,错误在所难免.
如发现错误请不要吝啬,发邮件给博主更正内容,在此提前鸣谢.
Email: JentChang@163.com (来信请注明文章标题,如果附带链接就更方便了)
你也可以在下方的留言板留下你宝贵的意见.


上一篇
爬虫17:爬虫框架Scrapy05 爬虫17:爬虫框架Scrapy05
爬虫•目录 爬虫•类别 爬虫框架Scrapyscrapy shell使用 代码  含义 scrapy shell URL 地址 request.headers 请求头 字典 request.meta 定义代理等相关信息
2019-01-22
下一篇
爬虫14:爬虫框架Scrapy02 爬虫14:爬虫框架Scrapy02
爬虫•目录 爬虫•类别 爬虫框架Scrapyyield 把1个函数当作1个生成器来使用 让函数暂停,等待下一次调用 记录执行的位置,每次开始都是从上次停止的地地方继续,不会从头开始. def fun_1(): print('
2019-01-22
目录