爬虫15：爬虫框架Scrapy03

Spider 2019-01-22

爬虫框架Scrapy

警告级别／日志文件

LOG_LEVEL = ‘’
LOG_FILE = ‘文件名.log’

LOG_LEVEL值：

级别	值	表示	结果
1	CRITICAL	严重错误	显示CRITICAL
2	ERROR	普通错误	显示ERROR及以上
3	WARNING	警告信心	显示WARNING及以上
4	INFO	一般信息	只显示INFO
5	DEBUG	调试信息	显示DEBUG & INFO

保存csv json文件

在setting.py文件设置编码
FEED_EXPORT_ENCODING = ‘编码’

windows excel打开csv为’db-18030’

scrapy crawl 程序名　-o 程序名.json
scrapy crawl 程序名　-o 程序名.csv

CSV可能会出现空行　更改原码设置newline
-> scrapy文件夹
-> exporters.py文件
ctrl+F Csv->Class CsvItem…
-> self.stream = io.TextIOWrapper(file, newline='',…)

腾讯招聘项目

网址
https://hr.tencent.com/position.php?start=0
xpath:

目标	表达式
基准	`//tr[@class="even"] ｜ //tr[@class="odd"]`
职位名称	`./td[1]/a/text()`
职位类别	`./td[2]/text()`
招聘人数	`./td[3]/text()`
招聘地点	`./td[4]/text()`
发布时间	`./td[5]/text()`
职位链接	`./td[1]/a/@href`

step1

终端

cd /jent/project/spider/spider_14
scrapy startproject Tengxun
cd Csdn
scrapy genspider tengxun hr.tencent.com

step2

items.py文件

import scrapy


class TengxunItem(scrapy.Item):
    # define the fields for your item here like:
    zhName = scrapy.Field()
    zhLink = scrapy.Field()
    zhType = scrapy.Field()
    zhNum = scrapy.Field()
    zhAddress = scrapy.Field()
    zhTime = scrapy.Field()

step3

tengxun.py文件

import scrapy
from Tengxun.items import TengxunItem


class TengxunSpider(scrapy.Spider):
    name = 'tengxun'
    allowed_domains = ['hr.tencent.com']
    url = 'https://hr.tencent.com/position.php?start='
    # 第1页地址发给引擎
    start_urls = ['https://hr.tencent.com/position.php?start=0']

    def parse(self, response):
        for page in range(0, 2860, 10):
            fullurl = self.url + str(page)
            yield scrapy.Request(fullurl, callback=self.parseHtml)

    def parseHtml(self, response):
        item = TengxunItem()
        baseList = response.xpath('//tr[@class="even"]|//tr[@class="odd"]')
        for base in baseList:
            try:
                item['zhName'] = base.xpath('./td[1]/a/text()').extract()[0]
            except:
                item['zhName'] = 'None'
            try:
                item['zhType'] = base.xpath('./td[2]/text()').extract()[0]
            except:
                item['zhType'] = 'None'
            try:
                item['zhNum'] = base.xpath('./td[3]/text()').extract()[0]
            except:
                item['zhNum'] = 'None'
            try:
                item['zhAddress'] = base.xpath('./td[4]/text()').extract()[0]
            except:
                item['zhAddress'] = 'None'
            try:
                item['zhTime'] = base.xpath('./td[5]/text()').extract()[0]
            except:
                item['zhTime'] = 'None'
            try:
                item['zhLink'] = base.xpath('./td[1]/a/@href').extract()[0]
            except:
                item['zhLink'] = 'None'

            yield item

step4

settings.py

# 更改
ROBOTSTXT_OBEY = False
CONCURRENT_REQUESTS = 32
DEFAULT_REQUEST_HEADERS = {
  'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.81 Safari/537.36',
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
}
ITEM_PIPELINES = {
   'Tengxun.pipelines.TengxunPipeline': 300,
   'Tengxun.pipelines.TengxunMongoPiprline': 200
}

# 添加
LOG_LEVEL = 'WARNING'
## 导出编码
FEED_EXPORT_ENCODING = 'db-18030'

MOGON_HOST = 'localhost'
MONGO_POST = 27017
MONGO_DB = 'tengxundb'
MONGO_SET = 'tengxunset'

step5

pipelines.py文件

import pymongo
from Tengxun.settings import *


class TengxunPipeline(object):
    def process_item(self, item, spider):
        print('*' * 20)
        print(item['zhName'])
        print(item['zhType'])
        print(item['zhNum'])
        print(item['zhAddress'])
        print(item['zhTime'])
        print(item['zhLink'])
        print('*' * 20)

        return item


class TengxunMongoPiprline(object):
    conn = pymongo.MongoClient(MOGON_HOST, MONGO_POST)
    db = conn[MONGO_DB]
    myset = db[MONGO_SET]

    def process_item(self, item, spider):
        d = dict(item)
        self.myset.insert_one(d)
        return item

    def close_spider(self, spider):
        print('OVER')