爬虫框架Scrapy
警告级别/日志文件
LOG_LEVEL = ‘’
LOG_FILE = ‘文件名.log’
LOG_LEVEL值:
级别 | 值 | 表示 | 结果 |
---|---|---|---|
1 | CRITICAL | 严重错误 | 显示CRITICAL |
2 | ERROR | 普通错误 | 显示ERROR及以上 |
3 | WARNING | 警告信心 | 显示WARNING及以上 |
4 | INFO | 一般信息 | 只显示INFO |
5 | DEBUG | 调试信息 | 显示DEBUG & INFO |
保存csv json文件
在setting.py文件设置编码
FEED_EXPORT_ENCODING = ‘编码’
windows excel打开csv为’db-18030’
scrapy crawl 程序名 -o 程序名.json
scrapy crawl 程序名 -o 程序名.csv
CSV可能会出现空行 更改原码设置newline
-> scrapy文件夹
-> exporters.py文件
ctrl+F Csv->Class CsvItem…
-> self.stream = io.TextIOWrapper(file,newline=''
,…)
腾讯招聘项目
目标 | 表达式 |
---|---|
基准 | //tr[@class="even"] | //tr[@class="odd"] |
职位名称 | ./td[1]/a/text() |
职位类别 | ./td[2]/text() |
招聘人数 | ./td[3]/text() |
招聘地点 | ./td[4]/text() |
发布时间 | ./td[5]/text() |
职位链接 | ./td[1]/a/@href |
step1
终端
cd /jent/project/spider/spider_14
scrapy startproject Tengxun
cd Csdn
scrapy genspider tengxun hr.tencent.com
step2
items.py文件
import scrapy
class TengxunItem(scrapy.Item):
# define the fields for your item here like:
zhName = scrapy.Field()
zhLink = scrapy.Field()
zhType = scrapy.Field()
zhNum = scrapy.Field()
zhAddress = scrapy.Field()
zhTime = scrapy.Field()
step3
tengxun.py文件
import scrapy
from Tengxun.items import TengxunItem
class TengxunSpider(scrapy.Spider):
name = 'tengxun'
allowed_domains = ['hr.tencent.com']
url = 'https://hr.tencent.com/position.php?start='
# 第1页地址发给引擎
start_urls = ['https://hr.tencent.com/position.php?start=0']
def parse(self, response):
for page in range(0, 2860, 10):
fullurl = self.url + str(page)
yield scrapy.Request(fullurl, callback=self.parseHtml)
def parseHtml(self, response):
item = TengxunItem()
baseList = response.xpath('//tr[@class="even"]|//tr[@class="odd"]')
for base in baseList:
try:
item['zhName'] = base.xpath('./td[1]/a/text()').extract()[0]
except:
item['zhName'] = 'None'
try:
item['zhType'] = base.xpath('./td[2]/text()').extract()[0]
except:
item['zhType'] = 'None'
try:
item['zhNum'] = base.xpath('./td[3]/text()').extract()[0]
except:
item['zhNum'] = 'None'
try:
item['zhAddress'] = base.xpath('./td[4]/text()').extract()[0]
except:
item['zhAddress'] = 'None'
try:
item['zhTime'] = base.xpath('./td[5]/text()').extract()[0]
except:
item['zhTime'] = 'None'
try:
item['zhLink'] = base.xpath('./td[1]/a/@href').extract()[0]
except:
item['zhLink'] = 'None'
yield item
step4
settings.py
# 更改
ROBOTSTXT_OBEY = False
CONCURRENT_REQUESTS = 32
DEFAULT_REQUEST_HEADERS = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.81 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
}
ITEM_PIPELINES = {
'Tengxun.pipelines.TengxunPipeline': 300,
'Tengxun.pipelines.TengxunMongoPiprline': 200
}
# 添加
LOG_LEVEL = 'WARNING'
## 导出编码
FEED_EXPORT_ENCODING = 'db-18030'
MOGON_HOST = 'localhost'
MONGO_POST = 27017
MONGO_DB = 'tengxundb'
MONGO_SET = 'tengxunset'
step5
pipelines.py文件
import pymongo
from Tengxun.settings import *
class TengxunPipeline(object):
def process_item(self, item, spider):
print('*' * 20)
print(item['zhName'])
print(item['zhType'])
print(item['zhNum'])
print(item['zhAddress'])
print(item['zhTime'])
print(item['zhLink'])
print('*' * 20)
return item
class TengxunMongoPiprline(object):
conn = pymongo.MongoClient(MOGON_HOST, MONGO_POST)
db = conn[MONGO_DB]
myset = db[MONGO_SET]
def process_item(self, item, spider):
d = dict(item)
self.myset.insert_one(d)
return item
def close_spider(self, spider):
print('OVER')
step6
终端启动爬虫
/jent/project/spider/spider_14/Tengxun >>>scrapy crawl tengxun -o tengxun.csv
博主个人能力有限,错误在所难免.
如发现错误请不要吝啬,发邮件给博主更正内容,在此提前鸣谢.
Email: JentChang@163.com (来信请注明文章标题,如果附带链接就更方便了)
你也可以在下方的留言板留下你宝贵的意见.