在crontab中执行scrapy会遇到命令不执行,或者执行了但是没有爬取数据的问题,这里做一下总结
- 先说这里遇到的问题和解决方案:
- spider不执行:在crontab中需要先cd到项目目录,然后调用命令,否则找不到爬虫
- 执行scrapy的时候需要调用/usr/local/bin/scrapy crawl spider_name,否则找不到scrapy命令
- 如何使用crontab做爬取:这里有2种方式,一种是直接在crontab中执行scrapy crawl spider_name,将每个爬虫写一遍
- 0 3 * * * cd /project_path/spider && /usr/bin/python3 startup.py >> /tmp/spider.log
- 另一种是增加一个startup.py,调用subprocess,将需要调用的爬虫做整合然后循环调用:
# 顺序执行所有爬虫 import subprocess from datetime import datetime import time def crawl_work(): date_start = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime()) crawl_name_list = ['spider_1', 'spider_2'] record_time_list = {} for crawl_name in crawl_name_list: start_time = time.time() record_time_list[crawl_name] = {} record_time_list[crawl_name]['start_date'] = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime()) subprocess.Popen('/usr/local/bin/scrapy crawl ' + crawl_name, shell=True).wait() end_time = time.time() record_time_list[crawl_name]['end_date'] = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime()) record_time_list[crawl_name]['time_last'] = int((end_time - start_time) / 60) # 分钟,向下取整 print('time_record-date_start: ', date_start) for crawl_name, record_time in record_time_list.items(): print('time_record-' + crawl_name + ': ', record_time) date_end = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime()) print('time_record-date_end: ', date_end) if __name__ == '__main__': crawl_work()