今天来实现以下笔趣阁小说爬虫
笔趣阁的小说爬取难度还是比较低的(不涉及搜索功能)
咱们用requests和xpath来完成这个小爬虫
首先肯定是导包
```python
import requests
import time
from lxml import etree
```
然后来写两个辅助函数
分别用于请求网页和xpath解析
函数会让我们后面的程序更加简洁方便
```python
def get_tag(response, tag):
? html = etree.HTML(response)
? ret = html.xpath(tag)
? return ret
def parse_url(url):
? response = requests.get(url)
? response.encoding = 'gbk'
? return response.text
```
我们是从目录页开始爬取的
所以首先我们要获取目录页面里面所有的章节网址
```python
def find_url(response):
? chapter = get_tag(response, '//*[@id="list"]/dl/dd/a/@href')
? # print(chapter)
? for i in chapter:
? ? ? url_list.append(start_url + i)
? ? ? # url_list.append('https://www.52bqg.com/book_187/' + i)
? # print(url_list)
```
然后根据章节的网址
去获取章节内的信息
```python
def find_content(url):
? response = parse_url(url)
? chapter = get_tag(response, '//*[@id="box_con"]/div[2]/h1/text()')[0]
? content = get_tag(response, '//*[@id="content"]/text()')
? print('正在爬取', chapter)
? with open('{}.txt'.format(title), 'at', encoding='utf-8') as j:
? ? ? j.write(chapter)
? ? ? for i in content:
? ? ? ? ? if i == '\r\n':
? ? ? ? ? ? ? continue
? ? ? ? ? j.write(i)
? j.close()
? print(chapter, '保存完毕')
? time.sleep(2)
```
最后就是一个主体函数
```python
def main():
? global title, start_url
? # start_url = 'https://www.52bqg.com/book_187/'
? start_url = input('输入爬取小说的目录页:例如(https://www.52bqg.com/book_187/)')
? response = parse_url(start_url)
? # print(response)
? title = get_tag(response, '//*[@id="info"]/h1/text()')[0]
? # print(title)
? find_url(response)
? # print(1)
? for url in url_list:
? ? ? find_content(url)
```
完整代码如下:
```python
import requests
import time
from lxml import etree
url_list = []
def get_tag(response, tag):
? html = etree.HTML(response)
? ret = html.xpath(tag)
? return ret
def parse_url(url):
? response = requests.get(url)
? response.encoding = 'gbk'
? return response.text
def find_url(response):
? chapter = get_tag(response, '//*[@id="list"]/dl/dd/a/@href')
? # print(chapter)
? for i in chapter:
? ? ? url_list.append(start_url + i)
? ? ? # url_list.append('https://www.52bqg.com/book_187/' + i)
? # print(url_list)
def find_content(url):
? response = parse_url(url)
? chapter = get_tag(response, '//*[@id="box_con"]/div[2]/h1/text()')[0]
? content = get_tag(response, '//*[@id="content"]/text()')
? print('正在爬取', chapter)
? with open('{}.txt'.format(title), 'at', encoding='utf-8') as j:
? ? ? j.write(chapter)
? ? ? for i in content:
? ? ? ? ? if i == '\r\n':
? ? ? ? ? ? ? continue
? ? ? ? ? j.write(i)
? j.close()
? print(chapter, '保存完毕')
? time.sleep(2)
def main():
? global title, start_url
? # start_url = 'https://www.52bqg.com/book_187/'
? start_url = input('输入爬取小说的目录页:例如(https://www.52bqg.com/book_187/)')
? response = parse_url(start_url)
? # print(response)
? title = get_tag(response, '//*[@id="info"]/h1/text()')[0]
? # print(title)
? find_url(response)
? # print(1)
? for url in url_list:
? ? ? find_content(url)
if __name__ == '__main__':
? main()
```
效果图:
笔趣阁小数的爬虫还是比较好写的
下篇文章我们为这个爬虫加上GUI界面
一起学习python,小白指导,教学分享记得私信我
领取专属 10元无门槛券
私享最新 技术干货