spider

HTTP原理

HTTP请求方式

常用的请求方式是GET和POST：

GET请求：是以实体的方式得到由请求URL所指定资源的信息，一般来说我们输入一个网址时，默认的请求就是GET请求。

POST请求：用来向目的服务器发出请求，要求它接受被附在请求后的实体，并把它当作请求队列中请求URL所指定资源的附加新子项。比如网页上的注册、登录等等都是POST请求。

GET与POST方法有以下区别：

在客户端，GET方式通过URL提交数据，数据在URL中可以看到；POST方式的数据放置在实体区内提交，不能直接看到。

GET方式提交的数据最多只能有1024个字节，而POST则没有此限制。

安全性问题，使用GET的时候，参数会显示在地址栏上，而POST不会，所以，如果这些数据是非敏感数据，那么使用GET；如果用户输入的数据包含敏感数据，那么还是使用POST提交比较靠谱。（其实POST也是不安全的）

HTTP状态码含义

200 ——请求成功

202 [已接受] 已接受请求，但尚未处理

204 成功处理了请求，但没有返回内容

206 服务器成功处理了部分 GET 请求
301 —— 资源（网页等）被永久转移到其他URL（永久重定向）

302 ——资源（网页等）被临时转移到其他的URL（临时重定向）

304 —— 资源（网页）没有更新

305 请求者只能使用代理访问请求的网页

400 错误请求，服务器不理解请求的语法

401 未授权，请求要求身份验证

403 —— 表示资源不可用。服务器理解客户的请求，但拒绝处理它，通常由于服务器上文件或目录的权限设置导致的WEB访问错误。

404 ——请求的资源（网页等）不存在

500 —— 服务器内部错误，主要是由于IWAM账号的密码错误造成的

502 作为网关或代理从上游收到无效响应

503 —— 由于临时的服务器维护或者过载，服务器当前无法处理请求

HTTP头部信息

Request Header

Accept请求报头域，用于指定客户端接收哪些类型的信息。

Accept-Encoding请求报头域，用于指定客户端可接受的内容编码。

Accept-Language请求报头域，类似Accept，但是它用于指定一种自然语言。

Connection报头域允许发送用于指定连接的选项。

Cookie辨别用户身份、进行 session 跟踪而储存在用户本地终端上的数据（通常经过加密）

Host头域，用于指定请求资源的intenet主机和端口号，必须表示请求URL的原始服务器或网关的位置。

User-Agent头域，里面包含发出请求的用户信息，其中有使用浏览器的型号，版本和操作系统的信息。这个头域经常用来作为反爬虫的措施。

Response Header

Cache-Contorl用于指定缓存指定，缓存指令是单向的，且是独立的。

Content-Type 实体报头域用于指明发送给接收者的实体正文的媒体类型。

Date表示消息产生的日期和时间

Expires-CT实体报头域给出响应过期日期和时间。（用来查看缓存过期时间）

Last-Modified实体报头域用于指示资源的最后修改日期和时间。

Transfer-Encoding：chunked表示输出的内容长度不能确定

urllib

参考链接：https://cloud.tencent.com/developer/news/207605
https://docs.python.org/3/library/urllib.request.html

简介

Python3中urllib是一个URL处理包，这个包中集合了一些处理URL的模块，包括了request模块、error模块、parse模块和robotparser模块。

urllib.request模块是用来打开和读取URL的；
urllib.error模块包含一些有urllib.request产生的错误，可以使用try进行捕捉处理；
urllib.parse模块包含了一些解析URLs的方法；
urllib.robotparser模块用来解析robots.txt文本文件.它提供了一个单独的RobotFileParser类，通过该类提供的can_fetch()方法测试爬虫是否可以下载一个页面。

request

Note

urllib.request.urlopen(url) 可以直接打开url地址，但是如果需要执行更复杂的操作，比如增加HTTP报头，则必须创建一个Request实例来作为urlopen()的参数，而需要访问的url地址则作为Request实例的参数。

urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)

Request实例参数

data（默认为空）：是伴随url提交的数据（比如post的数据），同时HTTP请求将从“GET”方式改为“POST”方式。
headers（默认为空）：是一个字典，包含了需要发送的HTTP报头的键值对。
- User-Agent伪装成公认的浏览器
- 调用Request.add_header() 添加/修改一个特定的header
- 调用Request.get_header()来查看已有的header。

基本教程示例


from urllib import request

# chrome 的 User-Agent，包含在 header里
header = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36'}

# 构造Request请求 返回<urllib.request.Request object at 0x02C4C450>
request = urllib.request.Request("http://www.baidu.com/", headers = header)

# 调用Request.add_header() 添加/修改一个特定的header
request.add_header("Connection", "keep-alive")

# 向服务器发送请求 返回<http.client.HTTPResponse object at 0x09135850>
response = urllib.request.urlopen(request)

html = response.read().decode("utf-8")

print(html)

request类Request方法常用的内置方法

Request.add_data(data)  设置data参数，如果一开始创建的时候没有给data参数，那么可以使用该方法追加data 参数

Request.get_method() 返回HTTP请求方法，一般返回GET或是POST

Request.has_data() 查看是否设置了data参数

Request.get_data() 获取data参数的数据

Request.add_header(key, val) 添加头部信息，key为头域名，val为域值

Request.get_full_url() 获取请求的完整url

Request.get_host() 返回请求url的host（主域名）

Request.set_proxy(host, type) 设置代理，第一个参数是代理ip和端口，第二个参数是代理类型（http/https)

微博登录示例

from urllib import request, parse

print('Login to weibo.cn...')
email = input('Email: ')
passwd = input('Password: ')
login_data = parse.urlencode([
    ('username', email),
    ('password', passwd),
    ('entry', 'mweibo'),
    ('client_id', ''),
    ('savestate', '1'),
    ('ec', ''),
    ('pagerefer', 'https://passport.weibo.cn/signin/welcome?entry=mweibo&r=http%3A%2F%2Fm.weibo.cn%2F')
])

req = request.Request('https://passport.weibo.cn/sso/login')
req.add_header('Origin', 'https://passport.weibo.cn')
req.add_header('User-Agent', 'Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25')
req.add_header('Referer', 'https://passport.weibo.cn/signin/login?entry=mweibo&res=wel&wm=3349&r=http%3A%2F%2Fm.weibo.cn%2F')

with request.urlopen(req, data=login_data.encode('utf-8')) as f:
    print('Status:', f.status, f.reason)
    for k, v in f.getheaders():
        print('%s: %s' % (k, v))
    print('Data:', f.read().decode('utf-8'))

Beautiful Soup4

参考链接:

简介

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.

对象的种类

得到BeautifulSoup对象并按标准格式输出
soup = BeautifulSoup(html, "html.parser")

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种: BeautifulSoup,Tag, NavigableString, Comment

BeautifulSoup

BeautifulSoup 对象表示的是一个文档的全部内容.大部分时候,可以把它当作 Tag 对象

Tag

Tag子节点对象与XML或HTML原生文档中的tag相同:

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
type(tag)
# <class 'bs4.element.Tag'>
tag.name
# u'b'

tag['class']
# u'boldest'
tag.attrs
# {u'class': u'boldest'}
tag['class'] = 'verybold'
tag['id'] = 1
tag
# <blockquote class="verybold" id="1">Extremely bold</blockquote>

del tag['class']
del tag['id']
tag
# <blockquote>Extremely bold</blockquote>

tag['class']
# KeyError: 'class'
print(tag.get('class'))
# None

节点操作

soup.contents 将子节点以列表的方式输出返回类型 list

soup.children 将子节点以可迭代列表的方式输出返回类型 list_iterator

可以被next()函数调用并不断返回下一个值的对象称为迭代器：Iterator

soup.descendants 对所有tag的子孙节点进行递归循环返回类型 generator

父节点 .parents 递归得到所有父辈节点返回类型 generator

.previous_sibling .next_sibling 查询兄弟节点

NavigableString

字符串常被包含在tag内.Beautiful Soup用 NavigableString 类来包装tag中的字符串:

tag.string
# u'Extremely bold'
type(tag.string)
# <class 'bs4.element.NavigableString'>

# 不能编辑，但可以直接替换
tag.string.replace_with("No longer bold")
tag
# <blockquote>No longer bold</blockquote>

字符串操作

tag中包含多个字符串可以用.strings循环获取返回 generator
.stripped_strings获取删除空格和空行后的内容

Comment

Comment 对象是一个特殊类型的 NavigableString 对象,
html和xml中的注释部分

markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup)
comment = soup.b.string
type(comment)
# <class 'bs4.element.Comment'>

搜索文档树

find(self, name=None, attrs={}, recursive=True, text=None,**kwargs)
find_all(self, name=None, attrs={}, recursive=True, text=None,limit=None, **kwargs)

name

name 参数可以查找所有名字为 name 的tag,字符串对象会被自动忽略掉

1 2	soup.find_all("title") # [<title>The Dormouse's story</title>]

attrs

根据属性查找

其中，class 属于python中的保留字，通过class_ 搜索

string 搜索文档中字符串的内容

soup.find_all(id='link2')
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

soup.find_all(href=re.compile("elsie"))
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

# 在文档树中查找所有包含 id 属性的tag,无论 id 的值是什么
soup.find_all(id=True)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find_all(href=re.compile("elsie"), id='link1')
# [<a class="sister" href="http://example.com/elsie" id="link1">three</a>]

soup.find_all("a", class_="sister")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find_all(string="Elsie")
# [u'Elsie']

soup.find_all(string=["Tillie", "Elsie", "Lacie"])
# [u'Elsie', u'Lacie', u'Tillie']

soup.find_all(string=re.compile("Dormouse"))
[u"The Dormouse's story", u"The Dormouse's story"]

soup.find_all("a", string="Elsie")
# [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>]

limit

1
2
3

soup.find_all("a", limit=2)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

recursive

只搜索tag的子节点 recursive=False

soup.html.find_all("title")
# [<title>The Dormouse's story</title>]

soup.html.find_all("title", recursive=False)
# []

直接调用tag

find_all 最常用简写方法

soup.find_all("a")
soup("a")

soup.title.find_all(string=True)
soup.title(string=True)

CSS选择器

 .select() 方法中传入字符串参数, 即可使用CSS选择器的语法找到tag:

soup.select("title")
# [<title>The Dormouse's story</title>]

soup.select("p:nth-of-type(3)")
# [<p class="story">...</p>]
通过tag标签逐层查找:

soup.select("body a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie"  id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select("html head title")
# [<title>The Dormouse's story</title>]
找到某个tag标签下的直接子标签 [6] :

soup.select("head > title")
# [<title>The Dormouse's story</title>]

soup.select("p > a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie"  id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select("p > a:nth-of-type(2)")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

soup.select("p > #link1")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

soup.select("body > a")
# []
找到兄弟节点标签:

soup.select("#link1 ~ .sister")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie"id="link3">Tillie</a>]

soup.select("#link1 + .sister")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
通过CSS的类名查找:

soup.select(".sister")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select("[class~=sister]")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
通过tag的id查找:

soup.select("#link1")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

soup.select("a#link2")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
同时用多种CSS选择器查询元素:

soup.select("#link1,#link2")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
通过是否存在某个属性来查找:

soup.select('a[href]')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
通过属性的值来查找:

soup.select('a[href="http://example.com/elsie"]')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

soup.select('a[href^="http://example.com/"]')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select('a[href$="tillie"]')
# [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select('a[href*=".com/el"]')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
通过语言设置来查找:

multilingual_markup = """
 <p lang="en">Hello</p>
 <p lang="en-us">Howdy, y'all</p>
 <p lang="en-gb">Pip-pip, old fruit</p>
 <p lang="fr">Bonjour mes amis</p>
"""
multilingual_soup = BeautifulSoup(multilingual_markup)
multilingual_soup.select('p[lang|=en]')
# [<p lang="en">Hello</p>,
#  <p lang="en-us">Howdy, y'all</p>,
#  <p lang="en-gb">Pip-pip, old fruit</p>]
返回查找到的元素的第一个

soup.select_one(".sister")
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
对于熟悉CSS选择器语法的人来说这是个非常方便的方法.Beautiful Soup也支持CSS选择器API, 如果你仅仅需要CSS选择器的功能,那么直接使用 lxml 也可以, 而且速度更快,支持更多的CSS选择器语法,但Beautiful Soup整合了CSS选择器的语法和自身方便使用API.

pdfkit

使用pdfkit依赖 wkhtmltopdf 需要在本地path下配置

或本地path未配置

import pdfkit
# 配置path后 可以不写
path_wk = r'C:\Program Files\wkhtmltopdf\bin\wkhtmltopdf.exe'  
# wkhtmltopdf安装位置
config = pdfkit.configuration(wkhtmltopdf=path_wk)
pdfkit.from_string(html, file_name, options=options, configuration=config)

import pdfkit
pdfkit.from_url('http://google.com', 'out.pdf')
pdfkit.from_file('test.html', 'out.pdf')
pdfkit.from_string('Hello!', 'out.pdf')

cmd下执行wkhtmltopdf "www.baidu.com" "out.pdf"