近期看到V2EX上一个老帖子新人求教： Python 删除 dict 一个 item 后，内存不释放的？
引起了我的思考，如果对一个dict对象进行增删改操作，它的内存占用会发生什么样的变化？

运行环境 Runtime environment

1
2
3

操作系统： Windos10  
IDE: JetBrains Pycharm 2018.2.4 x64  
语言: Python 3.6.4

背景

python一个非常基本爬虫库，它是基于python urllib 这个基本网络请求包来造的轮子。
可以更方便的开发python静态爬虫。
在这里就总一个简单总结和超级好懂的代码示例

基本用法

"""
urllib是手枪的话，那么现在可以做个升级，玩自动突击步枪了！
urllib开发爬虫相对来说比较繁琐，其中确实有不方便的地方。
为了更方便的开发来实现一些高级操作，就有了更为强大的库requests
现在就来初步体验一下！
"""
# 首先导入requests
import requests

# 发送请求
req = requests.get("https://sm.ms/")
# 查看请求结果的类型
print(type(req))
print("*"*50)
# 通过属性查看请求状态码
print(req.status_code)
print("*"*50)
# 查看请求结果网页内容的类型
print(type(req.text))
print("*"*50)
# 查看请求后获得cookies
print(type(req.cookies))
print(req.cookies)
for i in req.cookies:
    print(i.name+"="+i.value)
print("*"*50)
# 查看请求结果网页内容
print(req.text)
print("*"*50)
# 查看requests.get方法的所有参数
import inspect
print(inspect.getfullargspec(requests.get))

"""
可以发现，requests库对于请求得到的网页内容，在查看的时候要比urllib更加方便。
不再需要类似“.read()”啊，“。decode()”解码啊，还有opener处理cookies之类那么麻烦。
在requests库都能一步到位，而且可以很直观地通过requests.get()看出，这是以get方式发送请求！
除此之外，还有其他几种请求方式：
req = requests.post("http://httpbin.org/post")
req = requests.put("http://httpbin.org/put")
req = requests.delete("http://httpbin.org/delete")
req = requests.head("http://httpbin.org/head")
req = requests.options("http://httpbin.org/options")
"""

Get方法

"""
requests库的get()方法与urlopen()方法没有太大的区别，
能达到同样的效果，但是requests库简单得多，
requests.get(url,params,***)
"""
# 首先导入requests
import requests
# 有简入繁，最简单的requests_get爬虫
# req = requests.get(url="http://httpbin.org/get")
# print(req.text)
'''
{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Connection": "close", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.14.2"
  }, 
  "origin": "171.36.8.151", 
  "url": "http://httpbin.org/get"
}
'''


# 再来魔改一下，通过get请求传递参数
# req = requests.get(url="http://httpbin.org/get?name=666&value=888")
# print(req.text)
'''
对比以后可以看到，在args字段这里传参数了！
{
  "args": {
    "name": "666", 
    "value": "888"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Connection": "close", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.14.2"
  }, 
  "origin": "171.36.8.151", 
  "url": "http://httpbin.org/get?name=666&value=888"
}
'''

# 再人性化一点
data = {
    "name": "666",
    "value": "888",
}
req = requests.get(url="http://httpbin.org/get",params=data)
print(req.text)
print(type(req.text))
print(req.json())
print(type(req.json()))
'''
与之前的结果并没有不同，但是回想一下以前用urllib的时候，就会有些感慨。
不用你再转换什么byte了呢！也不用urllib.parse()方法了！
直接就用！是不是很方便呀！
{
  "args": {
    "name": "666", 
    "value": "888"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Connection": "close", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.14.2"
  }, 
  "origin": "171.36.8.151", 
  "url": "http://httpbin.org/get?name=666&value=888"
}

<class 'str'>
{'args': {'name': '666', 'value': '888'}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Connection': 'close', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.14.2'}, 'origin': '171.36.8.151', 'url': 'http://httpbin.org/get?name=666&value=888'}
<class 'dict'>
通过观察可知，网页内容返回的是str类型，但却是Json格式的（即：{"XX":"XXXX",}的格式），
在这里就可以将返回内容直接解析，从而得到一个字典格式，在这里使用json()来解析。
如果返回内容格式不是Json格式的（即：{"XX":"XXXX",}的格式），用此方法是没用的。
json()解析出来的结果，类型是dict,也就是字典。
'''

数据清洗，抓取内容

"""
requests 爬虫
实现简单的内容抓取
这里以get()方法为例，post等其他方法抓取内容时也可以参考
"""
import requests
import re
# 添加请求头
myheaders1 = {
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36",
    "Referer":"https://lnovel.cc/",
}
# req = requests.get("https://www.zhihu.com/explore",headers=myheaders1)
# pattern = re.compile('data-za-element-name="Title">(.*?)</a>',re.S)
# titles = re.findall(pattern,req.text)
# print(titles)
# print('*'*50)
'''
是不是相当的简单呀？
'''
myheaders2 = {
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36",
    "Referer":"https://lnovel.cc/",
}
# req = requests.get("https://lnovel.cc/",headers=myheaders2)
# print(req.text)
# print(req.content)
'''
可以观察到，text都是乱码,content是以b开头的内容，但是好一些，起码能看出来是编码了
text： 返回的是Unicode编码的数据
content：返回的是bytes类型的数据

由于这个网站它本身UTF-8编码的，当你再用Unicode编码时会肯定出现乱码了
怎么解决呢？万码之祖当然是bytes了，只要把它解码成utf-8就行，text在这里就不用它了
'''
# print(req.content.decode(encoding="utf-8"))
'''
然后抓取需要的内容！
'''
req = requests.get("https://lnovel.cc/",headers=myheaders2)
pattern = re.compile('<h2 class="mdl-card__title-text">(.*?)</h2>',re.S)
titles = re.findall(pattern,req.content.decode(encoding="utf-8"))
print(titles)
for i in titles:
    print(i)

采集的内容保存到文本

"""
requests 爬虫
实现简单的内容抓取以后，把有用的信息保存下来，此法可用于保存图片、视频、音乐、文本等，这里以图片和文本为例
请求方法以get()方法为例，post等其他方法抓取内容时也可以参考
"""
# 下载图片
# import requests
#
# req = requests.get("http://www.zzuliacgn.cf/static/ZA_Show/img/background/QYMX-logo.png")
# with open("QYMX-logo.png","wb") as f:
#     f.write(req.content)
# print("图片下载完成！")

# 保存文本
import requests
import re
myheaders = {
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36",
    "Referer":"https://lnovel.cc/",
}
req = requests.get("https://lnovel.cc/",headers=myheaders)
pattern = re.compile('<h2 class="mdl-card__title-text">(.*?)</h2>',re.S)
titles = re.findall(pattern,req.content.decode(encoding="utf-8"))
print(str(titles))
with open("titles.txt","w") as f:
    f.write(str(titles))
print("文本保存完成！")

简单的代码上传示例

"""
requests 爬虫
来而不往非礼也，既然有了下载保存，那么肯定也有上传
这里不再使用get方法，而是使用post方法！
以上传图片到图床https://sm.ms 为例
这是图床的API说明文档：https://sm.ms/doc/

了解一下：post()
requests.post(url,data=data,header=header,files=files)
- data设置body数据
- header设置请求头
- files设置上传的文件
"""
import requests,re
myheader = {
'Host':'sm.ms',
}
files = {'smfile':open('1.jpg','rb')}
req = requests.post(url="https://sm.ms/api/upload",files=files)
print(req.json())
'''
成功以后运行结果如下：
{'code': 'success', 'data': {'width': 150, 'height': 155, 'filename': '1.jpg', 'storename': '5bc9d5e084d19.jpg', 'size': 28902, 'path': '/2018/10/19/5bc9d5e084d19.jpg', 'hash': 'nVgKA5E8tLcofJx', 'timestamp': 1539954144, 'ip': '171.36.8.151', 'url': 'https://i.loli.net/2018/10/19/5bc9d5e084d19.jpg', 'delete': 'https://sm.ms/delete/nVgKA5E8tLcofJx'}}
其他结果均为失败！
由于只是测试所以就不要给别人的图床增加那么多负担啦,这是浪费资源，何况这是一个免费的良心图床，
不要让贡献者寒心，所以测试完，记得删除！这是礼仪！
'''
req = requests.post(url=req.json()['data']['delete'])
pattern = re.compile('<div class="bs-callout bs-callout-warning" style="border-left-width: 2px;">([\s\S]*?)</div>',re.S)
titles = re.findall(pattern,req.text)
print(titles)

下载文件的示例

import requests
# 下载其他文件
# with open("Sublime_Build_203207.dmg", "wb") as code:
#     code.write(requests.get(url="https://download.sublimetext.com/Sublime%20Text%20Build%203207.dmg").content)

# 下载图片文件
req = requests.get(url="https://www.seselah.com/uploads/2a/2a71e0d8fdf1c870596b2be33c27dc18.jpg")
print(req.content)
with open("1.jpg", "wb") as code:
    code.write(req.content)
req = requests.get(url="http://t1.aixinxi.net/o_1d6l2rdi21910gk01o22187c1sqja.jpg")
print(req.content)
with open("2.jpg", "wb") as code:
    code.write(req.content)
req = requests.get(url="http://img.wkcdn.com/image/0/2/2s.jpg")
print(req.content)
with open("3.jpg", "wb") as code:
    code.write(req.content)
req = requests.get(url="http://t1.aixinxi.net/o_1cnu5m83210v1oi41ca4n56r6la.jpg")
print(req.content)
with open("4.jpg", "wb") as code:
    code.write(req.content)

req = requests.get(url="http://t1.aixinxi.net/o_1d96s8ha36po1o2dumm10me1t8ga.jpg-j.jpg")
print(req.content)
with open("5.jpg", "wb") as code:
    code.write(req.content)

网页编码识别

import requests

# 方法一
url = 'http://www.langzi.fun'
r = requests.get(url)
encoding = requests.utils.get_encodings_from_content(r.text)[0]
res = r.content.decode(encoding,'replace')
print(res)

# 方法二
# 其实requests里面用的就是chardet
import chardet
r = requests.get(url=url)

# 获取网页编码格式，并修改为request.text的解码类型
r.encoding = chardet.detect(r.content)['encoding']
if r.encoding == "GB2312":
    r.encoding = "GBK"