新手写了一个多线程的爬虫,所有线程都执行完了,但是一直占着 1.5GB 的内存(任务数越多不释放的内存越多) 不知道怎么排查哪里出问题,pympler 看不太懂问题到底出在哪里,请教该如何正确的排查问题
执行多线程函数的代码:
def mainfunc(tasknum, thread):
tr = tracker.SummaryTracker()
tr.print_diff()
list = []
for i in range(tasknum):
list.append(str(i))
pool = threadpool.ThreadPool(thread)
requests = threadpool.makeRequests(childfunc, list)
for req in requests:
pool.putRequest(req)
pool.wait()
tr.print_diff()
tr.print_diff()打印的内容
初始化:
types | # objects | total size
========================== | =========== | ============
list | 3741 | 350.84 KB
str | 3739 | 260.01 KB
int | 673 | 18.40 KB
dict | 2 | 352 B
tuple | 4 | 256 B
code | 1 | 144 B
function (store_info) | 1 | 136 B
cell | 2 | 96 B
functools._lru_list_elem | 1 | 80 B
method | -1 | -64 B
所有线程结束后:
types | # objects | total size
===================================== | =========== | ============
dict | 202860 | 43.69 MB
list | 100169 | 8.47 MB
str | 102446 | 5.62 MB
threadpool.WorkRequest | 100000 | 5.34 MB
int | 100836 | 3.08 MB
_io.BufferedReader | 294 | 2.35 MB
tuple | 1480 | 93.30 KB
type | 76 | 85.98 KB
code | 572 | 80.57 KB
bytes | 1219 | 51.49 KB
set | 32 | 43.50 KB
socket.socket | 294 | 27.56 KB
pymysql.connections.Connection | 294 | 16.08 KB
socket.SocketIO | 294 | 16.08 KB
DBUtils.SteadyDB.SteadyDBConnection | 294 | 16.08 KB
附上可以复现问题的最小化代码,执行完输出done后,htop显示python3一直占用着那一部分内存,除非kill掉否则不释放(发不了链接base64编码了一下)
#!/usr/bin/pyyhon
# -*- coding: UTF-8 -*-
import threadpool, time, requests, base64
s = requests.Session()
def childfunc(id):
url = base64.b64decode('aHR0cHM6Ly91cGxvYWQud2lraW1lZGlhLm9yZy93aWtpcGVkaWEvY29tbW9ucy9mL2ZmL1BpemlnYW5pXzEzNjdfQ2hhcnRfMTBNQi5qcGc=')
res = s.get(url, timeout=(5, 60))
def mainfunc(tasknum, thread):
list = []
for i in range(tasknum):
list.append(str(i))
pool = threadpool.ThreadPool(thread)
requests = threadpool.makeRequests(childfunc, list)
for req in requests:
pool.putRequest(req)
pool.wait()
print('done')
while True:
time.sleep(1)
if __name__ == '__main__':
mainfunc(10000, 50)