玩命加载中 . . .

Python文件操作:大文件的读


概述

上一篇介绍了文件的open与 with open的区别,此篇介绍下如何读取大文件,避免MemoryError之类的错误(比如说VM内存4G,读取10GiB大小的文件)。

文件读取方式

method Description
read() 一次性读取整个文件内容。推荐使用read(size)方法,size越大运行时间越长
readline() 每次读取一行内容。内存不够时使用,一般不太用
readlines() 一次性读取整个文件内容,并按行返回到list,方便我们遍历

从介绍的几种method来看,read() 与 readlines() 都是读取文件的全部内容,遇到大文件不能按常规方法直接使用。

实践

读取大文件

准备了一个大小为2067MiB的大文件进行测试:

root@scaler80:/mnt/code# du -m ./big_file.txt 
2067	./big_file.txt

测试脚本如下:

#!/usr/bin/env python
# -*- coding:UTF-8 -*-

import os
import time
import psutil

# from memory_profiler import profile

def time_diff(func):
    def wrapper():
        start_time = time.time()
        func()
        end_time = time.time()
        time_diff= end_time - start_time
        print("Time diff : {}".format('%.4f s' % (time_diff)))

    return wrapper


def get_memory_usage(func):
    def wrapper():
        before_used = float(psutil.swap_memory().free) / 1024 / 1024 / 1024
        func()
        after_used = float(psutil.swap_memory().free) / 1024 / 1024 / 1024
        print("Memory diff : {}".format('%.4f GB' % (after_used - before_used)))
    return wrapper

@time_diff
@get_memory_usage
def open_file_test():
    file_name = r"D:\big_file.txt"
    with open(file_name, "r") as f:
        f.read()


if __name__ == "__main__":
    open_file_test()

输出结果:

root@scaler80:/mnt/code# python time_diff.py
Memory diff : 1.0151 GB
Time diff : 6.3894 s

root@scaler80:/mnt/code# 

从测试效果看,整个读取过程,文件占用了1.0151 GB的内存,耗时6.3894秒。

优化一下看看效果:

@time_diff
@get_memory_usage
def open_file_test():
    chunk_size = 1024 * 1024
    file_name = r"D:\big_file.txt"
    with open(file_name, "r") as f:
        f.read(chunk_size)

执行效果:

root@scaler80:/mnt/code# python time_diff.py
Memory diff : 0.0000 GB
Time diff : 0.1323 s

再次优化:

def chunked_file_reader(file, block_size=1024 * 1024):
    for chunk in iter(partial(file.read, block_size), ''):
        yield chunk


@time_diff
@get_memory_usage
def open_file_test():
    file_name = r"D:\big_file.txt"
    with open(file_name, "r") as f:
        chunked_file_reader(f)

效果如下:

root@scaler80:/mnt/code# python time_diff.py
Memory diff : 0.0000 GB
Time diff : 0.1168 s

root@scaler80:/mnt/code# 

最后两个优化基本上没差。


文章作者: Gavin Wang
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Gavin Wang !
  目录