This content originally appeared on Envato Tuts+ Tutorials and was authored by Monty Shokeen
If you have been using computers for some time, you have probably come across files with the .zip extension. They are special files that can hold the compressed content of many other files, folders, and subfolders. This makes them pretty useful for transferring files over the internet. Did you know that you can use Python to compress or extract files?
This tutorial will teach you how to use the zipfile module in Python, to extract or compress individual or multiple files at once.
Compressing Individual Files
This one is easy and requires very little code. We begin by importing the zipfile module and then open the ZipFile object in write mode by specifying the second parameter as 'w'. The first parameter is the path to the file itself. Here is the code that you need:
import zipfile with zipfile.ZipFile('C:\\Stories\\Fantasy\\jungle.zip', 'w') as jungle_zip: jungle_zip.write('C:\\Stories\\Fantasy\\jungle.pdf', compress_type=zipfile.ZIP_DEFLATED)
Please note that I will specify the path in all the code snippets in a Windows style format; you will need to make appropriate changes if you are on Linux or Mac.
You can specify different compression methods to compress files. The newer methods BZIP2
and LZMA
were added in Python version 3.3, and there are some other tools as well which don't support these two compression methods. For this reason, it is safe to just use the DEFLATED
method. You should still try out these methods to see the difference in the size of the compressed file.
Compressing Multiple Files
This is slightly complex as you need to iterate over all files. The code below should compress all files with the extension pdf in a given folder:
import os import zipfile fantasy_zip = zipfile.ZipFile('C:\\Stories\\Fantasy\\archive.zip', 'w') for folder, subfolders, files in os.walk('C:\\Stories\\Fantasy'): for file in files: if file.endswith('.pdf'): fantasy_zip.write(os.path.join(folder, file), os.path.relpath(os.path.join(folder,file), 'C:\\Stories\\Fantasy'), compress_type = zipfile.ZIP_DEFLATED) fantasy_zip.close()
This time, we have imported the os
module and used its walk()
method to go over all files and subfolders inside our original folder. I am only compressing the pdf files in the directory. You can also create different archived files for each format using if
statements.
If you don't want to preserve the directory structure, you can put all the files together by using the following line:
fantasy_zip.write(os.path.join(folder, file), file, compress_type = zipfile.ZIP_DEFLATED)
The write()
method accepts three parameters. The first parameter is the name of our file that we want to compress. The second parameter is optional and allows you to specify a different file name for the compressed file. If nothing is specified, the original name is used.
Extracting All Files
You can use the extractall()
method to extract all the files and folders from a zip file into the current working directory. You can also pass a folder name to extractall()
to extract all files and folders in a specific directory. If the folder that you passed does not exist, this method will create one for you. Here is the code that you can use to extract files:
import zipfile with zipfile.ZipFile('C:\\Stories\\Fantasy\\archive.zip') as fantasy_zip: fantasy_zip.extractall('C:\\Library\\Stories\\Fantasy')
If you want to extract multiple files, you will have to supply the name of files that you want to extract as a list.
Extracting Individual Files
This is similar to extracting multiple files. One difference is that this time you need to supply the filename first and the path to extract them to later. Also, you need to use the extract()
method instead of extractall()
. Here is a basic code snippet to extract individual files.
import zipfile with zipfile.ZipFile('C:\\Stories\\Fantasy\\archive.zip') as fantasy_zip: fantasy_zip.extract('Fantasy Jungle.pdf', 'C:\\Stories\\Fantasy')
Getting Information About Files
Consider a scenario where you need to see if a zip archive contains a specific file. Up to this point, your only option to do so is by extracting all the files in the archive. Similarly, you may need to extract only those files which are larger than a specific size. The zipfile
module allows us to inquire about the contents of an archive without ever extracting it.
Using the namelist()
method of the ZipFile object will return a list of all members of an archive by name. To get information on a specific file in the archive, you can use the getinfo()
method of the ZipFile object. This will give you access to information specific to that file, like the compressed and uncompressed size of the file or its last modification time. We will come back to that later.
Calling the getinfo()
method one by one on all files can be a tiresome process when there are a lot of files that need to be processed. In this case, you can use the infolist()
method to return a list containing a ZipInfo
object for every single member in the archive. The order of these objects in the list is same as that of actual zipfiles.
You can also directly read the contents of a specific file from the archive using the read(file)
method, where file
is the name of the file that you intend to read. To do this, the archive must be opened in read or append mode.
To get the compressed size of an individual file from the archive, you can use the compress_size
attribute. Similarly, to know the uncompressed size, you can use the file_size
attribute.
The following code uses the properties and methods we just discussed to extract only those files that have a size below 1MB.
import zipfile with zipfile.ZipFile('C:\\Stories\\Funny\\archive.zip') as stories_zip: for file in stories_zip.namelist(): if stories_zip.getinfo(file).file_size < 1024*1024: stories_zip.extract(file, 'C:\\Stories\\Short\\Funny')
To know the time and date when a specific file from the archive was last modified, you can use the date_time
attribute. This will return a tuple of six values. The values will be the year, month, day of the month, hours, minutes, and seconds, in that specific order. The year will always be greater than or equal to 1980, and hours, minutes, and seconds are zero-based.
import zipfile with zipfile.ZipFile('C:\\Stories\\Funny\\archive.zip') as stories_zip: thirsty_crow_info = stories_zip.getinfo('The Thirsty Crow.pdf') print(thirsty_crow_info.date_time) print(thirsty_crow_info.compress_size) print(thirsty_crow_info.file_size)
This information about the original file size and compressed file size can help you decide whether it is worth compressing a file. I am sure it can be used in some other situations as well.
Reading and Writing Content to Files
We were able to get a lot of important information about the files in our archive using their ZipInfo
objects. Now, it is time to go a step further and get the actual content of those files. I have taken some text files from the Project Gutenberg website and created an archive with them. We will now read the contents of one of the files in the archive using the read()
function. It will return the bytes of the given file as long as the archive containing the file is open for reading. Here is an example:
import zipfile with zipfile.ZipFile('D:\\tutsplus-tests\\books.zip') as books: for file in books.namelist(): if file == 'Frankenstein.txt': contents = books.read(file) # <class 'bytes'> print(type(contents)) # b'\xef\xbb\xbfThe Project Gutenberg eBook of Frankenstein, by Mary Wollstonecraft print(contents) # 29 print(contents.count(b'Frankenstein')) contents = contents.replace(b'Frankenstein', b'Crankenstein') # b'\xef\xbb\xbfThe Project Gutenberg eBook of Crankenstein, by Mary Wollstonecraft print(contents)
As you can see, the read()
function returns a bytes object with all the content of the file we are reading. You can do a lot of operations on the contents of the file like finding the position of any sub-sequence from either end of the data or regular replacements like we did above. In our example, we are doing all our operations with simple byte strings because we are reading text files.
There is also a write()
function in the module but it is used to write files to the archive and not write content to those files themselves. One way to write content to specific files is to open them in write mode using the open()
function and then use write()
to add content to those files.
import zipfile with zipfile.ZipFile('D:\\tutsplus-tests\\multiples.zip', 'w') as multiples_zip: for i in range(1, 101): with multiples_zip.open(str(i) + '.txt', 'w') as file: for j in range(1, 101): line = ' '.join(map(str, [i, 'x', j, '=', i*j ])) + '\n' number = bytes(line, 'utf-8') file.write(number)
The above code will create 100 text files with first 100 multiples of those numbers stored in each file. We convert our string to bytes
because write()
expects a bytes-like object instead of a regular string.
Final Thoughts
As evident from this tutorial, using the zipfile
module to compress files gives you a lot of flexibility. You can compress different files in a directory to different archives based on their type, name, or size. You also get to decide whether you want to preserve the directory structure or not. Similarly, while extracting the files, you can extract them to the location you want, based on your own criteria like size, etc.
To be honest, it was also pretty exciting for me to compress and extract files by writing my own code. I hope you enjoyed the tutorial, and if you have any questions, please let me know in the comments.
Learn Python
Learn Python with our complete python tutorial guide, whether you're just getting started or you're a seasoned coder looking to learn new skills.
This content originally appeared on Envato Tuts+ Tutorials and was authored by Monty Shokeen
Monty Shokeen | Sciencx (2016-07-06T09:44:49+00:00) Compressing and Extracting Files in Python. Retrieved from https://www.scien.cx/2016/07/06/compressing-and-extracting-files-in-python/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.