Aardvark female with cub. Photo: Scotto Bear , CC BY-SA 2.0You write a program for data processing, it passes the test perfectly on a small file, but it crashes at a real load.
The problem is out of memory. If you have 16 gigabytes of RAM, you will not be able to download a hundred gigabyte file there. At some point, the OS will run out of memory, it will not be able to allocate a new one, and the program will crash.
What to do?
Well, you can deploy a Big Data cluster, just:
- Find a cluster of computers.
- Set it up in a week.
- Learn the new API and rewrite your code.
It is expensive and unpleasant. Fortunately, it is often not necessary.
We need a simple and easy solution: to process data on one computer, with minimal setup and maximum use of already connected libraries. This is almost always possible with the help of simplest methods, sometimes called out-of-core computation.
In this article we discuss:
- Why do we need RAM at all.
- The easiest way to process data that does not fit in memory is to spend a little money.
- Three main software methods for processing excessive amounts of data: compression, blocking, and indexing.
Future articles will show in practice how to apply these methods with specific libraries such as NumPy and Pandas. But first, the theory.
Why is RAM necessary at all?
Before we get into the discussion of solutions, let's clarify why this problem exists at all. You can write data to random access memory (RAM), but also to the hard disk, so why do you need RAM? A disk is cheaper, it usually has no problems with a lack of space, why not just limit yourself to reading and writing from a disk?
Theoretically, this might work. But even modern fast SSDs work
much, much slower than RAM:
- Read from SSD: ~ 16,000 nanoseconds
- Read from RAM: ~ 100 nanoseconds
For fast calculations, we have no choice: the data must be written to RAM, otherwise the code will slow down 150 times.
The easiest solution: more RAM
The easiest solution to the problem of running out of RAM is to spend some money. You can buy a powerful computer, server or rent a virtual machine with lots of memory. In November 2019, a quick search and a very brief price comparison gives the following options:
- Buy Thinkpad M720 Tower with 6 cores and 64 GB of RAM for $ 1074
- Rent a virtual machine in the cloud with 64 cores and 432 GB of RAM for $ 3.62 / hour
These are just numbers after a quick search. Having done a good research, you will surely find better deals.
Spending a little money on hardware to fit data into RAM is often the cheapest solution. After all, our time is expensive. But sometimes this is not enough.
For example, if you perform many data processing tasks over a period of time, cloud computing can be a natural solution, but it can also be expensive. On one of our projects, such computing costs would have consumed all of the projected income from the product, including the most important income needed to pay my salary.
If buying / renting a large amount of RAM does not solve the problem or is not possible, the next step is to optimize the application itself so that it consumes less memory.Technique number 1. Compression
Compression allows you to put the same data in less memory. There are two forms of compression:
- Lossless : after compression, exactly the same information is saved as in the original data.
- Lossy : the stored data loses some details, but ideally this does not greatly affect the calculation results.
Just for clarity, it's not about zip or gzip files when data is compressed
on disk . To process data from a ZIP file, you usually need to unzip it, and then load the files into memory. So this will not help.
What we need is a compression of the representation of the data in memory .Suppose your data stores only two possible values, and nothing else:
"AVAILABLE"
and
"UNAVAILABLE"
. Instead of storing strings with 10 bytes or more per record, you can save them as Boolean values
True
or
False
, which are encoded with just one byte. You can compress information even to one bit, reducing memory consumption by another eight times.
Technique No. 2. Splitting into blocks, loading data one block at a time
Fragmentation is useful in situations where data does not have to be loaded into memory at the same time. Instead, we can load them in parts, processing one fragment at a time (or, as we discuss in the next article, several parts in parallel).
Suppose you want to find the biggest word in a book. You can load all the data into memory at once:
largest_word = "" for word in book.get_text().split(): if len(word) > len(largest_word): largest_word = word
But if the book does not fit in memory, you can load it page by page:
largest_word = "" for page in book.iterpages(): for word in page.get_text().split(): if len(word) > len(largest_word): largest_word = word
This greatly reduces memory consumption because only one page of a book is loaded at a time. In this case, the result will be the same answer.
Technique No. 3. Indexing when only a subset of data is required
Indexing is useful if you want to use only a subset of the data and you are going to load different subsets at different times.
In principle, in such a situation, you can filter out the necessary part and discard the unnecessary. But filtering is slow and not optimal, because you have to first load a lot of extra data into memory before dropping it.
If you need only part of the data, instead of fragmentation, it is better to use an index - a data squeeze that indicates their real location.Imagine that you want to read only fragments of a book mentioning aardvark (a cute mammal in the photograph at the beginning of the article). If you check all the pages in turn, the whole book will be loaded in parts, page by page, in search of aardvarks - and this will take quite a lot of time.
Or you can immediately open the alphabetical index at the end of the book - and find the word βaardvarkβ. It states that the word is mentioned on pages 7, 19 and 120-123. Now you can read these pages, and only them, which is much faster.
This is an effective method because the index is much smaller than the entire book, so it is much easier to load only the index into memory to search for relevant data.
The easiest indexing method
The easiest and most common way to index is naming files in a directory:
mydata/ 2019-Jan.csv 2019-Feb.csv 2019-Mar.csv 2019-Apr.csv ...
If you need data for March 2019, you just upload the file
2019-Mar.csv
- there is no need to download data for February, July or any other month.
Next: applying these methods
The problem of lack of RAM is easiest to solve with the help of money, having bought RAM. But if this is not possible or not enough, you will use compression, fragmentation, or indexing anyway.
The same methods are used in various software packages and tools . Even high-performance Big Data systems are built on them: for example, parallel processing of individual data fragments.
In the following articles, we will look at how to apply these methods in specific libraries and tools, including NumPy and Pandas.