Prevent `CUDA error: out of memory` in just 1 line of code.

šŸØ Koila

Koila solves CUDA error: out of memory error painlessly. Fix it with just one line of code, and forget it.

Type Checking Formatting Unit testing License: MIT Tweet

Koila

šŸš€ Features

  • šŸ™… Prevents CUDA error: out of memory error with one single line of code.

  • šŸ¦„ Lazily evaluates pytorch code to save computing power.

  • āœ‚ļø Automatically splits along the batch dimension to more GPU friendly numbers (2's powers) to speed up the execution.

  • šŸ¤ Minimal API (wrapping all inputs will be enough).

šŸ¤” Why Koila?

Ever encountered RuntimeError: CUDA error: out of memory? We all love PyTorch because of its speed, efficiency, and transparency, but that means it doesn't do extra things. Things like preventing a very common error that has been bothering many users since 2017.

This library aims to prevent that by being a light-weight wrapper over native PyTorch. When a tensor is wrapped, the library automatically computes the amount of remaining GPU memory and uses the right batch size, saving everyone from having to manually finetune the batch size whenever a model is used.

Also, the library automatically uses the right batch size to GPU. Did you know that using bigger batches doesn't always speed up processing? It's handled automatically in this library too.

Because Koila code is PyTorch code, as it runs PyTorch under the hood, you can use both together without worrying compatibility.

Oh, and all that in 1 line of code! šŸ˜Š

ā¬‡ļø Installation

Koila is available on PyPI. To install, run the following command.

pip install koila

šŸƒ Getting started

The usage is dead simple. For example, you have the following PyTorch code (copied from PyTorch's tutorial)

Define the input, label, and model:

# A batch of MNIST image
input = torch.randn(8, 28, 28)

# A batch of labels
label = torch.randn(0, 10, [8])

class NeuralNetwork(Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.flatten = Flatten()
        self.linear_relu_stack = Sequential(
            Linear(28 * 28, 512),
            ReLU(),
            Linear(512, 512),
            ReLU(),
            Linear(512, 10),
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

Define the loss function, calculate output and losses.

loss_fn = CrossEntropyLoss()

# Calculate losses
out = nn(t)
loss = loss_fn(out, label)

# Backward pass
nn.zero_grad()
loss.backward()

Ok. How to adapt the code to use Koila's features?

You change this line of code:

# Wrap the input tensor.
# If a batch argument is provided, that dimension of the tensor would be treated as the batch.
# In this case, the first dimension (dim=0) is used as batch's dimension.
input = lazy(torch.randn(8, 28, 28), batch=0)

Done. You will not run out of memory again.

See examples/getting-started.py for the full example.

šŸ‹ļø How does it work under the hood?

CUDA error: out of memory generally happens in forward pass, because temporary variables will need to be saved in memory.

Koila is a thin wrapper around PyTorch. It is inspired by TensorFlow's static/lazy evaluation. By building the graph first, and run the model only when necessarily, the model has access to all the information necessarily to determine how much resources is really need to compute the model.

In terms of memory usage, only shapes of temporary variables are required to calculate the memory usage of those variables used in the model. For example, + takes in two tensors with equal sizes, and outputs a tensor with a size equal to the input size, and log takes in one tensor, and outputs another tensor with the same shape. Broadcasting makes it a little more complicated than that, but the general ideas are the same. By tracking all these shapes, one could easily tell how much memory is used in a forward pass. And select the optimal batch size accordingly.

šŸŒ It sounds slow. Is it?

NO. Indeed, calculating shapes and computing the size and memory usage sound like a lot of work. However, keep in mind that even a gigantic model like GPT-3, which has 96 layers, has only a few hundred nodes in its computing graph. Because Koila's algorithms run in linear time, any modern computer will be able to handle a graph like this instantly.

Most of the computing is spent on computing individual tensors, and transferring tensors across devices. And bear in mind that those checks happen in vanilla PyTorch anyways. So no, not slow at all.

šŸ”Š How to pronounce koila?

This project was originally named koala, the laziest species in the world, and this project is about lazy evaluation of tensors. However, as that name is taken on PyPI, I had no choice but to use another name. Koila is a word made up by me, pronounced similarly to voila (It's a French word), so sounds like koala.

ā­ Give me a star!

If you like what you see, please consider giving this a star (ā˜…)!

šŸ—ļø Why did I build this?

Batch size search is not new. In fact, the mighty popular PyTorch Lightning has it. So why did I go through the trouble and build this project?

PyTorch Lightning's batch size search is deeply integrated in its own ecosystem. You have to use its DataLoader, subclass from their models, and train your models accordingly. While it works well with supervised learning tasks, it's really painful to use in a reinforcement learning task, where interacting with the environment is a must.

In comparison, because Koila is a super lightweight PyTorch wrapper, it works when PyTorch works, thus providing maximum flexibility and minimal changes to existing code.

šŸ“ Todos

  • šŸ§© Provide an extensible API to write custom functions for the users.
  • šŸ˜Œ Simplify internal workings even further. (Especially interaction between Tensors and LazyTensors).
  • šŸŖ Work with multiple GPUs.

šŸš§ Warning

The code works on many cases, but it's still a work in progress. This is not (yet) a fully PyTorch compatible library due to limited time.

šŸ„° Contributing

We take openness and inclusiveness very seriously. We have adopted the following Code of Conduct.

Comments
  • Using Koila with Big Sleep?

    Using Koila with Big Sleep?

    Hi, this project could be revolutionary, if only I knew how to use it :)

    You surely heard of Big Sleep, right? Using CLIP and BIGGAN, from just a line of text it's capable of generating amazing visuals and unique works of art, which is why is getting more and more popular among an ever growing number of artists and curious people who have been deeply fascinated by the potential of these techniques...

    However many of us have not been able to run these kind of projects on our machines because of low VRAM in consumer GPUs and crazy market prices and ended up stumbling almost immediately on the infamous CUDA Memory Error... (Yes, Google Colab is nice and all, but running this projects locally makes for a totally different kind of "technological chill" if you know what I mean :) )

    So, I was thinking, would it be possible to apply Koila to Big Sleep, to fix those errors? If so, that'd be a game changer! It would at the same time benefit a huge number of users, and translate into massive traction for Koila!
    Looking at the README I thought the whole process would have been very simple so I tried looking at it myself... but in the end I had to give up because I've just approached this field and I still miss much of the necessary background to figure out these kind of details.

    So yeah, would you consider providing a short example for this use case of Koila + Big Sleep, if feasible? In that case just a few lines of code could potentially mean the beginning of a little revolution :)

  • Stack overflow (endless loop) when gradients are disabled

    Stack overflow (endless loop) when gradients are disabled

    I've just installed and tried out koila. However there seems to be an endless loop when applying it to my backbone model. It uses Conv1d and gradients are disabled. Also it seems like koila does not handle the permute operation.

  • [BUG] pip can't find the package on Kaggle & Colab

    [BUG] pip can't find the package on Kaggle & Colab

    Hello, as who's suffering from "cuda out of memory" errors on Kaggle notebook, I can't wait to use your package. However, I run into errors when I try to install koila on both Kaggle and Colab notebooks.

    Describe the bug !pip install koila outputs the following error message: ERROR: Could not find a version that satisfies the requirement koila (from versions: none) ERROR: No matching distribution found for koila on Kaggle and Colab notebooks.

    To Reproduce Steps to reproduce the behavior:

    1. Run !pip install koila on Kaggle / Colab

    I'd appreciate it if anyone provides me with an alternate solution until this error gets fixed.

  • Compatibility with PyTorch hooks.

    Compatibility with PyTorch hooks.

    Hello, I found this project is interesting. However, I found the lazy tensor mechanism is impossible to work with the PyTorch backward hooks, which makes it difficult to be used in combination with PyTorch checkpointing (https://pytorch.org/docs/stable/checkpoint.html). Checkpointing is a common way to avoid OOM in training.

  • getting-started.py failed!

    getting-started.py failed!

    I run the following code and set the input batch size as 20. (pytorch 1.10.0) python example/getting-started.py The errros. Traceback (most recent call last): File "/home/user/codes/koila/examples/getting-started.py", line 97, in lazy_loss.backward() File "/home/user/anaconda3/envs/torch/lib/python3.9/site-packages/koila/tensors.py", line 439, in backward mini_batch = self.run((total, total + mini_batch_size)) File "/home/user/anaconda3/envs/torch/lib/python3.9/site-packages/koila/tensors.py", line 187, in run return data.run(partial) File "/home/user/anaconda3/envs/torch/lib/python3.9/site-packages/koila/tensors.py", line 94, in _run result = self.func(*real_args, **real_kwargs) File "/home/user/anaconda3/envs/torch/lib/python3.9/site-packages/torch/nn/functional.py", line 2846, in cross_entropy return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing) ValueError: Expected input batch_size (16) to match target batch_size (20).

  • Can't install from pip (PyPi)

    Can't install from pip (PyPi)

    I am unable to install this from PyPi using Pip. I'm not sure why, but I opened this issue in case anyone else was having this problem and was searching here.

    The output I get is this:

    pip install koila
    ERROR: Could not find a version that satisfies the requirement koila
    ERROR: No matching distribution found for koila
    
  • Issues with

    Issues with "No custom methods found. Evaluating eagerly."

    I tried this with a HuggingFace transformers model and set my batch size artificially large. Initially I saw the following before OOM memory.

    DEBUG    __getattr__ called for pin_memory. Automatically resolving function. 
    DEBUG    No custom methods found. Evaluating eagerly.  
    

    I changed the option of dataloader_pin_memory = False and got a little farther.

    DEBUG    __getattr__ called for to. Automatically resolving function.
    DEBUG    No custom methods found. Evaluating eagerly.
    

    This was resolved by moving the data to the GPU (calling .to('cuda:0')) in the collator ( this is done in the model). The next error was..

    DEBUG    __getattr__ called for float. Automatically resolving function
    DEBUG    No custom methods found. Evaluating eagerly.
    

    This one I'm not sure how to resolve and I'm not certain that "Evaluating eagerly" is even the issue. However, after the first one of those debug statements I see the OOM error. Any advice?

  • cannot get

    cannot get "device" attribute from LazyTensor

    I have code that depends on getting the device on which the tensor is stored. The device is then used to initialize a new empty tensor that my model needs. Long story short, if tensor x is wrapped in LazyTensor then accessing x.device leads to an error.

    Maybe you need to consider transparently exposing most (if not all) attributes of the wrapped tensor?

  • Typo in README

    Typo in README

    Just a typo for an incomplete sentence. Just wanted to let you know :) https://github.com/rentruewang/koila/blob/cca5830f24c46172947a0db29b55278585bfa912/README.md?plain=1#L150

  • This is fantastic, great work! Just to be clear...

    This is fantastic, great work! Just to be clear...

    Just making sure, this lazy wrapper somehow divvies up the computations per GPU budget, right? it doesn't just... sub-sample a smaller batch and ignore the remainder, right?

  • Hi! We are building LazyTensor too!

    Hi! We are building LazyTensor too!

    Hi! I'm with Pytorch team and it looks like we are also building something similar to koila here: https://github.com/pytorch/pytorch/tree/lazy_tensor_staging . We would love to connect and learn more about your work! If you are interested, could you please reply to this issue and drop me a line at k o r o v a i k o n AT gmail.com (no spaces obviously)

  • Maths domain error

    Maths domain error

    I am using Koila to solve an OOM error during my training. But the following error occurs : ``Traceback (most recent call last): File "/mnt/sdb2/Adama/configure_docker_for_transvw/pytorch/train.py", line 92, in loss.backward() File "/home/nanaa/.local/lib/python3.10/site-packages/koila/lazy.py", line 435, in backward for mini_batch_size in gpus.split_batch( File "/home/nanaa/.local/lib/python3.10/site-packages/koila/gpus.py", line 100, in split_batch batch_size = 2 ** (math.floor(math.log2(max_batch))) ValueError: math domain error``` Probably due to the value of max_batch ?

  • unet3d -  koila.errors.UnsupportedError

    unet3d - koila.errors.UnsupportedError

    I am trying to apply koila lazy eval on a Unet3D.

    # defining the model
    import torch
    import torch.nn as nn
    import torch.nn.functional as F
    
    
    def conv3(in_channels, out_channels, stride, norm='BatchNorm3d', act='GELU'):
        return nn.Sequential(
                nn.Conv3d(in_channels, out_channels, 3, 1, 1),
                getattr(nn, norm)(out_channels),
                getattr(nn, act)())
    
    
    def double_conv3(in_channels, out_channels, stride):
        return nn.Sequential(conv3(in_channels, out_channels, 1),
                             conv3(out_channels, out_channels, stride))
    
    def merge_skip(x, skip):
        x = F.upsample(x, size=skip.shape[-3:], mode='trilinear', align_corners=True)
        return torch.cat((x,skip),dim=1)
    
    
    
    class Unet3D(nn.Module):
        def __init__(self, in_channels, out_channels, num_layers=4, base=16):  
            super().__init__()
    	
            enc_channels = [in_channels]+[base * 2**i for i in range(num_layers)]
            dec_channels = [base * 2**i for i in range(num_layers-1,-1,-1)]+[out_channels]
    
            self.encoders = nn.ModuleList()
            for i in range(len(enc_channels)-1):
                cin = enc_channels[i]
                cout = enc_channels[i+1]
                enc = double_conv3(cin, cout, 2)
                self.encoders.append(enc)
    
            self.decoders = nn.ModuleList()
            for i in range(len(dec_channels)-1):
                cin_skip = enc_channels[-i-2]
                cin_up = dec_channels[i]
                cin = cin_skip + cin_up 
                cout = dec_channels[i+1]
                dec = double_conv3(cin, cout, 1)	
                self.decoders.append(dec)
    
        def forward(self, x, return_all=False):
            out = [x]
            for encoder in self.encoders:
                x = encoder(x)
                out.append(x)
            n = len(out)
            for i, decoder in enumerate(self.decoders): 
                skip = out[n - 2 - i]
                x = merge_skip(out[-1], skip)
                x = decoder(x)
                out.append(x)
    
            if return_all:
                return out 
            else:
                return out[-1]
    
    # test of koila on unet
    def test_lazy():
        net = Unet3D(1,3)
        net.cuda()
        s = 64 
        b,c,d,h,w = 2,1,s,s,s
        x = torch.randn(b,c,d,h,w).cuda()
        t = torch.randint(0,3, (b,d,h,w)).cuda()
    
        loss_fn = nn.CrossEntropyLoss()
        net.zero_grad()
    
        lazy_x, lazy_t = lazy(x, t, batch=0)
        lazy_out = net(lazy_x)
        lazy_loss = loss_fn(lazy_out, lazy_t) 
        assert isinstance(lazy_loss, LazyTensor), type(lazy_loss)
        lazy_loss.backward()
    
    
    
    # This fails
    test_lazy()
    

    This fails and outputs:

    tensors = (tensor([[[[[-8.9936e-02, -7.9037e-02, -1.5048e-02,  ...,  2.9969e-01,
                 2.9774e-01, -1.0489e-01],
            ...]]], device='cuda:0',
           grad_fn=<UpsampleTrilinear3DBackward1>), <koila.lazy.LazyTensor object at 0x7fa21bf99880>)
    dim = 1, args = (), kwargs = {}, shapes = [torch.Size([2, 128, 64, 64, 64]), (2, 64, 64, 64, 64)]
    no_dim = [torch.Size([2, 64, 64, 64]), (2, 64, 64, 64)], result_size = torch.Size([2, 64, 64, 64])
    size = (2, 64, 64, 64)
    
        def cat(
            tensors: Sequence[TensorLike], dim: int = 0, *args: Any, **kwargs: Any
        ) -> PrePass:
            mute_unused_args(*args, **kwargs)
    
            if len(tensors) == 0:
                raise ValueError("Expected a sequence of tensors. Got empty sequence.")
    
            shapes = [t.size() for t in tensors]
            no_dim = [t[:dim] + t[dim + 1 :] for t in shapes]
    
            result_size = no_dim[0]
            for size in no_dim[1:]:
                if result_size != size:
                    raise ValueError(
                        f"Dimension should be equal outside dim {dim}. Got {shapes}."
                    )
    
            if len(set(interfaces.bat(t) for t in tensors)) != 1:
    >           raise UnsupportedError
    E           koila.errors.UnsupportedError
    
    ../miniconda3/envs/snakes/lib/python3.9/site-packages/koila/prepasses.py:423: UnsupportedError
    
  • KeyError: 0

    KeyError: 0

    Thanks for your nice work! I wrapped my input and label (feat, label) = lazy(feat, label, batch=0) Then I met the following error when running it.

    File "/home/victor/anaconda3/envs/py38_tab/lib/python3.8/site-packages/koila/lazy.py", line 504, in lazy_forward out = LazyTensor(LazyFunction(func, shape_func)(*args, **kwargs)) File "/home/victor/anaconda3/envs/py38_tab/lib/python3.8/site-packages/koila/lazy.py", line 51, in __call__ prepass = self.prepass_func(*args, **kwargs) File "/home/victor/anaconda3/envs/py38_tab/lib/python3.8/site-packages/koila/prepasses.py", line 286, in tranpose batch = b.map(lambda x: {dim0: dim1, dim1: dim0}[x]) File "/home/victor/anaconda3/envs/py38_tab/lib/python3.8/site-packages/koila/interfaces.py", line 78, in map index = func(self.index) File "/home/victor/anaconda3/envs/py38_tab/lib/python3.8/site-packages/koila/prepasses.py", line 286, in batch = b.map(lambda x: {dim0: dim1, dim1: dim0}[x]) KeyError: 0

  • Major overhaul

    Major overhaul

    I'm planning on making a major overhaul, to simplify the code and make it more scalable.

    Currently this project relies too much on checks to determine if an object is a LazyTensor or a torch.Tensor, however, it's not only difficult to maintain, but can also negatively affect performance.

    I'm on my way to create a new wrapper for torch.Tensor that matches LazyTensor's API but executes immediately for internal use.

    Also, I'm modifying the LazyTensor's API to match torch.Tensor's.

    I'll be using this issue to track my progress.

    Closes: #22 Closes: #25

  • wrong error in getting-started.py

    wrong error in getting-started.py

    Hello, I noticed you fix the lazy label bug and the getting-started.py is able to run. But it can not pass the assertion. The grad diff is quite large!

    assert all( [print(torch.max(grad - lazy_grad)) for (grad, lazy_grad) in zip(grads, lazy_grads)] )

    tensor(0.0698) tensor(0.0227) tensor(0.0717) tensor(0.0415) tensor(0.5402) tensor(0.7869)

Decorators for maximizing memory utilization with PyTorch & CUDA

torch-max-mem This package provides decorators for memory utilization maximization with PyTorch and CUDA by starting with a maximum parameter size and

May 2, 2022
MEDS: Enhancing Memory Error Detection for Large-Scale Applications

MEDS: Enhancing Memory Error Detection for Large-Scale Applications Prerequisites cmake and clang Build MEDS supporting compiler $ make Build Using Do

Jun 1, 2022
PyTorch Code of "Memory In Memory: A Predictive Neural Network for Learning Higher-Order Non-Stationarity from Spatiotemporal Dynamics"

Memory In Memory Networks It is based on the paper Memory In Memory: A Predictive Neural Network for Learning Higher-Order Non-Stationarity from Spati

May 30, 2022
Convert Python 3 code to CUDA code.

Py2CUDA Convert python code to CUDA. Usage To convert a python file say named py_file.py to CUDA, run python generate_cuda.py --file py_file.py --arch

Jul 14, 2021
Segcache: a memory-efficient and scalable in-memory key-value cache for small objects

Segcache: a memory-efficient and scalable in-memory key-value cache for small objects This repo contains the code of Segcache described in the followi

Sep 21, 2022
Episodic-memory - Ego4D Episodic Memory Benchmark

Ego4D Episodic Memory Benchmark EGO4D is the world's largest egocentric (first p

Feb 18, 2022
Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(nĀ²) Memory"

Memory Efficient Attention Pytorch Implementation of a memory efficient multi-head attention as proposed in the paper, Self-attention Does Not Need O(

Sep 18, 2022
Extending JAX with custom C++ and CUDA code

Extending JAX with custom C++ and CUDA code This repository is meant as a tutorial demonstrating the infrastructure required to provide custom ops in

Sep 23, 2022
Just-Now - This Is Just Now Login Friendlist Cloner Tools
Just-Now - This Is Just Now Login Friendlist Cloner Tools

JUST NOW LOGIN FRIENDLIST CLONER TOOLS Install $ apt update $ apt upgrade $ apt

Mar 9, 2022
Several simple examples for popular neural network toolkits calling custom CUDA operators.
Several simple examples for popular neural network toolkits calling custom CUDA operators.

Neural Network CUDA Example Several simple examples for neural network toolkits (PyTorch, TensorFlow, etc.) calling custom CUDA operators. We provide

Sep 16, 2022
Picasso: A CUDA-based Library for Deep Learning over 3D Meshes

The Picasso Library is intended for complex real-world applications with large-scale surfaces, while it also performs impressively on the small-scale applications over synthetic shape manifolds. We have upgraded the point cloud modules of SPH3D-GCN from homogeneous to heterogeneous representations, and included the upgraded modules into this latest work as well. We are happy to announce that the work is accepted to IEEE CVPR2021.

Aug 31, 2022
This Repo is the official CUDA implementation of ICCV 2019 Oral paper for CARAFE: Content-Aware ReAssembly of FEatures

Introduction This Repo is the official CUDA implementation of ICCV 2019 Oral paper for CARAFE: Content-Aware ReAssembly of FEatures. @inproceedings{Wa

Sep 22, 2022
PyTorch implementation of Soft-DTW: a Differentiable Loss Function for Time-Series in CUDA
PyTorch implementation of Soft-DTW: a Differentiable Loss Function for Time-Series in CUDA

Soft DTW Loss Function for PyTorch in CUDA This is a Pytorch Implementation of Soft-DTW: a Differentiable Loss Function for Time-Series which is batch

Sep 23, 2022
Example repository for custom C++/CUDA operators for TorchScript

Custom TorchScript Operators Example This repository contains examples for writing, compiling and using custom TorchScript operators. See here for the

Sep 10, 2022
This demo showcase the use of onnxruntime-rs with a GPU on CUDA 11 to run Bert in a data pipeline with Rust.

Demo BERT ONNX pipeline written in rust This demo showcase the use of onnxruntime-rs with a GPU on CUDA 11 to run Bert in a data pipeline with Rust. R

Sep 18, 2022
Sep 26, 2022
CUDA Python Low-level Bindings

CUDA Python Low-level Bindings

Sep 14, 2022
Time-stretch audio clips quickly with PyTorch (CUDA supported)! Additional utilities for searching efficient transformations are included.

Time-stretch audio clips quickly with PyTorch (CUDA supported)! Additional utilities for searching efficient transformations are included.

Jul 7, 2022
A dead simple python wrapper for darknet that works with OpenCV 4.1, CUDA 10.1

What Dead simple python wrapper for Yolo V3 using AlexyAB's darknet fork. Works with CUDA 10.1 and OpenCV 4.1 or later (I use OpenCV master as of Jun

Jan 12, 2022