Colossal-AI: A Unified Deep Learning System for Large-Scale Parallel Training

ColossalAI

An integrated large-scale model training system with efficient parallelization techniques

Installation

PyPI

pip install colossalai

Install From Source

git clone [email protected]:hpcaitech/ColossalAI.git
cd ColossalAI
# install dependency
pip install -r requirements/requirements.txt

# install colossalai
pip install .

Install and enable CUDA kernel fusion (compulsory installation when using fused optimizer)

pip install -v --no-cache-dir --global-option="--cuda_ext" .

Documentation

Quick View

Start Distributed Training in Lines

import colossalai
from colossalai.engine import Engine
from colossalai.trainer import Trainer
from colossalai.core import global_context as gpc

model, train_dataloader, test_dataloader, criterion, optimizer, schedule, lr_scheduler = colossalai.initialize()
engine = Engine(
    model=model,
    criterion=criterion,
    optimizer=optimizer,
    lr_scheduler=lr_scheduler,
    schedule=schedule
)

trainer = Trainer(engine=engine,
                  hooks_cfg=gpc.config.hooks,
                  verbose=True)
trainer.fit(
    train_dataloader=train_dataloader,
    test_dataloader=test_dataloader,
    max_epochs=gpc.config.num_epochs,
    display_progress=True,
    test_interval=5
)

Write a Simple 2D Parallel Model

Let's say we have a huge MLP model and its very large hidden size makes it difficult to fit into a single GPU. We can then distribute the model weights across GPUs in a 2D mesh while you still write your model in a familiar way.

from colossalai.nn import Linear2D
import torch.nn as nn


class MLP_2D(nn.Module):

    def __init__(self):
        super().__init__()
        self.linear_1 = Linear2D(in_features=1024, out_features=16384)
        self.linear_2 = Linear2D(in_features=16384, out_features=1024)

    def forward(self, x):
        x = self.linear_1(x)
        x = self.linear_2(x)
        return x

Features

ColossalAI provides a collection of parallel training components for you. We aim to support you to write your distributed deep learning models just like how you write your single-GPU model. We provide friendly tools to kickstart distributed training in a few lines.

Owner
HPC-AI Tech
We are a global team to help you train and deploy your AI models
HPC-AI Tech
Comments
  • [BUG]:  Memory consumption by fp16 is not normal

    [BUG]: Memory consumption by fp16 is not normal

    ๐Ÿ› Describe the bug

    When i used pytorch origin amp, the gpu memory is much smaller than colossai, why? the config is

    from colossalai.amp import AMP_TYPE
    from colossalai.zero.shard_utils import TensorShardStrategy
    from colossalai.nn.optimizer import HybridAdam
    
    fp16 = dict(
        mode=AMP_TYPE.TORCH,
    )
    
    optimizer = dict(
        type=HybridAdam,
        lr=0.001,
        # weight_decay=1e-2,
    )
    

    model | dataset | machine | batch | gradient accmulate size | ZeRO | speed | GPU memory | OPT | tensor_placement_policy | ย  | ย  -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- ir18 | private dataset | 1 | 64 | 1 | no ZeRO | 24%|โ–ˆโ–ˆโ– ย  ย  ย  | 2089/8549 [02:51<08:39, 12.43it/s] | 8703M | HybridAdam | ย  | single machine + Engine | ย  ir18 | private dataset | 1 | 64 | 1 | no ZeRO | 19%|โ–ˆโ–Š ย  ย  ย  ย | 1599/8549 [02:24<10:21, 11.17it/s] | 5769M | HybridAdam | ย  | single machineย  + wo Engineย + pytorch origin fp16 | ย 

    Environment

    No response

  • [BUG]: RuntimeError of

    [BUG]: RuntimeError of "RANK" when running train.py of ResNet example on a single GPU

    ๐Ÿ› Describe the bug

    I met a problem today when running with python train.py, as below,

    /home/user/software/python/anaconda/anaconda3/envs/conda-general/bin/python /home/user/***/***
    /ColossalAI-Examples/image/resnet/train.py
    Traceback (most recent call last):
      File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/colossalai/initialize.py", line 210, in launch_from_torch
        rank = int(os.environ['RANK'])
      File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/os.py", line 679, in __getitem__
        raise KeyError(key) from None
    KeyError: 'RANK'
    
    During handling of the above exception, another exception occurred:
    
    ...
    
    RuntimeError: Could not find 'RANK' in the torch environment, visit https://www.colossalai.org/ for more information on launching with torch
    

    Is this error due to the absence of environment variable RANK in my Ubuntu?

    Environment

    Python: 3.10

  • [BUG]: type object 'ChunkManager' has no attribute 'search_chunk_size'

    [BUG]: type object 'ChunkManager' has no attribute 'search_chunk_size'

    ๐Ÿ› Describe the bug

    when i training the diffusion model that happened:

    Setting up LambdaLR scheduler... Traceback (most recent call last): File "/home/tongange/ColossalAI/examples/images/diffusion/main.py", line 804, in trainer.fit(model, data) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 578, in fit call._call_and_handle_interrupt( File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt return trainer_fn(*args, **kwargs) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 620, in _fit_impl self._run(model, ckpt_path=self.ckpt_path) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1038, in _run self.strategy.setup(self) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/colossalai.py", line 333, in setup self.setup_precision_plugin() File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/colossalai.py", line 270, in setup_precision_plugin chunk_size = self.chunk_size or ChunkManager.search_chunk_size( AttributeError: type object 'ChunkManager' has no attribute 'search_chunk_size' Setting up LambdaLR scheduler... /root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py:437: UserWarning: Error handling mechanism for deadlock detection is uninitialized. Skipping check. rank_zero_warn("Error handling mechanism for deadlock detection is uninitialized. Skipping check.") Summoning checkpoint.

    Traceback (most recent call last): File "/home/tongange/ColossalAI/examples/images/diffusion/main.py", line 804, in trainer.fit(model, data) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 578, in fit call._call_and_handle_interrupt( File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 88, in launch return function(*args, **kwargs) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 620, in _fit_impl self._run(model, ckpt_path=self.ckpt_path) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1038, in _run self.strategy.setup(self) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/colossalai.py", line 333, in setup self.setup_precision_plugin() File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/colossalai.py", line 270, in setup_precision_plugin chunk_size = self.chunk_size or ChunkManager.search_chunk_size( AttributeError: type object 'ChunkManager' has no attribute 'search_chunk_size'

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last): File "/home/tongange/ColossalAI/examples/images/diffusion/main.py", line 806, in melk() File "/home/tongange/ColossalAI/examples/images/diffusion/main.py", line 789, in melk trainer.save_checkpoint(ckpt_path) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1900, in save_checkpoint self._checkpoint_connector.save_checkpoint(filepath, weights_only=weights_only, storage_options=storage_options) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 512, in save_checkpoint _checkpoint = self.dump_checkpoint(weights_only) File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 444, in dump_checkpoint "state_dict": self._get_lightning_module_state_dict(), File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 526, in _get_lightning_module_state_dict state_dict = self.trainer.strategy.lightning_module_state_dict() File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/colossalai.py", line 383, in lightning_module_state_dict assert isinstance(self.model, ZeroDDP) AssertionError

    Environment

    i use the way bellow to train, all the steps are same: https://github.com/hpcaitech/ColossalAI/tree/main/examples/images/diffusion

  • [BUG]: colossalai/kernel/cuda_native/csrc/moe_cuda_kernel.cu:5:10: fatal error: cub/cub.cuh: No such file or directory (update: now with more build errors!)

    [BUG]: colossalai/kernel/cuda_native/csrc/moe_cuda_kernel.cu:5:10: fatal error: cub/cub.cuh: No such file or directory (update: now with more build errors!)

    ๐Ÿ› Describe the bug

    Trying to run a finetune torchrun script, get this error. ColossaiAL was built from source as directed, but it still fails.

    [email protected]:/media/anon/bighdd/ai/toolbox/training$ ./finetune.bash 
    + export BATCH_SIZE=4
    + BATCH_SIZE=4
    + export MODEL=/media/anon/bighdd/ai/models/opt-350m
    + MODEL=/media/anon/bighdd/ai/models/opt-350m
    + export NUMBER_OF_GPUS=1
    + NUMBER_OF_GPUS=1
    + export OUTPUT_DIR=checkpoints
    + OUTPUT_DIR=checkpoints
    ++ date +%Y-%m-%d_%H-%M-%S
    + LOG_NAME=2022-12-22_14-15-45
    + export HF_DATASETS_OFFLINE=1
    + HF_DATASETS_OFFLINE=1
    + mkdir -p checkpoints/logs
    + mkdir -p checkpoints/runs
    + torchrun --nproc_per_node 1 --master_port 19198 ./colossalai/run_clm.py --train_file ./data/train.json --learning_rate 2e-5 --checkpointing_steps 64 --mem_cap 0 --model_name_or_path /media/anon/bighdd/ai/models/opt-350m --output_dir checkpoints --per_device_eval_batch_size 4 --per_device_train_batch_size 4
    + tee checkpoints/logs/2022-12-22_14-15-45.log
    2022-12-22 14:15:51.339450: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
    To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
    Colossalai should be built with cuda extension to use the FP16 optimizer
    If you want to activate cuda mode for MoE, please install with cuda_ext!
    [12/22/22 14:15:54] INFO     colossalai - colossalai - INFO:                                                                              
                                 /home/anon/.local/lib/python3.8/site-packages/colossalai/context/parallel_context.py:521 set_device          
                        INFO     colossalai - colossalai - INFO: process rank 0 is bound to device 0                                          
    [12/22/22 14:15:55] INFO     colossalai - colossalai - INFO:                                                                              
                                 /home/anon/.local/lib/python3.8/site-packages/colossalai/context/parallel_context.py:557 set_seed            
                        INFO     colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 1024, python random: 1024,                
                                 ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the default parallel seed is ParallelMode.DATA.           
                        INFO     colossalai - colossalai - INFO: /home/anon/.local/lib/python3.8/site-packages/colossalai/initialize.py:117   
                                 launch                                                                                                       
                        INFO     colossalai - colossalai - INFO: Distributed environment is initialized, data parallel size: 1, pipeline      
                                 parallel size: 1, tensor parallel size: 1                                                                    
                        INFO     colossalai - colossalai - INFO: ./colossalai/run_clm.py:309 main                                             
                        INFO     colossalai - colossalai - INFO: Start preparing dataset                                                      
    Using custom data configuration default-ced548c04fa8d0c8
    Found cached dataset json (/home/anon/.cache/huggingface/datasets/json/default-ced548c04fa8d0c8/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
    100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 1/1 [00:00<00:00, 597.82it/s]
    Using custom data configuration default-ced548c04fa8d0c8
    Found cached dataset json (/home/anon/.cache/huggingface/datasets/json/default-ced548c04fa8d0c8/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
    Using custom data configuration default-ced548c04fa8d0c8
    Found cached dataset json (/home/anon/.cache/huggingface/datasets/json/default-ced548c04fa8d0c8/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
                        INFO     colossalai - colossalai - INFO: ./colossalai/run_clm.py:350 main                                             
                        INFO     colossalai - colossalai - INFO: Dataset is prepared                                                          
                        INFO     colossalai - colossalai - INFO: ./colossalai/run_clm.py:366 main                                             
                        INFO     colossalai - colossalai - INFO: Model config has been created                                                
    load model from /media/anon/bighdd/ai/models/opt-350m
                        INFO     colossalai - colossalai - INFO: ./colossalai/run_clm.py:373 main                                             
                        INFO     colossalai - colossalai - INFO: GPT2Tokenizer has been created                                               
                        INFO     colossalai - colossalai - INFO: ./colossalai/run_clm.py:388 main                                             
                        INFO     colossalai - colossalai - INFO: Finetune a pre-trained model                                                 
    [12/22/22 14:16:04] INFO     colossalai - ProcessGroup - INFO:                                                                            
                                 /home/anon/.local/lib/python3.8/site-packages/colossalai/tensor/process_group.py:24 get                      
                        INFO     colossalai - ProcessGroup - INFO: NCCL initialize ProcessGroup on [0]                                        
    [12/22/22 14:16:07] INFO     colossalai - colossalai - INFO: ./colossalai/run_clm.py:400 main                                             
                        INFO     colossalai - colossalai - INFO: using Colossal-AI version 0.1.13                                             
    searching chunk configuration is completed in 0.67 s.
    used number: 315.85 MB, wasted number: 3.01 MB
    total wasted percentage is 0.95%
    /home/anon/.local/lib/python3.8/site-packages/colossalai/gemini/chunk/chunk.py:40: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor._storage() instead of tensor.storage()
      return tensor.storage().size() == 0
    /home/anon/.local/lib/python3.8/site-packages/colossalai/gemini/chunk/chunk.py:45: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor._storage() instead of tensor.storage()
      tensor.storage().resize_(0)
    [12/22/22 14:16:09] INFO     colossalai - colossalai - INFO: ./colossalai/run_clm.py:415 main                                             
                        INFO     colossalai - colossalai - INFO: GeminiDDP has been created                                                   
    Running tokenizer on dataset: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 10/10 [00:23<00:00,  2.34s/ba]
    Running tokenizer on dataset: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 1/1 [00:01<00:00,  1.18s/ba]
    [12/22/22 14:16:37] WARNING  colossalai - colossalai - WARNING: ./colossalai/run_clm.py:444 main                                          
                        WARNING  colossalai - colossalai - WARNING: The tokenizer picked seems to have a very large `model_max_length`        
                                 (1000000000000000019884624838656). Picking 1024 instead. You can change that default value by passing        
                                 --block_size xxx.                                                                                            
    Grouping texts in chunks of 1024: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 10/10 [00:05<00:00,  1.92ba/s]
    Grouping texts in chunks of 1024: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 1/1 [00:00<00:00,  3.61ba/s]
    [12/22/22 14:16:42] INFO     colossalai - colossalai - INFO: ./colossalai/run_clm.py:503 main                                             
                        INFO     colossalai - colossalai - INFO: Dataloaders have been created                                                
    /home/anon/.local/lib/python3.8/site-packages/colossalai/tensor/colo_tensor.py:182: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor._storage() instead of tensor.storage()
      ret = func(*args, **kwargs)
    /home/anon/.local/lib/python3.8/site-packages/colossalai/nn/optimizer/nvme_optimizer.py:55: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor._storage() instead of tensor.storage()
      numel += p.storage().size()
    โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Traceback (most recent call last) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
    โ”‚ /home/anon/.local/lib/python3.8/site-packages/colossalai/nn/optimizer/hybrid_adam.py:80 in       โ”‚
    โ”‚ __init__                                                                                         โ”‚
    โ”‚                                                                                                  โ”‚
    โ”‚    77 โ”‚   โ”‚   super(HybridAdam, self).__init__(model_params, default_args, nvme_offload_fracti   โ”‚
    โ”‚    78 โ”‚   โ”‚   self.adamw_mode = adamw_mode                                                       โ”‚
    โ”‚    79 โ”‚   โ”‚   try:                                                                               โ”‚
    โ”‚ โฑ  80 โ”‚   โ”‚   โ”‚   import colossalai._C.cpu_optim                                                 โ”‚
    โ”‚    81 โ”‚   โ”‚   โ”‚   import colossalai._C.fused_optim                                               โ”‚
    โ”‚    82 โ”‚   โ”‚   except ImportError:                                                                โ”‚
    โ”‚    83 โ”‚   โ”‚   โ”‚   raise ImportError('Please install colossalai from source code to use HybridA   โ”‚
    โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
    ModuleNotFoundError: No module named 'colossalai._C.cpu_optim'
    
    During handling of the above exception, another exception occurred:
    
    โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Traceback (most recent call last) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
    โ”‚ /media/anon/bighdd/ai/toolbox/training/./colossalai/run_clm.py:643 in <module>                    โ”‚
    โ”‚                                                                                                  โ”‚
    โ”‚   640                                                                                            โ”‚
    โ”‚   641                                                                                            โ”‚
    โ”‚   642 if __name__ == "__main__":                                                                 โ”‚
    โ”‚ โฑ 643 โ”‚   main()                                                                                 โ”‚
    โ”‚   644                                                                                            โ”‚
    โ”‚                                                                                                  โ”‚
    โ”‚ /media/anon/bighdd/ai/toolbox/training/./colossalai/run_clm.py:519 in main                        โ”‚
    โ”‚                                                                                                  โ”‚
    โ”‚   516 โ”‚   โ”‚   },                                                                                 โ”‚
    โ”‚   517 โ”‚   ]                                                                                      โ”‚
    โ”‚   518 โ”‚                                                                                          โ”‚
    โ”‚ โฑ 519 โ”‚   optimizer = HybridAdam(optimizer_grouped_parameters, lr=args.learning_rate)            โ”‚
    โ”‚   520 โ”‚   optimizer = ZeroOptimizer(optimizer, model, initial_scale=2**14)                       โ”‚
    โ”‚   521 โ”‚                                                                                          โ”‚
    โ”‚   522 โ”‚   # Scheduler and math around the number of training steps.                              โ”‚
    โ”‚                                                                                                  โ”‚
    โ”‚ /home/anon/.local/lib/python3.8/site-packages/colossalai/nn/optimizer/hybrid_adam.py:83 in       โ”‚
    โ”‚ __init__                                                                                         โ”‚
    โ”‚                                                                                                  โ”‚
    โ”‚    80 โ”‚   โ”‚   โ”‚   import colossalai._C.cpu_optim                                                 โ”‚
    โ”‚    81 โ”‚   โ”‚   โ”‚   import colossalai._C.fused_optim                                               โ”‚
    โ”‚    82 โ”‚   โ”‚   except ImportError:                                                                โ”‚
    โ”‚ โฑ  83 โ”‚   โ”‚   โ”‚   raise ImportError('Please install colossalai from source code to use HybridA   โ”‚
    โ”‚    84 โ”‚   โ”‚                                                                                      โ”‚
    โ”‚    85 โ”‚   โ”‚   self.cpu_adam_op = colossalai._C.cpu_optim.CPUAdamOptimizer(lr, betas[0], betas[   โ”‚
    โ”‚    86 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   adamw_mode)            โ”‚
    โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
    ImportError: Please install colossalai from source code to use HybridAdam
    ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 206247) of binary: /usr/bin/python3
    Traceback (most recent call last):
      File "/home/anon/.local/bin/torchrun", line 8, in <module>
        sys.exit(main())
      File "/home/anon/.local/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
        return f(*args, **kwargs)
      File "/home/anon/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
        run(args)
      File "/home/anon/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
        elastic_launch(
      File "/home/anon/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
        return launch_agent(self._config, self._entrypoint, list(args))
      File "/home/anon/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
        raise ChildFailedError(
    torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
    ============================================================
    ./colossalai/run_clm.py FAILED
    ------------------------------------------------------------
    Failures:
      <NO_OTHER_FAILURES>
    ------------------------------------------------------------
    Root Cause (first observed failure):
    [0]:
      time      : 2022-12-22_14:16:47
      host      : linuxmint
      rank      : 0 (local_rank: 0)
      exitcode  : 1 (pid: 206247)
      error_file: <N/A>
      traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
    ============================================================
    

    Environment

    Python 3.8.10 torch: 2.0.0.dev20221215+cu117 colossalai-0.1.13 Nvidia 3060 12GB NVIDIA-SMI 525.60.11 Driver Version: 525.60.11 CUDA Version: 12.0 Cuda compilation tools, release 10.1, V10.1.243

  • [BUG]: ZeRO without using shard_param

    [BUG]: ZeRO without using shard_param

    ๐Ÿ› Describe the bug

    ๐Ÿ› Describe the bug

    When i use ZeRO without shard_params, it occurs the following problems

    Traceback (most recent call last):
      File "train.py", line 175, in <module>
        main()
      File "train.py", line 39, in main
        with ZeroInitContext(target_device=torch.cuda.current_device(), shard_strategy=shard_strategy, shard_param=False):
      File "/usr/local/Python-3.8.6/lib/python3.8/site-packages/colossalai/zero/init_ctx/init_context.py", line 75, in __init__
        self.config = ZeroContextConfig(target_device=target_device, replicated=True, shard_param=shard_param)
      File "/usr/local/Python-3.8.6/lib/python3.8/site-packages/colossalai/zero/init_ctx/init_context.py", line 37, in __init__
        assert target_device.type == 'cuda', "Replicated no-shard paramters should locate in cuda."
    AttributeError: 'int' object has no attribute 'type'
    
    

    My init code is:

    def main():
        parser = colossalai.get_default_parser()
        parser.add_argument('--use_trainer', action='store_true', help='whether to use trainer')
        args = parser.parse_args()
    
        colossalai.launch_from_torch(config='./config.py')
    
        logger = get_dist_logger()
    
        rank = int(os.environ['RANK'])
        # build resnet
        use_zero3 = hasattr(gpc.config, 'zero')
        if use_zero3:
            shard_strategy = TensorShardStrategy()
            with ZeroInitContext(target_device=torch.cuda.current_device(), shard_strategy=shard_strategy, shard_param=False):
                model = resnet34(num_classes=10)
        else:
            model = resnet34(num_classes=10)
    

    my config is

    from colossalai.amp import AMP_TYPE
    from colossalai.zero.shard_utils import TensorShardStrategy
    from colossalai.nn.optimizer import HybridAdam
    
    zero = dict(
        model_config=dict(
            tensor_placement_policy='cuda',
            shard_strategy=TensorShardStrategy(),
            reuse_fp16_shard=False
        ),
        optimizer_config=dict()
    )
    
    optimizer = dict(
        type=HybridAdam,
        lr=0.001,
        # weight_decay=1e-2,
    )
    
    BATCH_SIZE = 64
    NUM_EPOCHS = 20
    LOGGING_FREQUNCE = 20
    OUTPUT = './'
    
    gradient_clipping = 5.0
    

    Environment

    pip install colossalai==0.1.5+torch1.10cu11.1 -f https://release.colossalai.org

    ubuntu 18.04

    Environment

    pip install colossalai==0.1.5+torch1.10cu11.1 -f https://release.colossalai.org

    ubuntu 18.04

  • [BUG]: Issue with Colossal-AI on Cuda 11.4 and Docker ?

    [BUG]: Issue with Colossal-AI on Cuda 11.4 and Docker ?

    ๐Ÿ› Describe the bug

    Followed the installation guide here: https://github.com/hpcaitech/ColossalAI

    2001 mkdir colossalai 2002 cd colossalai/ 2003 ll 2004 colossalai 2005 git clone https://github.com/hpcaitech/ColossalAI.git 2006 cd ColossalAI 2007 # install dependency 2008 pip install -r requirements/requirements.txt 2009 # install colossalai 2010 pip install . 2014 docker build -t colossalai ./docker

    2015 docker run -ti --gpus all --rm --ipc=host colossalai bash

    [[email protected] workspace]# colossalai check -i Colossalai should be built with cuda extension to use the FP16 optimizer If you want to activate cuda mode for MoE, please install with cuda_ext! CUDA Version: 11.3 PyTorch Version: 1.10.1 CUDA Version in PyTorch Build: 11.3 PyTorch CUDA Version Match: โœ“ CUDA Extension: x

    The Cuda extension ^^^ isn't present?

    [[email protected] workspace]# colossalai benchmark --gpus 8 Colossalai should be built with cuda extension to use the FP16 optimizer If you want to activate cuda mode for MoE, please install with cuda_ext! === Benchmarking Parameters === gpus: 8 batch_size: 8 seq_len: 512 dimension: 1024 warmup_steps: 10 profile_steps: 50 layers: 2 model: mlp

    Colossalai should be built with cuda extension to use the FP16 optimizer If you want to activate cuda mode for MoE, please install with cuda_ext!

    === size: 8, mode: None === Average forward time: 0.0004958677291870118 Average backward time: 0.0010803651809692383 Max allocated GPU memory: 0.26564550399780273 Max cached GPU memory: 0.287109375

    === size: 8, mode: 1d === Average forward time: 0.004022541046142578 Average backward time: 0.0007260799407958985 Max allocated GPU memory: 0.2382950782775879 Max cached GPU memory: 0.287109375

    === size: 8, mode: 2.5d, depth: 2 === Average forward time: 0.001216425895690918 Average backward time: 0.002291984558105469 Max allocated GPU memory: 0.17383670806884766 Max cached GPU memory: 0.2734375

    === size: 8, mode: 3d === Average forward time: 0.000978093147277832 Average backward time: 0.0016768646240234374 Max allocated GPU memory: 0.05128049850463867 Max cached GPU memory: 0.185546875

    Colossalai should be built with cuda extension to use the FP16 optimizer

    What does this ^^^ really mean ?

    This is a A100 based system:

    $nvidia-smi Thu May 26 18:43:56 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A100-SXM... On | 00000000:07:00.0 Off | 0 | | N/A 27C P0 52W / 400W | 0MiB / 40536MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA A100-SXM... On | 00000000:0F:00.0 Off | 0 | | N/A 26C P0 50W / 400W | 0MiB / 40536MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 2 NVIDIA A100-SXM... On | 00000000:47:00.0 Off | 0 | | N/A 26C P0 54W / 400W | 0MiB / 40536MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 3 NVIDIA A100-SXM... On | 00000000:4E:00.0 Off | 0 | | N/A 25C P0 53W / 400W | 0MiB / 40536MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 4 NVIDIA A100-SXM... On | 00000000:87:00.0 Off | 0 | | N/A 30C P0 54W / 400W | 0MiB / 40536MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 5 NVIDIA A100-SXM... On | 00000000:90:00.0 Off | 0 | | N/A 29C P0 53W / 400W | 0MiB / 40536MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 6 NVIDIA A100-SXM... On | 00000000:B7:00.0 Off | 0 | | N/A 29C P0 54W / 400W | 0MiB / 40536MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 7 NVIDIA A100-SXM... On | 00000000:BD:00.0 Off | 0 | | N/A 29C P0 53W / 400W | 0MiB / 40536MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+

    +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

    Environment

    This is a A100 based system:

    $nvidia-smi Thu May 26 18:43:56 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4 |

  • [BUG]: Memory consumption by fp16 is not normal when using Engine.

    [BUG]: Memory consumption by fp16 is not normal when using Engine.

    ๐Ÿ› Describe the bug

    when using colossalai.amp.convert_to_torch_amp to wrap the model, optimizer and criterion,

    if not use_colossai_engine:
        model, optimizer, criterion =  colossalai.amp.convert_to_torch_amp(model, optimizer, criterion)
    

    and then train normally, which also only consumes 4700M of memory.

    output, _ = model(img, label)
    train_loss = criterion(output, label)
    optimizer.backward(train_loss)
    optimizer.step()
    optimizer.zero_grad()
    

    But if you use colossalai.initialize to initialize, it will consume 7700M of memory. But we did see that by reading the fp16 parameter in config, in the initialization code of colossalai.initialize, the conversion of process colossalai.amp.convert_to_torch_amp is performed, and then we use the Engine for training, but it needs to consume 7700M of memory at this time. This is where I get confused.

    engine.zero_grad()
    output, _ = engine(img, label)
    train_loss = engine.criterion(output, label)
    engine.backward(train_loss)
    engine.step()   
    

    Environment

    No response

  • [BUG]: examples/images/diffusion ran failed

    [BUG]: examples/images/diffusion ran failed

    ๐Ÿ› Describe the bug

    I ran the example of diffusion according to https://github.com/hpcaitech/ColossalAI/tree/main/examples/images/diffusion๏ผš steps: conda env create -f environment.yaml conda activate ldm pip install colossalai==0.1.10+torch1.11cu11.3 -f https://release.colossalai.org git clone https://github.com/Lightning-AI/lightning && cd lightning && git reset --hard b04a7aa pip install -r requirements.txt && pip install .

    dataset: laion-400m

    run: bash train.sh

    failed info:

    **/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py:438: UserWarning: Error handling mechanism for deadlock detection is uninitialized. Skipping check. rank_zero_warn("Error handling mechanism for deadlock detection is uninitialized. Skipping check.") Traceback (most recent call last): File "/home/code/ColossalAI/examples/images/diffusion/main.py", line 811, in trainer.fit(model, data) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 579, in fit call._call_and_handle_interrupt( File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt return trainer_fn(*args, **kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 621, in _fit_impl self._run(model, ckpt_path=self.ckpt_path) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1058, in _run results = self._run_stage() File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1137, in _run_stage self._run_train() File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1160, in _run_train self.fit_loop.run() File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(*args, **kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance self._outputs = self.epoch_loop.run(self._data_fetcher) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(*args, **kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 214, in advance batch_output = self.batch_loop.run(kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(*args, **kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance outputs = self.optimizer_loop.run(optimizers, kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(*args, **kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 200, in advance result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position]) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 247, in _run_optimization self._optimizer_step(optimizer, opt_idx, kwargs.get("batch_idx", 0), closure) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 357, in _optimizer_step self.trainer._call_lightning_module_hook( File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1302, in _call_lightning_module_hook output = fn(*args, **kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/core/module.py", line 1661, in optimizer_step optimizer.step(closure=optimizer_closure) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py", line 169, in step step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/colossalai.py", line 368, in optimizer_step return self.precision_plugin.optimizer_step( File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/colossalai.py", line 74, in optimizer_step closure_result = closure() File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 147, in call self._result = self.closure(*args, **kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 133, in closure step_output = self._step_fn() File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 406, in _training_step training_step_output = self.trainer._call_strategy_hook("training_step", *kwargs.values()) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1440, in _call_strategy_hook output = fn(*args, **kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 352, in training_step return self.model(*args, **kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/colossalai/nn/parallel/data_parallel.py", line 241, in forward outputs = self.module(*args, **kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/overrides/base.py", line 98, in forward output = self._forward_module.training_step(*inputs, **kwargs) File "/home/code/ColossalAI/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 411, in training_step loss, loss_dict = self.shared_step(batch) File "/home/code/ColossalAI/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 976, in shared_step loss = self(x, c) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/home/code/ColossalAI/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 988, in forward return self.p_losses(x, c, t, *args, **kwargs) File "/home/code/ColossalAI/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 1122, in p_losses model_output = self.apply_model(x_noisy, t, cond) File "/home/code/ColossalAI/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 1094, in apply_model x_recon = self.model(x_noisy, t, **cond) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/home/code/ColossalAI/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 1519, in forward out = self.diffusion_model(x, t, context=cc) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/home/code/ColossalAI/examples/images/diffusion/ldm/modules/diffusionmodules/openaimodel.py", line 927, in forward h = th.cat([h, hs.pop()], dim=1) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/colossalai/tensor/colo_tensor.py", line 170, in torch_function ret = func(*args, kwargs) RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 8 but got size 7 for tensor number 1 in the list.

    Environment

    image

  • add example of self-supervised SimCLR training - V2

    add example of self-supervised SimCLR training - V2

    The previous version uses Nvidia DALI to create a dataloader. I found that data augmentations in DALI are different from those of torchvision. As a result, the desired performance could not be achieved. In this version, dataloader is implemented with colossalai.nn.data and torchvision. The final linear evaluation accuracy could be up to 85.4%.

  • [BUG]: RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one.

    [BUG]: RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one.

    ๐Ÿ› Describe the bug

    After following the ResNet50 example in the tutorial as soon as possible,I got the error as the title said. It is like my last usage of hf's accelerate, I can't figure out this complex problem for my first usage. Of course I have tried my best to solve it and the reasons is likely: colossalai check -i and its output is: Colossalai should be built with cuda extension to use the FP16 optimizer If you want to activate cuda mode for MoE, please install with cuda_ext! CUDA Version: N/A (CUDA_HOME is not set) PyTorch Version: 1.11.0+cu102 CUDA Version in PyTorch Build: 10.2 PyTorch CUDA Version Match: x CUDA Extension: x

    but I tried in a machine of 11.3 CUDA and I threw a same error.

    Below is part of my code:

    logger = get_dist_logger()
    	# args = colossalai.get_default_parser().parse_args()
    	colossalai.launch_from_torch(config='config.py')
    	config = Config()
    	tokenizer = JiebaTokenizer.from_pretrained('Lowin/chinese-bigbird-base-4096')
    	model = BB()
    	optimizer = optim.AdamW(params=model.parameters(),lr=1e-5,weight_decay=1e-2)
    	lossFunc = F.cross_entropy
    	rouge =   load_metric('rouge')
    
    	valida = json.load(open("dataset/dev.json"))
    	trains = json.load(open("dataset/train.json"))
    	dataSetTrain = DS(trains,tokenizer,config)
    	dataSetValid = DS(valida,tokenizer,config)
    	tDL = DataLoader(dataSetTrain,batch_size=config.batch_size_train,shuffle=True)
    	vDL = DataLoader(dataSetValid,batch_size=config.batch_size_valid)
    
    	engine,tDL,vDL,_ = colossalai.initialize(
    		model,
    		optimizer,
    		lossFunc,
    		tDL,
    		vDL
    	)
    
    	for epoch in range(gpc.config.NUM_EPOCH):
    		tDL = tqdm(tDL,leave=False)
    		engine.train()
    		for batch in tDL:
    			labels = batch.pop('labels').cuda()
    			batch = {key:value.cuda() for key,value in batch.items()}
    			logist = engine(batch)
    			loss_sum = engine.criterion(logist.view(-1,config.vocab_size),labels.view(-1))
    			title_length = labels.ne(0).sum().item()
    			loss = loss_sum/title_length
    			engine.backward(loss)
    			engine.step()
    			engine.zero_grad()
    			tDL.set_description(f'Epoch:{epoch}:')
    			tDL.set_postfix(loss=loss.item())
    

    Code of model construction

    class BB(torch.nn.Module):
    	def __init__(self):
    		super(BB,self).__init__()
    		self.transformer = BigBirdModel.from_pretrained('Lowin/chinese-bigbird-base-4096')
    		self.dropout = torch.nn.Dropout(0.2)
    		self.output = torch.nn.Linear(768,39999)
            
    
    	def forward(self,batch):
    		# batch = self._set_token_type_ids_(batch)
    		outputs = self.transformer(**batch).last_hidden_state  #bs token_num outputsize 
    		logits = self.output(self.dropout(outputs))  #bs token_num vocab_size
    		return logits
    

    here is error info: /home/guxj/anaconda3/envs/NLP_colossalai/lib/python3.8/site-packages/transformers/models/big_bird/modeling_big_bird.py:981: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). torch.arange(indices.shape[0] * indices.shape[1] * num_indices_to_gather, device=indices.device) /home/guxj/anaconda3/envs/NLP_colossalai/lib/python3.8/site-packages/transformers/models/big_bird/modeling_big_bird.py:981: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). torch.arange(indices.shape[0] * indices.shape[1] * num_indices_to_gather, device=indices.device) Traceback (most recent call last):
    File "test3_v3.3.py", line 138, in logist = engine(batch) File "/home/guxj/anaconda3/envs/NLP_colossalai/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 183, in call return self.model(*args, **kwargs) File "/home/guxj/anaconda3/envs/NLP_colossalai/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/home/guxj/anaconda3/envs/NLP_colossalai/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 947, in forward Traceback (most recent call last):
    File "test3_v3.3.py", line 138, in logist = engine(batch) File "/home/guxj/anaconda3/envs/NLP_colossalai/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 183, in call return self.model(*args, **kwargs) File "/home/guxj/anaconda3/envs/NLP_colossalai/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/home/guxj/anaconda3/envs/NLP_colossalai/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 947, in forward if torch.is_grad_enabled() and self.reducer._rebuild_buckets(): RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel, and by making sure all forward function outputs participate in calculating loss. If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable). Parameter indices which did not receive grad for rank 0: 197 198 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
    if torch.is_grad_enabled() and self.reducer._rebuild_buckets(): RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel, and by making sure all forward function outputs participate in calculating loss. If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable). Parameter indices which did not receive grad for rank 1: 197 198 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 44596) of binary: /home/guxj/anaconda3/envs/NLP_colossalai/bin/python Traceback (most recent call last): File "/home/guxj/anaconda3/envs/NLP_colossalai/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/guxj/anaconda3/envs/NLP_colossalai/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/guxj/anaconda3/envs/NLP_colossalai/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in main() File "/home/guxj/anaconda3/envs/NLP_colossalai/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/home/guxj/anaconda3/envs/NLP_colossalai/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/home/guxj/anaconda3/envs/NLP_colossalai/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run elastic_launch( File "/home/guxj/anaconda3/envs/NLP_colossalai/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/guxj/anaconda3/envs/NLP_colossalai/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ test3_v3.3.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2022-05-18_01:27:08 host : dlp01 rank : 1 (local_rank: 1) exitcode : 1 (pid: 44597) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2022-05-18_01:27:08 host : dlp01 rank : 0 (local_rank: 0) exitcode : 1 (pid: 44596) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

    Environment

    CUDA: 10.2 pytorch: 1.11.0 python:3.8.13(miniconda)

  • [BUG]: CUDA extension build skipped when installing from source

    [BUG]: CUDA extension build skipped when installing from source

    ๐Ÿ› Describe the bug

    Hi,I use Install From Source option to install colossalAI,but i encouter problem like: /path/to/myconda/anaconda3/envs/py37-pt111-cu111-colai/lib/python3.7/site-packages/torch/autocast_mode.py:162: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling warnings.warn('User provided device_type of \'cuda\', but CUDA is not available. Disabling') Colossalai should be built with cuda extension to use the FP16 optimizer If you want to activate cuda mode for MoE, please install with cuda_ext! I have installed torch1.11+cu11.3 and using cuda 11.1 any suggestion?

    Environment

    Pytorch 1.11+cu11.3 CUDA 11.1

  • [workflow]New version: Create workflow files for examples' auto check

    [workflow]New version: Create workflow files for examples' auto check

    1. Add several .github/workflows/yml file to implement auto check function on example files.
    • While a new PR is submitted, if the example folder has been changed, the corresponding sub-folder will run automatically to check whether using the latest colossalai it works correctly.
    • Every Sunday's 00:00 (Singapore time) will run all the sub-folders in example folder to check whether the latest colossalai's APIs work well on them.
    • One can manually assert several sub-folders of example folder to run the code using the manual_check_example_file in example_dispatch.yml file.
    1. Changed several small bugs in /example/language/gpt folder. Basically the parameter initializing, requirement setting and so on.
  • [BUG]: FusedScaleMaskSoftmax last dimension does not sum to 1

    [BUG]: FusedScaleMaskSoftmax last dimension does not sum to 1

    ๐Ÿ› Describe the bug

    I use the following code to test the softmax, but the result does not sum to one

    from colossalai import kernel
    import math
    import torch
    
    attention_head_size = 32
    
    softmax = kernel.FusedScaleMaskSoftmax(input_in_fp16=True,
                                                        input_in_bf16=False,
                                                        attn_mask_type=None,
                                                        scaled_masked_softmax_fusion=True,
                                                        mask_func=lambda x, mask:x.masked_fill(mask, -50000),  
                                                        softmax_in_fp32=True,
                                                        scale=1/math.sqrt(attention_head_size))
    
    length = 200
    b = 1
    h = 4
    hidden_states = torch.randn(b, h, length, length).half()
    mask = torch.rand(1, 1, length, length)>0.5
    
    print(softmax.is_kernel_available(mask, b, h, length, length))
    output = softmax(hidden_states, mask) 
    print(output[0, 0, 0].sum())  # the result will be something like tensor(1.1623e-05, dtype=torch.float16), not equal 1
    

    However, if i purposely change the head to make it not using fusion kernel, the result does sum to one.

    Environment

    Colossal-AI version: 0.1.13

    PyTorch Version: 1.12.0 PyTorch Version required by Colossal-AI: 1.12 PyTorch version match: โœ“

    System CUDA Version: 11.2 CUDA Version required by PyTorch: 11.3 CUDA Version required by Colossal-AI: 11.3 CUDA Version Match: x

    CUDA Extension: โœ“

  • [BUG] [DIFFUSION]: Sampling Fails to produce output

    [BUG] [DIFFUSION]: Sampling Fails to produce output

    ๐Ÿ› Describe the bug

    After getting Training working in #2204 The loss value even went down during training but my output after using the sampling script and the default parameters the output was only random noise. Screenshot 2023-01-02 143009 The only minute error that I have that even comes up is that I don't have a validation set so using the default metrics throws a warning about the metrics not being passed. I did though train a new model with new metrics that threw no error and got the same result. If there are things you want me to test I can do that. I don't know if you'd prefer I open another issue or just tack it on here but is using Triton fully supported? I got a proper Triton install to stop a warning from being thrown but I was wondering if I should've ignored it. As well I couldn't get any models to properly work when trying to do training with the -r call. Is there a specific model that is compatible with this new version? The error it throws is RuntimeError: Error(s) in loading state_dict for GeminiDDP: Missing keys in state dict: _forward_modulelvlb_weight, _forward_module.cond_state_model.attn_mask Unexpected Keys in state dict: _forward_module.model_ema.decay _foward_model.model_ema.num_updates

    Environment

    Using the Conda Environment as given in the repository. Cuda supported up to 11.8 using Ubuntu 20.04. Nvidia Driver 525 Proprietary.

    PIP freeze: absl-py==1.3.0 accelerate==0.15.0 aiohttp==3.8.3 aiosignal==1.3.1 albumentations==1.3.0 altair==4.2.0 antlr4-python3-runtime==4.8 async-timeout==4.0.2 attrs==22.2.0 bcrypt==4.0.1 blinker==1.5 braceexpand==0.1.7 brotlipy==0.7.0 cachetools==5.2.0 certifi @ file:///croot/certifi_1671487769961/work/certifi cffi @ file:///tmp/abs_98z5h56wf8/croots/recipe/cffi_1659598650955/work cfgv==3.3.1 charset-normalizer @ file:///tmp/build/80754af9/charset-normalizer_1630003229654/work click==8.1.3 coloredlogs==15.0.1 colossalai==0.1.12+torch1.12cu11.3 commonmark==0.9.1 contexttimer==0.3.3 cryptography @ file:///croot/cryptography_1665612644927/work datasets==2.8.0 decorator==5.1.1 diffusers==0.11.1 dill==0.3.6 distlib==0.3.6 einops==0.3.0 entrypoints==0.4 fabric==2.7.1 filelock==3.8.2 flatbuffers==22.12.6 flit-core @ file:///opt/conda/conda-bld/flit-core_1644941570762/work/source/flit_core frozenlist==1.3.3 fsspec==2022.11.0 ftfy==6.1.1 future==0.18.2 gitdb==4.0.10 GitPython==3.1.29 google-auth==2.15.0 google-auth-oauthlib==0.4.6 grpcio==1.51.1 huggingface-hub==0.11.1 humanfriendly==10.0 identify==2.5.11 idna @ file:///croot/idna_1666125576474/work imageio==2.9.0 imageio-ffmpeg==0.4.2 importlib-metadata==5.2.0 invisible-watermark==0.1.5 invoke==1.7.3 Jinja2==3.1.2 joblib==1.2.0 jsonschema==4.17.3 kornia==0.6.0 latent-diffusion @ file:///media/thomas/108E73348E731208/Users/Thoma/Desktop/dndiffusion/ColossalAI/examples/images/diffusion lightning-utilities==0.5.0 Markdown==3.4.1 MarkupSafe==2.1.1 mkl-fft==1.3.1 mkl-random @ file:///tmp/build/80754af9/mkl_random_1626186066731/work mkl-service==2.4.0 modelcards==0.1.6 mpmath==1.2.1 multidict==6.0.4 multiprocess==0.70.14 networkx==2.8.8 nodeenv==1.7.0 numpy @ file:///tmp/abs_653_j00fmm/croots/recipe/numpy_and_numpy_base_1659432701727/work oauthlib==3.2.2 omegaconf==2.1.1 onnx==1.13.0 onnxruntime==1.13.1 open-clip-torch==2.0.2 opencv-python==4.6.0.66 opencv-python-headless==4.6.0.66 packaging==22.0 pandas==1.5.2 paramiko==2.12.0 pathlib2==2.3.7.post1 Pillow==9.3.0 platformdirs==2.6.0 pre-commit==2.21.0 prefetch-generator==1.0.3 protobuf==3.20.1 psutil==5.9.4 pyarrow==10.0.1 pyasn1==0.4.8 pyasn1-modules==0.2.8 pycparser @ file:///tmp/build/80754af9/pycparser_1636541352034/work pydeck==0.8.0 pyDeprecate==0.3.2 Pygments==2.13.0 Pympler==1.0.1 PyNaCl==1.5.0 pyOpenSSL @ file:///opt/conda/conda-bld/pyopenssl_1643788558760/work pyrsistent==0.19.2 PySocks @ file:///tmp/build/80754af9/pysocks_1605305812635/work python-dateutil==2.8.2 pytorch-lightning @ file:///media/thomas/108E73348E731208/Users/Thoma/Desktop/dndiffusion/ColossalAI/examples/images/diffusion/lightning pytz==2022.7 pytz-deprecation-shim==0.1.0.post0 PyWavelets==1.4.1 PyYAML==6.0 qudida==0.0.4 regex==2022.10.31 requests @ file:///opt/conda/conda-bld/requests_1657734628632/work requests-oauthlib==1.3.1 responses==0.18.0 rich==12.6.0 rsa==4.9 scikit-image==0.19.3 scikit-learn==1.2.0 scipy==1.9.3 semver==2.13.0 six @ file:///tmp/build/80754af9/six_1644875935023/work smmap==5.0.0 streamlit==1.12.1 streamlit-drawable-canvas==0.8.0 sympy==1.11.1 tensorboard==2.11.0 tensorboard-data-server==0.6.1 tensorboard-plugin-wit==1.8.1 tensorboardX==2.5.1 test-tube==0.7.5 threadpoolctl==3.1.0 tifffile==2022.10.10 tokenizers==0.12.1 toml==0.10.2 toolz==0.12.0 torch==1.12.1 torchmetrics==0.7.0 torchvision==0.13.1 tornado==6.2 tqdm==4.64.1 transformers==4.25.1 triton==1.1.1 typing-extensions @ file:///croot/typing_extensions_1669924550328/work tzdata==2022.7 tzlocal==4.2 urllib3 @ file:///croot/urllib3_1670526988650/work validators==0.20.0 virtualenv==20.17.1 watchdog==2.2.0 wcwidth==0.2.5 webdataset==0.2.5 Werkzeug==2.2.2 xformers==0.0.15.dev395+git.7e05e2c xxhash==3.1.0 yarl==1.8.2 zipp==3.11.0

ManiSkill-Learn is a framework for training agents on SAPIEN Open-Source Manipulation Skill Challenge (ManiSkill Challenge), a large-scale learning-from-demonstrations benchmark for object manipulation.

ManiSkill-Learn ManiSkill-Learn is a framework for training agents on SAPIEN Open-Source Manipulation Skill Challenge, a large-scale learning-from-dem

Dec 30, 2022
DeepGNN is a framework for training machine learning models on large scale graph data.

DeepGNN Overview DeepGNN is a framework for training machine learning models on large scale graph data. DeepGNN contains all the necessary features in

Jan 1, 2023
Easy Parallel Library (EPL) is a general and efficient deep learning framework for distributed model training.
Easy Parallel Library (EPL) is a general and efficient deep learning framework for distributed model training.

English | ็ฎ€ไฝ“ไธญๆ–‡ Easy Parallel Library Overview Easy Parallel Library (EPL) is a general and efficient library for distributed model training. Usability

Dec 21, 2022
Large-scale open domain KNOwledge grounded conVERsation system based on PaddlePaddle

Knover Knover is a toolkit for knowledge grounded dialogue generation based on PaddlePaddle. Knover allows researchers and developers to carry out eff

Dec 31, 2022
SLIDE : In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems

The SLIDE package contains the source code for reproducing the main experiments in this paper. Dataset The Datasets can be downloaded in Amazon-

Dec 16, 2022
DeepLM: Large-scale Nonlinear Least Squares on Deep Learning Frameworks using Stochastic Domain Decomposition (CVPR 2021)
DeepLM: Large-scale Nonlinear Least Squares on Deep Learning Frameworks using Stochastic Domain Decomposition (CVPR 2021)

DeepLM DeepLM: Large-scale Nonlinear Least Squares on Deep Learning Frameworks using Stochastic Domain Decomposition (CVPR 2021) Run Please install th

Dec 2, 2022
Open-AI's DALL-E for large scale training in mesh-tensorflow.

DALL-E in Mesh-Tensorflow [WIP] Open-AI's DALL-E in Mesh-Tensorflow. If this is similarly efficient to GPT-Neo, this repo should be able to train mode

Dec 16, 2022
An Efficient Training Approach for Very Large Scale Face Recognition or FยฒC for simplicity.
An Efficient Training Approach for Very Large Scale Face Recognition or FยฒC for simplicity.

Fast Face Classification (FยฒC) This is the code of our paper An Efficient Training Approach for Very Large Scale Face Recognition or FยฒC for simplicit

Jun 27, 2021
A large-scale video dataset for the training and evaluation of 3D human pose estimation models
A large-scale video dataset for the training and evaluation of 3D human pose estimation models

ASPset-510 ASPset-510 (Australian Sports Pose Dataset) is a large-scale video dataset for the training and evaluation of 3D human pose estimation mode

Oct 30, 2022
A large-scale video dataset for the training and evaluation of 3D human pose estimation models
A large-scale video dataset for the training and evaluation of 3D human pose estimation models

ASPset-510 (Australian Sports Pose Dataset) is a large-scale video dataset for the training and evaluation of 3D human pose estimation models. It contains 17 different amateur subjects performing 30 sports-related actions each, for a total of 510 action clips.

Jun 20, 2021
Official repository for the paper, MidiBERT-Piano: Large-scale Pre-training for Symbolic Music Understanding.
Official repository for the paper, MidiBERT-Piano: Large-scale Pre-training for Symbolic Music Understanding.

MidiBERT-Piano Authors: Yi-Hui (Sophia) Chou, I-Chun (Bronwin) Chen Introduction This is the official repository for the paper, MidiBERT-Piano: Large-

Dec 15, 2022
Galileo library for large scale graph training by JD
Galileo library for large scale graph training by JD

่ฟ‘ๅนดๆฅ๏ผŒๅ›พ่ฎก็ฎ—ๅœจๆœ็ดขใ€ๆŽจ่ๅ’Œ้ฃŽๆŽง็ญ‰ๅœบๆ™ฏไธญ่Žทๅพ—ๆ˜พ่‘—็š„ๆ•ˆๆžœ๏ผŒไฝ†ไนŸ้ขไธด่ถ…ๅคง่ง„ๆจกๅผ‚ๆž„ๅ›พ่ฎญ็ปƒ๏ผŒไธŽ็Žฐๆœ‰็š„ๆทฑๅบฆๅญฆไน ๆก†ๆžถTensorflowๅ’ŒPyTorch็ป“ๅˆ็ญ‰้šพ้ข˜ใ€‚ Galileo๏ผˆไผฝๅˆฉ็•ฅ๏ผ‰ๆ˜ฏไธ€ไธชๅ›พๆทฑๅบฆๅญฆไน ๆก†ๆžถ๏ผŒๅ…ทๅค‡่ถ…ๅคง่ง„ๆจกใ€ๆ˜“ไฝฟ็”จใ€ๆ˜“ๆ‰ฉๅฑ•ใ€้ซ˜ๆ€ง่ƒฝใ€ๅŒๅŽ็ซฏ็ญ‰ไผ˜็‚น๏ผŒๆ—จๅœจ่งฃๅ†ณ่ถ…ๅคง่ง„ๆจกๅ›พ็ฎ—ๆณ•ๅœจๅทฅไธš็บงๅœบๆ™ฏ็š„่ฝๅœฐ้šพ้ข˜๏ผŒๆ

Nov 29, 2022
UniLM AI - Large-scale Self-supervised Pre-training across Tasks, Languages, and Modalities

Pre-trained (foundation) models across tasks (understanding, generation and translation), languages (100+ languages), and modalities (language, image, audio, vision + language, audio + language, etc.)

Jan 1, 2023
Large-Scale Pre-training for Person Re-identification with Noisy Labels (LUPerson-NL)

LUPerson-NL Large-Scale Pre-training for Person Re-identification with Noisy Labels (LUPerson-NL) The repository is for our CVPR2022 paper Large-Scale

Dec 26, 2022
BigDetection: A Large-scale Benchmark for Improved Object Detector Pre-training
BigDetection: A Large-scale Benchmark for Improved Object Detector Pre-training

BigDetection: A Large-scale Benchmark for Improved Object Detector Pre-training By Likun Cai, Zhi Zhang, Yi Zhu, Li Zhang, Mu Li, Xiangyang Xue. This

Dec 29, 2022
Unified Pre-training for Self-Supervised Learning and Supervised Learning for ASR

UniSpeech The family of UniSpeech: UniSpeech (ICML 2021): Unified Pre-training for Self-Supervised Learning and Supervised Learning for ASR UniSpeech-

Jan 9, 2023
PointNetVLAD: Deep Point Cloud Based Retrieval for Large-Scale Place Recognition, CVPR 2018
PointNetVLAD: Deep Point Cloud Based Retrieval for Large-Scale Place Recognition, CVPR 2018

PointNetVLAD: Deep Point Cloud Based Retrieval for Large-Scale Place Recognition PointNetVLAD: Deep Point Cloud Based Retrieval for Large-Scale Place

Dec 12, 2022
Official Implement of CVPR 2021 paper โ€œCross-Modal Collaborative Representation Learning and a Large-Scale RGBT Benchmark for Crowd Countingโ€
Official Implement of CVPR 2021 paper โ€œCross-Modal Collaborative Representation Learning and a Large-Scale RGBT Benchmark for Crowd Countingโ€

RGBT Crowd Counting Lingbo Liu, Jiaqi Chen, Hefeng Wu, Guanbin Li, Chenglong Li, Liang Lin. "Cross-Modal Collaborative Representation Learning and a L

Dec 8, 2022