iBOT: Image BERT Pre-Training with Online Tokenizer

Image BERT Pre-Training with iBOT iBOT Icon

PWC PWC

Official PyTorch implementation and pretrained models for paper iBOT: Image BERT Pre-Training with Online Tokenizer.

[arXiv] [BibTex]

iBOT framework

iBOT is a novel self-supervised pre-training framework that performs masked image modeling with self-distillation. iBOT pre-trained model shows local semantic features, which helps the model transfer well to downstream tasks both at a global scale and a local scale. For example, iBOT achieves strong performance on COCO object detection (51.4 box AP and 44.2 mask AP) and ADE20K semantic segmentation (50.0 mIoU) with vanilla ViT-B/16. iBOT can also extract semantic-meaningful local parts, like dog's ear 🐶 .

Update 🎉

  • December 2021 - Release the code and pre-trained models.
  • November 2021 - Release the pre-print on arXiv.

Installation

See installation structions for details.

Training

For a glimpse at the full documentation of iBOT pre-training, please run:

python main_ibot.py --help

iBOT Pre-Training with ViTs

To start the iBOT pre-training with Vision Transformer (ViT), simply run the following commands. JOB_NAME is a customized argument to distinguish different experiments and this will automatically save checkpoints into the seperate folders.

./run.sh imagenet_pretrain $JOB_NAME vit_{small,base,large} teacher {16,24,64}

The exact arguments to reproduce the models presented in our paper can be found in the args column of the pre-trained models. We also provide the logs for pre-training to help reproducibility.

For example, run iBOT with ViT-S/16 network on two nodes with 8 GPUs for 800 epochs with the following command. The resulting checkpoint should reach 75.2% on k-NN accuracy, 77.9% on linear probing accuracy, and 82.3% on fine-tuning accuracy.

./run.sh imagenet_pretrain $JOB_NAME vit_small teacher 16 \
  --teacher_temp 0.07 \
  --warmup_teacher_temp_epochs 30 \
  --norm_last_layer false \
  --epochs 800 \
  --batch_size_per_gpu 64 \
  --shared_head true \
  --out_dim 8192 \
  --local_crops_number 10 \
  --global_crops_scale 0.25 1 \
  --local_crops_scale 0.05 0.25 \
  --pred_ratio 0 0.3 \
  --pred_ratio_var 0 0.2

iBOT Pre-Training with Swins

This code also works for training iBOT on Swin Transformer (Swin). In the paper, we only conduct experiments on Swin-T with different window size:

./run.sh imagenet_pretrain $JOB_NAME swin_tiny teacher {16,40} \
  --patch_size 4 \
  --window_size {7,14}

For example, run iBOT with Swin-T/14 network on five nodes with 8 GPUS for 300 epochs with the following command. The resulting checkpoint should reach 76.2% on k-NN accuracy, 79.3% on linear probing accuracy.

./run.sh imagenet_pretrain $JOB_NAME swin_tiny teacher 40 \
  --teacher_temp 0.07 \
  --warmup_teacher_temp_epochs 30 \
  --norm_last_layer false \
  --epochs 300 \
  --batch_size_per_gpu 26 \
  --shared_head true \
  --out_dim 8192 \
  --local_crops_number 10 \
  --global_crops_scale 0.25 1 \
  --local_crops_scale 0.05 0.25 \
  --pred_ratio 0 0.3 \
  --pred_ratio_var 0 0.2 \
  --pred_start_epoch 50 \
  --patch_size 4 \
  --window_size 14 

Pre-Trained Models

You can choose to download only the weights of the pretrained backbone used for downstream tasks, and the full ckpt which contains backbone and projection head weights for both student and teacher networks. For the backbone, s denotes that the student network is selected while t denotes that the teacher network is selected.

Arch. Par. k-NN Lin. Fin. download
ViT-S/16 21M 74.5% 77.0% 82.3% backbone (t) full ckpt args logs
Swin-T/7 28M 75.3% 78.6% \ backbone (t) full ckpt args logs
Swin-T/14 28M 76.2% 79.3% \ backbone (t) full ckpt args logs
ViT-B/16 85M 77.1% 79.5% 83.8% backbone (t) full ckpt args logs

We also provide the ViT-{B,L}/16 model pre-trained on ImageNet-22K dataset.

Arch. Par. k-NN Lin. Fin. download
ViT-B/16 85M 71.1% 79.0% 84.4% backbone (s) full ckpt args logs
ViT-L/16 307M 70.6% 81.7% 86.3% backbone (s) full ckpt args logs

To extract the backbone from the full checkpoint by yourself, please run the following command where KEY being either student or teacher.

WEIGHT_FILE=$OUTPUT_DIR/checkpoint_$KEY.pth

python extract_backbone_weights.py \
  --checkpoint_key $KEY \
  $PRETRAINED \
  $WEIGHT_FILE \

Downstream Evaluation

See Evaluating iBOT on Downstream Tasks for details.

Property Analysis

See Analyzing iBOT's Properties for robustness test and visualizing self-attention map:

iBOT Global Pattern Layout

or extracting sparse correspondence pairs bwtween two images:

iBOT Global Pattern Layout

Extracting Semantic Patterns

We extract top-k numbered local classes based on patch tokens with their corresponding patches and contexts by running the following command. We indentify very diverse behaviour like shared low-level textures and high-level semantics.

python3 -m torch.distributed.launch --nproc_per_node=8 \
    --master_port=${MASTER_PORT:-29500} \
    analysis/extract_pattern/extract_topk_cluster.py \
    --pretrained_path $PRETRAINED \
    --checkpoint {student,teacher} \
    --type patch \
    --topk 36 \
    --patch_window 5 \
    --show_pics 20 \
    --arch vit_small \
    --save_path memory_bank_patch.pth \
    --data_path data/imagenet/val
iBOT Local Part-Level Pattern Layout

The script also supports to extract the patern layout on the [CLS] token, which is actually doing clustering or unsupervised classification. This property is not induced by MIM objective since we also spot this feature on DINO.

python3 -m torch.distributed.launch --nproc_per_node=8 \
    --master_port=${MASTER_PORT:-29500} \
    analysis/extract_pattern/extract_topk_cluster.py \
    --pretrained_path $PRETRAINED \
    --checkpoint {student,teacher} \
    --type cls \
    --topk 36 \
    --show_pics 20 \
    --arch vit_small \
    --save_path memory_bank_cls.pth \
    --data_path data/imagenet/val
iBOT Global Pattern Layout

Acknowledgement

This repository is built using the DINO repository and the BEiT repository.

License

This repository is released under the Apache 2.0 license as found in the LICENSE file.

Citing iBOT

If you find this repository useful, please consider giving a star and citation:

@article{zhou2021ibot,
  title={iBOT: Image BERT Pre-Training with Online Tokenizer},
  author={Zhou, Jinghao and Wei, Chen and Wang, Huiyu and Shen, Wei and Xie, Cihang and Yuille, Alan and Kong, Tao},
  journal={arXiv preprint arXiv:2111.07832},
  year={2021}
}
Owner
Comments
  • The reproduce of 100 epochs

    The reproduce of 100 epochs

    Hello, first thanks for your great work. Now, I want to reproduce the results of 100 epochs, i.e, the results of 71.5 knn in figure 8. Can you tell me the corresponding args?

  • Semantic Segmentation on ADE20K

    Semantic Segmentation on ADE20K

    Thank you for your outstanding work, have you tested the Semantic Segmentation on ADE20K in the code? I have encountered many problems (such as mmcv version issues and model init problems), I want to confirm that you can use this code to test normally on ADE20k?

  • Loss goes to NaN after several epochs

    Loss goes to NaN after several epochs

    Hello,

    First of all, well done and thank you for this great work.

    I am trying to launch iBOT experiments but struggle with loss going to NaN after a few epochs. The training loss increases (see below), which seems weird. Have you faced similar issues during your experiments and if yes, how did you solve it ? I have limited ressources to launch my experiments, so I can't play too much on the parameters, and any tip would be greatly appreciated.

    I know that using mixed precision, as this is the case here, can lead to stability issues. I indeed do not get any nan loss when setting use_fp16 to False. But I'd rather let use_fp16 to True to keep a reasonable training time.

    I tried to increase eps of AdamW to 1e-6 and of Batch norm layers to 6.1e-5 (as proposed here), but this did not work. I see in #17 that you propose to decrease beta opt 2 of AdamW, can you comment a bit more on this ? And on other techniques if you know any ?

    Below is how I launched the training (I'm on a slurm cluster so I use run_with_submitit.py that calls main_ibot.py ). Dataset size is ~4M images, hence around 3 times larger than ImageNet10K training set. This is why I divided by 3 the number of epochs for warmup_teacher_temp_epochs and warmup_epochs. Not sure this makes sense though.

    python run_with_submitit.py --arch vit_small \
        --ngpus 4 \
        --nodes 4 \
        --num_workers 8 \
        --teacher_temp 0.07 \
        --warmup_teacher_temp_epochs 10 \
        --warmup_epochs 3 \
        --norm_last_layer false \
        --epochs 100 \
        --batch_size_per_gpu 112 \
        --shared_head true \
        --out_dim 8192 \
        --local_crops_number 10 \
        --global_crops_scale 0.25 1 \
        --local_crops_scale 0.05 0.25 \
        --pred_ratio 0 0.3 \
        --pred_ratio_var 0 0.2 \
        --timeout 1200 \
        --partition gpu_xxx \
        --data_path "xxx" \
        --saveckp_freq 1
    

    Below, you can find the training metrics recorded :

    {"train_loss": 6.99587931743066, "train_cls": 4.331486056964688, "train_patch": 2.6643932607626066, "train_lr": 0.0005831685964416835, "train_wd": 0.040029588545113536, "train_acc": 0.6155056208539762, "train_nmi": 0.1298466117645908, "train_ari": 0.006587329895082833, "train_fscore": 0.043441455577130375, "train_adjacc": -1, "epoch": 0}
    {"train_loss": 10.466352438147872, "train_cls": 6.67679833335828, "train_patch": 3.789554104414066, "train_lr": 0.0017500000000000005, "train_wd": 0.040207159993842605, "train_acc": 0.408415272718953, "train_nmi": 0.14113477636477684, "train_ari": 0.0071149456743873594, "train_fscore": 0.030568697062104223, "train_adjacc": -1, "epoch": 1}
    {"train_loss": 12.008997473409751, "train_cls": 8.066221890750954, "train_patch": 3.9427755813775858, "train_lr": 0.002916831403558317, "train_wd": 0.0405621652690049, "train_acc": 0.3335961760343232, "train_nmi": 0.13128954990810676, "train_ari": 0.004251988918558582, "train_fscore": 0.03003438171217746, "train_adjacc": -1, "epoch": 2}
    

    Below you can find more information about the config used for the experiment :

    act_in_head: gelu
    arch: vit_small
    batch_size_per_gpu: 112
    clip_grad: 3.0
    comment:
    constraint: ""
    data_path: ""
    dist_url: env://
    drop_path: 0.1
    epochs: 100
    freeze_last_layer: 1
    global_crops_number: 2
    global_crops_scale: [0.25, 1.0]
    gpu: 0
    lambda1: 1.0
    lambda2: 1.0
    local_crops_number: 10
    local_crops_scale: [0.05, 0.25]
    local_rank: 0
    lr: 0.0005
    min_lr: 1e-06
    momentum_teacher: 0.996
    ngpus: 4
    nodes: 4
    norm_in_head: None
    norm_last_layer: False
    num_workers: 8
    optimizer: adamw
    out_dim: 8192
    output_dir: ""
    partition: gpu_xxx
    patch_out_dim: 8192
    patch_size: 16
    pred_ratio: [0.0, 0.3]
    pred_ratio_var: [0.0, 0.2]
    pred_shape: block
    pred_start_epoch: 0
    qos: qos_xxx
    rank: 0
    saveckp_freq: 1
    seed: 0
    shared_head: True
    shared_head_teacher: True
    teacher_patch_temp: 0.07
    teacher_temp: 0.07
    timeout: 1200
    use_fp16: True
    use_masked_im_modeling: True
    warmup_epochs: 3
    warmup_teacher_patch_temp: 0.04
    warmup_teacher_temp: 0.04
    warmup_teacher_temp_epochs: 10
    weight_decay: 0.04
    weight_decay_end: 0.4
    window_size: 7
    world_size: 16
    

    Best regards

  • checkpoint not saved by master

    checkpoint not saved by master

    as your code describe, https://github.com/bytedance/ibot/blob/3302b63fc7e287afc68601cb1dc2f0c311af8e3b/main_ibot.py#L358 in ddp training, every processor(GPU) would save an checkpoint model in disk, this behaviou may cause duplicate writing problem and saved checkpoint can not be load by torch.load successfully

  • question about imagenet 1% logist regression

    question about imagenet 1% logist regression

    hello, I use the eval_logistic_regression.py (lambd=0.1) to evaluate the provided vit-small model but can only get 58.0 val top-1 acc with 1% data, but 65.9 in your paper.Thanks for your help!

    Start the logistic regression. Matrix X, n=12811, p=384 Switching to regular solver, problem is well conditioned


    Catalyst Accelerator MISO Solver Incremental Solver with uniform sampling Lipschitz constant: 0.25 Multiclass logistic Loss is used L2 regularization Epoch: 10, primal objective: 6.90561, time: 211.912 Best relative duality gap: 0.000430669 Time elapsed : 212.632 Logistic regression result: Acc: 0.58044

  • About license

    About license

    Thanks for the great job. As you said, this repository is released under the Apache 2.0 license. I want to know whether that means the pre-trained models are also under the Apache 2.0 license? Thanks!

  • Linear segmentation evaluation on ADE20k

    Linear segmentation evaluation on ADE20k

    Hi,

    I am trying to reproduce the linear segmentation results obtained with the ViT-B IBOT pretrained model, which performs at 38.3 mIoU according to the paper.

    With this model, and the config file provided in:

    ibot/evaluation/semantic_segmentation/configs/linear/vit_base_512_ade20k_160k.py
    

    I only reach ~18mIoU on ADE20K. I saw that the command in the README change the learning rate and normalize the output so I tried with:

    model.backbone.out_with_norm=true  optimizer.lr=8e-4
    

    and I got ~20mIoU.

    The only difference is that I am not using apex and the custom distributed optimizer, so I basically comment:

    runner = dict(type='IterBasedRunnerAmp')
    fp16 = None
    optimizer_config = dict(
        type="DistOptimizerHook",
        update_interval=1,
        grad_clip=None,
        coalesce=True,
        bucket_size_mb=-1,
        use_fp16=True,
    )
    

    In the config file.

    I run my experiment a single node with 8 GPUs. I was wondering if the performance gap could come from the fact that I am not using DistOptimizerHook and apex, or if there is something else I am missing.

    Thanks for your help.

  • 100 or 300 epoch training

    100 or 300 epoch training

    Hello, Have you train iBOT in shorter training time, eg.100epoch or 300epoch, can you share me these hyper-parameters? Follow DINO, I set python -m torch.distributed.launch --nproc_per_node=8 --master_port=29500\ main_ibot.py \ --arch vit_small \ --output_dir ibot_100epoch \ --data_path imagenet/train \ --batch_size_per_gpu 64 \ --local_crops_number 8 \ --saveckp_freq 10\ --shared_head true \ --epochs 100\ --out_dim 8192, I don't know if this is reasonable

  • Description of DINO in [preliminaries section] is not accurate

    Description of DINO in [preliminaries section] is not accurate

    "The parameters of the student network θ are Exponentially Moving Averaged (EMA) to the parameters of teacher network θ'",

    should be the other way around.

  • Semantic Segmentation Error on ADE20K

    Semantic Segmentation Error on ADE20K

    Thank you for your outstanding work. When I try to train ViT-S/16 with UperNet as the task layer, I got the error: KeyError: "EncoderDecoder: 'VisionTransformer is not in the backbone registry'" I find the issue Semantic Segmentation on ADE20K

    Solution --> Starting a new terminal window after the installation resolved the issue. This issue could also appear due to GPU - cuda version mismatch.

    But it didn't work. Also I checked the description of mmsegmentation v.12.0, VisionTransformer backbone is not yet supported. Hope you can provide some help.

  • large batch size training

    large batch size training

    when training dino with total batch size 64*8*8( 8-nodes ) for large dataset(40M), the model collapse after a few epochs (same issue in link). ibot train imagenet-22k with batch size 51*8*5, and the training process is very stable. have you train with larger batch size ? (for example 51*8*8)

  • Linear semantic segmentation with ViT-L models

    Linear semantic segmentation with ViT-L models

    Hi,

    I was wondering if you had evaluated the ViT-L pretrained on ImageNet1k and ViT-L pretrained on ImageNet22k on the linear semantic segmentation benchmark on ADE20k, similar to column 3 of right table of Table 6 in the paper ? If yes, can you share the results and the corresponding log files ?

    Thanks!

  • some debug about use torch.utils.checkpoint.checkpoint

    some debug about use torch.utils.checkpoint.checkpoint

    When I try to use torch.utils.checkpoint.checkpoint as follows, and use apex to train the model, I found that the loss is so small as 0.4, but the normal loss is 2.x.

    So, do you have some idea about this question?

            for blk in self.blocks:
                # x = blk(x)
                x = torch.utils.checkpoint.checkpoint(blk, x)
    
Related tags
天池中药说明书实体识别挑战冠军方案;中文命名实体识别;NER; BERT-CRF & BERT-SPAN & BERT-MRC;Pytorch
天池中药说明书实体识别挑战冠军方案;中文命名实体识别;NER; BERT-CRF & BERT-SPAN & BERT-MRC;Pytorch

天池中药说明书实体识别挑战冠军方案;中文命名实体识别;NER; BERT-CRF & BERT-SPAN & BERT-MRC;Pytorch

Sep 13, 2022
Pre-training BERT masked language models with custom vocabulary

Pre-training BERT Masked Language Models (MLM) This repository contains the method to pre-train a BERT model using custom vocabulary. It was used to p

May 15, 2022
TaCL: Improve BERT Pre-training with Token-aware Contrastive Learning

TaCL: Improve BERT Pre-training with Token-aware Contrastive Learning

Sep 8, 2022
LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

LV-BERT Introduction In this repo, we introduce LV-BERT by exploiting layer variety for BERT. For detailed description and experimental results, pleas

Aug 24, 2022
VD-BERT: A Unified Vision and Dialog Transformer with BERT
 VD-BERT: A Unified Vision and Dialog Transformer with BERT

VD-BERT: A Unified Vision and Dialog Transformer with BERT PyTorch Code for the following paper at EMNLP2020: Title: VD-BERT: A Unified Vision and Dia

Sep 1, 2022
Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

Sep 21, 2022
Unsupervised text tokenizer for Neural Network-based text generation.

SentencePiece SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabu

Sep 19, 2022
Unsupervised text tokenizer focused on computational efficiency

YouTokenToMe YouTokenToMe is an unsupervised text tokenizer focused on computational efficiency. It currently implements fast Byte Pair Encoding (BPE)

Sep 8, 2022
Unsupervised text tokenizer for Neural Network-based text generation.

SentencePiece SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabu

Feb 18, 2021
Unsupervised text tokenizer focused on computational efficiency

YouTokenToMe YouTokenToMe is an unsupervised text tokenizer focused on computational efficiency. It currently implements fast Byte Pair Encoding (BPE)

Feb 18, 2021
Japanese Long-Unit-Word Tokenizer with RemBertTokenizerFast of Transformers

Japanese-LUW-Tokenizer Japanese Long-Unit-Word (国語研長単位) Tokenizer for Transformers based on 青空文庫 Basic Usage >>> from transformers import RemBertToken

Dec 22, 2021
A Japanese tokenizer based on recurrent neural networks
A Japanese tokenizer based on recurrent neural networks

Nagisa is a python module for Japanese word segmentation/POS-tagging. It is designed to be a simple and easy-to-use tool. This tool has the following

Sep 12, 2022
Train BPE with fastBPE, and load to Huggingface Tokenizer.

BPEer Train BPE with fastBPE, and load to Huggingface Tokenizer. Description The BPETrainer of Huggingface consumes a lot of memory when I am training

Dec 23, 2021
Tokenizer - Module python d'analyse syntaxique et de grammaire, tokenization

Tokenizer Le Tokenizer est un analyseur lexicale, il permet, comme Flex and Yacc par exemple, de tokenizer du code, c'est à dire transformer du code e

Aug 15, 2022
Universal End2End Training Platform, including pre-training, classification tasks, machine translation, and etc.

背景 安装教程 快速上手 (一)预训练模型 (二)机器翻译 (三)文本分类 TenTrans 进阶 1. 多语言机器翻译 2. 跨语言预训练 背景 TrenTrans是一个统一的端到端的多语言多任务预训练平台,支持多种预训练方式,以及序列生成和自然语言理解任务。 安装教程 git clone git

Sep 19, 2022
TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.
TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset. TunBERT was applied to three NLP downstream tasks: Sentiment Analysis (SA), Tunisian Dialect Identification (TDI) and Reading Comprehension Question-Answering (RCQA)

Sep 7, 2022