Official codebase for running the small, filtered-data GLIDE model from GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models.

GLIDE

This is the official codebase for running the small, filtered-data GLIDE model from GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models.

For details on the pre-trained models in this repository, see the Model Card.

Usage

To install this package, clone this repository and then run:

pip install -e .

For detailed usage examples, see the notebooks directory.

  • The text2im notebook shows how to use GLIDE (filtered) with classifier-free guidance to produce images conditioned on text prompts.
  • The inpaint notebook shows how to use GLIDE (filtered) to fill in a masked region of an image, conditioned on a text prompt.
  • The clip_guided notebook shows how to use GLIDE (filtered) + a filtered noise-aware CLIP model to produce images conditioned on text prompts.
Comments
  • While running the clip_guided notebook in CPU mode I get:

    While running the clip_guided notebook in CPU mode I get: "RuntimeError - Expected tensor for argument #1 'indices' to have scalar type Long; but got torch.FloatTensor instead"

    When I run clip_guided notebook in CPU mode, I get the following error at the "Sample from the base model" cell:

    ---------------------------------------------------------------------------
    RuntimeError                              Traceback (most recent call last)
    ~\AppData\Local\Temp/ipykernel_9272/4093479580.py in <module>
         20 # Sample from the base model.
         21 model.del_cache()
    ---> 22 samples = diffusion.p_sample_loop(
         23     model,
         24     (batch_size, 3, options["image_size"], options["image_size"]),
    
    c:\users\alf\downloads\glide-text2im\glide_text2im\gaussian_diffusion.py in p_sample_loop(self, model, shape, noise, clip_denoised, denoised_fn, cond_fn, model_kwargs, device, progress)
        387         """
        388         final = None
    --> 389         for sample in self.p_sample_loop_progressive(
        390             model,
        391             shape,
    
    c:\users\alf\downloads\glide-text2im\glide_text2im\gaussian_diffusion.py in p_sample_loop_progressive(self, model, shape, noise, clip_denoised, denoised_fn, cond_fn, model_kwargs, device, progress)
        439             t = th.tensor([i] * shape[0], device=device)
        440             with th.no_grad():
    --> 441                 out = self.p_sample(
        442                     model,
        443                     img,
    
    c:\users\alf\downloads\glide-text2im\glide_text2im\gaussian_diffusion.py in p_sample(self, model, x, t, clip_denoised, denoised_fn, cond_fn, model_kwargs)
        351         )  # no noise when t == 0
        352         if cond_fn is not None:
    --> 353             out["mean"] = self.condition_mean(cond_fn, out, x, t, model_kwargs=model_kwargs)
        354         sample = out["mean"] + nonzero_mask * th.exp(0.5 * out["log_variance"]) * noise
        355         return {"sample": sample, "pred_xstart": out["pred_xstart"]}
    
    c:\users\alf\downloads\glide-text2im\glide_text2im\respace.py in condition_mean(self, cond_fn, *args, **kwargs)
         95 
         96     def condition_mean(self, cond_fn, *args, **kwargs):
    ---> 97         return super().condition_mean(self._wrap_model(cond_fn), *args, **kwargs)
         98 
         99     def condition_score(self, cond_fn, *args, **kwargs):
    
    c:\users\alf\downloads\glide-text2im\glide_text2im\gaussian_diffusion.py in condition_mean(self, cond_fn, p_mean_var, x, t, model_kwargs)
        287         This uses the conditioning strategy from Sohl-Dickstein et al. (2015).
        288         """
    --> 289         gradient = cond_fn(x, t, **model_kwargs)
        290         new_mean = p_mean_var["mean"].float() + p_mean_var["variance"] * gradient.float()
        291         return new_mean
    
    c:\users\alf\downloads\glide-text2im\glide_text2im\respace.py in __call__(self, x, ts, **kwargs)
        122         new_ts_2 = map_tensor[ts.ceil().long()]
        123         new_ts = th.lerp(new_ts_1, new_ts_2, frac)
    --> 124         return self.model(x, new_ts, **kwargs)
    
    c:\users\alf\downloads\glide-text2im\glide_text2im\clip\model_creation.py in cond_fn(x, t, grad_scale, **kwargs)
         57             with torch.enable_grad():
         58                 x_var = x.detach().requires_grad_(True)
    ---> 59                 z_i = self.image_embeddings(x_var, t)
         60                 loss = torch.exp(self.logit_scale) * (z_t * z_i).sum()
         61                 grad = torch.autograd.grad(loss, x_var)[0].detach()
    
    c:\users\alf\downloads\glide-text2im\glide_text2im\clip\model_creation.py in image_embeddings(self, images, t)
         47 
         48     def image_embeddings(self, images: torch.Tensor, t: torch.Tensor) -> torch.Tensor:
    ---> 49         z_i = self.image_encoder((images + 1) * 127.5, t)
         50         return z_i / (torch.linalg.norm(z_i, dim=-1, keepdim=True) + 1e-12)
         51 
    
    ~\.conda\envs\glide-text2im\lib\site-packages\torch\nn\modules\module.py in _call_impl(self, *input, **kwargs)
        725             result = self._slow_forward(*input, **kwargs)
        726         else:
    --> 727             result = self.forward(*input, **kwargs)
        728         for hook in itertools.chain(
        729                 _global_forward_hooks.values(),
    
    c:\users\alf\downloads\glide-text2im\glide_text2im\clip\encoders.py in forward(self, image, timesteps, return_probe_features)
        483     ) -> torch.Tensor:
        484         n_batch = image.shape[0]
    --> 485         h = self.blocks["input"](image, t=timesteps)
        486 
        487         for i in range(self.n_xf_blocks):
    
    ~\.conda\envs\glide-text2im\lib\site-packages\torch\nn\modules\module.py in _call_impl(self, *input, **kwargs)
        725             result = self._slow_forward(*input, **kwargs)
        726         else:
    --> 727             result = self.forward(*input, **kwargs)
        728         for hook in itertools.chain(
        729                 _global_forward_hooks.values(),
    
    c:\users\alf\downloads\glide-text2im\glide_text2im\clip\encoders.py in forward(self, x, t)
        124             self.pred_state[None, None].expand(x.shape[0], -1, -1)
        125             if self.n_timestep == 0
    --> 126             else F.embedding(cast(torch.Tensor, t), self.w_t)[:, None]
        127         )
        128         x = torch.cat((sot, x), dim=1) + self.w_pos[None]
    
    ~\.conda\envs\glide-text2im\lib\site-packages\torch\nn\functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
       1850         # remove once script supports set_grad_enabled
       1851         _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
    -> 1852     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
       1853 
       1854 
    
    RuntimeError: Expected tensor for argument #1 'indices' to have scalar type Long; but got torch.FloatTensor instead (while checking arguments for embedding)
    

    Can anyone help? Thanks!

  • Better resolution images for inpainting

    Better resolution images for inpainting

    Hello, thank you for this model!

    I have been wondering how to get better resolution on the outputs for inpainting. I believe that the main issue is the downsizing of the input image to 64X64, which loses a lot of resolution. Then, the upsampling can only be done up to 256x256 (more will create artifacts).

    I have tried to replace the 64X64 to something like 128X128 (which then will make it easier to upsample to 512X512) but got the below error.

    Is there a way to improve the output resolution of the inpainting model? In particular, to test my hypothesis that the low resolution is due to the downsampling - fill in - upsampling low resolutions?

    This is the error I am getting when replacing 64X64 -> 128x128 on the Colab:

    error

    Thanks!

  • Higher Resolution

    Higher Resolution

    Is there a way to upsize the outputs to something closer to 1024px? I've noticed a few people on twitter that have been able to do so with this model but after trying to change the image size to a higher value I get this error for anything over 256 -

    /usr/local/lib/python3.7/dist-packages/glide_text2im/model_creation.py in create_model(image_size, num_channels, num_res_blocks, channel_mult, attention_resolutions, num_heads, num_head_channels, num_heads_upsample, use_scale_shift_norm, dropout, text_ctx, xf_width, xf_layers, xf_heads, xf_final_ln, xf_padding, resblock_updown, use_fp16, cache_text_emb, inpaint, super_res) 140 channel_mult = (1, 2, 3, 4) 141 else: --> 142 raise ValueError(f"unsupported image size: {image_size}") 143 else: 144 channel_mult = tuple(int(ch_mult) for ch_mult in channel_mult.split(",")) ValueError: unsupported image size: 1024

  • Created non-rectangular masks for inpainting

    Created non-rectangular masks for inpainting

    In the paper, I see masks that are non-rectangular (white blob in the sky in the image below):

    image

    but I think the 'mask' in the inpaint notebook is being applied on this line:

    source_mask_64[:, :, 20:] = 0

    which produces an image with a gray rectangle. Is there an example of how to create more complex masks?

  • How could I load a mask generated by myself?

    How could I load a mask generated by myself?

    At first, I would like to say it's a amazing work. But when I try to change the code of 'inpaint.py' for using my own mask dataset, I realized it is uneasy. Because the mask was set by three lines. So, I want to ask for the code to use my own mask dataset like generated by PCov. Thanks a lot.

  • some errors

    some errors

    How to get a picture?How to get the result picture?I run the sample code, but I can't get the result picture. my code is

    from PIL import Image
    from IPython.display import display
    import torch as th
    
    from glide_text2im.download import load_checkpoint
    from glide_text2im.model_creation import (
        create_model_and_diffusion,
        model_and_diffusion_defaults,
        model_and_diffusion_defaults_upsampler
    )
    
    ### This notebook supports both CPU and GPU.
    ### On CPU, generating one sample may take on the order of 20 minutes.
    ### On a GPU, it should be under a minute.
    
    has_cuda = th.cuda.is_available()
    device = th.device('cpu' if not has_cuda else 'cuda')
    
    ### Create base model.
    options = model_and_diffusion_defaults()
    options['use_fp16'] = has_cuda
    options['timestep_respacing'] = '100' # use 100 diffusion steps for fast sampling
    model, diffusion = create_model_and_diffusion(**options)
    model.eval()
    if has_cuda:
        model.convert_to_fp16()
    model.to(device)
    model.load_state_dict(load_checkpoint('base', device))
    print('total base parameters', sum(x.numel() for x in model.parameters()))
    
    ### Create upsampler model.
    options_up = model_and_diffusion_defaults_upsampler()
    options_up['use_fp16'] = has_cuda
    options_up['timestep_respacing'] = 'fast27' # use 27 diffusion steps for very fast sampling
    model_up, diffusion_up = create_model_and_diffusion(**options_up)
    model_up.eval()
    if has_cuda:
        model_up.convert_to_fp16()
    model_up.to(device)
    model_up.load_state_dict(load_checkpoint('upsample', device))
    print('total upsampler parameters', sum(x.numel() for x in model_up.parameters()))
    
    def show_images(batch: th.Tensor):
        """ Display a batch of images inline. """
        scaled = ((batch + 1)*127.5).round().clamp(0,255).to(th.uint8).cpu()
        reshaped = scaled.permute(2, 0, 3, 1).reshape([batch.shape[2], -1, 3])
        display(Image.fromarray(reshaped.numpy()))
    
    ### Sampling parameters
    prompt = "an oil painting of a corgi"
    batch_size = 1
    guidance_scale = 3.0
    
    ### Tune this parameter to control the sharpness of 256x256 images.
    ### A value of 1.0 is sharper, but sometimes results in grainy artifacts.
    upsample_temp = 0.997
    
    ##############################
    ### Sample from the base model ###
    ##############################
    
    ### Create the text tokens to feed to the model.
    tokens = model.tokenizer.encode(prompt)
    tokens, mask = model.tokenizer.padded_tokens_and_mask(
        tokens, options['text_ctx']
    )
    
    ### Create the classifier-free guidance tokens (empty)
    full_batch_size = batch_size * 2
    uncond_tokens, uncond_mask = model.tokenizer.padded_tokens_and_mask(
        [], options['text_ctx']
    )
    
    ### Pack the tokens together into model kwargs.
    model_kwargs = dict(
        tokens=th.tensor(
            [tokens] * batch_size + [uncond_tokens] * batch_size, device=device
        ),
        mask=th.tensor(
            [mask] * batch_size + [uncond_mask] * batch_size,
            dtype=th.bool,
            device=device,
        ),
    )
    
    ### Create a classifier-free guidance sampling function
    def model_fn(x_t, ts, **kwargs):
        half = x_t[: len(x_t) // 2]
        combined = th.cat([half, half], dim=0)
        model_out = model(combined, ts, **kwargs)
        eps, rest = model_out[:, :3], model_out[:, 3:]
        cond_eps, uncond_eps = th.split(eps, len(eps) // 2, dim=0)
        half_eps = uncond_eps + guidance_scale * (cond_eps - uncond_eps)
        eps = th.cat([half_eps, half_eps], dim=0)
        return th.cat([eps, rest], dim=1)
    
    ### Sample from the base model.
    model.del_cache()
    samples = diffusion.p_sample_loop(
        model_fn,
        (full_batch_size, 3, options["image_size"], options["image_size"]),
        device=device,
        clip_denoised=True,
        progress=True,
        model_kwargs=model_kwargs,
        cond_fn=None,
    )[:batch_size]
    model.del_cache()
    
    ### Show the output
    show_images(samples)
    
    ##############################
    ### Upsample the 64x64 samples ###
    ##############################
    
    tokens = model_up.tokenizer.encode(prompt)
    tokens, mask = model_up.tokenizer.padded_tokens_and_mask(
        tokens, options_up['text_ctx']
    )
    
    ### Create the model conditioning dict.
    model_kwargs = dict(
        ### Low-res image to upsample.
        low_res=((samples+1)*127.5).round()/127.5 - 1,
    
        ### Text tokens
        tokens=th.tensor(
            [tokens] * batch_size, device=device
        ),
        mask=th.tensor(
            [mask] * batch_size,
            dtype=th.bool,
            device=device,
        ),
    )
    
    ### Sample from the base model.
    model_up.del_cache()
    up_shape = (batch_size, 3, options_up["image_size"], options_up["image_size"])
    up_samples = diffusion_up.ddim_sample_loop(
        model_up,
        up_shape,
        noise=th.randn(up_shape, device=device) * upsample_temp,
        device=device,
        clip_denoised=True,
        progress=True,
        model_kwargs=model_kwargs,
        cond_fn=None,
    )[:batch_size]
    model_up.del_cache()
    
    ### Show the output
    show_images(up_samples)
    
  • How to get the results closer to what is shown in the paper?

    How to get the results closer to what is shown in the paper?

    Really inspirational work guys!

    But the results from the published code and models are not even remotely comparable to the shown results in the paper. Is there anything we can do to get closer to the original work?

    • E.g. could we train on different (maybe bigger and more diverse) dataset?
    • Or do we need bigger model?
    • Or maybe tweaking the params a bit could help?

    Image from the paper for: "a surrealist dream-like oil painting by salvador dalı́ of a cat playing checkers"

    image

    Image from the code for the same text prompt "a surrealist dream-like oil painting by salvador..."

    image

    It's almost like that meme: " Your vs. The guy she told you not to worry about" :rofl:

    Anyway, if you can give us some advice on this matter it would be greatly appreciated! :+1:

  • The result of generating people is incredible

    The result of generating people is incredible

    hi I try to use your great work to do some test. but I found the result is incredible when the prompt is about women and man.

    for example, when the prompt is "A woman with long hair and glasses " result is 企业微信截图_16414535576761

    I think this model is unfriendly to generate people, isnt'it??

  • added missing required package to setup.py

    added missing required package to setup.py

    Several files import numpy, but it's not included in the setup requirements. In a fresh virtual environment the project fails to run after installing with pip install -e ..

  • Check if file exists before sending the GET request

    Check if file exists before sending the GET request

    The function fetch_file_cached call requests.get before checking if the file was already present, which caused it to raise an exception for offline use.

    This PR simply moves the check before the call to requests.

  • No license

    No license

    Hi! Currently, there is no license applied to this repository. Unfortunately, that means that by default, e.g. copying, modifying and distributing the code is forbidden. If this is intentional, please add a mention to the README about this. Otherwise, I suggest adding an open source license, such as MIT.

  • About CLIP training on nosied images

    About CLIP training on nosied images

    Hey! I think GLIDE is a wonderful work. But I have a question about CLIP training on nosied images.

    I want to know why CLIP can be trained on nosied images. I think if t (range from 0 to 1000) is large(maybe close to 500 or more), then the noised images hardly contain any semantic information. In this case, I want to know CLIP model how to encode similar features from noised images and text and I also think it may cause model to not converge (because it is hard to encode similar features between noised images and text)

  • Ways to reduce number of failed inpaints?

    Ways to reduce number of failed inpaints?

    When I use the model for inpainting, there is a large chance that an object will fail to inpaint, and instead GLIDE will simply guess at the background without inserting the object into the masked area. This is much more likely when the mask box is small, but is still common on large boxes too.

    Here is an example: Mask example_3_mask_0 Inpaint example_3_img_0

    My pipeline is to crop a 256x256 box with the mask as close to the center as possible. Then I downsample that, inpaint, run upsampler, and replace the 256x256 box.

    Is there any procedure I should use or parameter I should tune to reduce the number of misses? Thanks.

  • YouTube video walk-through of this codebase

    YouTube video walk-through of this codebase

    Hi @adityaramesh @unixpickle @prafullasd!

    Amazing work as always. :))

    I created a YouTube video where I do a deep dive/walk-through of this repo.

    I hope someone finds it useful: https://www.youtube.com/watch?v=c1GwVg3lt1c

  • Add image downloading in case colab env

    Add image downloading in case colab env

    Hi. In inpaint.ipynb notebook in case running in colab image grass.png is needed. So I add a cell with downloading an image in case the notebook is running in colab:

    if 'COLAB_GPU' in os.environ:
      !wget https://raw.githubusercontent.com/openai/glide-text2im/main/notebooks/grass.png
    
  • Question about generating masks

    Question about generating masks

    Hi, thanks for your great work. I have a question related to mask generation in "bpe.py".

    image

    As shown in the above figure, it seems that len(tokens) = text_ctx, and then padding = 0. Does this mean there is no padding mask?

    Best wishes,

  • Question about the CLIP model

    Question about the CLIP model

    Hi, thanks for your great work!

    I found that you release several checkpoints, including CLIP ( "clip/image-enc": "https://openaipublic.blob.core.windows.net/diffusion/dec-2021/clip_image_enc.pt", "clip/text-enc": "https://openaipublic.blob.core.windows.net/diffusion/dec-2021/clip_text_enc.pt").

    Are these checkpoints trained with noised images, or are they public CLIP models?

    Best wishes,

Minimal diffusion models - Minimal code and simple experiments to play with Denoising Diffusion Probabilistic Models (DDPMs)

Minimal code and simple experiments to play with Denoising Diffusion Probabilist

Oct 6, 2022
Just playing with getting CLIP Guided Diffusion running locally, rather than having to use colab.
Just playing with getting CLIP Guided Diffusion running locally, rather than having to use colab.

CLIP-Guided-Diffusion Just playing with getting CLIP Guided Diffusion running locally, rather than having to use colab. Original colab notebooks by Ka

Dec 4, 2022
This is the codebase for Diffusion Models Beat GANS on Image Synthesis.

This is the codebase for Diffusion Models Beat GANS on Image Synthesis.

Dec 5, 2022
Codebase for Diffusion Models Beat GANS on Image Synthesis.

Codebase for Diffusion Models Beat GANS on Image Synthesis.

Nov 5, 2022
Towards Implicit Text-Guided 3D Shape Generation (CVPR2022)
 Towards Implicit Text-Guided 3D Shape Generation (CVPR2022)

Towards Implicit Text-Guided 3D Shape Generation Towards Implicit Text-Guided 3D Shape Generation (CVPR2022) Code for the paper [Towards Implicit Text

Nov 4, 2022
Callable PyTrees and filtered JIT/grad transformations => neural networks in JAX.

Equinox Callable PyTrees and filtered JIT/grad transformations => neural networks in JAX Equinox brings more power to your model building in JAX. Repr

Nov 27, 2022
Pytorch-diffusion - A basic PyTorch implementation of 'Denoising Diffusion Probabilistic Models'
Pytorch-diffusion - A basic PyTorch implementation of 'Denoising Diffusion Probabilistic Models'

PyTorch implementation of 'Denoising Diffusion Probabilistic Models' This reposi

Dec 2, 2022
(ICCV 2021) Official code of "Dressing in Order: Recurrent Person Image Generation for Pose Transfer, Virtual Try-on and Outfit Editing."
(ICCV 2021) Official code of

Dressing in Order (DiOr) ?? [Paper] ?? [Webpage] ?? [Running this code] The official implementation of "Dressing in Order: Recurrent Person Image Gene

Nov 25, 2022
A weakly-supervised scene graph generation codebase. The implementation of our CVPR2021 paper ``Linguistic Structures as Weak Supervision for Visual Scene Graph Generation''
A weakly-supervised scene graph generation codebase. The implementation of our CVPR2021 paper ``Linguistic Structures as Weak Supervision for Visual Scene Graph Generation''

README.md shall be finished soon. WSSGG 0 Overview 1 Installation 1.1 Faster-RCNN 1.2 Language Parser 1.3 GloVe Embeddings 2 Settings 2.1 VG-GT-Graph

Nov 20, 2022
A 1.3B text-to-image generation model trained on 14 million image-text pairs
A 1.3B text-to-image generation model trained on 14 million image-text pairs

minDALL-E on Conceptual Captions minDALL-E, named after minGPT, is a 1.3B text-to-image generation model trained on 14 million image-text pairs for no

Nov 28, 2022
A Jupyter notebook to play with NVIDIA's StyleGAN3 and OpenAI's CLIP for a text-based guided image generation.

A Jupyter notebook to play with NVIDIA's StyleGAN3 and OpenAI's CLIP for a text-based guided image generation.

Nov 17, 2022
Official code for "Towards An End-to-End Framework for Flow-Guided Video Inpainting" (CVPR2022)
Official code for

E2FGVI (CVPR 2022) English | 简体中文 This repository contains the official implementation of the following paper: Towards An End-to-End Framework for Flo

Dec 5, 2022
This repository contains several image-to-image translation models, whcih were tested for RGB to NIR image generation. The models are Pix2Pix, Pix2PixHD, CycleGAN and PointWise.

RGB2NIR_Experimental This repository contains several image-to-image translation models, whcih were tested for RGB to NIR image generation. The models

Oct 29, 2022
ICRA 2021 "Towards Precise and Efficient Image Guided Depth Completion"
ICRA 2021

PENet: Precise and Efficient Depth Completion This repo is the PyTorch implementation of our paper to appear in ICRA2021 on "Towards Precise and Effic

Nov 29, 2022
Pytorch implementation of our method for high-resolution (e.g. 2048x1024) photorealistic video-to-video translation.
Pytorch implementation of our method for high-resolution (e.g. 2048x1024) photorealistic video-to-video translation.

vid2vid Project | YouTube(short) | YouTube(full) | arXiv | Paper(full) Pytorch implementation for high-resolution (e.g., 2048x1024) photorealistic vid

Dec 1, 2022
Hypersim: A Photorealistic Synthetic Dataset for Holistic Indoor Scene Understanding
Hypersim: A Photorealistic Synthetic Dataset for Holistic Indoor Scene Understanding

The Hypersim Dataset For many fundamental scene understanding tasks, it is difficult or impossible to obtain per-pixel ground truth labels from real i

Dec 2, 2022
Image-generation-baseline - MUGE Text To Image Generation Baseline

MUGE Text To Image Generation Baseline Requirements and Installation More detail

Oct 17, 2022
McGill Physics Hackathon 2021: Reaction-Diffusion Models for the Generation of Biological Patterns
McGill Physics Hackathon 2021: Reaction-Diffusion Models for the Generation of Biological Patterns

DiffuseAnimals: Reaction-Diffusion Models for the Generation of Biological Patterns Introduction Reaction-diffusion equations can be utilized in order

Mar 7, 2022
A denoising diffusion probabilistic model (DDPM) tailored for conditional generation of protein distograms
A denoising diffusion probabilistic model (DDPM) tailored for conditional generation of protein distograms

Denoising Diffusion Probabilistic Model for Proteins Implementation of Denoising Diffusion Probabilistic Model in Pytorch. It is a new approach to gen

Nov 23, 2022