Shweta Mahajan

I am a Machine Learning Researcher at Qualcomm AI Research. I was a postdoctoral researcher and a Vector Postdoctoral Affiliate in the Vision Group at University of British Columbia advised by Prof. Leonid Sigal and Prof. Kwang Moo Yi. My research at UBC is focussed on diffusion models for high-level as well as low-level computer vision. I obtained my Ph.D. under the supervision of Prof. Stefan Roth, Ph.D. in the Visual Inference Group, Technische Universität Darmstadt. During my Ph.D., I researched deep generative algorithms for multimodal representation learning and the efficiency of exact inference deep generative models. I received my M.Sc. from the Saarland University where I was a part of the Machine Learning Group and the Max Planck Institute of Informatics.

Email / CV / Google Scholar / LinkedIn

News

02/2025: HyperNet Fields: Efficiently Training Hypernetworks without Ground Truth by Learning Weight Trajectories accepted at CVPR2025!
02/2025: Distilling Multi-modal Large Language Models for Autonomous Driving accepted at CVPR2025!
11/2024: Recognized as a Top Reviewer at NeurIPS 2024!
02/2024: Prompting Hard or Hardly Prompting: Prompt Inversion for Text-to-Image Diffusion Models accepted at CVPR 2024.
02/2024: ViVid-1-to-3: Novel View Synthesis with Video Diffusion Models accepted at CVPR 2024.
02/2024: Unsupervised Keypoints from Pretrained Diffusion Models accepted at CVPR 2024.
11/2023: Recognized as a Top Reviewer at NeurIPS 2023!
09/2023: Talk on Multimodal Representation Learning with Deep Generative Models at NTU Singapore.
09/2023: Unsupervised Semantic Correspondence Using Stable Diffusion accepted at NeurIPS 2023.
08/2023: Nominated for the GI-Dissertationspreis!
07/2023: Awarded research grant by the Vector Institute of Artificial Intelligence!
06/2023: Invited to the doctoral consortium at CVPR 2023!
06/2023: Nominated for the Bertha Benz Best Thesis Award in Germany!
02/2023: Make-A-Story: Visual Memory Conditioned Consistent Story Generation accepted at CVPR 2023.
10/2022: I have joined UBC as a postdoctoral researcher.
06/2022: I have successfully defended my Ph.D. dissertation (summa cum laude)!

Research

I am interested in computer vision and machine learning, specifically in deep generative models (diffusion models, normalizing flows, variational methods, GANs) for multimodal representation learning.

	Prompting Hard or Hardly Prompting: Prompt Inversion for Text-to-Image Diffusion Models Shweta Mahajan, Tanzila Rahman, Kwang Moo Yi, Leonid Sigal CVPR, 2024 paper / arxiv Inverting the diffusion model to obtain interpretable language prompts directly based on the findings that different timesteps of the diffusion process cater to different levels of detail in an image.
	ViVid-1-to-3: Novel View Synthesis with Video Diffusion Models Jeong-gi Kwak, Erqun Dong, Yuhe Jin, Hanseok Ko, Shweta Mahajan, Kwang Moo Yi CVPR, 2024 paper / arxiv / code / We utilize a pre-trained video diffusion model to solve consistency in zero-shot view synthesis.
	Unsupervised Keypoints from Pretrained Diffusion Models Eric Hedlin, Gopal Sharma, Shweta Mahajan, Xingzhe He, Hossam Isack, Abhishek Kar Helge Rhodin, Andrea Tagliasacchi, Kwang Moo Yi CVPR, 2024 paper / arxiv / code / One can leverage this semantic knowledge within diffusion models to find key points across images of similar kind.
	Unsupervised Semantic Correspondence Using Stable Diffusion Eric Hedlin, Gopal Sharma, Shweta Mahajan, Hossam Isack, Abhishek Kar, Andrea Tagliasacchi, Kwang Moo Yi NeurIPS, 2023 paper / arxiv / code / One can leverage this semantic knowledge within diffusion models to find semantic correspondences with prompt optimization.

Make-A-Story: Visual Memory Conditioned Consistent Story Generation

Tanzila Rahman, Hsin-Ying Lee, Jian Ren, Sergey Tulyakov, Shweta Mahajan and Leonid Sigal
CVPR, 2023
paper / arxiv /

Sentence-conditioned soft attention over the memories enables effective reference resolution and learns to maintain scene and actor consistency when needed.

	Multimodal Representation Learning for Diverse Synthesis with Deep Generative Models Shweta Mahajan Ph.D. Thesis, 2022 Thesis
	Diverse Image Captioning with Grounded Style Franz Klein, Shweta Mahajan and Stefan Roth GCPR, 2021 paper / arxiv / code A sequential variational framework encoding the style information grounded in images for stylized image captioning.
	PixelPyramids: Exact Inference Models from Lossless Image Pyramids Shweta Mahajan and Stefan Roth ICCV, 2021 paper / supp / arxiv / video / code A block-autoregressive exact inference model employing a lossless pyramid decomposition with scale-specific representations to encode the joint distribution of image pixels.
	Diverse Image Captioning with Context-Object Split Latent Spaces Shweta Mahajan and Stefan Roth NeurIPS, 2020 paper / supp / arxiv / video / code We introduce a novel factorization of the latent space to model diversity in contextual descriptions across images and texts within the dataset.
	Normalizing Flows with Multi-Scale Autoregressive Priors Shweta Mahajan, Apratim Bhattacharyya, Mario Fritz, Bernt Schiele and Stefan Roth CVPR, 2020 paper / supp / arxiv / video / code We improve the representational power of flow-based models by introducing channel-wise dependencies in their latent space through multi-scale autoregressive priors.
	Latent Normalizing Flows for Many-to-Many Cross-Domain Mappings Shweta Mahajan, Iryna Gurevych and Stefan Roth ICLR, 2020 (Best Paper Award, Fraunhofer IGD) paper / arxiv / video / code Our model integrates normalizing flow-based priors for the domain-specific information, which allows us to learn diverse many-to-many mappings between the image and text domains.
	Joint Wasserstein Autoencoders for Aligning Multimodal Embeddings Shweta Mahajan, Teresa Botschen, Iryna Gurevych and Stefan Roth ICCV Workshops, 2019 (Oral presentation) paper / arxiv We propose joint Gaussian regularization of the latent representations to ensure coherent cross-modal semantics that generalize across datasets.

Shweta Mahajan

News

Research

Prompting Hard or Hardly Prompting: Prompt Inversion for Text-to-Image Diffusion Models

ViVid-1-to-3: Novel View Synthesis with Video Diffusion Models

Unsupervised Keypoints from Pretrained Diffusion Models

Unsupervised Semantic Correspondence Using Stable Diffusion

Make-A-Story: Visual Memory Conditioned Consistent Story Generation

Multimodal Representation Learning for Diverse Synthesis with Deep Generative Models

Diverse Image Captioning with Grounded Style

PixelPyramids: Exact Inference Models from Lossless Image Pyramids

Diverse Image Captioning with Context-Object Split Latent Spaces

Normalizing Flows with Multi-Scale Autoregressive Priors

Latent Normalizing Flows for Many-to-Many Cross-Domain Mappings

Joint Wasserstein Autoencoders for Aligning Multimodal Embeddings