This page showcases my work on 3D Object Reconstruction from images. The main goal is to learn category level object class and shape information from a large dataset of shape models. I trained a multi-view autoencoder architecture on common objects from the ShapeNet dataset.

Multi-View Image Dataset

I used Blender scripts to automatically generate synthetic renders of models from the ShapeNet dataset. I render ground truth color and depth images as well as ground truth camera poses, object masks, point cloud, and voxel grid information. Below you can see some example renders.

Architecture

The multi-view autoencoder consists of a PointNet like encoder network that fuses embedding vectors from multiple views. The intuition is that regardless of the viewing angle or pose of the input image, each image of the same object should project to the same latent representation. The decoder network is composed of series of linear and 2D deconvolutional layers that expand the shape embedding vector to a voxel grid of size 128x128x128. I use the cross-entropy loss for shape reconstruction loss and I use a contrastive loss on the embedding vectors to promote clustering in the latent space.

Results

I trained my network for 4 days with 6 RTX 2080 GPU's on the Euler compute cluster. The dataset consisted of 8 object categories with 50 object instances each and 16 input color/depth views.

Quantitative

I evaluate reconstruction performance with a cross-entropy loss, chamfer distance loss, and intersection over union. I run ablation studies and find that increasing the number of input views to the network improves reconstruction.

Qualitative

Ground Truth

1 Input View Prediction

5 Input View Prediction

1-5 Input View Prediction

The auto-encoder embedding space is clustered into classes as seen in the following TSNE embedding illustration:

Tools Used

  • Python, Blender, and various bash scripts.
  • Main libraries: PyTorch, PyTorch3D, OpenCV, and Open3D.