R. Danhaive & D. Pinochet
LA: Deepankar Gupta
Industry Mentor: Mahima Pushkarna, Google

1 Introduction

We develop a system to generate 3D voxel-based chairs based on sketch input. Specifically, we focus on shapes—and sketches—of chairs. Our work relies on recent progress in 3D Generative Adversarial Networks (GANs). 3D GANs are conceptually similar to the image GANs, which are capable of generating, among other things, realistic fakes of celebrites (Karras et al., 2018), but they generate shapes instead. Whereas image GANs generate 2D arrays/matrices, 3D GANs generate 3D arrays/matrices. The shape is described by density values (between 0 and 1), which can be interpreted as a material density values, 0 and 1 corresponding respectively to void and solid regions. img 3D GAN architecture from Wu et al.

img Shapes sampled by the 3D GAN from Wu et al.

We use the 3D GAN model for chairs developed by Wu et al. (2016), whose architecture is shown above, to generate a large number of chairs by sampling the latent vector. Each 3D model is screenshotted, and each screenshot is converted into a sketch. With paired sketch/latent vector data, we build a model mapping a processed sketch onto a latent vector, which in turn is fed into the 3D GAN to generate a chair.

Finally, we develop a Unity sketching interface and a 3D visualization app, which communicate through a Flask server to send the human-drawn sketch, process it, predict the corresponding latent vector, and finally generate a chair shape. The finished physical setup is shown below.

img
System setup: the user draws on the iPad Pro (left), the sketch information is sent over and processed into a 3D shape and visualized on the laptop (right)

Throughout this blog post, we detail the strategies developed to generate data artificially, process real-world input, and predict corresponding shapes.

1.1 Motivation

3D modelling is hard and requires a significant amount experience to get anything done. In contrast, sketching is a much more natural way to quickly explore ideas, but it does not provide ways to readily visualize them in 3D or quantify their merits through analysis requiring 3D information (e.g. structural analysis). 3D GANs allow us to avoid sketching or modelling altogether. For example, in Wu et al.’s 3D GAN, different chair shapes are generated by modifying a 200-dimensional input latent vector.

img 5 chairs generated by randomly sampling the latent vector of the chair GAN

The main issue with using a 3D GAN to explore ideas is that most dimensions of the latent vector are meaningless to a human designer. In addition, such a design space is hard to explore because of its large number of parameters. Our project seeks to provide a natural way to interact with 3D GANs, by means of interactive sketching.

1.2 Getting the 3D GAN Model

The 3D GAN model from Wu et al. is available online here, but it is built with Torch, a machine learning library for the Lua language. The first technical hurdle to overcome is to convert that model to a Python-readable model. In this case, we convert the model to a PyTorch model. The PyTorch library includes built-in functions to convert Lua models, but, because some of the layers in the 3D GAN model are not supported in PyTorch, we have to modify a model conversion routine to accomodate that. The modified routine can be found here, and the converted PyTorch model is available here. In this project, we use the 3D GAN model both to generate artificial data and as part of our prediction setup.

2 Generating Data

We decided early on that we would steer away from trying to collect our own data. In this case, that was compounded by the fact that we needed paired sketch/model data. Therefore, we opted for artificially generating all of the data required for this project. To do so, we sample 10,000 200-dimensional latent vectors where each coordinate was in the domain [-1,1]. To obtain better coverage of the high-dimensional design space, we use latin hypercube sampling. Each of the sampled latent vectors is fed into the 3D GAN to generate a voxelized shape. Each shape is automatically rendered in a 3D viewer and screenshotted.
The image below shows a collection of screenshots obtained using the strategy outlined above.

img
Screenshots sampled as part of the data generation process

The next step consists in converting these screenshots into sketch-like images. The goal here is to obtain sketches that look like they could have been hand-drawn by a human. We identify contours in the original screenshot, simplify and smooth these contours using a B-spline approximation, and filter out the smallest contours (more details on this in the next section). The image below shows result samples obtained from screenshots of 3D models (these correspond to the image-based strategy that is detailed in 2.1).

img
Computer-generated sketches resulting from the image-based data generation process

We see that the sketches do not quite look human-drawn: they’re a little too wiggly and include too many details, despite our attempts to smooth them out. We’ll see below that by simplifying the sketch representation and by retaining only a few strokes per sketch, we achieve good results.
Once all screenshots (corresponding to a specific latent vector) are turned into sketches, we have paired sketch-latent vector data, which we can use to learn a mapping from a sketch to a generating latent vector.

2.1 Converting Screenshots to Sketches

Turning model screenshots into sketches required a substantial amount of manual tuning and some neat tricks here and there. The first trick is to apply different shades of gray to the faces of the voxels defining a shape, as shown below.

img Different shades of grey are assigned to cube faces bases on their orientation such that they can be easily retrieved

Through this, we can identify regions of the screenshots independently. For example, we know one of the contours of the regions with a color value of [7, 7, 7] will represent the seat of the chair. Note that we do not even need RGB-values here as everything is grayscale. Once these regions are identified, we can extract meaningful contours and filter out those that amount to noise. We took two approaches to converting screenshots to sketches. The first one is image-based and corresponds to the training setup where we used a CNN embedding (described below) whereas the second one is vector-based and simplifies each screenshot more by only extracting 4 contours.

2.1.1 Image-based sketch representation

The first strategy consists in identifying screenshot contours and retaining only those with bounding boxes that are tall or wide enough. This filtering is useful to only keep contours that are meaningful to the overall drawing. The contours are then drawn onto a white canvas with a 139x139 resolution, and the resulting image is the sketch.

img Step-by-step conversion of a screenshot into an image-based sketch representation

2.1.2 Vector-based sketch representation

As we will see later below, the image-based representation did not yield the results that we hoped, and we decided to represent sketches with a more natural, simple representation. Instead of filtering out the smallest contours, we only keep the biggest one for each region of the screenshot (front, side, top, entire shape). In addition, prior to identifying the contours, we blur each region of the image in order to suppress some of the artifacts due to the voxel-based representation of the shape (see image below).

img Step-by-step conversion of a screenshot into a vector-based sketch representation

Once again, each of these contours is approximated by a B-spline that we evaluate at 200 points. Each contour thus corresponds to 400 features, and since we consider 4 contours, we end up with a 1600-dimensional feature representation. Importantly, we normalize the coordinates of all contours by subtracting the minimum x-coordinate and dividing by the total width of the outline. As a result, the size of the sketch is irrelevant for the prediction system. Note that the aspect ratio of the sketch is preserved as we only normalize with respect to the x-direction. The image below shows the results for screenshot samples:

img Screenshots converted to sketches (left: original screenshot, center: identified regions, right: output contours; colors represent the type of region)

At this point, we’re pretty happy with the results above: the strategy extracts good sketch features, leaving out unnecessary information. The overall outline, in black, is particularly good, whereas the inner contours are sometimes a bit weird. To lend more importance to the black outline, its corresponding coordinates are multiplied by a factor of 15 in the final feature vector for a given sample (this factor can be seen as a hyperparameter of our model).

3 Learning the Mapping from Sketch to Latent Vector

We pursued two distinct approaches to create a model capable of predicting a latent vector for the 3D GAN based on a sketch input. The first strategy relies on the image-based sketch representation discussed above, while the second one is based on the vector representation.

3.1 Strategy 1 Failure: Embed bitmap sketches using a pretrained CNN (Inception v3)

Our first approach to learn the mapping from sketch to latent vector is similar to the one used by Google for Teachable Machine. Instead of building a CNN from scratch, we used a powerful model pretrained on the ImageNet dataset, Inception-v3 (Szegedy et al., 2015), to compute embeddings for each layer. Specifically, we remove the last prediction layer of Inception-v3 to only retain the 2048-dimensional embedding—also called bottleneck features—used as features for each sketch. We then use a shallow (2 to 4 layers) neural network to learn a mapping between the embedded sketch features and the latent vector. The overall setup is summarized by the diagram below:

img Data generation and training setup for strategy 1: latent vectors are sampled to produce 3D shapes which are screenshotted and converted into image-based sketch features. These features are subsequently embedded in a 2048-dimensional vector using Inception-v3. Finally, a fully connected neural network is trained to learn the inverse relationship between the embedded features and the generating latent vector.

Unfortunately, this approach did not prove successful as the shallow neural network, post-training, collapses on the mean of the training latent vectors, thus virtually predicting only one value no matter the sketch, despite attempts made to improve the architecture of the network as well as its hyperparameters. We identify two potential reasons for this poor performance:

  • The Inception-v3 embeddings result in a large loss of information compared to the raw original sketch: this motivated the use of a vector-based sketch representation, which allowed to keep a fairly compressed representation with little to no loss of information.
  • Two similar shapes may have vastly different generating latent vectors, rendering a parametric model such as a neural network hard to train: this suggested using a non-parametric method—nearest neighbor—for our final system.

3.2 Strategy 2 Success: Use simplified vector representation for sketches and predict using nearest neighbor

In this case, we use the vector-based sketch representation detailed above, and predict the latent vector of an artificial test sketch based on its closest neighbors in terms of features. Given how we represent sketches, this essentially implies that we’re identifying drawings in our existing data that are close (in terms of euclidean distance) to the curves in the test sketch, which is intuitively a sensible approach. The training setup is diagrammed below:

img Data generation and training setup for strategy 2: latent vectors are sampled to produce 3D shapes which are screenshotted and converted into vector-based sketch features. Finally, a nearest neighbor model is trained on the generated data and used for predicting the latent vector corresponding to an input sketch.

We found that this model performed well. In particular, because nearest neighbor is non-parametric (up to the chosen number of nearest neighbors), mean collapse, which was observed with neural networks, is not an issue. The image below shows a test sketch and its corresponding test shape, as well as the nearest neighbor sketches and the predicted shape.

img Example of prediction with 5 nearest neighbors for an artificial test sketch

Our nearest neighbor model bases its prediction for each test on its 5 nearest neighbors. The contribution of each nearest neighbor is weighted inversely proportionally to its distance to the test sketch.

img Predictions for artificial test sketches based on the implemented model

3.3 Processing human-drawn sketches

We have detailed above how we generated our training data based on the GAN and custom image processing functions. However, humans do not draw sketches that can readily be processed in the same way. Nonetheless, we can follow a similar approach if we provide users with colors to define regions based on their function. Specifically, we request that the user employs three colors to indicate different shapes in the sketch. Green is the seat region, red the side, and blue the back rest. Black allows the user to draw a general outline for the shape with no specific assignment. Each shape drawn by the user is identified by color and, for each color, the largest region is retained and its contour is extracted, which results in a format identical to the generated data, i.e. 200 points (2 coordinates) per contour (4 contours) or 1600 features.

3.4 Final System

With the nearest neighbors model and a method to process human-drawn input, the full prediction system comes together as diagrammed below.

img Diagram of the shape prediction system based on human-drawn sketch

First, the image data is processed into the contour features described above. These features are used to predict a latent vector based on the 5-nearest-neighbors model. This latent vector is then fed into the 3D GAN, and the voxel data is visualized in 3D.

4 Implementation

In this section, we discuss the practical implementation of the user interfaces and how they communicate. The implementation consists in a client-server configuration with three modules: the server, the drawing application, and the visualization application. The server is configured using Flask and FlaskIO to implement a fast and fluid communication between clients using sockets. We use a portable wi-fi router with a static IP address to communicate between the server and the clients. The drawing application was written in C# using Unity 3D game engine and built for iOS. We used a 2018 iPad Pro as the main device for the demo. The drawing application implements a raycasting system over a rendered texture to generate the 2D pixel information. This is a somewhat hacky way of implementing a skething interface, but it allows us to use the Unity game engine and compile the application to virtually any platform without further work. Once the user finished their sketch, by pushing the predict button, the app transformed the render texture into an array of RGBA values sent as a JSON object through socket communication (SocketsIO). The server received the JSON object and processed the pixel information in our system.
Finally, once received and processed the information on the server side, a 64x64x64 array is sent as JSON object to the second client for visualizing the 3D Model. The C# visualization application consists in a 3D navigator that renders a procedural watertight mesh from the array. We use a procedural mesh algorithm to achieve speed and fluidity in the visualization and avoid rendering occluded meshes. The application is built for WebGL and can thus be hosted on any server, although we have only run it locally thus far.

img User testing our application

5 Results

The video below shows our apps being used on real data (human-drawn sketches). It shows that the system is fairly sucessful. After processing the input sketch, our machine learning model is capable of predicting different chairs with distinct features, such as armrest, back inclination, chair smoothness, etc…

There is no real-world paired sketch-shape data—all of our data is artificial—and we cannot compute an error metric. However, we observe qualitatively that the system, despite being far from perfect, performs reasonably well and allows users to explore different chair shapes intuitively.

6 Conclusions

This project shows that machine learning can be used creatively to imagine novel, natural ways to interact with generative systems—a 3D GAN in this case. Our model performs decently well from a qualitative perspective, although there is room for vast improvements. We’re most likely going to keep working on this and see how far we can take it.

6.1 Applications

We started this project with no particular application in mind: we just thought it’d be fun! That said, it could be used for diverse purposes, such as:

  • Design ideation: sketching is much more intuitive than modelling, even for design professionals, our system, if more fully developed, could provide a platform for design teams to quickly iterate and brainstorm design ideas.
  • Online shopping: online merchants could create products easily customizable by customers through sketching (e.g. customized sneakers)
  • Any application for creating or selecting 3D objects that would benefit from an easier exploration interface.

Our project demonstrates that machine learning has the potential to act as a comprehension layer translating natural human input into structured machine output. A lot of work in that area has focused on speech recognition, but some ideas are easier to express with visual means, and we hope there will be more work in that area in the future.

6.2 Future Work

In this project, we took a first stab at what is arguably a complex problem. Through multiple iterations, we identified a few promising directions for future work:

  • Improve the sketch representation: by transitioning from an image-based to a vector-based representation for the sketches, we noticed a significant improvement in model performance. In future work, we will explore the power of a vector-based representation that is unrestrained in length (number of contours) and similar to the features developed by Ha & Eck (2017) for Sketch-RNN. With such a representation, our model would also need to be modified to accomodate sequential data and we’d likely have to opt for a recurrent neural network architecture.
  • Increase the amount of generated data: for the model presented, we generated 10,000 artificial data points. Clearly, using additional data would yield a finer model. In particular, given the design space is 200-dimensional, 10,000 samples provide a very sparse coverage of all possible shapes.
  • Augment data: all the data generated was obtained by taking screenshots from specific viewpoints, making our model brittle against perspective transformations. We expect that taking screenshots from different view points will fix the problem
  • Use other GAN models: we focused our attention solely to the chair model from Wu et al. because that provided a scope appropriate for a project with such a constrained timeline. In the future, we plane to work with richer models or models of different types of objects.

Acknowledgments

Thanks to the entire 6.S198 staff for organizing this fun class and specifically to Deep Gupta and Mahima Pushkarna for their help throughout the project!