MIT team invents cheap way to add 3D cameras to cellphones

A one-pixel pulse is tricked into decoding 3D data. Qualcomm is taking a close look.

[Editor’s Note: A video explaining the technology in more detail appears at the end of the article.]

”]Researchers at MIT are working on a small, cheap, and efficient way to allow cellphones and other handheld devices to become 3D cameras.

Like other sophisticated depth-sensing devices, the MIT system uses the “time of flight” of light particles to gauge depth. A pulse of infrared laser light is fired at a scene and the camera measures the time it takes the light to return from objects at different distances. But this system uses a single light detector; in essence it is a one-pixel camera, aided by clever algorithms and some basic truths about light.

The light emitted by the laser passes through a series of randomly generated patterns of light and dark squares, like irregular checkerboards. This provides enough information that algorithms can reconstruct a two-dimensional visual image from the light intensities measured by a single pixel.

In experiments, researchers found that the number of laser flashes—and, roughly, the number of checkerboard patterns—needed to build an adequate depth map was about 5% of the number of pixels in the final image. A LIDAR system, by contrast, would need to send out a separate laser pulse for every pixel.

To add the crucial third dimension to the depth map, the researchers use a technique called parametric signal processing. Essentially, they assume that all of the surfaces in the scene, however oriented toward the camera, are flat planes. While not strictly true, the mathematics of light bouncing off flat planes is much simpler than that of light bouncing off curved surfaces. The researchers’ parametric algorithm fits the information about returning light to the flat-plane model that best fits it, creating an accurate depth map from a minimum of visual information.

The research is from Vivek Goyal, the Esther and Harold E. Edgerton Associate Professor of Electrical Engineering, and his group at MIT’s Research Lab of Electronics.“3D acquisition has become a really hot topic,” Goyal says. “In consumer electronics, people are very interested in 3D for immersive communication, but then they’re also interested in 3D for human-computer interaction.”

Andrea Colaco, a graduate student at MIT’s Media Lab and one of Goyal’s co-authors on a paper that will be presented at the IEEE’s International Conference on Acoustics, Speech, and Signal Processing in March, points out that gestural interfaces make it much easier for multiple people to interact with a computer at once—as in the dance games the Kinect has popularized.

“When you’re talking about a single person and a machine, we’ve sort of optimized the way we do it,” Colaco says. “But when it’s a group, there’s less flexibility.”

Ahmed Kirmani, a graduate student in the Department of Electrical Engineering and Computer Science and another of the paper’s authors, adds, “3D displays are way ahead in terms of technology as compared to 3D cameras. You have these very high-resolution 3D displays that are available that run at real-time frame rates.

“Sensing is always hard,” he says, “and rendering it is easy.”

The team’s algorithm lets them get away with using relatively crude hardware. The system measures the time of flight of photons using a cheap photo-detector and an ordinary analog-to-digital converter—an off-the-shelf component already found in all cellphones. The sensor takes about 0.7 nanoseconds to register a change to its input.

That’s enough time for light to travel 21 centimeters, Goyal says. “So for an interval of depth of 10 and a half centimeters—I’m dividing by two because light has to go back and forth—all the information is getting blurred together,” he says. Because of the parametric algorithm, however, the researchers’ system can distinguish objects that are only two millimeters apart in depth. “It doesn’t look like you could possibly get so much information out of this signal when it’s blurred together,” Goyal says.

The researchers’ algorithm is also simple enough to run on the type of processor ordinarily found in a smartphone. To interpret the data provided by the Microsoft Kinect, by contrast, an Xbox requires the extra processing power of a graphics-processing unit (GPU).

“This is a brand-new way of acquiring depth information,” says Yue M. Lu, an assistant professor of electrical engineering at Harvard University. “It’s a very clever way of getting this information.” One obstacle to deployment of the system in a handheld device, Lu speculates, could be the difficulty of emitting light pulses of adequate intensity without draining the battery.

But the light intensity required to get accurate depth readings is proportional to the distance of the objects in the scene, Goyal explains, and the applications most likely to be useful on a portable device—such as gestural interfaces—deal with nearby objects. In addition, the researchers’ system makes an initial estimate of objects’ distance and adjusts the intensity of subsequent light pulses accordingly.

Telecom giant Qualcomm sees enough promise in the technology that it selected a team consisting of Kirmani and Colaco as one of eight winners—out of 146 applicants from a select group of universities—of a $100,000 grant through its 2011 Innovation Fellowship program.