Enhancing Stereo Depth Estimation with Deep Learning Techniques

Overview

Stereo depth estimation is essential for robotics, AR/VR, and industrial inspection, enabling accurate 3D perception for tasks like bin picking, autonomous navigation, and quality control. The Teledyne IIS Bumblebee X stereo camera delivers both high accuracy and real-time performance, producing detailed disparity maps at up to 38 FPS at 1024×768 resolution.

Built on the Semi-Global Block Matching (SGBM) algorithm, Bumblebee X performs reliably in well-textured scenes. However, like many classical stereo methods, it can struggle in regions with low texture or reflective surfaces—especially without a pattern projector—resulting in gaps or incomplete depth data.

Recent advances in deep learning (DL) offer promising solutions to enhance disparity, accuracy, and completeness. This document explores these methods through real-world tests and focusses on their strengths, tradeoffs, and suitability for embedded systems.

Before evaluating these methods, it is important to understand the real-world challenges classical stereo techniques face.

Stereo Depth Estimation: Challenges and Limitations

Classical stereo algorithms, such as onboard SGBM, provide fast and efficient disparity estimation, making them well-suited for embedded and real-time applications. These methods work reliably in scenes with good surface texture and do not require GPU acceleration or training data.

However, in more complex environments—especially those with reflective or low-texture surfaces—they can produce incomplete or inaccurate depth maps.

The warehouse scene below illustrates several of these challenges. Long, repetitive shelving reduces parallax cues, while glossy epoxy flooring reflects ambient light, and specular highlights from overhead luminaires introduce matching errors.

Blank regions along the left and right edges of the scene occur because the SGBM algorithm’s MinDisparity is set to 0, in combination with a fixed 256-level disparity range. As a result, the system cannot measure objects that fall outside the measurable depth window—specifically, anything closer than approximately 1.6 meters. To capture these near-field objects, users can either increase the minimum disparity value (Scan3D Coordinate Offset) or switch to quarter resolution mode.

Image 1-SGBM.jpg

As shown in the disparity image above, the shortcomings of the SGBM on-board disparity engine are apparent.

To address these challenges, two complementary deep learning-based methods are commonly used in stereo vision applications:

Hybrid DL Method:

This method enhances the initial disparity maps generated by SGBM using lightweight neural models. Techniques like CVLAB-Unibo’s neural disparity refinement model improve depth completeness and reduce matching artifacts by leveraging spatial and color cues. As a hybrid method, it strikes a balance between improved accuracy and computational efficiency, making it well-suited for real-time or embedded systems.

End-to-End DL Method:

This method uses end-to-end deep learning models—such as Selective-Stereo and FoundationStereo—to compute disparity directly from stereo image pairs, without relying on on-board SGBM. These networks learn semantic and contextual features from large datasets, enabling them to deliver dense and accurate disparity maps even in complex scenes involving occlusions or reflective surfaces. The trade-off is higher GPU requirements, which may limit their use in real-time or resource-constrained environments.

The next sections dive deeper into each approach, evaluating their accuracy, runtime, and coverage performance across real-world scenes.

Hybrid DL Method (neural-disparity-refinement model)

Method Description

The neural disparity refinement approach by CVLAB-Unibo enhances existing disparity maps produced by classical methods (e.g., SGBM). It employs a deep convolutional neural network (CNN) with a VGG-13 backbone arranged in a U-Net architecture to:

  • Fill disparity gaps based on spatial and color consistency
  • Sharpen edges through learned spatial-contextual information
  • Reduce common stereo matching artifacts like streaking

Network Architecture

The neural refinement network processes two inputs:

  1. The left RGB image from the stereo camera
  2. The raw disparity map generated on-board by Bumblebee X

A U-Net architecture with skip connections effectively merges coarse disparity estimations with fine detail from the RGB input, significantly improving depth-map completeness.

Performance

Inference speed for neural disparity refinement is approximately 3 FPS on an NVIDIA RTX 3060 GPU, suitable for asynchronous, real-time enhancements.

In the same warehouse scene we refine the disparity by passing the output obtained from the on-board disparity engine, along with the left rectified image, into the neural-disparity-refinement model. The results are shown below:

Image 2-neural-disparity-refinement model.jpg

As seen in the disparity images, applying this network to the warehouse scene reduces holes and fixes floor mismatches as well. However, because the refinement relies on the SGBM prior, areas where SGBM has no data (left/right extremes), some holes can still be observed.

To reproduce these results, go to the Deep Learning examples on GitHub.

End-to-End DL Method (Selective-Stereo)

Method Description

Selective-Stereo and Foundation-Stereo are examples of advanced deep learning frameworks that compute disparity maps directly from stereo image pairs without relying on traditional matching algorithms like SGBM. They employ adaptive frequency selection within their architectures, distinguishing high-frequency edges from low-frequency smooth regions, resulting in optimized processing tailored for different regions of the image.

Network Architecture

Selective-Stereo builds upon the IGEV-Stereo architecture, integrating gated recurrent units (GRUs) for iterative refinement. This method dynamically adjusts computational focus based on the detected image frequency characteristics:

  • High-frequency branch enhances edges and fine details
  • Low-frequency branch maintains smooth regions and avoid overfitting

Performance

While achieving high accuracy and completeness, Selective-Stereo is computationally intensive, providing a frame rate of approximately 0.5 FPS on an NVIDIA RTX 3060 GPU.

Based on the results shown below, the end-to-end deep-learning approach delivers the widest disparity coverage and preserves fine structural detail: For example, the sharply rendered ceiling light fixtures while avoiding the speckle artifacts caused by reflections from the light fixtures.

Overall, the fully end-to-end disparity estimation network outperforms both the raw on-board SGBM output and the neural-refinement pipeline, though at the expense of longer runtimes or the need for a more powerful GPU.

Image 3-Selective-Stereo.jpg

To reproduce these results, go to the Deep Learning examples on GitHub.

Additional Considerations

As with the on-board disparity results, surfaces closer than 1.6 m (outside the 0–256 disparity window) are not accurately handled. The storage bin in the lower-right corner illustrates this issue: Since it is very close to the camera, it should appear in the extreme red range, yet the network assigns it a smaller disparity, placing it farther away than it is. This local error corrupts the depth map and generates a poor point cloud only in that region.

Some deep learning models provide an option to tune minimum disparity to correctly capture closer objects while others do not. If the deep learning model of your choice does not allow for minimum disparity, you can shift the right image left by the desired minimum-disparity pixels and then add that value back to each output disparity.

Likewise, some deep learning models also limit the disparity range that they operate on. In such cases, the input rectified image will need to be resized to get the same measurable depth range, but this comes at the cost of depth precision.

Many models also require scene-specific fine-tuning (though advanced “foundation” stereo networks generalize zero-shot), whereas SGBM and SGBM-based hybrid models do not need any tuning and deliver reliable, out-of-the-box performance in any scene.

Comparative Experimental Analysis

An experimental benchmark was conducted using a random pattern placed at a known distance of 5 meters. The camera operated at a resolution of 1024x768 (Quarter mode).  For the accuracy test, a region of interest (ROI) was defined entirely within the textured portion of the pattern, ensuring that only well-defined features contributed to the depth statistics. Coverage was evaluated on the same wall in two stages: First over the textured patch itself, and then over the adjacent textureless smooth white surface. The images presented below show the resulting disparity maps.

Image 4-Comparative Experimental Analysis.jpg

Results from these tests include:

            

Coverage on textured region (%)

Coverage on textureless regions (%)

Median Depth (m)

Median Error (m)

Median Error (%)

Frame Rate (FPS)

SGBM (On-board)

100.00

18.48

5.052

0.052

1.03

38

SGBM + Neural Refinement

100.00

100.00

5.058

0.058

1.17

3

Selective-Stereo

100.00

100.00

4.988

-0.012

-0.24

0.5

 

 

 

 

 

 

 

 

 

Observations:

  • Neural refinement substantially improves disparity completeness, slightly increasing the median error.
  • Selective-Stereo provides superior completeness and minimal bias, indicating its effectiveness in precision-demanding applications.

Practical Application Guidelines

Recommendations for specific application scenarios:

  • High-Speed Real-Time Applications (≥30 FPS): Utilize Bumblebee X’s onboard SGBM, optionally combined with a pattern projector, for improved completeness.
  • Balanced Coverage and Latency: Implement neural disparity refinement asynchronously with the onboard SGBM for enhanced coverage.
  • Highest Accuracy and Completeness: Opt for Selective-Stereo when low frame rate is acceptable, and high accuracy is essential.

Conclusion

Deep learning methods significantly enhance Bumblebee X’s on-board SGBM performance in complex environments. Lightweight refinement offers real-time improvements with modest hardware, while end-to-end networks deliver superior quality when speed is less critical. Unlike many stereo cameras constrained by fixed pipelines or lack of on-board processing, Bumblebee X supports both methods, giving users the flexibility to optimize for accuracy, speed, and compute across diverse applications.