AADS: Augmented autonomous driving simulation using data-driven algorithms

Data preprocessing

AADS used the scanned real images for simulation. Our goal was to simulate new vehicles and pedestrians in the scanned scene with new trajectories. To achieve this goal, before simulating data, our AADS should remove moving objects, e.g., vehicles and pedestrians, from scanned RGB images and point clouds. Automatic detection and removal of moving objects constitute a full research topic in its own right; fortunately, most recent datasets provide semantic labels of RGB images, including point clouds. By using semantic information in the ApolloScape dataset, we removed objects of a specific type, e.g., cars, bicycles, trucks, and pedestrians. After removing moving objects, numerous holes in both RGB images and point clouds appear, which must be carefully filled to generate a complete and clean background for AADS. We used the most recent RGB image inpainting method (36) to close the holes in the images. This method used the semantic label to guide a learning-based inpainting technique, which achieved acceptable levels of quality. The point cloud completion will be introduced in the depth processing for novel background synthesis (see the next section).

Given synthesized background images, we could place any 3D CG model on the ground and then render it into the image space to create a new, composite simulation image. However, to make the composite image photorealistic (look close to the real image), we must first estimate the illumination in background images. This enables our AADS to render 3D CG models with consistent shadows on the ground and on vehicle bodies. We solved such outdoor lighting estimation problems according to the method in (37). In addition, to further improve the reality of composite images, our AADS also provided an optional feature to enhance the appearance of 3D CG models by grabbing textures from real images. Specifically, given an RGB image with unremoved vehicles, we retrieved the corresponding 3D vehicle models and aligned those models to the input image using the method in (38). Similar to (39), we then used symmetric priors to transfer and complete the appearance of 3D CG models from aligned real images.

Augmented background synthesis

Given a dense point cloud and image sequence produced from automatic scanning, the most straightforward way to build virtual assets for an AD simulator is to reconstruct the full environment. This line of work focuses on using geometry and texture reconstruction methods to produce complete large-scale 3D models from the captured real scene. However, these methods cannot avoid hand editing while modeling, which is expensive in terms of time, computation, and storage.

Here, we directly synthesized an augmented background in a specific view as needed during simulation. Our method avoided modeling the full scene ahead of running the simulation. Technically, our method created such a scan-and-simulate system by using the view synthesis technique.

To synthesize a target view, we needed to first obtain dense depth maps for input reference images. Ideally, these depth maps should be extracted from a scanned point cloud. Unfortunately, such depth maps will be incomplete and unreliable. In our case, these problems came with scanning: (i) The baseline of our stereo camera was too small compared to the size of street view scenes, and consequently, there were too few data points for objects that were too far from the camera. (ii) The scenes were full of numerous moving vehicles and pedestrians that needed to be removed. Unfortunately, their removal produced holes in scanned point clouds. (iii) The scenes were always complicated (e.g., many buildings are fully covered with glasses), which led to scanning sensor failure and thus incomplete scanned point clouds. We introduced a two-step procedure to address the lack of reliability and incompleteness in depth maps: depth filtering and depth completion.

With respect to depth filtering, we carried out a judiciously selected combination of pruning filters. The first pruning filter is a median filter: A pixel was pruned if its depth value was sufficiently different from the median filtered value. To prevent removing thin structures, the kernel size of the median filter was set to small (e.g., 5 by 5 in our implementation). Then, a guided filter (40) was applied to keep thin structures and to enhance edge alignment between the depth map and the color image. After getting a much more reliable depth, we completed the depth map by propagating the existing depth value to the pixels in the holes by solving a first-order Poisson equation similar to the one used in colorization algorithms (41).

After depth filtering and completion, reliable dense depth maps that could provide enough geometry information to render an image into virtual views were produced. Similar to (23), given a target virtual view, we selected the four nearest reference views to synthesize the virtual view. For each reference view, we first used the forward mapping method to produce a depth map with camera parameters of the virtual view and then performed depth inpainting to close small holes. Then, a backward mapping method and occlusion test were used to warp the reference color image into the target view.

A naïve way to synthesize the target image is to blend all the warped images together. However, when we blended the warped images using the view angle penalty following a previous work (23), there always existed obvious artifacts. Thus, we solved this problem as an image stitching problem rather than direct blending. Technically, for each pixel xi of the synthesized image in the target virtual view, it is optimized to choose a color from one of those warped images. This can be formulated as a discrete pixel labeling energy functionarg min{xi}∑iλ1E1(xi)+λ2E2(xi)+∑(ij)∈Nλ3E3(xi,xj)+λ4E4(xi,xj)+λ5E5(xi,xj)(1)

Here, xi is the ith pixel of the target image, and N is the pixel set of xi’s one ring neighbor. E1(xi) is the pixel-wise data term, which is defined by extending the view angle penalty in (42). In contrast to the scenarios in (42), depth maps of the street view scene always contain pixels with large depth values, which lead the angle view penalty to be too small. To address this problem, when the penalty is close to zero, we added another term to help choose the appropriate image by taking advantage of camera position information. Specifically, E1(xi) is defined as E1(xi) = Eangle(xi)Wlabel(xi). Here, Eangle(xi) is the view angle penalty in (42). When Eangle(xi) is too small, it will be hooked and set to 0.01 in our implementation. This is done to balance two energy terms and make Wlabel(xi) effective. Wlabel(xi) is defined as Wlabel(xi)=Dpos(Cxi,Csyn)Ddir(Cxi,Csyn), which evaluates the difference between the reference view and the target view. Here, Cxi and Csyn denote the view choice for the camera for pixel xi and for the target view’s camera, respectively. Furthermore, Dpos represents the distance from the camera center, and Ddir is the angle between the optical axes of the two cameras.

E2(xi) is the occlusion term used to exclude the occlusion areas while minimizing the pixel labeling energy. Most occlusions appear near depth edges. Thus, when using the backward mapping method to render the warped images, we detected occlusions by performing depth testing. All pixels in the reference view with larger depth values than those of the source depth can yield an occlusion mask, which is then used to define E2(xi). Specifically, when an occlusion mask is invalid, i.e., the pixel is nonocclusion, we set E2(xi) = 0 to add no penalty into the energy function. When a pixel is occluded, we set E2(xi) = ∞ to exclude this pixel completely.

The rest of the terms in Eq. 1 are smoothness terms: color term E3(xi, xj ), depth term E4(xi, xj), and color gradient term E5(xi, xj). Similar to (43), the color term E3(xi, xj) is defined by a truncated seam-hiding pairwise cost first introduced in (44) E3(xi,xj)= min(||cixi−cixj||2,τc)+min(||cjxi−cjxj||2,τc), where cixi is the RGB value of pixel xi. Similarly, the depth term E4(xi, xj) is defined as E4(xi,xj)= min(|dixi−dixj|,τd)+min(|djxi−djxj|,τd), where dixi is the depth of pixel xi, and the RGB and depth truncation thresholds are set to τc = 0.5 and τd = 5 m in our implementation, respectively. Because the illumination difference may occur between different reference images, the color difference is not sufficient to ensure a good stitch. An additional gradient difference E5(xi, xj) is used. By assuming that the gradient vector should be similar on both sides of the seam, we defined E5(xi,xj)=∣gix−gjx∣+∣giy−gjy∣, where gix is a color space gradient of the ith pixel in the image and includes xi.

The term weights in Eq. 1 are set to λ1 = 200, λ2 = 1, λ3 = 200, λ4 = 100, and λ5 = 50. The labeling problems are solved using the sequential tree-reweighted message passing (TRW-S) method (45). Figure 8 shows the pipeline and results of augmented background synthesis. Note that, in Fig. 8C, a color difference may exist near the stitching seams after image stitching. To obtain consistent results, a modified Poisson image blending method (46) was performed. Specifically, we selected the nearest reference image as the source domain and then fixed its edges to propagate color brightness to the other side of the stitch seams. After solving the Poisson equation, we obtained the fusion result shown in Fig. 8D. Note that, when the novel view is far from the input views, e.g., large view extrapolation, there will be artifacts because of disoccluded regions that cannot be filled in by stitching together pieces of the input images. We marked those regions as holes and set their gradient value to zero. Thus, these holes were filled with plausible blurred color when solving the Poisson equation.

<a href="https://robotics.sciencemag.org/content/robotics/4/28/eaaw0863/F8.large.jpg?width=800&height=600&carousel=1" title="Novel view synthesis pipeline. (A) The four nearest reference images were used to synthesize the target view in (D). (B) The four reference images were warped into the target view via depth proxy. (C) A stitching method was used to yield a complete image. (D) Final results in the novel view were synthesized after post-processing, e.g., hole filling and color blending." class="fragment-images colorbox-load" rel="gallery-fragment-images-1208680064" data-figure-caption="

Fig. 8 Novel view synthesis pipeline.

(A) The four nearest reference images were used to synthesize the target view in (D). (B) The four reference images were warped into the target view via depth proxy. (C) A stitching method was used to yield a complete image. (D) Final results in the novel view were synthesized after post-processing, e.g., hole filling and color blending.

” data-icon-position data-hide-link-title=”0″>

Fig. 8 Novel view synthesis pipeline.

(A) The four nearest reference images were used to synthesize the target view in (D). (B) The four reference images were warped into the target view via depth proxy. (C) A stitching method was used to yield a complete image. (D) Final results in the novel view were synthesized after post-processing, e.g., hole filling and color blending.

Moving objects’ synthesis and data augmentation

With a synthesized background image in the target view, a complete simulator should have the ability to synthesize realistic traffics with diverse moving objects (e.g., vehicles, bicycles, and pedestrians) and produce corresponding semantic labels and bounding boxes in simulated images and LiDAR point clouds.

We used the data-driven method described in (26) to address challenges involving traffic generation and placement of moving objects. Specifically, given the localization information, we first extracted lane information from an associated high-definition (HD) map. Then, we randomly initialized the moving objects’ positions within lanes and ensured that the directions of moving objects were consistent with the lanes. We used agents to simulate objects’ movements under constraints such as avoiding collisions and yielding to pedestrians. The multiagent system was iteratively deduced and optimized using previously captured traffic following a data-driven method. Specifically, we estimated motion states from our real-world trajectory dataset ApolloScape-TRAJ; these motion states included position, velocity, and control direction information of cars, cyclists, and pedestrians. Note that such real dataset processing was performed in advance of simulation and needed to be processed just once. During simulation runtime, we used an interactive optimization algorithm to make decisions for each agent at each frame of the simulation. In particular, we solved this optimization problem by choosing a velocity from the datasets that tends to minimize our energy function. The energy function was defined on the basis of the locomotion or dynamics rules of heterogeneous agents, including continuity of velocity, collision avoidance, attraction, direction control, and other user-defined constraints.

With generated traffic, i.e., the object placement in each simulation frame, we rendered 3D models into the RGB image space and generated annotated data using the physical rendering engine PBRT (47). Meanwhile, we also generated a corresponding LiDAR point cloud with annotations using the method introduced in the next section.

LiDAR synthesis

Given 3D models and corresponding placement, it is relatively straightforward to synthesize LiDAR point clouds with popular simulators such as CARLA (4). Nevertheless, there are opportunities to take advantage of specific LiDAR sensors (e.g., Velodyne HDL-64E S3). We proposed a realistic point cloud synthesis method by effectively modeling the specific LiDAR sensor following a data-driven fashion. Technically, a real LiDAR sensor captured the surrounding scene by measuring the time of flight for pulses of each laser beam (48). One laser beam was emitted from the LiDAR and then reflected from target surfaces. A 3D point was then generated if the returned pulse energy of a laser beam was big enough. We modeled the behavior of laser beams to simulate this physical process. Specifically, the emitted laser beam could be modeled using parameters including the vertical and azimuth angles and their angular noises, as well as the distance measurement noise. For example, the Velodyne HDL-64E S3 LiDAR sensor emits 64 laser beams in different vertical angles ranging from −24.33° to 2°. During data acquisition, HDL-64E S3 rotates around its own upright direction and shoots laser beams at a predefined rate to accomplish 360° coverage of the scenes. Ideally, such models can be adaptive to other types of LiDAR sensors. Model parameters should depend on the specific type of sensor. However, we experimentally found that parameters vary considerably, even among devices of the same type. To be as close as possible to reality, we fitted the model from real point clouds to statistically derive those parameters.

Specifically, we collected real point clouds from HDL-64E S3 sensors on top of parked vehicles, guaranteeing smoothness of point curves from different laser beams. The points of each laser beam were then marked manually and fitted by a cone with the apex located in the LiDAR center. The half-angle of the cone minus π/2 forms the real vertical angle, whereas the noise variance was determined from the deviation of lines constructed by the cone apex and points from the cone surface. The real vertical angles usually differed from ideal angles by 1° to 3°. In our implementation, we modeled the noise with standard Gaussian distribution, setting the distance noise variance to 0.5 cm and the azimuth angular noise variance to 0.05°.

To generate a point cloud, we computed intersections between the laser beams and the scene. Specifically, we proposed a cubed, map-based method to mix the background of the scenes in the form of points and meshes of 3D CG models. Instead of computing intersections between the beams and the mixed data, we computed the intersection with the projected maps (e.g., depth map) of scenes, which offer the equivalent information but in a much simpler form. Note that our LiDAR simulation method can be easily extended for arbitrary LiDAR sensors and to any sensor solution for different numbers and poses of sensors. Figure S1 shows the visual results of our LiDAR simulation.