Here’s how to use ROS 2 Navigation stack to get around and check the health of your data! My workspace has the repos linked from openrover-demo.repos, and everything else (including the Navigation2 stack) is installed on Ubuntu via apt.

Which version of ROS to use?

Every version of ROS 2 has its issues right now. RViz on Dashing release might not launch with different middlewares. RViz on Eloquent does not display maps properly. And frequent API changes on master keep breaking things in the Nav2 stack.

Additionally some packages we use are only available in the release testing repos at this time (diagnostics, rmw_cyclonedds). But let’s face it: at this point everyone using ROS 2 is already a beta tester anyway :-). As of this writing, I recommend using Dashing, with the ros2-testing package list instead of the normal release packages. Here’s how to set that up on Ubuntu:

echo "deb [arch=amd64,arm64] bionic main" | sudo tee /etc/apt/sources.list.d/ros2-testing.list
sudo apt update
sudo apt install -y ros-dashing-desktop ros-dashing-rmw-cyclonedds-cpp
rosdep install --from-paths ~/ros2_ws/src -iyr

Middleware build tips

When building your workspace after installing a new RMW implementation, you likely need to pass --cmake-clean-cache to colcon build or you will get weird build errors about missing symbols.

Make sure you don’t set RMW_IMPLEMENTATION=rmw_cyclonedds_cpp until after you’ve built your workspace or it will fail to build.

What about middleware?

Glad you asked. I’ve tried 5 6 8 RMW implementations, all on ROS2 Dashing with ros2-testing repositories installed from binary. Long story short, I recommend CycloneDDS. Here’s what I found:

Which middleware is fastest? See for yourself!
  • Eclipse CycloneDDS: The new kid on the block seems to work the best right now. This comes up fast (~10s) and most predictably right now.
  • ADLINK OpenSplice (Commercial Edition): Bringup was as fast (~10s) though some setup was required: Ignore/Click through installer warnings about Java version. Edit or add your ospl.xml file to have OpenSlice/Domain/SingleProcess=false and OpenSplice/Domain/Database/Size=32M. Make sure you’ve sourced the OpenSplice script before building or running your ROS2 workspace and run ospl start before launching ROS2 nodes. This requires a paid license and separate installation, so it’s not what we recommend for most users.
  • RTI Connext: This one fared well, but not great. Bringup took almost 2 minutes. Requires a paid license for commercial use.
  • eProsima Fast RTPS: As of the most recent update (Fast RTPS 1.8.2), performance is now on par with CycloneDDS (~10s)! Sometimes it would come up fast (<2 min), sometimes it would hang on startup. I usually found that if nav2 hung bringing up AMCL or if RViz failed to display laser scan data, restarting RViz would fix the issue and un-hang the nav2 bringup.
  • ADLINK OpenSplice (Community Edition): Sometimes I was able to make this work quickly (bringup < 2 min) but often not (> 10 min). When this RMW failed, it tended to fail hard: while it was struggling to bring up the nav2 stack, the network quality, including pings between unrelated computers (sorry, coworkers), was negatively affected. Even with a recent fix to reduce WiFi multicast traffic, performance seemed problematic. On the plus side, OpenSlice was the only middleware to give warning messages when things were going badly.
  • TwinOaks CoreDX: Failed to build on Dashing, so I wasn’t able to try it out.
  • Intel DPS: Bringup hung and did not finish.
  • Gurum CoreDDS: Bringup crashed.

My testing was not done in a controlled environment and some of the issues may be in the RMW itself, the ROS2 compatibility layer, or in RViz and/or the Nav2 stack. But I’ll keep an eye out in the future, since things are sure to change as ROS 2 matures.

The bringup

Here’s the sequence of steps to get the rover to drive somewhere:

  • On your workstation ros2 launch openrover_demo
  • On the rover: ros2 launch openrover_demo
  • On either the workstation or the rover: ros2 launch openrover_demo
  • In RViz, use the 2D Pose Estimate tool to designate the rover’s approximate location and heading.
  • In RViz, use the 2D Nav Goal tool to designate the rover’s target location and heading

You probably want to launch RViz first, so you can see a pretty overview of what’s going on. If you use my launch files with map_server publisher a map file and you have the fixed frame set to ‘map’, the first thing you’ll see is a map

Upon starting up the navigation 2 stack, in that window, you will see lots of messages like the following. This is because the AMCL node can’t fully start up until it knows an initial guess of the robot’s position. Don’t worry about these yet:

[world_model-4] [INFO] [global_costmap.world_model]: Timed out waiting for transform from base_link to map to become available, tf error: . canTransform returned after 1.00495 timeout was 1.
[world_model-4] [INFO] [global_costmap.world_model_rclcpp_node]: [signalFailure] Drop message: frame 'lidar_link' at time 1567646510.784 for reason(0)

This initial guess must be provided in the map frame (RViz must have Fixed Frame = map). Click “2D Pose Estimate” in the top toolbar and then click on the map approximately where the robot is in real life and drag the mouse in the direction the robot is facing. Your initial guess can be pretty sloppy – it will get better as the robot drives around. Also, the RViz log may incorrectly display “setting goal” – this is just a bug in logging; you are indeed setting the pose estimate, not the goal.

Before giving robot commands, your sensor data should all work. The robot model, lidar data, and AMCL particle cloud should show up in RViz. Note the particles will not move around until your robot’s starts going somewhere – this is a feature (not a bug) since additional data from the same exact perspective won’t generally improve the location estimate. In a few seconds the local costmap should show up as well.

Finally, use the “2D Nav Goal” tool from the top tool. It will work similarly to the 2D Pose Estimate tool. If you have it enabled, the planned path will show up as a curve on the map.

Here’s a video of the whole bringup and driving around the office a bit. You can see the part where the robot runs into an obstacle below its line of site and spins around until it is free of the obstacle and can continue on its way! Have fun driving around!

RViz useful displays

Here are the visualizations I like to enable. They’re all saved in a file called default.rviz so RViz can easily load them all:

  • TF: this data is useful for reconciling all the different sensor data. Mainly I keep this activated for the little warning icon it gives me if something is wrong.
  • RobotModel: Where the robot is and what it looks like
  • LaserScan: What the lidar is bouncing off of, including walls, chairs, and my coworkers’ shins
  • Odometry: Where the robot is and where it is pointed at.
  • AMCL Samples: The hypothetical locations AMCL considers when guessing the robot’s location. The tighter these are clustered, the more confident we are of the robot’s position.
  • Map: Obstacles the robot should expect to see with its lidar.
  • Costmap (global): Obstacles taken into account for planning navigation routes over long distances.
  • Costmap (local): Obstacles taken into account for planning navigation routes over short distances.
  • Planned Path: Trajectory the robot currently is currently following.

NOTE: under Dashing and Fast RTPS, RViz had trouble receiving TF data. In the attached screenshot nav2 was able to incorporate lidar data into the costmap, but RViz was not able to visualize the robot model nor the lidar data due to tf issues.

Network Quality and OpenSplice

I found that network quality strongly affected how well everything tends to work. Make sure the computer and the rover are both connected over 5 GHz WiFi. We use a Ubiquiti Unifi router, which shows an “experience score” for each device. A low experience score is usually a cause of bad performance, though in the case of OpenSplice, I found it was often the other way around – ROS2 could cause the network to be negatively impacted.

The computer Rhea is going to have trouble. As seen in the Unifi management console, even though Rhea has good signal strength, there’s lots of network interference on the 2.4 GHz WiFi band. This is solved by switching to the 5GHz band like Gauss.

Nav2 leans heavily on the tf2 library, which is used for reconciling data in different coordinate frames. The most frequent issue we see here is the world model being unable to make sense of lidar data due to TFs getting delayed. That looks like the following in the Navigation 2 logs:

[world_model-3] [INFO] [global_costmap.world_model_rclcpp_node]: [signalFailure] Drop message: frame 'lidar_link' at time 1567192709.161 for reason(0)
[dwb_controller-4] [INFO] [local_costmap.dwb_controller_rclcpp_node]: [signalFailure] Drop message: frame 'lidar_link' at time 1567192709.233 for reason(0)

This is a symptom of the mismatch between QoS (Quality of Service) settings with the laser data versus the transform data.

Lidar data uses QoS reliability = “best effort”, which means that it will send data as fast as it can, regardless of whether receivers can keep up.

Transform data uses QoS reliability = “reliable”, which means if a receiver does not receive data and confirm to the sender that data was received (either it was lost or the receiver is too busy), the sender will attempt to retransmit the data. When the sender’s transmit queue fills up (because it is sending data, on average, slower than it is producing it), the sender will fall further and further behind, as it can no longer keep up.

So you see the problem – transform data can get backlogged whereas the sensor data to be transformed won’t be. In the best case, this means the sensor data is slightly delayed. Worst case, the sensor data gets dropped because the TransformBuffer isn’t set up to wait long enough.

Some things tend to increase this backlog:

  • When nodes spin up or shut down it tends to cause delays
  • When nodes crash, it causes even bigger delays, as the sender has no indication that the listening node is gone.
  • With WiFi multicast, (which is the default for OpenSplice on Dashing) messages take *much* longer to send.
  • When there is lots of unrelated WiFi traffic messages can get lost or take longer to send (even if large file transfers or video streaming are working fine)

You can tell the TFs are healthy with the green checkmark next to TF status in RViz. You can also run ros2 run tf2_ros tf2_monitor, which will give you an idea of how much backlog there is. The number after “Average Delay” is the one to watch – it should be no bigger than 1.

Get In Touch

Need help using ROS to build an autonomous solution? Contact us to discuss our consulting capabilities.

Contact Us