Let's consider the following figure:
As we can see here, the camera captures real-world video to get the reference point. The graphics system generates the virtual objects that need to be overlaid on top of the video. Now, the video-merging block is where all the magic happens. This block should be smart enough to understand how to overlay the virtual objects on top of the real world in the best way possible.