Let's see how we can somehow fix the problem by introducing another optimized architecture:
First, we will divide the execution into two independent parts. The main thread, or the video thread, which simply continues to read the frames, puts it on a stack and then shows those frames into the screen together if the bounding boxes exist. Then we have another thread, the YOLO thread, which is independent from the main thread; it reads those frames from the stack, and it produces bounding-box predictions. Since the YOLO thread, which is the slowest one, isn't mixing with the main thread, it isn't slowing down the ...