The RoI Pooling layer is just a type of max-pooling, where the pool size is dependent on the input size. Doing this ensures that the output is always of the same size. This layer is used because the fully-connected layer always expects the same input size, but input regions to the FC layer may have different sizes.
The inputs of the RoI layer will be the proposals and the last convolution layer activations. For example, consider the following input image and its proposals:
Here, we have a table summarizing the differences between methods:
R-CNN | Fast R-CNN | Faster R-CNN | |
Test time per image | 50 seconds | 2 seconds | 0.2 seconds ... |