ip: apply dual loop unrolling in ip4_rewrite
Too many prefetches within loop unrollings induce bottleneck and
performance degradation on some CPUs which have less cache line fill
buffers, e.g, Arm Cortex-A72.
Apply dual loop unrolling and tune prefetches manually to remove
hot-spot with prefetch instructions, to get throughput improvement.
It brings about 7% throughput improvement and saves 28% clocks with
ip4_rewrite nodes on Cortex-A72 CPUs.
Type: feature
Change-Id: I0d35ef19faccbd7a5a4647f50bc369bfcb01a20d
Signed-off-by: Lijian Zhang <Lijian.Zhang@arm.com>