Starting with sm_70, a fundamental change in the architecture occurred, called "Independent Thread Scheduling". It means warps threads are no longer executing in lock-step. We could try to exploit this in the port. F.i., is it still necessary to emit a warp sync after a diverging branch?
Hmm, reading about it a bit more, it's more about enabling algorithms that were not possible before, than about performance improvements. So, we should aim at having test-cases, both openacc and openmp that hang on previous architectures but pass with sm_70+.