It's working. That's a huge huge satisfaction. Considering that it was only yesterday that I was down in the dumps, the recovery seems miraculous some 20 hours later.
However, as always, the devil lies in the details. Blind run for production sized samples led to a 3x slowdown.
Yup. That's correct. All this effort. All this pain. All the sweating, thinking, toiling, fretting, praying for a 3x slowdown. I didn't sign up for this.
On deeper inspection, the searches are at least 2x faster. PRNG is 5x faster. Then what the hell is up with it?
Turns out, the overhead in Python/c++ transition is too much. Not much if you are going to amortize it over large runs. But if the underlying code is not going to do much, you are dead. So, some automation is in order. Further, I need to add prefetching hints to it. I think at least some latency of memory access can be hidden safely behind the vector operations. And I just realized after running it, there are fundamental limits to speed up that may be obtained with my cache aware optimizations.
I am gonna go out on a limb by saying this. I think we are approaching the fundamental limits of this method. But then, I have said before that I am out of optimizations for this. And of course, as always,
THERE AIN'T ANYTHING SUCH AS THE FASTEST CODE.
In short, the real bottleneck is between the keyboard and the chair, not between the motherboard and the cooler.
Then why say so? It's my gut feeling. Would love to be proven wrong. The more the margin, the better.