What is an example program that realizes a performance gain by calling _mm_stream_si64x()?
The MSDN article on _mm_stream_si64x: http://msdn.microsoft.com/en-us/library/35b8kssy.aspx
Here's an example, assuming the source and destination are sufficiently large:
const char *source;
char *destination;
for (size_t offset= 0; offset<100*1024*1024; offset+= 64)
{
*(__int64 *)(destination + offset)= *(__int64 *)(source + offset);
}
If you do this manually instead of using _mm_stream_si64x
, you effectively flush the cache.
Like the reference says, the _mm_stream_si64x intrinsic writes to the memory location pointed to by Dest directly without writing Dest to the cache. So if you want to copy data to the Dest pointer, but do not plan on accessing data from the Dest pointer until much later, then this intrinsic would 'realize a performance gain' over the equivalent _mm_stream_si64 intrinsic.