Kernel launches are inherently asynchronized w.r.t. host. export CUDA_LAUNCH_BLOCKING=1 could be set to make all CUDA operations sync w.r.t. the host (useful for debugging race conditions).

Operations within the same stream are ordered (FIFO) and cannot overlap, operations in different streams are unordered and can overlap.

cudaStreamDestroy will first synchronize host until works in stream have completed, then destroy and free the stream.

Legacy stream has special sync rules: it is sync with all streams, i.e., operations in stream 0 cannot overlap other streams. It means one operation in the legacy stream will not start until all previously launched operations in other non-default streams have completed, and all operations in non-default streams will not start until current operation in legacy stream has completed. (exception: non-blocking streams created by cudaStreamCreateWithFlags).

Memory Copies

Transfers with pinned pagable memory are using PCIe DMA engine, while copies together with unpinned (pages that could be swapped out) memory are using CPU.

Memory copies can execute concurrently if and only if:

  • The memory copy is in a different non-default stream
  • The copy uses pinned memory on the host
  • The asynchronous API is called
  • There isn’t another memory copy occurring in the same direction at the same time



  • cudaSynchronizeDevice
  • cudaStreamSynchronize
  • CUDA events (disable timing flag to increase performance): cudaEventRecord + cudaStreamWaitEvent

APIs like memory allocation/free, stream and event creation/destroy are implicitly sync w.r.t. the host.