Kokkos Node API and Local Linear Algebra Kernels Version of the Day
Kokkos Node API
Collaboration diagram for Kokkos Node API:


class  Kokkos::CUDANodeMemoryModel
 A default implementation of the Node memory architecture for Node with a distinct device memory space allocated by the CUDA runtime. More...
class  Kokkos::DefaultNode
 Class to specify Kokkos default node type and instantiate the default node. More...
class  Kokkos::ReadyBufferHelper< Node >
 A class to assist in readying buffers via the Node::readyBuffer() method. More...
class  Kokkos::SerialNode
 Kokkos node interface for a serial, CPU node. More...
class  Kokkos::StandardNodeMemoryModel
 A default implementation of the Node memory architecture for a single memory space allocated by standard library calls. More...
class  Kokkos::TBBNode
 Kokkos node interface to the Intel Threading Building Blocks threading library. More...
class  Kokkos::ThrustGPUNode
 Kokkos node interface to the Thrust library for NVIDIA CUDA-capable GPUs. More...
class  Kokkos::TPINode
 Kokkos node interface to the ThreadPool threading library. More...

Detailed Description

Introduction to the Node API

The Kokkos Node API is intended to provide a generic interface for programming shared-memory parallel nodes. Generic programming of an arbitrary shared-memory node is difficult for a number of reasons, particularly:

Along these lines, the Kokkos Node API is broken into two orthogonal components:

Node API Memory Model

The following describes the methods on a node object required to support the node API memory model, for some sample node SomeNode:

   class SomeNode {
     // isHostNode == true indicates that memory allocated by the node may be safely accessed on the host thread
     static const bool isHostNode = ...

     // Allocate a segment of memory for use in parallel kernels (a <i><b>parallel compute buffer</b></i>)
     template <class T>
     ArrayRCP<T> allocBuffer(size_t size)

     // Copy data to the host from a parallel compute buffer.
     template <class T>
     void copyFromBuffer(size_t size, const ArrayRCP<const T> &buffSrc, const ArrayView<T> &hostDest)

     // Copy data from the host to a parallel compute buffer.
     template <class T>
     void copyToBuffer(size_t size, const ArrayView<const T> &hostSrc, const ArrayRCP<T> &buffDest)

     // Copy data from one parallel compute buffer to another.
     template <class T>
     void copyBuffers(size_t size, const ArrayRCP<const T> &buffSrc, const ArrayRCP<T> &buffDest)

     // Return a read-only view of a parallel compute buffer for use on the host thread.
     template <class T>
     ArrayRCP<const T> viewBuffer(size_t size, ArrayRCP<const T> buff)

     // Return a modifiable view of a parallel compute compute for use on the host thread.
     template <class T>
     ArrayRCP<T> viewBufferNonConst(ReadWriteOption rw, size_t size, const ArrayRCP<T> &buff)

     // Prepare a set of parallel compute buffers for use in a parallel kernel.
     void readyBuffers(ArrayView<ArrayRCP<const char> > buffers, ArrayView<ArrayRCP<char> > ncBuffers)

The distinction made between views and copies is significant. On a GPU-based parallel platform, parallel compute buffers may not be accessible from the host thread. However, on a CPU-based platform, they are. It is preferably to devise an access model which enables generic code to run on the GPU platform, without incurring unnecessary memory costs on the CPU platform. By requiring the user to explicitly distinguish between a copy to/from parallel memory and some artibtrary host access of the parallel memory, we allow optimal efficiency in either case. Furthermore, by providing the compile-time Node::isHostNode flag for each Node class, we allow the users to forgoe generic code in the cases where greater efficiency can be achieved otherwise.

The readyBuffers method is technically required to be called for a set of buffers before their use in a parallel kernel. However, the current node implementations in Kokkos only perform debugging in this method, and therefore neglecting to call this method (especially in a non-debug mode) does not effect the correctness of the code. Future node implementations, however, may require that this method is called (for example, a device-based node which swaps out parallel buffers due to limited memory.) The class ReadyBufferHelper assists in calling this method.

For example, consider the sample code below, templated on the generic node SomeNode:

   template <class SomeNode>
   void initializeNodeMemory(RCP<SomeNode> node)
     // alloc a buffer of ten integers "on the node"
     ArrayRCP<int> buffA  = node->allocBuffer<int>( 10 );

     // get a view to the allocated buffer, that we can access on the host
     // * on a CPU node, this will result in viewA == buffA
     // * on a GPU node, this is not the case; however, the WriteOnly flag will 
     //   not initiate a copy to the host; the initial contents of viewA are undefined.
     ArrayRCP<int> viewA = node->viewBufferNonConst<int>( Kokkos::WriteOnly, 10, buffA );
     for (int i=0; i < 10; ++i) {
       viewA[i] = someFunction(i); // initialize the view on the host
     // free the view. the changes to the view are realized in the buffer only after all references 
     // to the view have been freed and its deallocator has been called.
     // the contents of buffA are undefined while the view is active.
     viewA = Teuchos::null; 
     // at this point, the entries of buffA are initialized as follows:
     //   buffA[i] = someFunction(i)
     // allocate another buffer
     ArrayRCP<int> buffB = node->allocBuffer<int>( 3 );
     // copy the last three entries of buffA into buffB
     // after this is finished, buffB[i] = buffA[7+i], for i=0:2
     node->copyBuffers<int>( 3, buffA+7, buffB );

     // get a view of buffB, verify that its entries were properly set
     // this view is const and cannot change buffB. the availability of buffB
     // for computation is not affected by the existence of viewB.
     // * on a GPU-based node, this requires a copy of these entries to the host
     // * on a CPU-based node, there is no copy, and viewB == buffB == buffA+7 (except that the first of these is const)
     ArrayRCP<const int> viewB = node->viewBuffer<int>( 3, buffB );
     for (int i=0; i<3; ++i) {
       assert( viewB[i] == someFunction(i+7) );
     // free the view. there were no changes, so even for a device platform with distinct
     // memory, there will be no copy back to device memory.
     viewB = Teuchos::null;

For a host-based compute node (i.e., one where SomeNode::isHostNode == true), the above code will:

  1. allocate ten integers (buffA) and initialize them
  2. allocate three integers (buffB)
  3. copy the last three of buffA to buffB
  4. verify the contents of buffB

However, on a device-based compute node with distinct memory (like many GPU platforms), the above code will:

  1. allocated ten integers (buffA) on the device
  2. allocated ten integers (viewA) on the host and initialize them
  3. copy these from the host (viewA) to the device (buffA) and free the host allocation (viewA)
  4. allocate three integers on the device (buffB)
  5. copy three integers from the device to the device (buffA to buffB)
  6. allocate three integers on the host (viewB)
  7. copy three integers from the device (buffB) to the host (viewB)
  8. verify the contents of viewB
  9. release the host allocation (viewB)

The distinction between these is entirely due to the specific implementation of the methods of the specific node class.

Node API Compute Model

The Node API compute model specifies the parallel primitives available for a generic Kokkos node, as well as the interface by which user kernels are submitted for execution. Currently, the compute model specifies two parallel primitive:

The Node API specifies one additional routine. Each node is required to provide a sync() method, which blocks until any outstanding memory or computational operations are complete. This is only significant for nodes that support asynchronous computation or memory transfer (e.g., CUDA-based nodes). For other nodes, this will typically be a no-op.

Below is the interface for these methods, for some sample node SomeNode:

  class SomeNode {
    template <class ForLoopBody>
    parallel_for(int begin, int end, ForLoopBody body);

    template <class ReductionKernel>
    typename ReductionKernel::ReductionType 
    parallel_reduce(int begin, int end, ReductionKernel rd);

Parallel For

The ForLoopBody object is required to provide the following:

  class ForLoopBody {
    KERNEL_PREFIX void execute(int iter);

This method executes iteration number iter of the for loop. Optionally (and typically), the data members of the class will contain the data necessary to execute the for loop.

The KERNEL_PREFIX macro is required to be inserted before all methods that are called by a parallel kernel. This is typically an empty macro, but it may contain helpful or necessary function specifiers on some platforms (for example, on CUDA platforms, it resolves to "__global__ __device__").

This class is combine with the parallel_for method as follows:

  ForLoopBody body(...);
  SomeNode<ForLoopBody>( beg, end, body );

The semantics of this call imply that:

Parallel Reduce

The ReductionKernel object is a little larger, requiring to provide the following, in addtion to any user-specified data member necessary to implement the reduction:

  class ReductionKernel {
    typedef ... ReductionType;
    KERNEL_PREFIX ReductionType identity() const;
    KERNEL_PREFIX ReductionType generate(int element);
    KERNEL_PREFIX ReductionType reduce(ReductionType a, ReductionType b) const;

This class is combined with the parallel_reduce method as follows:

  ReductionKernel kern(...);
  ReductionKernel::ReductionType result = SomeNode<ReductionKernel>( beg, end, kern );

The semantics of this call imply that:

Simple Node API Example

The following example illustrates the parallel initialization of a vector and a reduction of its data. This example is located at kokkos/NodeAPI/examples/SimpleNodeExample.cpp.

#include "Kokkos_NodeExampleKernels.hpp"
#include <Kokkos_DefaultNode.hpp>
#include <Kokkos_NodeHelpers.hpp>

int main() {
  typedef Kokkos::DefaultNode::DefaultNodeType NODE;
  const int VEC_LENGTH = 100;

  Teuchos::RCP<NODE> node = Kokkos::DefaultNode::getDefaultNode();
  Teuchos::ArrayRCP<int> x = node->allocBuffer<int>( VEC_LENGTH );

  KokkosExamples::initVec( node, x );
  int ret = KokkosExamples::reduceVec( node, x );
  std::cout << "Result is " << ret << std::endl;
  if (ret == (VEC_LENGTH-1)*VEC_LENGTH/2) std::cout << "End Result: TEST PASSED" << std::endl;

  return 0;

This example makes use of the kernels and methods from the following code, located at kokkos/NodeAPI/examples/Kokkos_NodeExampleKernels.hpp


#include <Teuchos_RCP.hpp>
#include <Teuchos_ArrayRCP.hpp>
#include <Kokkos_NodeHelpers.hpp>


namespace KokkosExamples {

  struct VecInit {
    int * x;
    KERNEL_PREFIX inline void execute(int i) {x[i] = i;}

  template <class NODE>
  void initVec(Teuchos::RCP<NODE> node, Teuchos::ArrayRCP<int> data) {
    Kokkos::ReadyBufferHelper<NODE> rbh(node);
    VecInit init;
    // ready the buffer and encapsulate the kernel arguments
    init.x = rbh.addNonConstBuffer(data);
    rbh.end();  // this call node->readyBuffers()
    // execute the kernel

  struct VecReduce {
    const int * x;
    typedef int ReductionType;
    KERNEL_PREFIX static inline int identity()            { return 0;    }
    KERNEL_PREFIX        inline int generate(int i)       { return x[i]; }
    KERNEL_PREFIX        inline int reduce  (int a, int b){ return a+b;  }

  template <class NODE>
  int reduceVec(Teuchos::RCP<NODE> node, Teuchos::ArrayRCP<const int> data) {
    Kokkos::ReadyBufferHelper<NODE> rbh(node);
    VecReduce reduce;
    // ready the buffer and encapsulate the kernel arguments
    reduce.x = rbh.addConstBuffer(data);
    rbh.end();  // this call node->readyBuffers()
    int ret = node->parallel_reduce(0,data.size(),reduce);
    return ret;

} // end of namespace KokkosExamples

 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends