Kokkos Node API and Local Linear Algebra Kernels Version of the Day

Parallel Tall Skinny QR (TSQR) factorization. More...
#include <Tsqr.hpp>
Public Types  
typedef dist_tsqr_type::rank_type  rank_type 
"Rank" here means MPI rank, not linear algebra rank.  
typedef std::pair< NodeOutput, DistOutput >  FactorOutput 
Return value of factor() .  
Public Member Functions  
Tsqr (const node_tsqr_ptr &nodeTsqr, const dist_tsqr_ptr &distTsqr)  
Constructor.  
Teuchos::RCP< node_tsqr_type >  getNodeTsqr () 
Get the intranode part of TSQR.  
size_t  cache_size_hint () const 
Cache size hint in bytes used by the intranode part of TSQR.  
size_t TEUCHOS_DEPRECATED  cache_block_size () const 
Cache size hint in bytes used by the intranode part of TSQR.  
bool  QR_produces_R_factor_with_nonnegative_diagonal () const 
Does the R factor have a nonnegative diagonal?  
template<class NodeType >  
void  factorExplicit (Kokkos::MultiVector< Scalar, NodeType > &A, Kokkos::MultiVector< Scalar, NodeType > &Q, Teuchos::SerialDenseMatrix< LocalOrdinal, Scalar > &R, const bool contiguousCacheBlocks, const bool forceNonnegativeDiagonal=false) 
Compute QR factorization with explicit Q factor.  
FactorOutput  factor (const LocalOrdinal nrows_local, const LocalOrdinal ncols, Scalar A_local[], const LocalOrdinal lda_local, Scalar R[], const LocalOrdinal ldr, const bool contiguousCacheBlocks=false) 
Compute QR factorization of the global dense matrix A.  
void  apply (const std::string &op, const LocalOrdinal nrows_local, const LocalOrdinal ncols_Q, const Scalar Q_local[], const LocalOrdinal ldq_local, const FactorOutput &factor_output, const LocalOrdinal ncols_C, Scalar C_local[], const LocalOrdinal ldc_local, const bool contiguousCacheBlocks=false) 
Apply Q factor to the global dense matrix C.  
void  explicit_Q (const LocalOrdinal nrows_local, const LocalOrdinal ncols_Q_in, const Scalar Q_local_in[], const LocalOrdinal ldq_local_in, const FactorOutput &factorOutput, const LocalOrdinal ncols_Q_out, Scalar Q_local_out[], const LocalOrdinal ldq_local_out, const bool contiguousCacheBlocks=false) 
Compute the explicit Q factor from factor()  
void  Q_times_B (const LocalOrdinal nrows, const LocalOrdinal ncols, Scalar Q[], const LocalOrdinal ldq, const Scalar B[], const LocalOrdinal ldb, const bool contiguousCacheBlocks=false) const 
Compute Q*B.  
LocalOrdinal  reveal_R_rank (const LocalOrdinal ncols, Scalar R[], const LocalOrdinal ldr, Scalar U[], const LocalOrdinal ldu, const magnitude_type &tol) const 
Reveal the rank of the R factor, using the SVD.  
template<class NodeType >  
LocalOrdinal  revealRank (Kokkos::MultiVector< Scalar, NodeType > &Q, Teuchos::SerialDenseMatrix< LocalOrdinal, Scalar > &R, const magnitude_type &tol, const bool contiguousCacheBlocks=false) const 
Rankrevealing decomposition.  
void  cache_block (const LocalOrdinal nrows_local, const LocalOrdinal ncols, Scalar A_local_out[], const Scalar A_local_in[], const LocalOrdinal lda_local_in) const 
Cacheblock A_in into A_out.  
void  un_cache_block (const LocalOrdinal nrows_local, const LocalOrdinal ncols, Scalar A_local_out[], const LocalOrdinal lda_local_out, const Scalar A_local_in[]) const 
Uncacheblock A_in into A_out. 
Parallel Tall Skinny QR (TSQR) factorization.
This class computes the parallel Tall Skinny QR (TSQR) factorization of a matrix distributed in block rows across one or more MPI processes. The parallel critical path length for TSQR is independent of the number of columns in the matrix, unlike ScaLAPACK's comparable QR factorization (P_GEQR2), Modified GramSchmidt, or Classical GramSchmidt.
LocalOrdinal  Index type that can address all elements of a matrix, when treated as a 1D array. That is, for A[i + LDA*j], the index i + LDA*j must fit in a LocalOrdinal. 
Scalar  The type of the matrix entries. 
NodeTsqrType  The intranode (singlenode) part of TSQR. Defaults to SequentialTsqr , which provides a sequential cacheblocked implementation. Any class implementing the same compiletime interface is valid. We provide NodeTsqr as an archetype of the "NodeTsqrType" concept, but it is not necessary that NodeTsqrType derive from that abstract base class. Inheriting from NodeTsqr is useful, though, because it provides default implementations of some routines that are not performancecritical. 
TSQR::Tsqr< LocalOrdinal, Scalar, NodeTsqrType >::rank_type 
TSQR::Tsqr< LocalOrdinal, Scalar, NodeTsqrType >::FactorOutput 
TSQR::Tsqr< LocalOrdinal, Scalar, NodeTsqrType >::Tsqr  (  const node_tsqr_ptr &  nodeTsqr, 
const dist_tsqr_ptr &  distTsqr  
)  [inline] 
Teuchos::RCP<node_tsqr_type> TSQR::Tsqr< LocalOrdinal, Scalar, NodeTsqrType >::getNodeTsqr  (  )  [inline] 
size_t TSQR::Tsqr< LocalOrdinal, Scalar, NodeTsqrType >::cache_size_hint  (  )  const [inline] 
size_t TEUCHOS_DEPRECATED TSQR::Tsqr< LocalOrdinal, Scalar, NodeTsqrType >::cache_block_size  (  )  const [inline] 
Cache size hint in bytes used by the intranode part of TSQR.
This method is deprecated; please call cache_size_hint()
instead.
bool TSQR::Tsqr< LocalOrdinal, Scalar, NodeTsqrType >::QR_produces_R_factor_with_nonnegative_diagonal  (  )  const [inline] 
Does the R factor have a nonnegative diagonal?
Tsqr implements a QR factorization (of a distributed matrix). Some, but not all, QR factorizations produce an R factor whose diagonal may include negative entries. This Boolean tells you whether Tsqr promises to compute an R factor whose diagonal entries are all nonnegative.
void TSQR::Tsqr< LocalOrdinal, Scalar, NodeTsqrType >::factorExplicit  (  Kokkos::MultiVector< Scalar, NodeType > &  A, 
Kokkos::MultiVector< Scalar, NodeType > &  Q,  
Teuchos::SerialDenseMatrix< LocalOrdinal, Scalar > &  R,  
const bool  contiguousCacheBlocks,  
const bool  forceNonnegativeDiagonal = false 

)  [inline] 
Compute QR factorization with explicit Q factor.
This method computes the "thin" QR factorization, like Matlab's [Q,R] = qr(A,0), of a matrix A with more rows than columns. Like Matlab, it computes the Q factor "explicitly," that is, as a matrix represented in the same format as in the input matrix A (rather than the implicit representation returned by factor()
).
Calling this method may be faster than calling factor()
and explicit_Q()
in sequence, if you know that you only want the explicit version of the Q factor. This method is especially intended for orthogonalizing the columns of a Tpetra::MultiVector
. It can also be used for an Epetra_MultiVector
, if you put each node's data in a Kokkos::MultiVector
first. (This does not require copying.)
A  [in/out] On input: my node's part of the matrix to factor; the matrix is distributed over the participating processors. If contiguousCacheBlocks is false, my node's part of the matrix is stored in columnmajor order. Otherwise, it is stored in the contiguously cacheblocked format that would be computed by cache_block() . On output: overwritten with garbage (part of the implicit Q factor, which is not useful in this case). 
Q  [out] On output: my node's part of the explicit Q factor, stored in the same format as the input matrix A. 
R  [out] On output: the R factor, which is square with the same number of rows and columns as the number of columns in A. The R factor is replicated on all nodes. 
contiguousCacheBlocks  [in] Whether cache blocks of A (on input) and Q (on output) are stored contiguously. If you don't know what this means, set it to false. 
forceNonnegativeDiagonal  [in] If true, then (if necessary) do extra work (modifying both the Q and R factors) in order to force the R factor to have a nonnegative diagonal. 
FactorOutput TSQR::Tsqr< LocalOrdinal, Scalar, NodeTsqrType >::factor  (  const LocalOrdinal  nrows_local, 
const LocalOrdinal  ncols,  
Scalar  A_local[],  
const LocalOrdinal  lda_local,  
Scalar  R[],  
const LocalOrdinal  ldr,  
const bool  contiguousCacheBlocks = false 

)  [inline] 
Compute QR factorization of the global dense matrix A.
Compute the QR factorization of the tall and skinny dense matrix A. The matrix A is distributed in a row block layout over all the MPI processes. A_local contains the matrix data for this process.
nrows_local  [in] Number of rows of this node's local component (A_local) of the matrix. May differ on different nodes. Precondition: nrows_local >= ncols. 
ncols  [in] Number of columns in the matrix to factor. Should be the same on all nodes. Precondition: nrows_local >= ncols. 
A_local  [in,out] On input, this node's local component of the matrix, stored as a general dense matrix in columnmajor order. On output, overwritten with an implicit representation of the Q factor. 
lda_local  [in] Leading dimension of A_local. Precondition: lda_local >= nrows_local. 
R  [out] The final R factor of the QR factorization of the global matrix A. An ncols by ncols upper triangular matrix with leading dimension ldr. 
ldr  [in] Leading dimension of the matrix R. 
contiguousCacheBlocks  [in] Whether or not cache blocks of A_local are stored contiguously. The default value of false means that A_local uses ordinary columnmajor (Fortranstyle) order. Otherwise, the details of the format depend on the specific NodeTsqrType. Tsqr's cache_block() and un_cache_block() methods may be used to convert between cacheblocked and noncacheblocked (columnmajor order) formats. 
void TSQR::Tsqr< LocalOrdinal, Scalar, NodeTsqrType >::apply  (  const std::string &  op, 
const LocalOrdinal  nrows_local,  
const LocalOrdinal  ncols_Q,  
const Scalar  Q_local[],  
const LocalOrdinal  ldq_local,  
const FactorOutput &  factor_output,  
const LocalOrdinal  ncols_C,  
Scalar  C_local[],  
const LocalOrdinal  ldc_local,  
const bool  contiguousCacheBlocks = false 

)  [inline] 
Apply Q factor to the global dense matrix C.
Apply the Q factor (computed by factor() and represented implicitly) to the global dense matrix C, consisting of all nodes' C_local matrices stacked on top of each other.
[in]  If  "N", compute Q*C. If "T", compute Q^T * C. If "H" or "C", compute Q^H * C. (The last option may not be implemented in all cases.) 
nrows_local  [in] Number of rows of this node's local component (C_local) of the matrix C. Should be the same on this node as the nrows_local argument with which factor() was called Precondition: nrows_local >= ncols.  
ncols_Q  [in] Number of columns in Q. Should be the same on all nodes. Precondition: nrows_local >= ncols_Q.  
Q_local  [in] Same as A_local output of factor()  
ldq_local  [in] Same as lda_local of factor()  
factor_output  [in] Return value of factor()  
ncols_C  [in] Number of columns in C. Should be the same on all nodes. Precondition: nrows_local >= ncols_C.  
C_local  [in,out] On input, this node's local component of the matrix C, stored as a general dense matrix in columnmajor order. On output, overwritten with this node's component of op(Q)*C, where op(Q) = Q, Q^T, or Q^H.  
ldc_local  [in] Leading dimension of C_local. Precondition: ldc_local >= nrows_local.  
contiguousCacheBlocks  [in] Whether or not the cache blocks of Q and C are stored contiguously. 
void TSQR::Tsqr< LocalOrdinal, Scalar, NodeTsqrType >::explicit_Q  (  const LocalOrdinal  nrows_local, 
const LocalOrdinal  ncols_Q_in,  
const Scalar  Q_local_in[],  
const LocalOrdinal  ldq_local_in,  
const FactorOutput &  factorOutput,  
const LocalOrdinal  ncols_Q_out,  
Scalar  Q_local_out[],  
const LocalOrdinal  ldq_local_out,  
const bool  contiguousCacheBlocks = false 

)  [inline] 
Compute the explicit Q factor from factor()
Compute the explicit version of the Q factor computed by factor() and represented implicitly (via Q_local_in and factor_output).
nrows_local  [in] Number of rows of this node's local component (Q_local_in) of the matrix Q_local_in. Also, the number of rows of this node's local component (Q_local_out) of the output matrix. Should be the same on this node as the nrows_local argument with which factor() was called. Precondition: nrows_local >= ncols_Q_in. 
ncols_Q_in  [in] Number of columns in the original matrix A, whose explicit Q factor we are computing. Should be the same on all nodes. Precondition: nrows_local >= ncols_Q_in. 
Q_local_in  [in] Same as A_local output of factor(). 
ldq_local_in  [in] Same as lda_local of factor() 
factorOutput  [in] Return value of factor(). 
ncols_Q_out  [in] Number of columns of the explicit Q factor to compute. Should be the same on all nodes. 
Q_local_out  [out] This node's component of the Q factor (in explicit form). 
ldq_local_out  [in] Leading dimension of Q_local_out. 
contiguousCacheBlocks  [in] Whether or not cache blocks in Q_local_in and Q_local_out are stored contiguously. 
void TSQR::Tsqr< LocalOrdinal, Scalar, NodeTsqrType >::Q_times_B  (  const LocalOrdinal  nrows, 
const LocalOrdinal  ncols,  
Scalar  Q[],  
const LocalOrdinal  ldq,  
const Scalar  B[],  
const LocalOrdinal  ldb,  
const bool  contiguousCacheBlocks = false 

)  const [inline] 
LocalOrdinal TSQR::Tsqr< LocalOrdinal, Scalar, NodeTsqrType >::reveal_R_rank  (  const LocalOrdinal  ncols, 
Scalar  R[],  
const LocalOrdinal  ldr,  
Scalar  U[],  
const LocalOrdinal  ldu,  
const magnitude_type &  tol  
)  const [inline] 
Reveal the rank of the R factor, using the SVD.
Compute the singular value decomposition (SVD) of the R factor: , not in place. Use the resulting singular values to compute the numerical rank of R, with respect to the relative tolerance tol. If R is full rank, return without modifying R. If R is not full rank, overwrite R with .
LocalOrdinal TSQR::Tsqr< LocalOrdinal, Scalar, NodeTsqrType >::revealRank  (  Kokkos::MultiVector< Scalar, NodeType > &  Q, 
Teuchos::SerialDenseMatrix< LocalOrdinal, Scalar > &  R,  
const magnitude_type &  tol,  
const bool  contiguousCacheBlocks = false 

)  const [inline] 
Rankrevealing decomposition.
Using the R factor and explicit Q factor from factorExplicit(), compute the singular value decomposition (SVD) of R ( ). If R is full rank (with respect to the given relative tolerance tol), don't change Q or R. Otherwise, compute and in place (the latter may be no longer upper triangular).
Q  [in/out] On input: explicit Q factor computed by factorExplicit(). (Must be an orthogonal resp. unitary matrix.) On output: If R is of full numerical rank with respect to the tolerance tol, Q is unmodified. Otherwise, Q is updated so that the first rank columns of Q are a basis for the column space of A (the original matrix whose QR factorization was computed by factorExplicit()). The remaining columns of Q are a basis for the null space of A. 
R  [in/out] On input: ncols by ncols upper triangular matrix with leading dimension ldr >= ncols. On output: if input is full rank, R is unchanged on output. Otherwise, if is the SVD of R, on output R is overwritten with $ V^*$. This is also an ncols by ncols matrix, but may not necessarily be upper triangular. 
tol  [in] Relative tolerance for computing the numerical rank of the matrix R. 
contiguousCacheBlocks  [in] Whether or not the cache blocks of Q are stored contiguously. Defaults to false, which means that Q uses the ordinary columnmajor layout on each MPI process. 
void TSQR::Tsqr< LocalOrdinal, Scalar, NodeTsqrType >::cache_block  (  const LocalOrdinal  nrows_local, 
const LocalOrdinal  ncols,  
Scalar  A_local_out[],  
const Scalar  A_local_in[],  
const LocalOrdinal  lda_local_in  
)  const [inline] 
void TSQR::Tsqr< LocalOrdinal, Scalar, NodeTsqrType >::un_cache_block  (  const LocalOrdinal  nrows_local, 
const LocalOrdinal  ncols,  
Scalar  A_local_out[],  
const LocalOrdinal  lda_local_out,  
const Scalar  A_local_in[]  
)  const [inline] 