#                                                                             #
# Trilinos Release 11.4 Release Notes                                         #
#                                                                             #


The Trilinos Project is an effort to develop algorithms and enabling
technologies within an object-oriented software framework for the solution of
large-scale, complex multi-physics engineering and scientific problems.


The Trilinos 11.4 general release contains 54 packages: Amesos, Amesos2,
Anasazi, AztecOO, Belos, CTrilinos, Didasko, Epetra, EpetraExt, FEI,
ForTrilinos, Galeri, GlobiPack, Ifpack, Ifpack2, Intrepid, Isorropia, Kokkos,
Komplex, LOCA, Mesquite, ML, Moertel, MOOCHO, NOX, Optika, OptiPack, Pamgen,
Phalanx, Piro, Pliris, PyTrilinos, RTOp, Rythmos, Sacado, SEACAS, Shards,
ShyLU, STK, Stokhos, Stratimikos, Sundance, Teko, Teuchos, ThreadPool, Thyra,
Tpetra, TriKota, TrilinosCouplings, Trios, Triutils, Xpetra, Zoltan, Zoltan2.

Framework Release Notes:

  - The following packages have been switched to BSD-compatible licenses:
    Didasko, Ifpack, Ifpack2, Moertel, Stokhos, Stratimikos


 - This release includes 11 modules or classes of the Epetra package.

 - This package is still in its experimental stage and is only supported on AIX.

 - Sample configure script are provided in
   Trilinos/sampleScripts/aix-fortrilinos-serial and
   Trilinos/sampleScripts/aix-fortrilinos-mpif90 for serial and mpi builds

 - Because of the object-oriented features used, it requires a XL Fortran
   compiler v13.1. The source code can be compiled using the xlf compiler

 - Required compiler flags for Fortran include:

     -qfixed=72 -qxlines:   deals with older Fortran source code in other
                            Trilinos packages. These flags are used for mpi
                            builds and must be specified in  the configure

     -qxlf2003=polymorphic: allows for the use of polymorphism in the source

     -qxlf2003=autorealloc: allows the compiler to automatically reallocate the
                            left hand side with the shape of the right hand side
                            when using allocatable variables in an assignment.

     -qfree=f90:            informs the compiler that the source code is free
                            form and  conforms to Fortran 90.

     These flags(-qfree=f90 -qxlf2003=polymorphic -qxlf2003=autorealloc) are
     hardcoded in Trilinos/packages/ForTrilinos/CMakeLists.txt

 - Required compiler flag for xlc++ include:

     -qrtti=all:            this flag should be included in the configure

 - The project is primarily user-driven; so new interfaces are developed at the
   request of Trilinos users.


  - Relaxation: Use precomputed offsets to extract diagonal

    As of this release, Tpetra::CrsMatrix has the ability to to precompute
    offsets of diagonal entries, and use them to accelerate extracting a
    copy of the diagonal. Relaxation now exploits this feature to speed up
    compute() (which extracts a copy of the diagonal of the input matrix).
    The optimization only occurs if the input matrix is a CrsMatrix (not
    just a RowMatrix) and if it has a const ("static") graph. The latter
    is necessary so that we know that the structure can't change between
    calls to compute(). (Otherwise we would have to recompute the offsets
    each time, which would be no more efficient than what it was doing


  - Non-backwards compatible change: Default Kokkos/Tpetra Node type is now
    Kokkos::SerialNode User expectation seems to be that the default behavior of
    Tpetra is MPI-only. These users are therefore experiencing unexpected
    performance when the default node is threaded, as is currently the case if
    any of the threading libraries (pthreads, TBB, OpenMP) are enabled.
    Therefore, after some discussion among Kokkos/Tpetra developers, it was
    decided to change the  default Kokkos node (and therefore, the default node
    used by Tpetra objects) to Kokkos::SerialNode. This can be over-ridden at
    configure time by specifying the following option to CMake when configuring

      -D KokkosClassic_DefaultNode:STRING="node_type"

    where node_type is one of the official Kokkos nodes:

      Kokkos::SerialNode    (current default)


  - Added polygon support to allow reading and writing of vtk files containing
    polygons and smoothing of meshes containing polygons using the Laplacian 

  - Rewrote ShapeImprover wrapper determine if mesh to be optimized is
    tangled or not. If tangled, wrapper now uses a non-barrier metric and
    if not tangled, a barrier metric is used. 

  - Created a new directory structure underneath meshFiles/3D/vtk and 
    meshFiles/2D/vtk that arranges the mesh files into subdirectories 
    based on element type and whether they are tangled or untangled. 

  - Created new class MeshDomainAssoc to formally associate a Mesh instance
    with a Domain instance to verify that the mesh and domain are compatible.

  - Productionized the NonGradient solver.

  - Added new classes TMetricBarrier and TMetricNonBarrier to TMetric class to
    provide a clear division between the barrier and non-barrier target metric

  - Added new classes AWMetricBarrier and AWMetricNonBarrier to AWMetric class
    for same reason as the TMetric classes. 

  - Added a new error code "BARRIER_VIOLATED" to the MsgError class that is 
    issued when a barrier violation is encountered when using a barrier target
    metric class.

  - Added warning when MaxTemplate is used with any solver other than

  - Made a number of changes to the Quality Summary output to improve 
    readability and provide additional information.


  - Updated the NumPy interface to properly deal with deprecated
    code.  If PyTrilinos if compiled an older NumPy, it still works,
    but if compiled against newer versions of NumPy, the deprecated
    code is avoided, as are the warnings.


  - Added optional automatic global reductions of pass/fail to Teuchos Unit
    Test Harness: Prior to this feature addition, only the result on the root
    process of a parallel unit test would determine pass/fail, even if tests on
    other proesses failed.  This makes it easier to write parallel unit tests
    and results in more robust test code.  For a discussion, see Trilinos issue
    #5909. An example can be found in
    teuchos/comm/test/UnitTesting/UnitTestHarness_Parallel_UnitTests.cpp (see
    the CMakeLists.txt file for how that test is run).  NOTE: By default, no
    global reductions of pass/fail are done as to maintain perfect backward

  - Added new feature to TimeMonitor: You may now enable or disable a timer
    (instance of Time) by name.  Disabled timers ignore start() and stop()
    calls; calling these methods on a disabled timer does not change its elapsed
    time or call count.  Thus, TimeMonitor's constructor and destructor have no
    effect on disabled timers. However, the disabled timers still exist, and
    TimeMonitor's summarize() and report() class methods will print statistics
    for disabled timers (using their elapsed times and call counts while
    enabled).  Enabling a timer does not reset its elapsed time or call count. 
    This feature is useful if you want to time only certain invocations of a
    particular function that has an internal timer, without modifying the
    function's source code.  For an example, see
    packages/teuchos/comm/test/Time/TimeMonitor_UnitTests.cpp, line 175
    ("TimeMonitor, enableTimer" unit test).


  - Fixed explicit template instantation system in the generation of
    Thyra_XXX.hpp files to *not* include Thyra_XXX_def.hpp when explicit
    instantation is turned on.  The refactoring of Thyra to use subpackages some
    time ago broke the generation of Thyra_XXX.hpp files in that they were
    always including Thyra_XXX_def.hpp files.  That was bad because it increased
    compile time for client code and allowed other includes to get pulled in
    silently. Now client code that includes Thyra_XXX.hpp when explicit
    instantiation is turned on will will *not* get the include of
    Thyra_XXX_def.hpp.  This might break some downstream client code that was
    not properly including the necessary header files and was accidentally
    getting them from the Thyra_XXX_def.hpp files that were being silently
    included.  However, this technically does not break backward compatibility
    since client code should have been including the right headers all along. 
    For example, when GCC cleaned up their standard C++ header files this
    required existing C++ code to add a bunch of missing includes that should
    have been there the whole time.


  - Performance improvements to fillComplete (CrsGraph and CrsMatrix)

  - Performance improvements to Map's global-to-local index conversions

  - MPI performance optimizations

    Methods that perform communication between (MPI) processes do less
    communication than before.  This should improve performance,
    especially for large process counts, of the following operations:

      - Creating a Map
      - Creating an Import or Export communication plan
      - Executing an Import or Export (e.g., in a distributed sparse
        matrix-vector multiply, or in global finite element assembly)
      - Calling fillComplete() on a CrsGraph or CrsMatrix

  - Restrict a Map's communicator to processes with nonzero elements,
    and apply the result to a distributed object

    Map now has two new methods.  The first, removeEmptyProcesses(),
    returns a new Map with a new communicator, which contains only those
    processes which have a nonzero number of entries in the original Map.
    The second method, replaceCommWithSubset(), returns a new Map whose
    communicator is an arbitrary subset of processes of the original Map's
    communicator.  Distributed objects (subclasses of DistObject) also
    have a new removeEmptyProcessesInPlace() method, for applying in place
    the new Map created by calling removeEmptyProcesses() on the original
    Map over which the object was distributed.

    These methods are especially useful for algebraic multigrid.  At
    coarser levels of the multigrid hierarchy, it is helpful for
    performance to "rebalance" the matrices at those levels, so that a
    subset of processes share the elements.  This leaves the remaining
    processes without any elements.  Excluding them from the communicator
    reduces the cost of all-reduces and other communication operations
    necessary for creating the coarser levels of the hierarchy.

  - CrsMatrix: Native SOR and Gauss-Seidel kernels

    These kernels improve the performance of Ifpack2 and MueLu.
    Gauss-Seidel is a special case of SOR (Symmetric Over-Relaxation).
    See the documentation of Ifpack2::Relaxation for details on the
    algorithm, which is actually a "hybrid" of Jacobi between MPI
    processes, and SOR (or Gauss-Seidel) within an MPI process.  The
    kernels also include the "symmetric" variant (forward and backward
    sweeps) of SOR and Gauss-Seidel.

  - CrsMatrix: Precompute and reuse offsets of diagonal entries

    The (existing) one-argument verison of CrsMatrix's getLocalDiagCopy()
    method requires the following operations per row:

      1. Convert current local row index to global, using the row Map
      2. Convert global index to local column index, using the column Map
      3. Search the row for that local column index

    Precomputing the offsets of diagonal entries and reusing them skips
    all these steps.  CrsMatrix has a new method getLocalDiagOffsets() to
    precompute the offsets, and a two-argument version of
    getLocalDiagCopy() that uses the precomputed offsets.  The precomputed
    offsets are not meant to be used in any way other than to be given to
    the two-argument version of getLocalDiagCopy().  They must be
    recomputed whenever the structure of the sparse matrix changes (by
    calling insertGlobalValues() or insertLocalValues()) or is optimized
    (e.g., by calling fillComplete() for the first time).

  - CrsGraph,CrsMatrix: Added "No Nonlocal Changes" parameter to

    The fillComplete() method accepts an optional ParameterList which
    controls the behavior of fillComplete(), as opposed to behavior of the
    object in general.  "No Nonlocal Changes" is a bool parameter which is
    false by default.  Its value must be the same on all processes in the
    graph or matrix's communicator.  If the parameter is true, the caller
    asserts that no entries were inserted in nonowned rows.  This lets
    fillComplete() skip the global communication that checks whether any
    processes inserted any entries in nonowned rows.

  - Default Kokkos/Tpetra Node type is now Kokkos::SerialNode

    NOTE: This change breaks backwards compatibility.

    Users expect that Tpetra by default uses "MPI only" for parallelism,
    rather than "MPI plus threads."  These users were therefore
    experiencing unexpected performance issues when the default Kokkos
    Node type is threaded, as was the case if Trilinos' support for any of
    the threading libraries (Pthreads, TBB, OpenMP) are enabled.  Trilinos
    detects and enables support for Pthreads automatically on many
    platforms.  Therefore, after some discussion among Kokkos and Tpetra
    developers, we decided to change the default Kokkos Node type (and
    therefore, the default Node used by Tpetra objects) to
    Kokkos::SerialNode. This can be overridden at configure time by
    specifying the following option to CMake when configuring Trilinos:

    -D KokkosClassic_DefaultNode:STRING="" 

    where  any of the official Kokkos Node types, such as the
    - Kokkos::SerialNode (current default) 
    - Kokkos::TBBNode
    - Kokkos::TPINode
    - Kokkos::OpenMPNode