spral_ssids
- Sparse Symmetric Indefinite Direct Solver
Purpose
This package solves one or more sets of \(n\times n\) sparse symmetric equations \(AX = B\) using a multifrontal method. The following cases are covered:
1. \(A\) is indefinite. SSIDS computes the sparse factorization
where \(P\) is a permutation matrix, \(L\) is unit lower triangular, and \(D\) is block diagonal with blocks of size \(1 \times 1\) and \(2 \times 2\).
2. \(A\) is positive definite. SSIDS computes the sparse Cholesky factorization
where \(P\) is a permutation matrix and \(L\) is lower triangular. However, as SSIDS is designed primarily for indefinite systems, this may be slower than a dedicated Cholesky solver.
The code optionally supports hybrid computation using one or more NVIDIA GPUs.
SSIDS returns bit-compatible results.
An option exists to scale the matrix. In this case, the factorization of the scaled matrix \(\overline{A} = S A S\) is computed, where \(S\) is a diagonal scaling matrix.
Usage overview
Solving \(AX=B\) using SSIDS is a four stage process.
Call
ssids_analyse()
to perform a symbolic factorization, stored in akeep.Call
ssids_factor()
to perform a numeric factorization, stored in fkeep. More than one numeric factorization can refer to the same akeep.Call
ssids_solve()
to perform a solve with the factors. More than one solve can be performed with the same fkeep.Once all desired solutions have been performed, free memory with
ssids_free()
.
In addition, advanced users may use the following routines:
ssids_enquire_posdef()
andssids_enquire_indef()
return the diagonal entries of the factors and the pivot sequence.ssids_alter()
allows altering the diagonal entries of the factors.
Note
Bit-compatibility: If used with bit-compatible BLAS and compiler options, this routine will return bit compatible results. That is, consecutive runs with the same data on the same machine produces exactly the same solution.
Basic Subroutines
In the below, all reals are double precision unless otherwise indicated (as described in Data types and conventions).
Note
For the most efficient use of the package, CSC format should be used without checking.
- subroutine ssids_analyse(check, n, ptr, row, akeep, options, inform[, order, val, topology])
Perform the analyse (symbolic) phase of the factorization for a matrix supplied in CSC format. The resulting symbolic factors stored in akeep should be passed unaltered in the subsequent calls to ssids_factor().
- Parameters:
check [logical ,in] :: if true, matrix data is checked. Out-of-range entries are dropped and duplicate entries are summed.
n [integer ,in] :: number of columns in \(A\).
ptr (n+1) [integer(long),in] :: column pointers for \(A\) (see CSC format).
row (ptr(n+1)-1) [integer ,in] :: row indices for \(A\) (see CSC format).
akeep [ssids_akeep ,out] :: returns symbolic factorization, to be passed unchanged to subsequent routines.
options [ssids_options ,in] :: specifies algorithm options to be used (see
ssids_options
).inform [ssids_inform ,out] :: returns information about the execution of the routine (see
ssids_inform
]).
- Options:
order (n) [integer ,inout] :: on entry a user-supplied ordering (options%ordering=0). On return, the actual ordering used (if present).
val (ptr(n+1)-1) [real ,in] :: non-zero values for \(A\) (see CSC format). Only used if a matching-based ordering is requested.
topology (*) [numa_region ,in] :: If present, specifies the machine topology to be exploited. The size of the machine is the number of independent NUMA regions. Region i will use topology(i)%nproc threads and is associated with the GPUs in the array topology(i)%gpus, which may have length 0. If not present, these parameters are auto-detected using the hwloc library (if detected at compiled time) or the environment variable OMP_NUM_THREADS if the hwloc library is not available. See the method section for details of how work is divided.
Note
If a user-supplied ordering is used, it may be altered by this routine, with the altered version returned in order(:). This version will be equivalent to the original ordering, except that some supernodes may have been amalgamated, a topographic ordering may have been applied to the assembly tree and the order of columns within a supernode may have been adjusted to improve cache locality.
Note
A version where ptr is of kind default integer is also provided for backwards compatibility.
- subroutine ssids_analyse_coord(n, ne, row, col, akeep, options, inform[, order, val, topology])
As
ssids_analyse()
, but for coordinate data. The variant parameters are:- Parameters:
ne [integer(long),in] :: number of non-zero entries in \(A\).
row (ne) [integer ,in] :: row indices for \(A\) (see Coordinate format).
col (ne) [integer ,in] :: column indices for \(A\) (see Coordinate format).
- subroutine ssids_factor(posdef, val, akeep, fkeep, options, inform[, scale, ptr, row])
- Parameters:
posdef [logical ,in] :: true if matrix is positive-definite
val (*) [real ,in] :: non-zero values for \(A\) in same format as for the call to
ssids_analyse()
orssids_analyse_coord()
.akeep [ssids_akeep ,in] :: symbolic factorization returned by preceding call to
ssids_analyse()
orssids_analyse_coord()
.fkeep [ssids_fkeep ,inout] :: returns numeric factorization, to be passed unchanged to subsequent routines.
options [ssids_options ,in] :: specifies algorithm options to be used (see
ssids_options
).inform [ssids_inform ,out] :: returns information about the execution of the routine (see
ssids_inform
).
- Options:
scale (n) [real ,inout] :: diagonal scaling. scale(i) contains entry \(S_{ii}\) of \(S\). Must be supplied by user if options%scaling=0 (user-supplied scaling). On exit, return scaling used.
ptr (n+1) [integer(long),in] :: column pointers for \(A\), only required if
akeep
was obtained by runningssids_analyse()
withcheck
=.true., in which case it must be unchanged since that call.row (ptr(n+1)-1) [integer ,in] :: row indices for \(A\), only required if
akeep
was obtained by runningssids_analyse()
withcheck
=.true., in which case it must be unchanged since that call.
Note
A version where ptr is of kind default integer is also provided for backwards compatibility.
- subroutine ssids_solve(x, akeep, fkeep, options, inform[, job])
Solve (for a single right-hand side) one of the following equations:
job
Equation solved
0 (or absent)
\(Ax=b\)
1
\(PLx=Sb\)
2
\(Dx=b\)
3
\((PL)^TS^{-1}x=b\)
4
\(D(PL)^TS^{-1}x=b\)
Recall \(A\) has been factorized as either:
\(SAS = (PL)(PL)^T~\) (positive-definite case); or
\(SAS = (PL)D(PL)^T\) (indefinite case).
- Parameters:
x (n) [real ,in] :: right-hand side \(b\) on entry, solution \(x\) on exit.
akeep [ssids_akeep ,in] :: symbolic factorization returned by preceding call to
ssids_analyse()
orssids_analyse_coord()
.fkeep [ssids_fkeep ,in] :: numeric factorization returned by preceding call to
ssids_factor()
.options [ssids_options ,in] :: specifies algorithm options to be used (see
ssids_options
).inform [ssids_inform ,out] :: returns information about the execution of the routine (see
ssids_inform
).
- Options:
job [integer ,in] :: specifies equation to solve, as per above table.
- subroutine ssids_solve(nrhs, x, ldx, akeep, fkeep, options, inform[, job])
Solve (for multiple right-hand sides) one of the following equations:
job
Equation solved
0 (or absent)
\(AX=B\)
1
\(PLX=SB\)
2
\(DX=B\)
3
\((PL)^TS^{-1}X=B\)
4
\(D(PL)^TS^{-1}X=B\)
Recall \(A\) has been factorized as either:
\(SAS = (PL)(PL)^T~\) (positive-definite case); or
\(SAS = (PL)D(PL)^T\) (indefinite case).
- Parameters:
nrhs [integer ,in] :: number of right-hand sides.
x (ldx,nrhs) [real ,inout] :: right-hand sides \(B\) on entry, solutions \(X\) on exit.
ldx [integer ,in] :: leading dimension of
x
.akeep [ssids_akeep ,in] :: symbolic factorization returned by preceding call to
ssids_analyse()
orssids_analyse_coord()
.fkeep [ssids_fkeep ,in] :: numeric factorization returned by preceding call to
ssids_factor()
.options [ssids_options ,in] :: specifies algorithm options to be used (see
ssids_options
).inform [ssids_inform ,out] :: returns information about the execution of the routine (see
ssids_inform
).
- Options:
job [integer ,in] :: specifies equation to solve, as per above table.
- subroutine ssids_free([akeep, fkeep, ]cuda_error)
Frees memory and resources associated with
akeep
and/orfkeep
, at least one of which must be present.- Parameters:
akeep [ssids_akeep ,inout] :: symbolic factors to be freed.
fkeep [ssids_fkeep ,inout] :: numeric factors to be freed.
cuda_error [integer ,out] :: 0 on success, or a CUDA error code on failure.
Warning
ssids_free()
must be called by the user. Merely deallocatingakeep
orfkeep
, or allowing them to go out of scope will result in memory leaks due to underlying memory allocation performed using C++ and CUDA mechanisms.akeep
should only be deallocated after all associated numeric factorizationsfkeep
have been freed.
Advanced subroutines
- subroutine ssids_enquire_posdef(akeep, fkeep, options, inform, d)
Return the diagonal entries of the Cholesky factor.
- Parameters:
akeep [ssids_akeep ,in] :: symbolic factorization returned by preceding call to
ssids_analyse()
orssids_analyse_coord()
.fkeep [ssids_fkeep ,in] :: numeric factorization returned by preceding call to
ssids_factor()
.options [ssids_options ,in] :: specifies algorithm options to be used (see
ssids_options
).inform [ssids_inform ,out] :: returns information about the execution of the routine (see
ssids_inform
).d (n) [real ,out] :: returns the diagonal of \(L\). d(i) stores the entry \(L_{ii}\).
- subroutine ssids_enquire_indef(akeep, fkeep, options, inform[, piv_order, d])
Return the pivot order and/or values of \(D\) of the Symmetric Indefinite Factorization.
- Parameters:
akeep [ssids_akeep ,in] :: symbolic factorization returned by preceding call to
ssids_analyse()
orssids_analyse_coord()
.fkeep [ssids_fkeep ,in] :: numeric factorization returned by preceding call to
ssids_factor()
.options [ssids_options ,in] :: specifies algorithm options to be used (see
ssids_options
).inform [ssids_inform ,out] :: returns information about the execution of the routine (see
ssids_inform
).
- Options:
piv_order (n) [integer ,out] :: returns the pivot order. \(|\,\texttt{piv_order(i)}|\) gives the position of variable \(i\) in the pivot order. The sign will be positive if \(i\) is a \(1\times1\) pivot, and negative if \(i\) is part of a \(2 \times 2\) pivot.
d (2,n) [real ,out] :: returns the \(2\times2\) block diagonal of \(D\). d(1,i) stores \(D_{ii}\) and d(2,i) stores \(D_{(i+1)i}\).
- subroutine ssids_alter(d, akeep, fkeep, options, inform)
Alter the entries of the diagonal factor \(D\) for a symmetric indefinite factorization. The pivot order remains the same.
- Parameters:
d (2,n) [real ,in] :: New entries of \(D\). d(1,i) stores \(D_{ii}\) and d(2,i) stores \(D_{(i+1)i}\).
akeep [ssids_akeep ,in] :: symbolic factorization returned by preceding call to
ssids_analyse()
orssids_analyse_coord()
.fkeep [ssids_fkeep ,in] :: numeric factorization returned by preceding call to
ssids_factor()
.options [ssids_options ,in] :: specifies algorithm options to be used (see
ssids_options
).inform [ssids_inform ,out] :: returns information about the execution of the routine (see
ssids_inform
).
Derived types
- type ssids_options
The derived data type ssids_options is used to specify the options used within
SSIDS
. The components, that are automatically given default values in the definition of the type, are:- Type fields:
% print_level [integer ,default=0] ::
the level of printing. The different levels are:
< 0
No printing.
= 0
Error and warning messages only.
= 1
As 0, plus basic diagnostic printing.
> 1
As 1, plus some additional diagnostic printing.
% unit_diagnostics [integer ,default=6] :: Fortran unit number for diagnostics printing. Printing is suppressed if <0.
% unit_error [integer ,default=6] :: Fortran unit number for printing of error messages. Printing is suppressed if <0.
% unit_warning [integer ,default=6] :: Fortran unit number for printing of warning messages. Printing is suppressed if <0.
% ordering [integer ,default=1] ::
Ordering method to use in analyse phase:
0
User-supplied ordering is used (order argument to
ssids_analyse()
orssids_analyse_coord()
).1 (default)
METIS ordering with default settings.
2
Matching-based elimination ordering is computed (the Hungarian algorithm is used to identify large off-diagonal entries. A restricted METIS ordering is then used that forces these on to the subdiagonal).
Note: This option should only be chosen for indefinite systems. A scaling is also computed that may be used in
ssids_factor()
(see %scaling below).% nemin [integer ,default=32] :: supernode amalgamation threshold. Two neighbours in the elimination tree are merged if they both involve fewer than nemin eliminations. The default is used if nemin<1.
% ignore_numa [logical ,default=true] :: If true, all CPUs and GPUs are treated as belonging to a single NUMA region.
% use_gpu [logical ,default=true] :: Use an NVIDIA GPU if present.
% min_gpu_work [integer(long),default=5e9] :: Minimum number of flops in subtree before scheduling on GPU.
% max_load_inbalance [real ,default=1.2] :: Maximum permissible load inbalance for leaf subtree allocations. Values less than 1.0 are treated as 1.0.
% gpu_perf_coeff [real ,default=1.0] :: GPU performance coefficient. How many times faster a GPU is than CPU at factoring a subtree.
% scaling [integer ,default=0] ::
scaling algorithm to use:
<=0 (default)
No scaling (if
scale(:)
is not present on call tossids_factor()
, or user-supplied scaling (ifscale(:)
is present).=1
Compute using weighted bipartite matching via the Hungarian Algorithm (
MC64
algorithm).=2
Compute using a weighted bipartite matching via the Auction Algorithm (may be lower quality than that computed using the Hungarian Algorithm, but can be considerably faster).
=3
Use matching-based ordering generated during the analyse phase using options%ordering=2. The scaling will be the same as that generated with options%scaling= 1 if the matrix values have not changed. This option will generate an error if a matching-based ordering was not used during analysis.
>=4
Compute using the norm-equilibration algorithm of Ruiz (see spral_scaling - Sparse matrix scalings).
% small_subtree_threshold [integer(long),default=4e6] :: Maximum number of flops in a subtree treated as a single task. See method section.
% cpu_block_size [integer ,default=256] :: Block size to use for parallelization of large nodes on CPU resources.
% action [logical ,default=.true.] :: continue factorization of singular matrix on discovery of zero pivot if true (a warning is issued), or abort if false.
% pivot_method [integer ,default=2] ::
Pivot method to be used on CPU, one of:
1
Aggressive a posteori pivoting. Cholesky-like communication pattern is used, but a single failed pivot requires restart of node factorization and potential recalculation of all uneliminated entries.
2 (default)
Block a posteori pivoting. A failed pivot only requires recalculation of entries within its own block column.
3
Threshold partial pivoting. Not parallel.
% small [real ,default=1d-20] :: threshold below which an entry is treated as equivalent to 0.0.
% u [real ,default=0.01] :: relative pivot threshold used in symmetric indefinite case. Values outside of the range \([0,0.5]\) are treated as the closest value in that range.
- type ssids_inform
Used to return information about the progress and needs of the algorithm.
- Type fields:
% cpu_flops [integer(long)] :: number of flops performed on CPU
% cublas_error [integer ] :: CUBLAS error code in the event of a CUBLAS error (0 otherwise).
% cuda_error [integer ] :: CUDA error code in the event of a CUDA error (0 otherwise). Note that due to asynchronous execution, CUDA errors may not be reported by the call that caused them.
% flag [integer ] :: exit status of the algorithm (see table below).
% gpu_flops [integer(long)] :: number of flops performed on GPU
% matrix_dup [integer ] :: number of duplicate entries encountered (if
ssids_analyse()
called with check=true, or any call tossids_analyse_coord()
).% matrix_missing_diag [integer ] :: number of diagonal entries without an explicit value (if
ssids_analyse()
called with check=true, or any call tossids_analyse_coord()
).% matrix_outrange [integer ] :: number of out-of-range entries encountered (if
ssids_analyse()
called with check=true, or any call tossids_analyse_coord()
).% matrix_rank [integer ] :: (estimated) rank (structural after analyse phase, numerical after factorize phase).
% maxdepth [integer ] :: maximum depth of the assembly tree.
% maxfront [integer ] :: maximum front size (without pivoting after analyse phase, with pivoting after factorize phase).
% maxsupernode [integer ] :: maximum supernode size (without pivoting after analyse phase, with pivoting after factorize phase).
% num_delay [integer ] :: number of delayed pivots. That is, the total number of fully-summed variables that were passed to the father node because of stability considerations. If a variable is passed further up the tree, it will be counted again.
% num_factor [integer(long)] :: number of entries in \(L\) (without pivoting after analyse phase, with pivoting after factorize phase).
% num_flops [integer(long)] :: number of floating-point operations for Cholesky factorization (indefinte needs slightly more). Without pivoting after analyse phase, with pivoting after factorize phase.
% num_neg [integer ] :: number of negative eigenvalues of the matrix \(D\) after factorize phase.
% num_sup [integer ] :: number of supernodes in assembly tree.
% num_two [integer ] :: number of \(2 \times 2\) pivots used by the factorization (i.e. in the matrix \(D\)).
% stat [integer ] :: Fortran allocation status parameter in event of allocation error (0 otherwise).
inform%flag
Return status
0
Success.
-1
Error in sequence of calls (may be caused by failure of a preceding call).
-2
n<0 or ne<1.
-3
Error in ptr(:).
-4
CSC format: All variable indices in one or more columns are out-of-range.
Coordinate format: All entries are out-of-range.
-5
Matrix is singular and options%action=.false.
-6
Matrix found not to be positive definite but posdef=true.
-7
ptr(:) and/or row(:) not present, but required as
ssids_analyse()
was called with check=.false,.-8
options%ordering out of range, or options%ordering=0 and order parameter not provided or not a valid permutation.
-9
options%ordering=-2 but val(:) was not supplied.
-10
ldx<n or nrhs<1.
-11
job is out-of-range.
-13
Called
ssids_enquire_posdef()
on indefinite factorization.-14
Called
ssids_enquire_indef()
on positive-definite factorization.-15
options%scaling=3 but a matching-based ordering was not performed during analyse phase.
-50
Allocation error. If available, the stat parameter is returned in inform%stat.
-51
CUDA error. The CUDA error return value is returned in inform%cuda_error.
-52
CUBLAS error. The CUBLAS error return value is returned in inform%cublas_error.
-53
OpenMP cancellation is disabled. Please set the environment variable OMP_CANCELLATION=true.
+1
Out-of-range variable indices found and ignored in input data. inform%matrix_outrange is set to the number of such entries.
+2
Duplicate entries found and summed in input data. inform%matrix_dup is set to the number of such entries.
+3
Combination of +1 and +2.
+4
One or more diagonal entries of \(A\) are missing.
+5
Combination of +4 and +1 or +2.
+6
Matrix is found be (structurally) singular during analyse phase. This will overwrite any of the above warning flags.
+7
Matrix is found to be singular during factorize phase.
+8
Matching-based scaling found as side-effect of matching-based ordering ignored (consider setting options%scaling=3).
+50
OpenMP processor binding is disabled. Consider setting the environment variable OMP_PROC_BIND=true (this may affect performance on NUMA systems).
Example
Suppose we wish to factorize the matrix
and then solve for the right-hand side
The following code may be used.
! examples/Fortran/ssids.f90 - Example code for SPRAL_SSIDS package
program ssids_example
use spral_ssids
implicit none
integer, parameter :: long = selected_int_kind(18)
! Derived types
type (ssids_akeep) :: akeep
type (ssids_fkeep) :: fkeep
type (ssids_options) :: options
type (ssids_inform) :: inform
! Parameters
integer, parameter :: wp = kind(0.0d0)
! Matrix data
logical :: posdef
integer :: n, row(9)
integer(long) :: ptr(6)
real(wp) :: val(9)
! Other variables
integer :: piv_order(5), cuda_error
real(wp) :: x(5)
logical :: check
! Data for matrix:
! ( 2 1 )
! ( 1 4 1 1 )
! ( 1 3 2 )
! ( 2 -1 )
! ( 1 2 )
posdef = .false.
n = 5
ptr(1:n+1) = (/ 1, 3, 6, 8, 9, 10 /)
row(1:ptr(n+1)-1) = (/ 1, 2, 2, 3, 5, 3, 4, 4, 5 /)
val(1:ptr(n+1)-1) = (/ 2.0, 1.0, 4.0, 1.0, 1.0, 3.0, 2.0, -1.0, 2.0 /)
! The right-hand side with solution (1.0, 2.0, 3.0, 4.0, 5.0)
x(1:n) = (/ 4.0, 17.0, 19.0, 2.0, 12.0 /)
! Perform analyse and factorise with data checking
check = .true.
call ssids_analyse(check, n, ptr, row, akeep, options, inform)
if(inform%flag<0) go to 100
call ssids_factor(posdef, val, akeep, fkeep, options, inform)
if(inform%flag<0) go to 100
! Solve
call ssids_solve(x,akeep,fkeep,options,inform)
if(inform%flag<0) go to 100
write(*,'(a,/,(3es18.10))') ' The computed solution is:', x(1:n)
! Determine and print the pivot order
call ssids_enquire_indef(akeep, fkeep, options, inform, piv_order=piv_order)
write(*,"(a,10i5)") ' Pivot order:', piv_order(1:n)
100 continue
call ssids_free(akeep, fkeep, cuda_error)
end program ssids_example
This produces the following output:
The computed solution is:
1.0000000000E+00 2.0000000000E+00 3.0000000000E+00
4.0000000000E+00 5.0000000000E+00
Pivot order: -3 4 -1 0 -2
Driver Program
SSIDS ships with a driver program spral_ssids
that allows reading a
matrix in Rutherford-Boeing format specified as a command-line argument and
factorizing it. There are a number of other command-line arguments that
configure the factorization.
- program spral_ssids
SSIDS driver program.
- Parameters:
filename ::
Rutherford-Boeing matrix filename (default if not specified is matrix.rb).
- --scale=none
use no scaling (the default).
- --scale=mc64
use the Hungarian scaling algorithm (as in MC64).
- --scale=auction
use the Auction scaling algorithm.
- --scale=mc77
use the norm-equilibration scaling algorithm (as in MC77).
- --ordering=mc64-metis
use matching-based ordering and scaling (scale is overwritten).
- --force-posdef
force the matrix to be positive definite
- --posdef
assume the matrix is positive definite.
- --time-scaling
time the scaling routine.
- --nrhs
set the number of right-hand sides [integer,default=1].
- --nemin
set the supernode amalgamation threshold [integer,default=32].
- --u
set the relative pivot threshold used in the symmetric indefinite case [real,default=0.01].
- --max-load-inbalance
set the maximum permissible load inbalance for leaf subtree allocations [real,default=1.2].
- --pivot-method=app-aggressive
use aggressive a posteori pivoting.
- --pivot-method=app-block
use block a posteori pivoting (the default).
- --pivot-method=tpp
use threshold partial pivoting.
- --flat-topology
force a flat machine topology (the default).
- --no-flat-topology
use the actual machine topology.
- --disable-gpu
don’t use an NVIDIA GPU if present.
- --min-gpu-work
set the minimum number of flops in a subtree before scheduling on GPU [integer(long),default=5e9].
- --gpu-perf-coeff
set the GPU performance coefficient (how many times faster a GPU is than CPU at factoring a subtree) [real,default=1.0].
- --small-subtree-threshold
set the maximum number of flops in a subtree treated as a single task [integer(long),default=4e6].
- --cpu-block-size
set the block size to use for parallelization of large nodes on CPU resources [integer ,default=256].
- --no-ignore-numa
don’t treat all CPUs and GPUs as belonging to a single NUMA region (which is the default).
- --ngpus
set the number of NVIDIA GPUs to use [integer,default=0].
For example, to use auction scaling with two right-hand sides on the linverse.rb matrix:
./spral_ssids linverse.rb --scale=auction --nrhs 2
This produces output similar to the following:
The computed solution is:
Set scaling to Auction
solving for 2 right-hand sides
Reading 'linverse.rb'...
ok
Number of CUDA devices: 0
Forcing topology to 32
Using 0 GPUs
Used order 1
ok
Analyse took 5.20000011E-02
Predict nfact = 3.03E+05
Predict nflop = 9.25E+06
nparts 1
cpu_fl 9.25E+06
gpu_fl 0.00E+00
Factorize...
ok
Factor took 1.20000001E-02
Solve...
ok
Solve took 1.00000005E-03
number bad cmp = 0
fwd error || ||_inf = 3.8014036363165360E-013
bwd error scaled = 4.2549737582555113E-015 4.2549737582555113E-015
cmp: SMFCT
anal: 0.05
fact: 0.01
afact: 3.03E+05
aflop: 9.25E+06
nfact: 3.03E+05
nflop: 9.25E+06
delay: 0
inerti 2838 0 9161
2x2piv 5502
maxfro 55
maxsup 52
not_fi 0
not_se 0
Method
Partition of work across available resources
Once the ordering has been determined and the assembly tree determined in the analyse phase, the tree is broken into a number of leaf subtrees rooted at a single node, leaving a root subtree/forest consisting of all remaining nodes above those. Each leaf subtree is pre-assigned to a particular NUMA region or GPU for the factorization phase. Details of the algorithm used for finding these subtrees and their assignment can be found in the paper [2].
The factorization phase has two steps. In the first, leaf subtrees are factorized in parallel on their assigned resources. Once all leaf subtrees are factored, the second phase begins where all CPU resources cooperate to factorize the root subtree.
At present the solve phase is performed in serial.
Data checking
If check is set to .true. on the call to ssids_analyse()
or if
ssids_analyse_coord()
is called, the user-supplied matrix data is
checked for errors. The cleaned integer matrix data (duplicates are
summed and out-of-range indices discarded) is stored in akeep. The use
of checking is optional on a call to ssids_analyse()
as it incurs both
time and memory overheads. However, it is recommended since the
behaviour of the other routines in the package is unpredictable if
duplicates and/or out-of-range variable indices are entered.
If the user has supplied an elimination order it is checked for errors.
Otherwise, an elimination order is generated by the package. The
elimination order is used to construct an assembly tree. On exit from
ssids_analyse()
(and ssids_analyse_coord()
), order(:) is
set so that order(i) holds the position of variable \(i\) in the
elimination order. If an ordering was supplied by the user, this order may
differ, but will be equivalent in terms of fill-in.
Factorization performed
The factorization performed depends on the value of posdef.
If posdef=true, a Cholesky factorization is performed:
Pivoting is not performed, so \(P\) is the permutation determined in the analysis phase. \(S\) is a diagonal scaling matrix.
If posdef=false, a symmetric indefinite factorization is performed:
Pivoting is performed, so \(P\) may differ from the permutation determined in the analysis phase, though it is kept as close as possible to minimize fill. The exact pivoting algorithm varies depending on the particular kernel the algorithm chooses to employ.
Full details of the algorithms used are provided in [1] for GPUs and [2] for CPUs.
Small Leaf Subtrees
For subtrees allocated to run on the CPU, the factorization of small nodes near the leaves of the tree can be amalgamated into a single parallel task (normally each would be treated as its own OpenMP task to be scheduled). This can reduce scheduling overheads, especially on small problems. If the total number of operations for a subtree root at a given node is less than options.small_subtree_threshold, that subtree is treated as a single task.