Global Settings

In block2, we try to minimize the use of global variables. Two global variables (frame_() and threading_()) have been used for controlling global settings such as stack memory, scartch folder and threading schemes.

Note that in block2 the distributed parallelization scheme is handled locally.

Threading

enum class block2::ThreadingTypes : uint8_t

An indicator for where the openMP shared-memory threading should be activated. In the case of nested openMP, the total number of nested threading layers is determined from this enumeration.

For each enumerator, the number in brackets is the total number of threading layers.

Values:

enumerator SequentialGEMM: [0] seq mkl

enumerator BatchedGEMM: [1] parallel mkl

enumerator Quanta: [1] openmp quanta + seq mkl

enumerator QuantaBatchedGEMM: [2] openmp quanta + parallel mkl

enumerator Operator: [1] openmp operator

enumerator OperatorBatchedGEMM: [2] openmp operator + parallel mkl

enumerator OperatorQuanta: [2] openmp operator + openmp quanta

enumerator OperatorQuantaBatchedGEMM: [3] openmp operator + openmp quanta + parallel mkl

enumerator Global: [1] openmp for general non-core-algorithm tasks

enum class block2::SeqTypes : uint8_t

Method of GEMM (dense matrix multiplication) parallelism. For CSR matrix multiplication, the only possbile case is SeqTypes::None, but one can still use SeqTypes::Simple and it will only parallelize dense matrix multiplication.

Values:

enumerator None: GEMM are not parallelized. Parallelism may happen inside each GEMM, if a threaded version of MKL is linked.

enumerator Simple: GEMM written to the different outputs are parallelized, otherwise they are executed in sequential. With this mode, the code will sort and divide GEMM to several groups (batches). Inside each batch, the output addresses are guarenteed to be different. The cblas_dgemm_batch is invoked to compute each batch.

enumerator Auto: DGEMM automatically divided into several batches only when there are data dependency. Conflicts of output are automatically resolved by introducing temporary arrays. The cblas_dgemm_batch is invoked to compute each batch. This option normally requires a large amount of time for preprocessing and it will introduce a large number of temporary arrays, which is not memory friendly.

enumerator Tasked: GEMM will be evenly divided into n_threads groups, Different groups are executed in different threads. Since different threads may write into the same output array, there is an additional reduction step after all GEMM finishes. This mode is mainly implemented for Davidson matrix-vector step (tensor_product_multiply), where the size of the output array (wavefunction) is small compared to that of all input arrays. For blocking/rotation step, SeqTypes::Tasked has no effect and it is equivalent to SeqTypes::None. The cblas_dgemm_batch is not used in this mode.

enumerator SimpleTasked: This is the same as SeqTypes::Tasked for the Davidson matrix-vector step, and the same as SeqTypes::Simple for other steps.

struct Threading

Global information for threading schemes.

Public Functions

inline bool openmp_available() const: Whether openmp compiler option is set.

inline bool tbb_available() const: Whether tbb memory allocator is used.

inline bool mkl_available() const: Whether MKL math library is used.

inline bool blis_available() const: Whether BLIS math library is used.

inline bool complex_available() const: Whether complex number extension is used.

inline bool single_precision_available() const: Whether single precision extension is used.

inline bool ksymm_available() const: Whether K symmetry extension is used.

inline string get_mkl_version() const

Check version of the linked MKL library.

Returns:: A version string of the linked MKL library if MKL is linked, or an empty string otherwise.

inline string get_mkl_threading_type() const: Return a string indicating which threaded MKL library is linked.

inline string get_seq_type() const: Return a string indicating which SeqTypes is used.

inline string get_align_type() const: Return a string indicating which AlignTypes is used.

inline int get_thread_id() const: If inside a openMP parallel region, return the id of the current thread.

inline int activate_global() const

Set number of threads for a general task. Parallelism inside MKL will be deactivated for a general task.

Returns:: Number of threads for general tasks. Returns 1 if openMP should not be used for a general task.

inline int activate_global_mkl() const

Set number of threads for a general task with parallelism inside MKL. Parallelism outside MKL will be deactivated.

Returns:: Number of threads for general tasks. Returns 1 if MKL is not supported.

inline int activate_normal() const

Set number of threads for a normal (parallelism over renormalized operators) task.

Returns:: Number of threads for parallelism over renormalized operators.

inline int activate_operator() const

Set number of threads for parallelism over renormalized operators.

Returns:: Number of threads for parallelism over renormalized operators.

inline int activate_quanta() const

Set number of threads for parallelism over symmetry sectors.

Returns:: Number of threads for parallelism over symmetry sectors.

inline Threading(): Default constructor. Uses ThreadingTypes::Global | ThreadingTypes::BatchedGEMM with maximal available number of threads, and SeqTypes::None for dense matrix multiplication.

inline Threading(ThreadingTypes type, int nta = -1, int ntb = -1, int ntc = -1, int ntd = -1)

Constructor.

Parameters:

type – Type of the threading scheme.
nta – Number of threads for a general task (if ThreadingTypes::Global is set) or number of threads in the first threading layer.
ntb – Number of threads in the first threading layer for a non-general threaded task (if ThreadingTypes::Global is set) or number of threads in the second threading layer.
ntc – Number of threads in the second threading layer for a non-general threaded task (if ThreadingTypes::Global is set) or number of threads in the third threading layer.
ntd – Number of threads in the third threading layer for a non-general threaded task (if ThreadingTypes::Global is set).

Public Members

ThreadingTypes type: Type of the threading scheme.

SeqTypes seq_type = SeqTypes::None : Method of dense matrix multiplication parallelism.

AlignTypes align_type = AlignTypes::None: Memory alignment mode.

int n_threads_op = 0: Number of threads for parallelism over renormalized operators.

int n_threads_quanta = 0: Number of threads for parallelism over symmetry sectors.

int n_threads_mkl = 0: Number of threads for parallelism within dense matrix multiplications.

int n_threads_global = 0: Number of threads for general tasks.

int n_levels = 0: Number of nested threading layers.

Friends

inline friend ostream &operator<<(ostream &os, const Threading &th): Print threading information.

inline shared_ptr<Threading> &block2::threading_(): Implementation of the threading global variable.

threading: Global variable containing information for shared-memory parallelism schemes and number of threads used for each threading layer.

Allocators

template<typename T> struct Allocator

Abstract memory allocator.

Template Parameters:: T – The type of the element in the array.

Subclassed by block2::StackAllocator< T >, block2::VectorAllocator< T >

Public Functions

inline Allocator(): Default constructor.

virtual ~Allocator() = default: Default destructor.

inline virtual T *allocate(size_t n)

Allocate a length n array.

Parameters:: n – Number of elements in the array.
Returns:: The allocated pointer.

inline virtual complex<T> *complex_allocate(size_t n)

Allocate a length n complex array.

Parameters:: n – Number of elements in the array.
Returns:: The allocated pointer.

inline virtual void deallocate(void *ptr, size_t n)

Deallocate a length n array.

Parameters:

ptr – The pointer to be deallocated.
n – Number of elements in the array.

inline virtual void complex_deallocate(void *ptr, size_t n)

Deallocate a length n complex array.

Parameters:

ptr – The pointer to be deallocated.
n – Number of elements in the array.

inline virtual T *reallocate(T *ptr, size_t n, size_t new_n)

Adjust the size an allocated pointer. No data copying will happen.

Parameters:

ptr – The allocated pointer.
n – Number of elements in original allocation.
new_n – Number of elements in the new allocation.

Returns:

The new pointer.

inline virtual shared_ptr<Allocator<T>> copy() const

Return a copy of the allocator.

Returns:: ptr The copy of this allocator.

template<typename T> struct StackAllocator : public block2::Allocator<T>

Stack memory allocator.

Template Parameters:: T – The type of the element in the array.

Subclassed by block2::TemporaryAllocator< T >

Public Functions

inline StackAllocator(T *ptr, size_t max_size)

Constructor.

Parameters:

ptr – Pointer to the first elemenet in the stack. The stack should be pre-allocated.
max_size – Total size of the stack (in number of elements).

inline StackAllocator(): Default constructor.

inline virtual T *allocate(size_t n) override

Allocate a length n array.

Parameters:: n – Number of elements in the array.
Returns:: The allocated pointer.

inline virtual void deallocate(void *ptr, size_t n) override

Deallocate a length n array. Must be invoked in the reverse order of allocation.

Parameters:

ptr – The pointer to be deallocated.
n – Number of elements in the array.

inline virtual T *reallocate(T *ptr, size_t n, size_t new_n) override

Change the allocated size in middle of stack memory and introduce a shift for moving memory after it.

Parameters:

ptr – The allocated pointer.
n – Number of elements in original allocation.
new_n – Number of elements in the new allocation.

Returns:

The new pointer.

Public Members

size_t size: Total size of the stack (in number of elements).

size_t used: Occupied size of the stack (in number of elements).

size_t shift: Temporary shift introduced due to deallocation in the middle of the stack.

T *data: Pointer to the first elemenet in the stack.

Friends

inline friend ostream &operator<<(ostream &os, const StackAllocator &c)

Print the status of the allocator.

Parameters:

os – The output stream.
c – The object to be printed.

Returns:

The output stream.

template<typename T> struct VectorAllocator : public block2::Allocator<T>

Vector memory allocator.

Template Parameters:: T – The type of the element in the array.

Public Functions

inline VectorAllocator(): Default constructor.

inline virtual T *allocate(size_t n) override

Allocate a length n array.

Parameters:: n – Number of elements in the array.
Returns:: The allocated pointer.

inline virtual void deallocate(void *ptr, size_t n) override

Deallocate a length n array. Note that explicit deallocation is not required for vector allocator. Can be invoked in arbitrary order.

Parameters:

ptr – The pointer to be deallocated.
n – Number of elements in the array.

inline virtual T *reallocate(T *ptr, size_t n, size_t new_n) override

Change the allocated size for one allocated block.

Parameters:

ptr – The allocated pointer.
n – Number of elements in original allocation.
new_n – Number of elements in the new allocation.

Returns:

The new pointer.

inline virtual shared_ptr<Allocator<T>> copy() const override

Return a copy of the allocator. When deep-copying objects using VectorAllocator, the other object should have an independent allocator, since VectorAllocator is not global.

Returns:: The copy of this allocator.

Public Members

vector<vector<T>> data: The allocated data blocks.

Friends

inline friend ostream &operator<<(ostream &os, const VectorAllocator &c)

Print the status of the allocator.

Parameters:

os – The output stream.
c – The object to be printed.

Returns:

The output stream.

inline shared_ptr<StackAllocator<uint32_t>> &block2::ialloc_(): Implementation of the ialloc global variable.

template<typename FL> inline shared_ptr<StackAllocator<FL>> &block2::dalloc_(): Implementation of the dalloc global variable.

ialloc: Global variable for the integer stack memory allocator.

Data Frame

template<typename FL> struct DataFrame

DataFrame includes several (n_frames = 2) frames. Each frame includes one integer stack memory and one double stack memory. The two frames are used alternatively to avoid data copying.

Public Functions

inline DataFrame(size_t isize = 1 << 28, size_t dsize = 1 << 30, const string &save_dir = "node0", double dmain_ratio = 0.7, double imain_ratio = 0.7, int n_frames = 2)

Constructor.

Parameters:

isize – Max size (in bytes) of all integer stacks.
dsize – Max size (in bytes) of all double stacks.
save_dir – Scartch folder for renormalized operators.
dmain_ratio – The fraction of stack space occupied by the main double stacks.
imain_ratio – The fraction of stack space occupied by the main integer stacks.
n_frames – Number of data frames.

inline virtual ~DataFrame(): Destructor.

inline void activate(int i)

Activate one data frame.

Parameters:: i – The index of the data frame to be activated.

inline void reset(int i)

Reset one data frame, marking all stack memory as unused.

Parameters:: i – The index of the data frame to be reset.

inline void reset_buffer(int i)

Reset saving and loading buffers for one data frame. Contents in the loading buffer will be deleted. Unsaved contents in the saving buffer will be saved in disk.

Parameters:: i – The index of the data frame.

inline void rename_data(const string &old_filename, const string &new_filename) const

Rename one scratch file.

Parameters:

old_filename – original filename.
new_filename – new filename.

inline void load_data_from(int i, istream &ifs) const

Load one data frame from input stream.

Parameters:

i – The index of the data frame.
ifs – The input stream.

inline void load_data(int i, const string &filename) const

Load one data frame from disk.

Parameters:

i – The index of the data frame.
filename – The filename for the data frame.

inline void save_data_to(int i, ostream &ofs) const

Save one data frame into output stream.

Parameters:

i – The index of the data frame.
ofs – The output stream.

inline void save_data(int i, const string &filename) const

Save one data frame to disk.

Parameters:

i – The index of the data frame.
filename – The filename for the data frame.

inline void deallocate(): Deallocate the memory allocated for all stacks. Note that this method is automatically invoked at deconstruction.

inline size_t memory_used() const

Return the current used memory in all stacks.

Returns:: The current used memory in Bytes.

inline void update_peak_used_memory() const: Update peak used memory statistics.

inline void reset_peak_used_memory() const: Reset peak used memory statistics to zero.

Public Members

string save_dir: Scartch folder for renormalized operators.

string mps_dir: Scartch folder for MPS (default is the same as save_dir).

string mpo_dir: Scartch folder for MPO (default is the same as save_dir, only used when minimal_memory_usage is true).

string restart_dir = "": If not empty, save MPS to this dir after each sweep.

string restart_dir_per_sweep = "": if not empty, save MPS to this dir with sweep index as suffix, so that MPS from all sweeps will be kept in individual dirs.

string restart_dir_optimal_mps = "": If not empty, save MPS to this dir whenever an optimal solution is reached in one sweep. For DMRG, this is the MPS with the lowest energy. Note that if the best solution from the current sweep is worse than the best solution from the previous sweep (for example in a reverse schedule), the best solution from the current sweep is saved.

string restart_dir_optimal_mps_per_sweep = "": If not empty, save the optimal MPS from each sweep to this dir with sweep index as suffix.

size_t save_dir_quota = 0: Disk quota for save_dir (in bytes).

string alt_save_dir = "": Alternative scartch folder.

string prefix = "F": Filename prefix for common scratch files (such as MPS tensors).

string prefix_distri = "F0": Filename prefix for distributed scratch files (such as renormalized operators). When distributed parallelization is used, different procs will have different values for this data.

bool prefix_can_write = true: Whether this proc should be able to write common scratch files (such as MPS tensors).

bool partition_can_write = true: Whether this proc should be able to write renormalized operators.

size_t isize: Max number of elements in all integer stacks.

size_t dsize: Max number of elements in all double stacks.

int n_frames: Total number of data frames.

int i_frame: The index of Current activated data frame.

mutable double tread = 0: IO Time cost for reading scratch files.

double twrite = 0: IO Time cost for writing scratch files.

double tasync = 0: IO Time cost for async writing scratch files.

mutable double fpread = 0: IO Time cost for reading scratch files with floating-point decompression.

double fpwrite = 0: IO Time cost for writing scratch files with floating-point compression.

mutable Timer _t: Temporary timer.

Timer _t2: Auxiliary temporary timer.

vector<shared_ptr<StackAllocator<uint32_t>>> iallocs: Integer stacks allocators.

vector<shared_ptr<StackAllocator<FL>>> dallocs: Double stacks allocators.

mutable vector<size_t> peak_used_memory: Peak used memory by stacks (in Bytes). Even indices are for double stacks. Odd indices are for interger stacks.

mutable vector<string> present_filenames: The filename for the current stack memory content for each data frame. Used for tracking loading and saving buffering to avoid loading the same data into memory.

mutable vector<pair<string, shared_ptr<stringstream>>> load_buffers: Buffers for loading. Skpping reading a file with certain filename, if the contents of the file with that filename is in the loading buffer.

mutable vector<pair<string, shared_ptr<stringstream>>> save_buffers: Buffers for async saving.

mutable vector<shared_future<void>> save_futures: Async saving files.

bool load_buffering = false: Whether load buffering should be used. If true, memory usage will increase.

bool save_buffering = false: Whether async saving and saving buffering should be used. If true, memory usage will increase.

bool use_main_stack = true: Whether main stack should be used for storing blocked operators in enlarged blocks. If false, these blocked operators will be stored in dynamically allocated memory.

bool minimal_disk_usage = false: Whether temporary renormalized operator files should be deleted as soon as possible. If true, will save roughly half of required storage for renormalized operators.

bool minimal_memory_usage = false: Whether MPO should be build in minimal memory mode by saving intermediates to disk. In this mode, MPO should have different tags.

bool compressed_sparse_tensor_storage = false: Whether block-sparse tensor should be stored in compressed form to save storage (mainly for MPS).

shared_ptr<FPCodec<FL>> fp_codec = nullptr: Floating-point compression codec. If nullptr, floating-point compression will not be used.

Public Static Functions

static inline void buffer_save_data(const string &filename, const shared_ptr<stringstream> &ss, double *tasync)

Save the data in buffer stream into disk.

Parameters:

filename – The filename for saving data.
ss – The buffer stream.
tasync – Pointer to the time recorder for async saving.

Friends

inline friend ostream &operator<<(ostream &os, const DataFrame &df)

Print the status of the data frame.

Parameters:

os – The output stream.
df – The object to be printed.

Returns:

The output stream.

template<typename FL> inline shared_ptr<DataFrame<FL>> &block2::frame_(): Global variable for accessing global stack memory and file I/O in scratch space.

Miscellanies

inline auto block2::check_signal_() -> void (*&)(): Function pointer for signal checking.

inline void block2::print_trace(): Print calling stack when an error happens. Not working for non-unix systems.