Global Settings

In block2, we try to minimize the use of global variables. Two global variables (frame_() and threading_()) have been used for controlling global settings such as stack memory, scartch folder and threading schemes.

Note that in block2 the distributed parallelization scheme is handled locally.

Threading

enum class block2::ThreadingTypes : uint8_t

An indicator for where the openMP shared-memory threading should be activated. In the case of nested openMP, the total number of nested threading layers is determined from this enumeration.

For each enumerator, the number in brackets is the total number of threading layers.

Values:

enumerator SequentialGEMM

[0] seq mkl

enumerator BatchedGEMM

[1] parallel mkl

enumerator Quanta

[1] openmp quanta + seq mkl

enumerator QuantaBatchedGEMM

[2] openmp quanta + parallel mkl

enumerator Operator

[1] openmp operator

enumerator OperatorBatchedGEMM

[2] openmp operator + parallel mkl

enumerator OperatorQuanta

[2] openmp operator + openmp quanta

enumerator OperatorQuantaBatchedGEMM

[3] openmp operator + openmp quanta + parallel mkl

enumerator Global

[1] openmp for general non-core-algorithm tasks

enum class block2::SeqTypes : uint8_t

Method of GEMM (dense matrix multiplication) parallelism. For CSR matrix multiplication, the only possbile case is SeqTypes::None, but one can still use SeqTypes::Simple and it will only parallelize dense matrix multiplication.

Values:

enumerator None

GEMM are not parallelized. Parallelism may happen inside each GEMM, if a threaded version of MKL is linked.

enumerator Simple

GEMM written to the different outputs are parallelized, otherwise they are executed in sequential. With this mode, the code will sort and divide GEMM to several groups (batches). Inside each batch, the output addresses are guarenteed to be different. The cblas_dgemm_batch is invoked to compute each batch.

enumerator Auto

DGEMM automatically divided into several batches only when there are data dependency. Conflicts of output are automatically resolved by introducing temporary arrays. The cblas_dgemm_batch is invoked to compute each batch. This option normally requires a large amount of time for preprocessing and it will introduce a large number of temporary arrays, which is not memory friendly.

enumerator Tasked

GEMM will be evenly divided into n_threads groups, Different groups are executed in different threads. Since different threads may write into the same output array, there is an additional reduction step after all GEMM finishes. This mode is mainly implemented for Davidson matrix-vector step (tensor_product_multiply), where the size of the output array (wavefunction) is small compared to that of all input arrays. For blocking/rotation step, SeqTypes::Tasked has no effect and it is equivalent to SeqTypes::None. The cblas_dgemm_batch is not used in this mode.

enumerator SimpleTasked

This is the same as SeqTypes::Tasked for the Davidson matrix-vector step, and the same as SeqTypes::Simple for other steps.

struct Threading

Global information for threading schemes.

Public Functions

inline bool openmp_available() const

Whether openmp compiler option is set.

inline bool tbb_available() const

Whether tbb memory allocator is used.

inline bool mkl_available() const

Whether MKL math library is used.

inline bool blis_available() const

Whether BLIS math library is used.

inline bool complex_available() const

Whether complex number extension is used.

inline bool single_precision_available() const

Whether single precision extension is used.

inline bool ksymm_available() const

Whether K symmetry extension is used.

inline string get_mkl_version() const

Check version of the linked MKL library.

Returns:

A version string of the linked MKL library if MKL is linked, or an empty string otherwise.

inline string get_mkl_threading_type() const

Return a string indicating which threaded MKL library is linked.

inline string get_seq_type() const

Return a string indicating which SeqTypes is used.

inline int get_thread_id() const

If inside a openMP parallel region, return the id of the current thread.

inline int activate_global() const

Set number of threads for a general task. Parallelism inside MKL will be deactivated for a general task.

Returns:

Number of threads for general tasks. Returns 1 if openMP should not be used for a general task.

inline int activate_global_mkl() const

Set number of threads for a general task with parallelism inside MKL. Parallelism outside MKL will be deactivated.

Returns:

Number of threads for general tasks. Returns 1 if MKL is not supported.

inline int activate_normal() const

Set number of threads for a normal (parallelism over renormalized operators) task.

Returns:

Number of threads for parallelism over renormalized operators.

inline int activate_operator() const

Set number of threads for parallelism over renormalized operators.

Returns:

Number of threads for parallelism over renormalized operators.

inline int activate_quanta() const

Set number of threads for parallelism over symmetry sectors.

Returns:

Number of threads for parallelism over symmetry sectors.

inline Threading()

Default constructor. Uses ThreadingTypes::Global | ThreadingTypes::BatchedGEMM with maximal available number of threads, and SeqTypes::None for dense matrix multiplication.

inline Threading(ThreadingTypes type, int nta = -1, int ntb = -1, int ntc = -1, int ntd = -1)

Constructor.

Parameters:
  • type – Type of the threading scheme.

  • nta – Number of threads for a general task (if ThreadingTypes::Global is set) or number of threads in the first threading layer.

  • ntb – Number of threads in the first threading layer for a non-general threaded task (if ThreadingTypes::Global is set) or number of threads in the second threading layer.

  • ntc – Number of threads in the second threading layer for a non-general threaded task (if ThreadingTypes::Global is set) or number of threads in the third threading layer.

  • ntd – Number of threads in the third threading layer for a non-general threaded task (if ThreadingTypes::Global is set).

Public Members

ThreadingTypes type

Type of the threading scheme.

SeqTypes seq_type = SeqTypes::None

Method of dense matrix multiplication parallelism.

int n_threads_op = 0

Number of threads for parallelism over renormalized operators.

int n_threads_quanta = 0

Number of threads for parallelism over symmetry sectors.

int n_threads_mkl = 0

Number of threads for parallelism within dense matrix multiplications.

int n_threads_global = 0

Number of threads for general tasks.

int n_levels = 0

Number of nested threading layers.

Friends

inline friend ostream &operator<<(ostream &os, const Threading &th)

Print threading information.

inline shared_ptr<Threading> &block2::threading_()

Implementation of the threading global variable.

threading

Global variable containing information for shared-memory parallelism schemes and number of threads used for each threading layer.

Allocators

template<typename T>
struct Allocator

Abstract memory allocator.

Template Parameters:

T – The type of the element in the array.

Subclassed by block2::StackAllocator< T >, block2::VectorAllocator< T >

Public Functions

inline Allocator()

Default constructor.

virtual ~Allocator() = default

Default destructor.

inline virtual T *allocate(size_t n)

Allocate a length n array.

Parameters:

n – Number of elements in the array.

Returns:

The allocated pointer.

inline virtual complex<T> *complex_allocate(size_t n)

Allocate a length n complex array.

Parameters:

n – Number of elements in the array.

Returns:

The allocated pointer.

inline virtual void deallocate(void *ptr, size_t n)

Deallocate a length n array.

Parameters:
  • ptr – The pointer to be deallocated.

  • n – Number of elements in the array.

inline virtual void complex_deallocate(void *ptr, size_t n)

Deallocate a length n complex array.

Parameters:
  • ptr – The pointer to be deallocated.

  • n – Number of elements in the array.

inline virtual T *reallocate(T *ptr, size_t n, size_t new_n)

Adjust the size an allocated pointer. No data copying will happen.

Parameters:
  • ptr – The allocated pointer.

  • n – Number of elements in original allocation.

  • new_n – Number of elements in the new allocation.

Returns:

The new pointer.

inline virtual shared_ptr<Allocator<T>> copy() const

Return a copy of the allocator.

Returns:

ptr The copy of this allocator.

template<typename T>
struct StackAllocator : public block2::Allocator<T>

Stack memory allocator.

Template Parameters:

T – The type of the element in the array.

Subclassed by block2::TemporaryAllocator< T >

Public Functions

inline StackAllocator(T *ptr, size_t max_size)

Constructor.

Parameters:
  • ptr – Pointer to the first elemenet in the stack. The stack should be pre-allocated.

  • max_size – Total size of the stack (in number of elements).

inline StackAllocator()

Default constructor.

inline virtual T *allocate(size_t n) override

Allocate a length n array.

Parameters:

n – Number of elements in the array.

Returns:

The allocated pointer.

inline virtual void deallocate(void *ptr, size_t n) override

Deallocate a length n array. Must be invoked in the reverse order of allocation.

Parameters:
  • ptr – The pointer to be deallocated.

  • n – Number of elements in the array.

inline virtual T *reallocate(T *ptr, size_t n, size_t new_n) override

Change the allocated size in middle of stack memory and introduce a shift for moving memory after it.

Parameters:
  • ptr – The allocated pointer.

  • n – Number of elements in original allocation.

  • new_n – Number of elements in the new allocation.

Returns:

The new pointer.

Public Members

size_t size

Total size of the stack (in number of elements).

size_t used

Occupied size of the stack (in number of elements).

size_t shift

Temporary shift introduced due to deallocation in the middle of the stack.

T *data

Pointer to the first elemenet in the stack.

Friends

inline friend ostream &operator<<(ostream &os, const StackAllocator &c)

Print the status of the allocator.

Parameters:
  • os – The output stream.

  • c – The object to be printed.

Returns:

The output stream.

template<typename T>
struct VectorAllocator : public block2::Allocator<T>

Vector memory allocator.

Template Parameters:

T – The type of the element in the array.

Public Functions

inline VectorAllocator()

Default constructor.

inline virtual T *allocate(size_t n) override

Allocate a length n array.

Parameters:

n – Number of elements in the array.

Returns:

The allocated pointer.

inline virtual void deallocate(void *ptr, size_t n) override

Deallocate a length n array. Note that explicit deallocation is not required for vector allocator. Can be invoked in arbitrary order.

Parameters:
  • ptr – The pointer to be deallocated.

  • n – Number of elements in the array.

inline virtual T *reallocate(T *ptr, size_t n, size_t new_n) override

Change the allocated size for one allocated block.

Parameters:
  • ptr – The allocated pointer.

  • n – Number of elements in original allocation.

  • new_n – Number of elements in the new allocation.

Returns:

The new pointer.

inline virtual shared_ptr<Allocator<T>> copy() const override

Return a copy of the allocator. When deep-copying objects using VectorAllocator, the other object should have an independent allocator, since VectorAllocator is not global.

Returns:

The copy of this allocator.

Public Members

vector<vector<T>> data

The allocated data blocks.

Friends

inline friend ostream &operator<<(ostream &os, const VectorAllocator &c)

Print the status of the allocator.

Parameters:
  • os – The output stream.

  • c – The object to be printed.

Returns:

The output stream.

inline shared_ptr<StackAllocator<uint32_t>> &block2::ialloc_()

Implementation of the ialloc global variable.

template<typename FL>
inline shared_ptr<StackAllocator<FL>> &block2::dalloc_()

Implementation of the dalloc global variable.

ialloc

Global variable for the integer stack memory allocator.

Data Frame

template<typename FL>
struct DataFrame

DataFrame includes several (n_frames = 2) frames. Each frame includes one integer stack memory and one double stack memory. The two frames are used alternatively to avoid data copying.

Public Functions

inline DataFrame(size_t isize = 1 << 28, size_t dsize = 1 << 30, const string &save_dir = "node0", double dmain_ratio = 0.7, double imain_ratio = 0.7, int n_frames = 2)

Constructor.

Parameters:
  • isize – Max size (in bytes) of all integer stacks.

  • dsize – Max size (in bytes) of all double stacks.

  • save_dir – Scartch folder for renormalized operators.

  • dmain_ratio – The fraction of stack space occupied by the main double stacks.

  • imain_ratio – The fraction of stack space occupied by the main integer stacks.

  • n_frames – Number of data frames.

inline virtual ~DataFrame()

Destructor.

inline void activate(int i)

Activate one data frame.

Parameters:

i – The index of the data frame to be activated.

inline void reset(int i)

Reset one data frame, marking all stack memory as unused.

Parameters:

i – The index of the data frame to be reset.

inline void reset_buffer(int i)

Reset saving and loading buffers for one data frame. Contents in the loading buffer will be deleted. Unsaved contents in the saving buffer will be saved in disk.

Parameters:

i – The index of the data frame.

inline void rename_data(const string &old_filename, const string &new_filename) const

Rename one scratch file.

Parameters:
  • old_filename – original filename.

  • new_filename – new filename.

inline void load_data_from(int i, istream &ifs) const

Load one data frame from input stream.

Parameters:
  • i – The index of the data frame.

  • ifs – The input stream.

inline void load_data(int i, const string &filename) const

Load one data frame from disk.

Parameters:
  • i – The index of the data frame.

  • filename – The filename for the data frame.

inline void save_data_to(int i, ostream &ofs) const

Save one data frame into output stream.

Parameters:
  • i – The index of the data frame.

  • ofs – The output stream.

inline void save_data(int i, const string &filename) const

Save one data frame to disk.

Parameters:
  • i – The index of the data frame.

  • filename – The filename for the data frame.

inline void deallocate()

Deallocate the memory allocated for all stacks. Note that this method is automatically invoked at deconstruction.

inline size_t memory_used() const

Return the current used memory in all stacks.

Returns:

The current used memory in Bytes.

inline void update_peak_used_memory() const

Update prak used memory statistics.

inline void reset_peak_used_memory() const

Reset prak used memory statistics to zero.

Public Members

string save_dir

Scartch folder for renormalized operators.

string mps_dir

Scartch folder for MPS (default is the same as save_dir).

string mpo_dir

Scartch folder for MPO (default is the same as save_dir, only used when minimal_memory_usage is true).

string restart_dir = ""

If not empty, save MPS to this dir after each sweep.

string restart_dir_per_sweep = ""

if not empty, save MPS to this dir with sweep index as suffix, so that MPS from all sweeps will be kept in individual dirs.

string restart_dir_optimal_mps = ""

If not empty, save MPS to this dir whenever an optimal solution is reached in one sweep. For DMRG, this is the MPS with the lowest energy. Note that if the best solution from the current sweep is worse than the best solution from the previous sweep (for example in a reverse schedule), the best solution from the current sweep is saved.

string restart_dir_optimal_mps_per_sweep = ""

If not empty, save the optimal MPS from each sweep to this dir with sweep index as suffix.

string prefix = "F"

Filename prefix for common scratch files (such as MPS tensors).

string prefix_distri = "F0"

Filename prefix for distributed scratch files (such as renormalized operators). When distributed parallelization is used, different procs will have different values for this data.

bool prefix_can_write = true

Whether this proc should be able to write common scratch files (such as MPS tensors).

bool partition_can_write = true

Whether this proc should be able to write renormalized operators.

size_t isize

Max number of elements in all integer stacks.

size_t dsize

Max number of elements in all double stacks.

int n_frames

Total number of data frames.

int i_frame

The index of Current activated data frame.

mutable double tread = 0

IO Time cost for reading scratch files.

double twrite = 0

IO Time cost for writing scratch files.

double tasync = 0

IO Time cost for async writing scratch files.

mutable double fpread = 0

IO Time cost for reading scratch files with floating-point decompression.

double fpwrite = 0

IO Time cost for writing scratch files with floating-point compression.

mutable Timer _t

Temporary timer.

Timer _t2

Auxiliary temporary timer.

vector<shared_ptr<StackAllocator<uint32_t>>> iallocs

Integer stacks allocators.

vector<shared_ptr<StackAllocator<FL>>> dallocs

Double stacks allocators.

mutable vector<size_t> peak_used_memory

Peak used memory by stacks (in Bytes). Even indices are for double stacks. Odd indices are for interger stacks.

mutable vector<string> present_filenames

The filename for the current stack memory content for each data frame. Used for tracking loading and saving buffering to avoid loading the same data into memory.

mutable vector<pair<string, shared_ptr<stringstream>>> load_buffers

Buffers for loading. Skpping reading a file with certain filename, if the contents of the file with that filename is in the loading buffer.

mutable vector<pair<string, shared_ptr<stringstream>>> save_buffers

Buffers for async saving.

mutable vector<shared_future<void>> save_futures

Async saving files.

bool load_buffering = false

Whether load buffering should be used. If true, memory usage will increase.

bool save_buffering = false

Whether async saving and saving buffering should be used. If true, memory usage will increase.

bool use_main_stack = true

Whether main stack should be used for storing blocked operators in enlarged blocks. If false, these blocked operators will be stored in dynamically allocated memory.

bool minimal_disk_usage = false

Whether temporary renormalized operator files should be deleted as soon as possible. If true, will save roughly half of required storage for renormalized operators.

bool minimal_memory_usage = false

Whether MPO should be build in minimal memory mode by saving intermediates to disk. In this mode, MPO should have different tags.

shared_ptr<FPCodec<FL>> fp_codec = nullptr

Floating-point compression codec. If nullptr, floating-point compression will not be used.

Public Static Functions

static inline void buffer_save_data(const string &filename, const shared_ptr<stringstream> &ss, double *tasync)

Save the data in buffer stream into disk.

Parameters:
  • filename – The filename for saving data.

  • ss – The buffer stream.

  • tasync – Pointer to the time recorder for async saving.

Friends

inline friend ostream &operator<<(ostream &os, const DataFrame &df)

Print the status of the data frame.

Parameters:
  • os – The output stream.

  • df – The object to be printed.

Returns:

The output stream.

template<typename FL>
inline shared_ptr<DataFrame<FL>> &block2::frame_()

Global variable for accessing global stack memory and file I/O in scratch space.

Miscellanies

inline auto block2::check_signal_() -> void (*&)()

Function pointer for signal checking.

inline void block2::print_trace()

Print calling stack when an error happens. Not working for non-unix systems.