DBHT, OWA Weights, Gerber Statistic, CPP and Auxiliary Functions

The DBHT module has functions that allows us to use the Direct Bubble Hierarchical Tree (DBHT) [D2], a new linkage method; and the j-LoGo [D3] covariance estimation method.

The OwaWeights module has functions that allows us to build the weights of some special cases of the OWA Portfolio optimization model [D4].

The GerberStatistic module has functions that allows us to use the Gerber Statistic [D5].

The cppfunctions module has functions that allows us to build some special matrixes defined in [D6] and [D7].

The AuxFunctions module has some auxiliary functions that are used in other modules.

DBHT Methods

DBHT.DBHTs(D, S, leaf_order=True)[source]

Perform Direct Bubble Hierarchical Tree (DBHT) clustering, a deterministic technique which only requires a similarity matrix S, and related dissimilarity matrix D. For more information see “Hierarchical information clustering by means of topologically embedded graphs.” [D2]. This version makes extensive use of graph-theoretic filtering technique called Triangulated Maximally Filtered Graph (TMFG).

Parameters:

D (nd-array) – N x N dissimilarity matrix - e.g. a distance: D=pdist(data,’euclidean’) and then D=squareform(D).
S (nd-array) – N x N similarity matrix (non-negative)- e.g. correlation coefficient+1: S = 2-D**2/2 or another possible choice can be S = exp(-D).

Returns:

T8 (DataFrame) – N x 1 cluster membership vector.
Rpm (nd-array) – N x N adjacency matrix of Plannar Maximally Filtered Graph (PMFG).
Adjv (nd-array) – Bubble cluster membership matrix from BubbleCluster8.
Dpm (nd-array) – N x N shortest path length matrix of PMFG
Mv (nd-array) – N x Nb bubble membership matrix. Nb(n,bi)=1 indicates vertex n is a vertex of bubble bi.
Z (nd-array) – Linkage matrix using DBHT hierarchy.

DBHT.j_LoGo(S, separators, cliques)[source]

computes sparse inverse covariance, J, from a clique tree made of cliques and separators. For more information see: [D3].

Parameters:

S (ndarray) – It is the complete covariance matrix.
separators (nd-array) – It is the list of separators.
clique (nd-array) – It is the list of cliques.

Returns:

JLogo – Inverse covariance.

Return type:

nd-array

Notes

separators and cliques can be the outputs of TMFG function

DBHT.PMFG_T2s(W, nargout=3)[source]

Computes a Triangulated Maximally Filtered Graph (TMFG) [D8] starting from a tetrahedron and inserting recursively vertices inside existing triangles (T2 move) in order to approximate a maximal planar graph with the largest total weight - non negative weights.

Parameters:

W (nd-array) – An N x N matrix of non-negative weights.
nargout (int, optional) – Number of results, Possible values are 3, 4 and 5.

Returns:

A (nd-array) – Adjacency matrix of the PMFG (with weights)
tri (nd-array) – Matrix of triangles (triangular faces) of size 2N - 4 x 3
separators (nd-array) – Matrix of 3-cliques that are not triangular faces (all 3-cliques are given by: [tri;separators]).
clique4 (nd-array, optional) – List of all 4-cliques.
cliqueTree (nd-array, optional) – 4-cliques tree structure (adjacency matrix).

DBHT.distance_wei(L)[source]

The distance matrix contains lengths of shortest paths between all pairs of nodes. An entry (u,v) represents the length of shortest path from node u to node v. The average shortest path length is the characteristic path length of the network.

Parameters:

L (nd-array) – Directed/undirected connection-length matrix.

Returns:

D (nd-array) – Distance (shortest weighted path) matrix
B (nd-array) – Number of edges in shortest weighted path matrix

Notes

The input matrix must be a connection-length matrix, typically obtained via a mapping from weight to length. For instance, in a weighted correlation network higher correlations are more naturally interpreted as shorter distances and the input matrix should consequently be some inverse of the connectivity matrix. The number of edges in shortest weighted paths may in general exceed the number of edges in shortest binary paths (i.e. shortest paths computed on the binarized connectivity matrix), because shortest weighted paths have the minimal weighted distance, but not necessarily the minimal number of edges.

Lengths between disconnected nodes are set to Inf. Lengths on the main diagonal are set to 0.

Algorithm: Dijkstra’s algorithm.

Mika Rubinov, UNSW/U Cambridge, 2007-2012. Rick Betzel and Andrea Avena, IU, 2012 Modification history : 2007: original (MR) 2009-08-04: min() function vectorized (MR) 2012: added number of edges in shortest path as additional output (RB/AA) 2013: variable names changed for consistency with other functions (MR)

DBHT.CliqHierarchyTree2s(Apm, method1)[source]

ClqHierarchyTree2 looks for 3-cliques of a maximal planar graph, then construct hierarchy of the cliques with the definition of ‘inside’ a clique to be a subgraph with smaller size, when the entire graph is made disjoint by removing the clique [D9].

Parameters:

Apm (N) – N x N Adjacency matrix of a maximal planar graph.
method1 (str) – Choose between ‘uniqueroot’ and ‘equalroot’. Assigns connections between final root cliques. Uses Voronoi tesselation between tiling triangles.

Returns:

H1 (nd-array) – Nc x Nc adjacency matrix for 3-clique hierarchical tree where Nc is the number of 3-cliques.
H2 (nd-array) – Nb x Nb adjacency matrix for bubble hierarchical tree where Nb is the number of bubbles.
Mb (nd-array) – Nc x Nb matrix bubble membership matrix. Mb(n,bi)=1 indicates that 3-clique n belongs to bi bubble.
CliqList (nd-array) – Nc x 3 matrix. Each row vector lists three vertices consisting a 3-clique in the maximal planar graph.
Sb (nd-array) – Nc x 1 vector. Sb(n)=1 indicates nth 3-clique is separating.

DBHT.clique3(A)[source]

Computes the list of 3-cliques.

Parameters:: A (nd-array) – N x N sparse adjacency matrix.
Returns:: clique – Nc x 3 matrix. Each row vector contains the list of vertices for a 3-clique.
Return type:: nd-array

DBHT.breadth(CIJ, source)[source]

Implementation of breadth-first search.

Parameters:

CIJ (nd-array) – Binary (directed/undirected) connection matrix
source (nd-array) – Source vertex

Returns:

distance (nd-array) – Distance between ‘source’ and i’th vertex (0 for source vertex).
branch (nd-array) – Vertex that precedes i in the breadth-first search tree (-1 for source vertex)

Notes

Breadth-first search tree does not contain all paths (or all shortest paths), but allows the determination of at least one path with minimum distance. The entire graph is explored, starting from source vertex ‘source’.

Olaf Sporns, Indiana University, 2002/2007/2008

DBHT.BubbleCluster8s(Rpm, Dpm, Hb, Mb, Mv, CliqList)[source]

Obtains non-discrete and discrete clusterings from the bubble topology of PMFG.

Parameters:

Rpm (nd-array) – N x N sparse weighted adjacency matrix of PMFG.
Dpm (nd-array) – N x N shortest path lengths matrix of PMFG
Hb (nd-array) – Undirected bubble tree of PMFG
Mb (nd-array) – Nc x Nb bubble membership matrix for 3-cliques. Mb(n,bi)=1 indicates that 3-clique n belongs to bi bubble.
Mv (nd-array) – N x Nb bubble membership matrix for vertices.
CliqList (nd-array) – Nc x 3 matrix of list of 3-cliques. Each row vector contains the list of vertices for a particular 3-clique.

Returns:

Adjv (nd-array) – N x Nk cluster membership matrix for vertices for non-discrete clustering via the bubble topology. Adjv(n,k)=1 indicates cluster membership of vertex n to kth non-discrete cluster.
Tc (nd-array) – N x 1 cluster membership vector. Tc(n)=k indicates cluster membership of vertex n to kth discrete cluster.

DBHT.DirectHb(Rpm, Hb, Mb, Mv, CliqList)[source]

Computes directions on each separating 3-clique of a maximal planar graph, hence computes Directed Bubble Hierarchical Tree (DBHT).

Parameters:

Rpm (nd-array) – N x N sparse weighted adjacency matrix of PMFG
Hb (nd-array) – Undirected bubble tree of PMFG
Mb (nd-array) – Nc x Nb bubble membership matrix for 3-cliques. Mb(n,bi)=1 indicates that 3-clique n belongs to bi bubble.
Mv (nd-array) – N x Nb bubble membership matrix for vertices.
CliqList (nd-array) – Nc x 3 matrix of list of 3-cliques. Each row vector contains the list of vertices for a particular 3-clique.

Returns:

Hc – Nb x Nb unweighted directed adjacency matrix of DBHT. Hc(i,j)=1 indicates a directed edge from bubble i to bubble j.

Return type:

nd-array

DBHT.HierarchyConstruct4s(Rpm, Dpm, Tc, Adjv, Mv)[source]

Constructs intra- and inter-cluster hierarchy by utilizing Bubble hierarchy structure of a maximal planar graph, namely Planar Maximally Filtered Graph (PMFG).

Parameters:

Rpm (nd-array) – N x N Weighted adjacency matrix of PMFG.
Dpm (nd-array) – N x N shortest path length matrix of PMFG.
Tc (nd-array) – N x 1 cluster membership vector from DBHT clustering. Tc(n)=z_i indicate cluster of nth vertex.
Adjv (nd-array) – Bubble cluster membership matrix from BubbleCluster8s.
Mv (nd-array) – Bubble membership of vertices from BubbleCluster8s.

Returns:

Z – (N-1) x 4 linkage matrix, in the same format as the output from matlab function ‘linkage’.

Return type:

nd-array

OWA Weights Functions

OwaWeights.owa_l_moment(T, k=2)[source]

Calculate the OWA weights to calculate the kth linear moment (l-moment) of a returns series as shown in [D10].

Parameters:

T (int) – Number of observations of the returns series.
k (int) – Order of the l-moment. Must be an integer higher or equal than 1.

Returns:

value – An OWA weights vector of size Tx1.

Return type:

1d-array

OwaWeights.owa_gmd(T)[source]

Calculate the OWA weights to calculate the Gini mean difference (GMD) of a returns series as shown in [D4].

Parameters:: T (int) – Number of observations of the returns series.
Returns:: value – An OWA weights vector of size Tx1.
Return type:: 1d-array

OwaWeights.owa_cvar(T, alpha=0.05)[source]

Calculate the OWA weights to calculate the Conditional Value at Risk (CVaR) of a returns series as shown in [D4].

Parameters:

T (int) – Number of observations of the returns series.
alpha (float, optional) – Significance level of CVaR. The default is 0.05.

Returns:

value – An OWA weights vector of size Tx1.

Return type:

1d-array

OwaWeights.owa_wcvar(T, alphas, weights)[source]

Calculate the OWA weights to calculate the Weighted Conditional Value at Risk (WCVaR) of a returns series as shown in [D4].

Parameters:

T (int) – Number of observations of the returns series.
alphas (list) – List of significance levels of each CVaR model.
weights (list) – List of weights of each CVaR model.

Returns:

value – An OWA weights vector of size Tx1.

Return type:

1d-array

OwaWeights.owa_tg(T, alpha=0.05, a_sim=100)[source]

Calculate the OWA weights to calculate the Tail Gini of a returns series as shown in [D4].

Parameters:

T (int) – Number of observations of the returns series.
alpha (float, optional) – Significance level of TaiL Gini. The default is 0.05.
a_sim (float, optional) – Number of CVaRs used to approximate the Tail Gini. The default is 100.

Returns:

value – A OWA weights vector of size Tx1.

Return type:

1d-array

OwaWeights.owa_wr(T)[source]

Calculate the OWA weights to calculate the Worst realization (minimum) of a returns series as shown in [D4].

Parameters:: T (int) – Number of observations of the returns series.
Returns:: value – A OWA weights vector of size Tx1.
Return type:: 1d-array

OwaWeights.owa_rg(T)[source]

Calculate the OWA weights to calculate the range of a returns series as shown in [D4].

Parameters:: T (int) – Number of observations of the returns series.
Returns:: value – A OWA weights vector of size Tx1.
Return type:: 1d-array

OwaWeights.owa_cvrg(T, alpha=0.05, beta=None)[source]

Calculate the OWA weights to calculate the CVaR range of a returns series as shown in [D4].

Parameters:

T (int) – Number of observations of the returns series.
alpha (float, optional) – Significance level of CVaR of losses. The default is 0.05.
beta (float, optional) – Significance level of CVaR of gains. If None it duplicates alpha. The default is None.

Returns:

value – A OWA weights vector of size Tx1.

Return type:

1d-array

OwaWeights.owa_wcvrg(T, alphas, weights_a, betas=None, weights_b=None)[source]

Calculate the OWA weights to calculate the WCVaR range of a returns series as shown in [D4].

Parameters:

T (int) – Number of observations of the returns series.
alphas (list) – List of significance levels of each CVaR of losses model.
weights_a (list) – List of weights of each CVaR of losses model.
betas (list, optional) – List of significance levels of each CVaR of gains model. If None it duplicates alpha. The default is None.
weights_b (list, optional) – List of weights of each CVaR of gains model. If None it duplicates weights_a. The default is None.

Returns:

value – A OWA weights vector of size Tx1.

Return type:

1d-array

OwaWeights.owa_tgrg(T, alpha=0.05, a_sim=100, beta=None, b_sim=None)[source]

Calculate the OWA weights to calculate the Tail Gini range of a returns series as shown in [D4].

Parameters:

T (int) – Number of observations of the returns series.
alpha (float, optional) – Significance level of Tail Gini of losses. The default is 0.05.
a_sim (float, optional) – Number of CVaRs used to approximate Tail Gini of losses. The default is 100.
beta (float, optional) – Significance level of Tail Gini of gains. If None it duplicates alpha value. The default is None.
b_sim (float, optional) – Number of CVaRs used to approximate Tail Gini of gains. If None it duplicates a_sim value. The default is None.

Returns:

value – A OWA weights vector of size Tx1.

Return type:

1d-array

OwaWeights.owa_l_moment_crm(T, k=4, method='MSD', g=0.5, max_phi=0.5, solver='CLARABEL')[source]

Calculate the OWA weights to calculate a convex risk measure that considers higher linear moments or L-moments as shown in [D10].

Parameters:

T (int) – Number of observations of the returns series.
k (int) – Order of the l-moment. Must be an integer higher or equal than 2.
method (str, optional) –
Method to calculate the weights used to combine the l-moments with order higher than 2. The default value is ‘MSD’. Possible values are:
- ’CRRA’: Normalized Constant Relative Risk Aversion coefficients.
- ’ME’: Maximum Entropy.
- ’MSS’: Minimum Sum Squares.
- ’MSD’: Minimum Square Distance.
g (float, optional) – Risk aversion coefficient of CRRA utility function. The default is 0.5.
max_phi (float, optional) – Maximum weight constraint of L-moments. The default is 0.5.
solver (str, optional) – Solver available for CVXPY. Used to calculate ‘ME’, ‘MSS’ and ‘MSD’ weights. The default value is ‘CLARABEL’.

Returns:

value – A OWA weights vector of size Tx1.

Return type:

1d-array

Gerber Statistic Functions

GerberStatistic.gerber_cov_stat0(X, threshold=0.5)[source]

Compute Gerber covariance Statistics 0 or original Gerber statistics :cite: d-Gerber2021, not always PSD, however this function fixes the covariance matrix finding the nearest covariance matrix that is positive semidefinite.

Parameters:

X (ndarray) – Returns series of shape n_sample x n_features.
threshold (float) – threshold: threshold is between 0 and 1

Returns:

value – Gerber covariance matrix of shape (n_features, n_features), where n_features is the number of features.

Return type:

bool

Raises:

ValueError when the value cannot be calculated. –

GerberStatistic.gerber_cov_stat1(X, threshold=0.5)[source]

Compute Gerber covariance Statistics 1 :cite: d-Gerber2021.

Parameters:

X (ndarray) – Returns series of shape n_sample x n_features.
threshold (float) – threshold: threshold is between 0 and 1

Returns:

value – Gerber covariance matrix of shape (n_features, n_features), where n_features is the number of features.

Return type:

bool

Raises:

ValueError when the value cannot be calculated. –

GerberStatistic.gerber_cov_stat2(X, threshold=0.5)[source]

Compute Gerber covariance Statistics 2 :cite: d-Gerber2021.

Parameters:

X (ndarray) – Returns series of shape n_sample x n_features.
threshold (float) – threshold: threshold is between 0 and 1

Returns:

value – Gerber covariance mtrix of shape (n_features, n_features), where n_features is the number of features.

Return type:

bool

Raises:

ValueError when the value cannot be calculated. –

CPP Functions

cppfunctions.duplication_matrix(n: int)[source]

Calculate duplication matrix of size “n” as shown in [D6].

Parameters:: n (int) – Number of assets.
Returns:: D – Duplication matrix
Return type:: np.ndarray

cppfunctions.duplication_elimination_matrix(n: int)[source]

Calculate duplication elimination matrix of size “n” as shown in [D6].

Parameters:: n (int) – Number of assets.
Returns:: L – Duplication matrix
Return type:: np.ndarray

cppfunctions.duplication_summation_matrix(n: int)[source]

Calculate duplication summation matrix of size “n” as shown in [D7].

Parameters:: n (int) – Number of assets.
Returns:: S – Duplication summation matrix.
Return type:: np.ndarray

cppfunctions.commutation_matrix(T: int, n: int)[source]

Calculate commutation matrix of size T x n.

Parameters:

T (int) – Number of rows.
n (int) – Number of columns.

Returns:

K – Duplication summation matrix.

Return type:

np.ndarray

cppfunctions.coskewness_matrix(Y: ndarray)[source]

Calculates coskewness rectangular matrix as shown in [D7].

Parameters:: Y (ndarray or dataframe) – Returns series of shape n_sample x n_features.
Returns:: M3 – The lower semi coskewness rectangular matrix.
Return type:: ndarray
Raises:: ValueError when the value cannot be calculated. –

cppfunctions.semi_coskewness_matrix(Y: ndarray)[source]

Calculates lower semi coskewness rectangular matrix as shown in [D7].

Parameters:: Y (ndarray or dataframe) – Returns series of shape n_samples x n_features.
Returns:: s_M3 – The lower semi coskewness rectangular matrix.
Return type:: ndarray
Raises:: ValueError when the value cannot be calculated. –

cppfunctions.cokurtosis_matrix(Y: ndarray)[source]

Calculates cokurtosis square matrix as shown in [D7].

Parameters:: Y (ndarray or dataframe) – Returns series of shape n_samples x n_features.
Returns:: S4 – The cokurtosis square matrix.
Return type:: ndarray
Raises:: ValueError when the value cannot be calculated. –

cppfunctions.semi_cokurtosis_matrix(Y)[source]

Calculates lower semi cokurtosis square matrix as shown in [D7].

Parameters:: Y (ndarray or dataframe) – Returns series of shape n_sample x n_features.
Returns:: s_S4 – The lower semi cokurtosis square matrix.
Return type:: ndarray
Raises:: ValueError when the value cannot be calculated. –

cppfunctions.k_eigh(Y, k)[source]

Calculates lower semi cokurtosis square matrix as shown in [D7].

Parameters:: Y (ndarray or dataframe) – Returns series of shape n_sample x n_features.
Returns:: s_S4 – The lower semi cokurtosis square matrix.
Return type:: ndarray
Raises:: ValueError when the value cannot be calculated. –

cppfunctions.d_corr(X, Y)[source]

Calculates the distance correlation of X and Y.

Parameters:

X (ndarray or dataframe) – Returns series of shape n_sample x n_features.
Y (ndarray or dataframe) – Returns series of shape n_sample x n_features.

Returns:

value – Distance correlation.

Return type:

float

Raises:

ValueError when the value cannot be calculated. –

cppfunctions.d_corr_matrix(Y)[source]

Calculates the distance correlation matrix of matrix of variables Y.

Parameters:: Y (ndarray or dataframe) – Returns series of shape n_sample x n_features.
Returns:: value – Distance correlation.
Return type:: float
Raises:: ValueError when the value cannot be calculated. –

Auxiliary Functions

AuxFunctions.is_pos_def(cov, threshold=1e-08)[source]

Indicate if a matrix is positive (semi)definite.

Parameters:: cov (ndarray) – Covariance matrix of shape (n_features, n_features), where n_features is the number of features.
Returns:: value – True if matrix is positive (semi)definite.
Return type:: bool
Raises:: ValueError when the value cannot be calculated. –

AuxFunctions.cov2corr(cov)[source]

Generate a correlation matrix from a covariance matrix cov.

Parameters:: cov (ndarray) – Covariance matrix of shape n_features x n_features, where n_features is the number of features.
Returns:: corr – A correlation matrix.
Return type:: ndarray
Raises:: ValueError when the value cannot be calculated. –

AuxFunctions.corr2cov(corr, std)[source]

Generate a covariance matrix from a correlation matrix corr and a standard deviation vector std.

Parameters:

corr (ndarray) – Assets correlation matrix of shape n_features x n_features, where n_features is the number of features.
std (1darray) – Assets standard deviation vector of size n_features, where n_features is the number of features.

Returns:

cov – A covariance matrix.

Return type:

ndarray

Raises:

ValueError when the value cannot be calculated. –

AuxFunctions.cov_fix(cov, method='clipped', threshold=1e-08)[source]

Fix a covariance matrix to a positive definite matrix.

Parameters:

cov (ndarray) – Covariance matrix of shape n_features x n_features, where n_features is the number of features.
method (str) – The default value is ‘clipped’, see more in cov_nearest.
**kwargs –
Other parameters from cov_nearest.

Returns:

cov_ – A positive definite covariance matrix.

Return type:

bool

Raises:

ValueError when the value cannot be calculated. –

AuxFunctions.cov_returns(cov, seed=0)[source]

Generate a matrix of returns that have a covariance matrix cov.

Parameters:: cov (ndarray) – Covariance matrix of shape n_features x n_features, where n_features is the number of features.
Returns:: a – A matrix of returns that have a covariance matrix cov.
Return type:: ndarray
Raises:: ValueError when the value cannot be calculated. –

AuxFunctions.block_vec_pq(A, p, q)[source]

Calculates block vectorization operator as shown in [D11] and [D12].

Parameters:

A (ndarray) – Matrix that will be block vectorized.
p (int) – Order p of block vectorization operator.
q (int) – Order q of block vectorization operator.

Returns:

bvec_A – The block vectorized matrix.

Return type:

ndarray

Raises:

ValueError when the value cannot be calculated. –

AuxFunctions.dcorr(X, Y)[source]

Calculate the distance correlation between two variables [D13].

Parameters:

X (1d-array) – Returns series, must have of shape n_sample x 1.
Y (1d-array) – Returns series, must have of shape n_sample x 1.

Returns:

value – The distance correlation between variables X and Y.

Return type:

float

Raises:

ValueError when the value cannot be calculated. –

AuxFunctions.dcorr_matrix(X)[source]

Calculate the distance correlation matrix of n variables.

Parameters:: X (ndarray) – Returns series of shape n_sample x n_features.
Returns:: corr – The distance correlation matrix of shape n_features x n_features.
Return type:: ndarray
Raises:: ValueError when the value cannot be calculated. –

AuxFunctions.numBins(n_samples, corr=None)[source]

Calculate the optimal number of bins for discretization of mutual information and variation of information.

Parameters:

n_samples (integer) – Number of samples.
corr (float, optional) – Correlation coefficient of variables. The default value is None.

Returns:

bins – The optimal number of bins.

Return type:

int

Raises:

ValueError when the value cannot be calculated. –

AuxFunctions.mutual_info_matrix(X, bins_info='KN', normalize=True)[source]

Calculate the mutual information matrix of n variables.

Parameters:

X (ndarray) – Returns series of shape n_sample x n_features.
bins_info (int or str) –
Number of bins used to calculate mutual information. The default value is ‘KN’. Possible values are:
- ’KN’: Knuth’s choice method. See more in knuth_bin_width.
- ’FD’: Freedman–Diaconis’ choice method. See more in freedman_bin_width.
- ’SC’: Scotts’ choice method. See more in scott_bin_width.
- ’HGR’: Hacine-Gharbi and Ravier’ choice method.
- int: integer value choice by user.
normalize (bool) – If normalize variation of information. The default value is True.

Returns:

corr – The mutual information matrix of shape n_features x n_features.

Return type:

ndarray

Raises:

ValueError when the value cannot be calculated. –

AuxFunctions.var_info_matrix(X, bins_info='KN', normalize=True)[source]

Calculate the variation of information matrix of n variables.

Parameters:

X (ndarray) – Returns series of shape n_sample x n_features.
bins_info (int or str) –
Number of bins used to calculate variation of information. The default value is ‘KN’. Possible values are:
- ’KN’: Knuth’s choice method. See more in knuth_bin_width.
- ’FD’: Freedman–Diaconis’ choice method. See more in freedman_bin_width.
- ’SC’: Scotts’ choice method. See more in scott_bin_width.
- ’HGR’: Hacine-Gharbi and Ravier’ choice method.
- int: integer value choice by user.
normalize (bool) – If normalize variation of information. The default value is True.

Returns:

corr – The mutual information matrix of shape n_features x n_features.

Return type:

ndarray

Raises:

ValueError when the value cannot be calculated. –

AuxFunctions.ltdi_matrix(X, alpha=0.05)[source]

Calculate the lower tail dependence index matrix using the empirical approach.

Parameters:

X (ndarray) – Returns series of shape n_sample x n_features.
alpha (float, optional) – Significance level for lower tail dependence index. The default is 0.05.

Returns:

corr – The lower tail dependence index matrix of shape n_features x n_features.

Return type:

ndarray

Raises:

ValueError when the value cannot be calculated. –

AuxFunctions.two_diff_gap_stat(dist, clusters, max_k=10)[source]

Calculate the optimal number of clusters based on the two difference gap statistic [D14].

Parameters:

codep (DataFrame) – A codependence matrix.
dist (str, optional) – A distance measure based on the codependence matrix.
clusters (str, optional) – The hierarchical clustering encoded as a linkage matrix, see linkage for more details.
max_k (int, optional) – Max number of clusters used by the two difference gap statistic to find the optimal number of clusters. The default is 10.

Returns:

k – The optimal number of clusters based on the two difference gap statistic.

Return type:

int

Raises:

ValueError when the value cannot be calculated. –

AuxFunctions.fitKDE(obs, bWidth=0.01, kernel='gaussian', x=None)[source]

Fit kernel to a series of obs, and derive the prob of obs x is the array of values on which the fit KDE will be evaluated. It is the empirical Probability Density Function (PDF). For more information see chapter 2 of [D1].

Parameters:

obs (ndarray) – Observations to fit. Commonly is the diagonal of Eigenvalues.
bWidth (float, optional) – The bandwidth of the kernel. The default value is 0.01.
kernel (string, optional) –
The kernel to use. The default value is ‘gaussian’. For more information see: kernel-density. Possible values are:
- ’gaussian’: gaussian kernel.
- ’tophat’: tophat kernel.
- ’epanechnikov’: epanechnikov kernel.
- ’exponential’: exponential kernel.
- ’linear’: linear kernel.
- ’cosine’: cosine kernel.
x (ndarray, optional) – It is the array of values on which the fit KDE will be evaluated.

Returns:

pdf – Empirical PDF.

Return type:

pd.series

Raises:

ValueError when the value cannot be calculated. –

AuxFunctions.mpPDF(var, q, pts)[source]

Creates a Marchenko-Pastur Probability Density Function (PDF). For more information see chapter 2 of [D1].

Parameters:

var (float) – Variance.
q (float) – T/N where T is the number of rows and N the number of columns
pts (int) – Number of points used to construct the PDF.

Returns:

pdf – Marchenko-Pastur PDF.

Return type:

pd.series

Raises:

ValueError when the value cannot be calculated. –

AuxFunctions.errPDFs(var, eVal, q, bWidth=0.01, pts=1000)[source]

Fit error of Empirical PDF (uses Marchenko-Pastur PDF). For more information see chapter 2 of [D1].

Parameters:

var (float) – Variance.
eVal (ndarray) – Eigenvalues to fit.
q (float) – T/N where T is the number of rows and N the number of columns.
bWidth (float, optional) – The bandwidth of the kernel. The default value is 0.01.
pts (int) – Number of points used to construct the PDF. The default value is 1000.

Returns:

pdf – Sum squared error.

Return type:

float

Raises:

ValueError when the value cannot be calculated. –

AuxFunctions.findMaxEval(eVal, q, bWidth=0.01)[source]

Find max random eVal by fitting Marchenko’s dist (i.e) everything else larger than this, is a signal eigenvalue. For more information see chapter 2 of [D1].

Parameters:

eVal (ndarray) – Eigenvalues to fit.
q (float) – T/N where T is the number of rows and N the number of columns.
bWidth (float, optional) – The bandwidth of the kernel.

Returns:

pdf – First value is the maximum random eigenvalue and second is the variance attributed to noise (1-result) is one way to measure signal-to-noise.

Return type:

tuple (float, float)

Raises:

ValueError when the value cannot be calculated. –

AuxFunctions.getPCA(matrix)[source]

Gets the Eigenvalues and Eigenvector values from a Hermitian Matrix. For more information see chapter 2 of [D1].

Parameters:: matrix (ndarray or pd.DataFrame) – Correlation matrix.
Returns:: pdf – First value are the eigenvalues of correlation matrix and second are the Eigenvectors of correlation matrix.
Return type:: tuple (float, float)
Raises:: ValueError when the value cannot be calculated. –

AuxFunctions.denoisedCorr(eVal, eVec, nFacts, kind='fixed')[source]

Remove noise from correlation matrix using fixing random eigenvalues and spectral method. For more information see chapter 2 of [D1].

Parameters:

eVal (ndarray) – Eigenvalues.
eVal – Eigenvectors.
nFacts (float) – The number of factors.
kind (str, optional) –
The denoise method. The default value is ‘fixed’. Possible values are:
- ’fixed’: takes average of eigenvalues above max Marchenko Pastour limit.
- ’spectral’: makes zero eigenvalues above max Marchenko Pastour limit.

Returns:

corr – Denoised correlation matrix.

Return type:

ndarray

Raises:

ValueError when the value cannot be calculated. –

AuxFunctions.shrinkCorr(eVal, eVec, nFacts, alpha=0)[source]

Remove noise from correlation using target shrinkage. For more information see chapter 2 of [D1].

Parameters:

eVal (ndarray) – Eigenvalues.
eVal – Eigenvectors.
nFacts (float) – The number of factors.
alpha (float, optional) – Shrinkage factor.

Returns:

corr – Denoised correlation matrix.

Return type:

ndarray

Raises:

ValueError when the value cannot be calculated. –

AuxFunctions.denoiseCov(cov, q, kind='fixed', bWidth=0.01, detone=False, mkt_comp=1, alpha=0)[source]

Remove noise from cov by fixing random eigenvalues of their correlation matrix. For more information see chapter 2 of [D1].

Parameters:

cov (ndarray or pd.DataFrame) – Covariance matrix of shape n_features x n_features, where n_features is the number of features.
q (float) – T/N where T is the number of rows and N the number of columns.
bWidth (float) – The bandwidth of the kernel.
kind (str, optional) –
The denoise method. The default value is ‘fixed’. Possible values are:
- ’fixed’: takes average of eigenvalues above max Marchenko Pastour limit.
- ’spectral’: makes zero eigenvalues above max Marchenko Pastour limit.
- ’shrink’: uses target shrinkage method.
detone (bool, optional) – If remove the firs mkt_comp of correlation matrix. The detone correlation matrix is singular, so it cannot be inverted.
mkt_comp (float, optional) – Number of first components that will be removed using the detone method.
alpha (float, optional) – Shrinkage factor.

Returns:

cov_ – Denoised covariance matrix.

Return type:

ndarray or pd.DataFrame

Raises:

ValueError when the value cannot be calculated. –

AuxFunctions.round_values(data, decimals=4, wider=False)[source]

This function help us to round values to values close or away from zero.

Parameters:

data (np.ndarray, pd.Series or pd.DataFrame) – Data that are going to be rounded.
decimals (integer) – Number of decimals to round.
wider (float) – False if round to values close to zero, True if round to values away from zero.

Returns:

value – Data rounded using selected method.

Return type:

np.ndarray, pd.Series or pd.DataFrame

Raises:

ValueError – When the value cannot be calculated.

AuxFunctions.weights_discretizetion(weights, prices, capital=1000000, w_decimal=6, ascending=False)[source]

This function help us to find the number of shares that must be bought or sold to achieve portfolio weights according the prices of assets and the invested capital.

Parameters:

weights (pd.Series or pd.DataFrame) – Vector of weights of size n_assets x 1.
prices (pd.Series or pd.DataFrame) – Vector of prices of size n_assets x 1.
capital (float, optional) – Capital invested. The default value is 1000000.
w_decimal (int, optional) – Number of decimals use to round the portfolio weights. The default value is 6.
ascending (bool, optional) – If True assigns excess capital to assets with lower weights, else, to assets with higher weights. The default value is False.

Returns:

n_shares – Number of shares that must be bought or sold to achieve portfolio weights.

Return type:

pd.DataFrame

Raises:

ValueError – When the value cannot be calculated.

AuxFunctions.color_list(k)[source]

This function creates a list of colors.

Parameters:: k (int) – Number of colors.
Returns:: colors – A list of colors.
Return type:: list

Bibliography

[D1] (1,2,3,4,5,6,7,8)

Marcos M. López de Prado. Machine Learning for Asset Managers. Elements in Quantitative Finance. Cambridge University Press, 2020. doi:10.1017/9781108883658.

[D2] (1,2)

Won-Min Song, T. Di Matteo, and Tomaso Aste. Hierarchical information clustering by means of topologically embedded graphs. PLOS ONE, 7(3):1–14, 03 2012. URL: https://doi.org/10.1371/journal.pone.0031929, doi:10.1371/journal.pone.0031929.

[D3] (1,2)

Wolfram Barfuss, Guido Previde Massara, T. Di Matteo, and Tomaso Aste. Parsimonious modeling with information filtering networks. Physical Review E, Dec 2016. URL: http://dx.doi.org/10.1103/PhysRevE.94.062306, doi:10.1103/physreve.94.062306.

[D4] (1,2,3,4,5,6,7,8,9,10)

Dany Cajas. Owa portfolio optimization: a disciplined convex programming framework. SSRN Electronic Journal, 2021. URL: https://doi.org/10.2139/ssrn.3988927, doi:10.2139/ssrn.3988927.

[D5]

Sander Gerber, Harry Markowitz, Philip Ernst, Yinsen Miao, Babak Javid, and Paul Sargen. The gerber statistic: a robust co-movement measure for portfolio optimization. SSRN Electronic Journal, 2021. URL: https://doi.org/10.2139/ssrn.3880054, doi:10.2139/ssrn.3880054.

[D6] (1,2,3)

Jan R. Magnus and H. Neudecker. The elimination matrix: some lemmas and applications. SIAM Journal on Algebraic Discrete Methods, 1(4):422–449, 1980. URL: https://doi.org/10.1137/0601049, arXiv:https://doi.org/10.1137/0601049, doi:10.1137/0601049.

[D7] (1,2,3,4,5,6,7)

Dany Cajas. Convex optimization of portfolio kurtosis. SSRN Electronic Journal, 2022. URL: https://doi.org/10.2139/ssrn.4202967, doi:10.2139/ssrn.4202967.

[D8]

Guido Previde Massara, T. D. Matteo, and T. Aste. Network filtering for big data: triangulated maximally filtered graph. J. Complex Networks, 5:161–178, 2017.

[D9]

Won-Min Song, T. Di Matteo, and Tomaso Aste. Nested hierarchies in planar graphs. Discrete Applied Mathematics, 159(17):2135–2146, 2011. URL: https://www.sciencedirect.com/science/article/pii/S0166218X11002794, doi:https://doi.org/10.1016/j.dam.2011.07.018.

[D10] (1,2)

Dany Cajas. Higher order moment portfolio optimization with l-moments. SSRN Electronic Journal, 2023. URL: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4393155.

[D11]

missing booktitle in Loan1992

[D12]

Ignacio Ojeda. Kronecker square roots and the block vec matrix. The American Mathematical Monthly, 122(1):60, 2015. URL: https://doi.org/10.4169/amer.math.monthly.122.01.60, doi:10.4169/amer.math.monthly.122.01.60.

[D13]

Gábor J. Székely, Maria L. Rizzo, and Nail K. Bakirov. Measuring and testing dependence by correlation of distances. The Annals of Statistics, 35(6):2769 – 2794, 2007. URL: https://doi.org/10.1214/009053607000000505, doi:10.1214/009053607000000505.

[D14]

Shihong Yue, Xiuxiu Wang, and Miaomiao Wei. Application of two-order difference to gap statistic. Transactions of Tianjin University, 14:217–221, 06 2008. doi:10.1007/s12209-008-0039-1.