To understand the algorithm implementation, fundamental topological concepts must first be comprehended, specifically those of base, subbase, and their relationship with an object’s topology.
Let \(X\) be an arbitrary set. A topology in \(X\) is any system \(\tau\) of subsets \(G\) of \(X\) that verifies two conditions (Kolmogórov & Fomin, 1978, p. 90):
The set \(X\) itself and the empty set \(\emptyset\) belong to \(\tau\).
The union \(\bigcup_{\alpha}G_{\alpha}\) of any number (finite or infinite) and the intersection \(\bigcap_{k}G_{k}\) of a finite number of sets in \(\tau\) belong to \(\tau\). Another way to define it, following Kelley (1955, p. 50), is that the intersection of any two members of \(\tau\) is a member of \(\tau\) (\(\tau\) is closed under intersections) and that the union of the members of any family of \(\tau\) is part of \(X\) (the topology space of \(\tau\) is closed under arbitrary unions).
Thus, the set \(X\) is called the topology space of \(\tau\), \(\tau\) is called the topology of \(X\), and the pair \((X, \tau)\) is called a topological space, that is, \(T = (X, \tau)\). The sets belonging to the system of sets \(\tau\) are called open (Kolmogórov & Fomin, 1978, p. 90).
As Kolmogórov & Fomin (1978, p. 90) point out, a metric space consists of a set of points and a metric introduced in that set; similarly, a topological space consists of a set of points and a topology introduced in it. Consequently, defining a topological space means defining a set \(X\) and a topology \(\tau\) on it, that is, indicating which subsets are considered open in \(X\) (Kolmogórov & Fomin, 1978, p. 90).
It is clear that different topologies can be introduced in the same set \(X\), thus transforming it into different topological spaces (Kolmogórov & Fomin, p. 90). However, the topological space, that is, the pair \((X, \tau)\), will be denoted by \(T\), and the elements of the topological space will be called “points”.
The sets \(T \setminus G\), that is, the elements of \(T\) that do not belong to \(G\), which are complementary to the open sets, are called closed sets of the topological space \(T\) (Kolmogórov & Fomin, 1978, p. 90).
The principle of duality consists of the fact that from any theorem concerning a system of subsets of a fixed set \(S\), another dual theorem can be automatically deduced by replacing the original sets with their complements, the sum of sets with their intersection, and the intersection with the sum (Kolmogórov & Fomin, 1978, p. 16). By virtue of the duality relations, from axioms 1 (the complement of the sum is equal to the intersection of the complements) and 2 (the complement of the intersection is equal to the sum of the complements), it follows that (Kolmogórov & Fomin, 1978, p. 90):
The empty set \(\emptyset\) and the entire space \(T\) are closed.
The intersection of any number (finite or infinite) and the union of a finite number of closed sets are closed.
In every topological space, the concepts of neighborhood, adherence points, set adherence, etc., are naturally introduced from these definitions.
As Kolmogórov & Fomin (1978, p. 91) point out, a neighborhood of the point \(x\in T\) is any open set \(G \subset T\) that contains the point \(x\); a point \(x\in T\) is called an adherence point of the set \(M \subset T\), when every neighborhood of point \(x\) contains at least one point of \(M\); a point \(x\) is called an accumulation point of the set \(M\), when every neighborhood of the point \(x\) contains an infinite number of points of \(M\). The totality of adherence points of the set \(M\) is called the adherence of the set \(M\) and is denoted by the symbol \([M]\) (Kolmogórov & Fomin, 1978, p. 91).
As Kolmogórov & Fomin (1978, p. 93) note, a collection \(\zeta\) of open subsets is called a base of the topological space \(T\), when every open subset of \(T\) can be represented as a certain number of sets from \(\zeta\).
For example, the collection of all open balls (of all possible radii and centers) constitutes a base in a metric space. In particular, the system of all intervals is a base on the line. From the above, it follows that the topology \(\tau\) of the space \(T\) is defined if a base \(\zeta\) is indicated in this space. (Kolmogórov & Fomin, 1978, p. 93)
As Kolmogórov & Fomin (1978, p. 93) point out, for this method of introducing topology to have practical value, it is necessary to indicate those conditions that a system \(\zeta\) of subsets of the given set \(T\) must satisfy so that the collection of all possible sums of sets from \(\zeta\) can be considered as the collection of open sets in \(T\) (that is, so that these sums verify axioms 1° and 2° of topological space). Such conditions are given by the following theorem:
THEOREM Suppose that in a set \(T\), a system \(\zeta\) of subsets \(G_\alpha\) has been chosen that verifies the following conditions:
Every point \(x\in T\) is contained in at least one subset \(G_\alpha \in \zeta\)
If \(x \in G_\alpha\) and \(x \in G_\beta\), there exists a \(G_\gamma \in \zeta\) such that \(x\in G'_\gamma \in G_\alpha \bigcup G_\beta\), where \(G'_\gamma\) is the complement set of \(G_\gamma\), that is, \(G'_\gamma = T \setminus G_\gamma\).
Thus, if we declare open in \(T\) the empty set and all sets that can be represented as the sum of certain \(G_\alpha \in \zeta\), the set \(T\) will result in a topological space (that is, these sums will verify axioms 1° and 2°) and the system \(\zeta\) will be a base of it (Kolmogórov & Fomin, 1978, p. 93).
To prove whether a given collection of open sets is a base or not, the following criterion is often useful (Kolmogórov & Fomin, 1978, p. 94):
THEOREM
For a system \(G_\alpha\) of open sets to be a base of the topological space \(T\), it is necessary and sufficient that for every open set \(G\) and every point \(x\in G\), there exists a set \(G_\alpha\) from this system such that \(x\in G_\alpha \in G\).
Finally, Kelley (1955, p. 61) notes that a family \(S\) of sets is a sub-base of a topology \(\tau\) if the family of all finite intersections of members of \(S\) is a base of \(\tau\) (or equivalently: every member of \(\tau\) is a union of finite intersections of members of \(S\)). A more intuitive way to see it is that a subbase is a collection \(S\) of subsets of a topological space that is contained in a base of the topology and can be completed to form a base by adding all finite intersections of subsets of the set \(S\).
Kelley (1955, p. 67) notes that a topological space \(T=(X, \tau)\) is connected if \(X\) is not the union of two separated non-empty subsets. In other words, it is a subset of a topological space that cannot be described as a disjoint union of two non-empty open sets of the topological space in question.
The above means that it cannot be simultaneously true that, on one hand, if we separate the topological space into two sets, that is, the endpoints of each new set (resulting from the separation) are not included in these two sets, upon reuniting them the result is equivalent to the original set and, on the other hand, that this intersection is empty (that they have no elements in common).
This does not occur simultaneously because if they are separated into two subsets, it is possible to achieve that upon intersecting them (reuniting) the result is an empty set, but only at the cost of omitting some point from the original sets; on the other hand, it is possible to return to the original set, but only at the cost of not omitting any of the original points, in which case their intersection would not be an empty set.
The above can also be formally expressed by saying that a connected set is a subset \(C\subseteq X\) of a topological space \((X, \tau)\) that cannot be described as a disjoint union of two non-empty open sets of the topology, where \(T\) is the collection of open sets of the topological space; this subset, as its own definition reveals, may be equivalent to the complete topological space. In other words, it is formed by a single piece and is not divisible because all its parts are connected in some way, so analyzing parts in isolation will affect the global properties of the set.
By contrast, in a non-connected or disconnected set, the situation is different. A non-connected set can be separated into two or more disjoint open subsets, each of which is called a connected component.
This means that in disconnected sets, it is possible to analyze a part (i.e., a connected component) of the set independently, and in many cases, this will not affect the global properties of the set. However, there are important nuances.
First, local properties can be completely analyzed in a connected component without affecting the others. Second, some global properties will be maintained when analyzing a single component, while others will require considering all components; for example, the compactness of a component does not imply the compactness of the total set.
Some topological (i.e., essential) properties that can be analyzed locally (by components) are the connectivity of each component, the compactness of each component, or other more general local topological properties (for example, being locally Euclidean). In contrast, global properties that will require considering all components, even when the set is disconnected, include the total number of connected components, the compactness of the total set (which requires all components to be compact and that there be a finite number of them), different metric or measure properties (to see if they are generalizable to the entire space), among others. Thus, although it is possible to analyze components separately, it is crucial to remember that some global properties can only be understood by considering all components together.
The main difference between connected and non-connected sets in terms of local properties is that in a connected set, local properties can “propagate” or have global implications in ways that do not occur in non-connected sets. For example, consider a continuous function \(f\) on a closed interval \([a,b]\) predefined as connected. It is possible to analyze local properties such as differentiability at each point; however, the Intermediate Value Theorem, which is a global property, is derived from local continuity and the connectivity of the interval: consequently, what is not possible in a connected set is to completely isolate a part of the set from the rest, in topological terms, since there will always be some form of “connection” between the parts.
Once the topology (complete or approximated by optimization) is revealed, its connectivity can be determined by constructing a directed graph or an undirected graph.
A graph is a set of objects called vertices or nodes connected by links called edges or arcs, which allow representing binary relations between elements of a set (Trudeau, 1993, pp. 19-20).
An undirected graph is justified when it is valid to assume that the connection relationship between the topology elements is symmetric. In other words, if element A is connected to element B, then B is also connected to A.
Imagine we have a set \(X\) representing the Cartesian plane. Within this plane, we draw a circle with its corresponding circumference. The circumference is the line that delimits the circle, while the circle is the area enclosed by the circumference.
Now, let’s introduce a topology \(\tau\) in this set \(X\). The topology \(\tau\) is a collection of subsets of \(X\) that meets certain conditions; the Cartesian plane would be the underlying space from which the topology is constructed, while the plane plus the topology would be the topological space. In our example, we can consider that the open sets in this topology are the circle (without including the circumference) and the area outside the circle (excluding the circumference). These open sets satisfy the mentioned conditions: the set \(X\) (entire plane) and the empty set belong to \(\tau\), and arbitrary unions and finite intersections of open sets are also open.
On the other hand, the closed sets in this topology would be the circumference and the complete set \(X\) (including the circle and its exterior). These closed sets satisfy the dual conditions: arbitrary intersections and finite unions of closed sets are also closed.
Now, imagine a point \(x\) inside the circle. A neighborhood of \(x\) would be any open set that contains \(x\), that is, any area within the circle that includes \(x\). A point \(y\) on the circumference would be an adherence point of the circle, since any neighborhood of \(y\) (any open area containing \(y\)) will always include at least one point of the circle. The adherence of the circle would be the circle along with its circumference, as it includes all adherence points.
A base \(\zeta\) of the topology \(\tau\) would be a collection of open sets such that any open set in \(\tau\) can be expressed as a union of elements of \(\zeta\). In our example, a base could be the collection of all open disks within the circle.
Finally, a sub-base \(S\) of the topology \(\tau\) would be a collection of sets such that all finite intersections of elements of \(S\) form a base of \(\tau\). In our example, a sub-base could be the collection of all open regions that include the circle’s center and extend to the circumference, along with all open regions outside the circle that extend to the circumference.
Any region of the plane (i.e., any set of points in space) for which if separated into two subsets it is possible to achieve that upon intersecting them (reuniting) the result is an empty set, but only at the cost of omitting some point from the original sets or returning to the original set, but only at the cost of not omitting any of the original points, so that their intersection would not be an empty set, is a connected set, a single piece, an indivisible totality.
ALGORITHM IN SPANISH
Input: A number of vertices of a graph G, Adjacency matrix of V(G).
Output: A topology of G (TG).
1. Insert the number of vertices of G.
2. for i \in n
Enter the name of the vertices of a graph G.
end
3. for i \in n
for j \in n
Enter the adjacency matrix of V(G).
end
end
4. for i \in n
for j \in n
Calculate the degree of V(G).
end
end
5. for i \in n
for j \in n
if (degree != 0)
class(i, j) = x(j).
R = (degree(x(i))x(t), degree(x(j))x(f)).
end
end
end
6. for i \in n
for j \in n
if (class(i, j) != 0)
subbase = class(j).
end
end
end
7. for v \in n
for vj \in n
base = subbase + intersection(v, vj).
end
end
8. for vi \in n
for vj \in n
if (union(v, vj) != (\emptyset))
vi, vj \in union.
union(vi, vj) \in union.
end
topology = base + union(vi, vj).
end
end
ALGORITHM IN ENGLISH
Input: A number of vertices of a graph G, Adjacency matrix of V(G).
Output: A topology of G (TG).
1. Insert the number of vertices of G.
2. for i \in n
Enter the name of the vertices of a graph G.
end
3. for i \in n
for j \in n
Enter the adjacency matrix of V(G).
end
end
4. for i \in n
for j \in n
Calculate the degree of V(G).
end
end
5. for i \in n
for j \in n
if (degree != 0)
class(i, j) = x(j).
R = (degree(x(i))x(t), degree(x(j))x(f)).
end
end
end
6. for i \in n
for j \in n
if (class(i, j) != 0)
subbase = class(j).
end
end
end
7. for v \in n
for vj \in n
base = subbase + intersection(v, vj).
end
end
8. for vi \in n
for vj \in n
if (union(v, vj) != (\emptyset))
vi, vj \in union.
union(vi, vj) \in union.
end
topology = base + union(vi, vj).
end
end
Any code (in R or another language) constructed to operationalize the above algorithm in its simplest form must contain the following steps:
Generate the relevant numerical data with the given parameters.
Create a completely connected graph using the data indices as vertices.
Assign numerical values as attributes to the graph vertices.
Define the relationship R according to Definition 2.1 of Nada et al. (2018, p. 2), considering the numerical values.
Obtain the subbase by taking the post-classes (images of the relationship) of each vertex.
Generate the base by including the total space X, the empty set, the vertices, and their neighborhoods.
Construct the topology by taking unions of base elements, according to the laws of topological spaces.
Print the resulting R relationship, subbase, base, and topology.
NOTE: The R relationship is defined considering the numerical values associated with the vertices. The subbase, base, and topology are obtained by following the mathematical laws of topological spaces.
Obviously, the above code could be modified to consider neighborhoods of vertices and allow open sets containing multiple vertices, which will generate a richer topology, as constructed by Nada et al. (2018, p. 9).
However, despite the fact that this consideration of neighborhoods and multiple vertices is correct, the functions produce topologies that become exponentially larger as the data set increases slightly, which makes this consideration alone (without any optimization) unfeasible for sufficiently large sets. For example, for a set of 129 observations, its calculation becomes unfeasible, both computationally and in terms of the time required for the researcher to obtain valuable conclusions about the discovered topology[^2].
The above problem occurs due to the way n-dimensional open balls are constructed[^3] (i.e., the neighborhoods using the adjacency matrix), specifically when constructing the neighborhoods of each vertex with the adjacency matrix.
To optimize this process and make it viable for large sets, the following modifications are made:
Define a metric and a threshold for neighborhood creation. Thus, instead of connecting each vertex with all others, we can define a maximum distance to consider two points as neighbors.
Limit the size of the base. Specifically, a maximum number of elements in the base will be established, selecting the most representative ones.
Use a topology approximation. This will be achieved by using approximation methods that capture the essential structure of the topology without considering all possible unions (i.e., without generating all possible sets).
These criteria are mathematically valid and can be considered a robust form of approximating the original topology. However, it is important to note that this approximation may not capture all topological properties of the original space. Therefore, it might be recommended to use the syntax of the second code block (the immediately preceding block) for small sets (30 or fewer observations).
Since the definitions of distance and threshold are crucial in optimizing topology search, the methodological and theoretical-mathematical robustness of such optimization lies in the robustness of such definitions in the same senses.
Since we are working with a single real number variable, the distance to be defined is a one-dimensional distance.
Although defining the distance poses no complication, it is different when defining the radius of open balls (i.e., the threshold). For this, there are different possible heuristics, each with its advantages and disadvantages:
a. Average Distance between Adjacent Points
threshold <- mean(abs(diff(sort(data))))
This has the advantage of capturing the average scale of local variations in the data, but it can be sensitive to outliers and may not be suitable when the data does not have a location parameter that is the mean (as is the case with the Normal distribution).
b. Median of Adjacent Points Distances
threshold <- median(abs(diff(sort(data))))
The advantage of this estimation is that it is more robust to outliers than the average, but it has the disadvantage that it might not capture the structure well if there are many repeated values[^4].
c. Standard Deviation of the Data
threshold <- sd(data)
This heuristic would have the advantage of being able to capture the global dispersion of the data, however, it can be too large if there are extreme outliers and, consequently, become computationally and human time infeasible (which is precisely what we want to avoid).
d. Interquartile Range Divided by a Factor
threshold <- IQR(data) / factor
This heuristic has the advantage of being very robust to outliers, but could be too conservative if the data is very grouped[^5].
The division factor, for example, 4, is not a fixed rule. It is commonly used in statistics to identify outliers (1.5 * IQR is a common rule), but for our purpose of defining a threshold, we could adjust it. The general logic is that dividing by a larger number gives a smaller threshold, resulting in a finer topology.
In reality, the choice of factor depends on the level of detail desired in the topology.
A smaller factor (for example, dividing by 2 instead of 4) would result in a larger threshold, leading to a coarser topology with fewer sets. A larger factor (for example, dividing by 8) would result in a smaller threshold, leading to a finer topology with more sets.
The choice of factor should be based on experiments with the specific data and the level of detail needed for the analysis. There is no universal rule for this factor, as there is not for any heuristic. For this, the “analyze_topology_factors” function of this library should be used.
# IF LOCAL INSTALLATION IS DESIRED, THE FOLLOWING STEPS MUST BE FOLLOWED:
# Step 1:
setwd("C:/Users/ROG/Documents/topologyR")
# Step 2:
library(devtools)## Warning: package 'devtools' was built under R version 4.4.3
## Cargando paquete requerido: usethis
## Warning: package 'usethis' was built under R version 4.4.3
## Warning: package 'roxygen2' was built under R version 4.4.3
## ℹ Updating topologyR documentation
## ℹ Setting RoxygenNote to "7.3.3"
## ℹ Loading topologyR
## rlang (1.1.5 -> 1.1.6) [CRAN]
## cli (3.6.3 -> 3.6.5) [CRAN]
## Installing 2 packages: rlang, cli
## Installing packages into 'C:/Users/ROG/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)
## package 'rlang' successfully unpacked and MD5 sums checked
## Warning: cannot remove prior installation of package 'rlang'
## Warning in file.copy(savedcopy, lib, recursive = TRUE): problema al copiar
## C:\Users\ROG\AppData\Local\R\win-library\4.4\00LOCK\rlang\libs\x64\rlang.dll a
## C:\Users\ROG\AppData\Local\R\win-library\4.4\rlang\libs\x64\rlang.dll:
## Permission denied
## Warning: restored 'rlang'
## package 'cli' successfully unpacked and MD5 sums checked
## Warning: cannot remove prior installation of package 'cli'
## Warning in file.copy(savedcopy, lib, recursive = TRUE): problema al copiar
## C:\Users\ROG\AppData\Local\R\win-library\4.4\00LOCK\cli\libs\x64\cli.dll a
## C:\Users\ROG\AppData\Local\R\win-library\4.4\cli\libs\x64\cli.dll: Permission
## denied
## Warning: restored 'cli'
##
## The downloaded binary packages are in
## C:\Users\ROG\AppData\Local\Temp\Rtmpc36nDk\downloaded_packages
## ── R CMD build ─────────────────────────────────────────────────────────────────
## ✔ checking for file 'C:\Users\ROG\Documents\topologyR/DESCRIPTION'
## ─ preparing 'topologyR': (1s)
## checking DESCRIPTION meta-information ... ✔ checking DESCRIPTION meta-information
## ─ checking for LF line-endings in source and make files and shell scripts
## ─ checking for empty or unneeded directories
## ─ building 'topologyR_0.1.0.tar.gz'
##
## Running "C:/PROGRA~1/R/R-44~1.2/bin/x64/Rcmd.exe" INSTALL \
## "C:\Users\ROG\AppData\Local\Temp\Rtmpc36nDk/topologyR_0.1.0.tar.gz" \
## --install-tests
## * installing to library 'C:/Users/ROG/AppData/Local/R/win-library/4.4'
## * installing *source* package 'topologyR' ...
## ** using staged installation
## ** R
## ** byte-compile and prepare package for lazy loading
## ** help
## *** installing help indices
## ** building package indices
## ** testing if installed package can be loaded from temporary location
## ** testing if installed package can be loaded from final location
## ** testing if installed package keeps a record of temporary installation path
## * DONE (topologyR)
# ONCE THE ABOVE STEPS ARE COMPLETED, FROM HERE ON YOU CAN PROCEED TO USE THE LIBRARY WITH LOCAL INSTALLATION
topology <- list(
c(1, 2, 3),
c(3, 4, 5)
)
# For the specific case of this function, check the structure of test_topology
str(topology)## List of 2
## $ : num [1:3] 1 2 3
## $ : num [1:3] 3 4 5
# If it's a list, try converting to a numeric vector
# For example:
test_topology_numeric <- unlist(topology)
# Then try the function
analyze_topology_factors(test_topology_numeric)## Warning: package 'ggplot2' was built under R version 4.4.3
## factor threshold base_size max_set_size min_set_size
## 1 1 1.50000 12 6 0
## 2 2 0.75000 7 6 0
## 3 4 0.37500 7 6 0
## 4 8 0.18750 7 6 0
## 5 16 0.09375 7 6 0
The obtained graph will reveal what happens to the base size (number of sets in the topology) as the IQR factor increases. It is expected that the size will decrease, since a larger IQR factor implies a higher threshold for considering two elements as connected. Thus, with a higher threshold, fewer elements will be considered part of the same set, resulting in a simpler topology with fewer sets.
Regarding the maximum and minimum set sizes, it is natural that they do not change. The maximum size corresponds to the complete set of original elements, which will always be present in the topology, regardless of the IQR factor. The minimum size corresponds to the empty set, which is also part of the topology by definition. These extreme sets are not affected by the IQR factor.
Therefore, the expected effect of the IQR factor on the topology is that a larger IQR factor will generate a simpler topology, while the extreme sets (complete and empty) remain unchanged.
e. Density-Based Method (similar to that used in DBSCAN[^6])
k <- ceiling(log(length(data)))
sorted_distances <- sort(dist(matrix(data, ncol=1)))
threshold <- sorted_distances[k * length(data)]
This method has the advantage of adapting the threshold to the local density of the data, but its disadvantage is that it can be computationally more expensive for large data sets (which we are trying to avoid).
Without preliminary tests, among the recommended options, it would seem more appropriate to use either the median of distances between adjacent points or the density-based method. The reason is that these methods are robust to outliers and adapt well to the local structure of the data.
However, the most recommendable approach is to perform the pertinent tests to determine which of the heuristics in question is appropriate for each specific data set that the researcher has. Ultimately, the choice of heuristic will depend on the specific global topology properties to be studied. In this case, the desired property is connectivity.
Purpose: Verifies if a topology is completely connected using an undirected graph approach. This function is particularly useful for analyzing the interconnection of data sets.
Functionality:
Usage Example:
## [1] TRUE
Special Cases:
Purpose: Similar to is_topology_connected, but uses a directed graph approach, allowing for more specific analyses of directional relationships.
Functionality:
Usage Example:
## [1] TRUE
Key Difference from is_topology_connected:
Purpose: Provides a manual and direct method for verifying topology connectivity.
Functionality:
Usage Example:
## [1] TRUE
Special Use Case: Ideal for topologies with specific completeness requirements.
Purpose: Analyzes how different Interquartile Range (IQR) factors affect topology characteristics.
Functionality:
Usage Example:
## factor threshold base_size max_set_size min_set_size
## 1 1 1.03663910 685 100 0
## 2 2 0.51831955 476 100 0
## 3 4 0.25915978 398 100 0
## 4 8 0.12957989 221 100 0
## 5 16 0.06478994 138 100 0
Key Features:
Purpose: Calculates multiple thresholds for topological analysis using different statistical methods.
Calculation Methods:
Usage Example:
## $mean_diff
## [1] 1
##
## $median_diff
## [1] 1
##
## $sd
## [1] 1.581139
##
## $iqr
## [1] 0.5
##
## $dbscan
## [1] 4
Utility: Provides multiple perspectives for defining thresholds in topological analysis.
Purpose: Visualizes and compares calculated thresholds and their effects on topological structure.
Functionality:
Usage Example:
## method threshold base_size
## mean_diff mean_diff 0.04333029 95
## median_diff median_diff 0.02726101 85
## sd sd 0.94541300 868
## iqr iqr 0.38496153 501
## dbscan dbscan 0.18391863 311
Key Features:
The topologyR package provides specialized tools for analyzing economic time series data through topological methods. This section demonstrates practical applications using real GDP data.
When analyzing economic time series data, particularly GDP changes, the distribution choice is crucial for topology construction for large datasets when we are facing restrictions in computational resources.
However, before determining the optimal distribution of our data we must determine the optimal method to empirically fit our data to theoretical distributions or, to be more precise, to empirical distributions that perfectly follow a theoretical distribution.
MLE (Maximum Likelihood Estimation)
When to Use:
Large samples with correct parametric model specification.
Prioritize statistical efficiency (minimum asymptotic variance).
Data is well-behaved (no heavy tails/outliers; e.g., normal distribution).
Known distributional form (parametric assumptions hold).
Avoid: Small samples, misspecified models, or heavy-tailed distributions.
Example: Fitting a normal distribution to clean, symmetric data.
MME (Moment Matching Estimation)
When to Use:
Small samples or computational simplicity required.
Distributional form uncertain, but moments are calculable/stable.
Real-time applications prioritizing speed over precision.
Avoid for distributions with undefined moments (e.g., Cauchy).
Example: Quick parameter estimation for gamma/beta distributions.
QME (Quantile Matching Estimation)
When to Use:
Focus on tail behavior/quantiles (e.g., VaR in finance).
Robustness to outliers/censored data (quantiles are less sensitive).
Asymmetric/heavy-tailed distributions (e.g., Pareto for extreme losses).
Example: Modeling insurance claim extremes using the 95th percentile.
MGE (Maximum Goodness-of-Fit Estimation)
When to Use:
Validating distributional hypotheses (e.g., KS/AD tests).
Comparing multiple distributions globally.
Non-standard distributions where MLE/MME fail.
Example: Testing if data follows logistic vs. Gumbel distributions.
MSE (Maximum Spacing Estimation)
When to Use:
Data with gaps/rounding (e.g., irregular measurements).
Heavy-tailed distributions or small continuous samples.
Avoids overreliance on extreme values (spacing-based robustness).
Example: Fitting a Weibull distribution to irregular failure-time data.
## Warning: package 'knitr' was built under R version 4.4.3
estimation_methods <- data.frame(
Method = c("MLE", "MME", "QME", "MGE", "MSE"),
When_To_Use = c(
"Large samples, correct model, efficiency priority",
"Small samples, speed priority, moments exist",
"Tail/quantile focus, robustness to outliers",
"Distribution validation/comparison, non-standard cases",
"Gapped data, heavy tails, small samples"
),
Key_Advantages = c(
"Asymptotic efficiency, parametric precision",
"Computational simplicity, no likelihood needed",
"Robust quantile alignment, tail accuracy",
"Global fit assessment, hypothesis testing",
"Robust to gaps/outliers, spacing consistency"
),
R_Functions = c(
"`fitdistr` (MASS), `mle` (stats4)",
"`optim` + manual moment equations",
"`qme` (custom), `quantreg::rq`",
"`goftest`, `ks.test`, `ad.test`",
"`mps` (POT), `MPS.est` packages"
)
)
colnames(estimation_methods) <- gsub("_", " ", colnames(estimation_methods))
kable(
estimation_methods,
format = "pipe",
caption = "**ESTIMATION METHOD BENCHMARK**",
align = c("c", "c", "c", "c") # Centrar todas las columnas
)| Method | When To Use | Key Advantages | R Functions |
|---|---|---|---|
| MLE | Large samples, correct model, efficiency priority | Asymptotic efficiency, parametric precision | fitdistr (MASS), mle
(stats4) |
| MME | Small samples, speed priority, moments exist | Computational simplicity, no likelihood needed | optim + manual moment equations |
| QME | Tail/quantile focus, robustness to outliers | Robust quantile alignment, tail accuracy | qme (custom),
quantreg::rq |
| MGE | Distribution validation/comparison, non-standard cases | Global fit assessment, hypothesis testing | goftest, ks.test,
ad.test |
| MSE | Gapped data, heavy tails, small samples | Robust to gaps/outliers, spacing consistency | mps (POT), MPS.est
packages |
We are going to provide users of this library with a custom function to fit multiple types of distributions simultaneously. To do this, we will provide the user with data on US quarterly GDP growth from the first quarter of 1992 to the first quarter of 2024 after applying a Yeo-Johnson transformation to the data (with a value for the hyperparameter \(\lambda\) equal to 0.5).
X <- c(
0.0119368777374702, 0.0108137656182881, 0.00985681082011425, 0.0104015519293652,
0.00166870385685947, 0.00581325152667178, 0.00476582173579576, 0.0135555616868386,
0.00968017355996231, 0.0135061956696334, 0.005837281536067, 0.0114230783204214,
0.00354465884841204, 0.00298097844188216, 0.00848928301845797, 0.00677931023817369,
0.00747652539201571, 0.0166104234581352, 0.00895027315262364, 0.0103568837398003,
0.00644401865589073, 0.0165834473187565, 0.0124469682453747, 0.00852045048090089,
0.0100124377724629, 0.00923557603383074, 0.0125500242230006, 0.0160277775864102,
0.00937233981161389, 0.00833005255610297, 0.0132151400185725, 0.016339951496275,
0.00362461554054549, 0.0181307192548257, 0.00101904038917144, 0.00596151508447429,
-0.00328068485517437, 0.00623528031983334, -0.00401041013477036, 0.00274591498772025,
0.00834618529774378, 0.00611774330421611, 0.00406097711621545, 0.00123571825010149,
0.0052594844558147, 0.0088366782792475, 0.0165634133346764, 0.011573712295923,
0.00565849535757224, 0.00773454420647957, 0.00946351049229044, 0.0101762111814976,
0.0110630024939544, 0.00492014803582652, 0.00782280094633858, 0.00554750629348089,
0.0134088506808547, 0.00258582837290655, 0.00149903822110309, 0.00857611257328283,
0.00300614077940331, 0.00610827225252253, 0.00575202854191303, 0.00627256373604412,
-0.00427214986914152, 0.00594546286782194, -0.00525949142529332, -0.0220096627946815,
-0.0113808376042674, -0.00178729765814796, 0.00350882204196923, 0.0107800476432023,
0.0048391456673027, 0.00965260679551294, 0.0076959929232312, 0.00524422452727702,
-0.00237340604041976, 0.00675439453860394, -0.000223112442939784, 0.0111991447889985,
0.00836839250173416, 0.00445833082157154, 0.001439781757123, 0.00115606587792128,
0.00984138677657853, 0.0026746116131795, 0.0084958551114811, 0.0086984840936184,
-0.00345397563998976, 0.0128769460650098, 0.0121171934059903, 0.00505092204661972,
0.00898461915466164, 0.00618453787282514, 0.00399850299345283, 0.00184394996213433,
0.00578682815497622, 0.0032086261795099, 0.00708136357248845, 0.00553534000276334,
0.00486298783732364, 0.00559088549983189, 0.00786971688902183, 0.01123892165998,
0.00811762603688138, 0.00530127412316794, 0.00622730516758763, 0.00141589880764181,
0.00542364601597312, 0.00827956221239301, 0.0112899343456179, 0.00640295055604412,
-0.0136749297466007, -0.0804470145766672, 0.0761427696572317, 0.0103255457761064,
0.0128141493938281, 0.01514019363418, 0.00812718720702543, 0.0168963285206307,
-0.0049829870079526, -0.0014131988129115, 0.00657539105810834, 0.00634413797832822,
0.00555608248684969, 0.00510468554636834, 0.011903277993254, 0.00836650041769005,
0.00394730469640825
) # Real GDP Growth
library(fitdistrplus)## Warning: package 'fitdistrplus' was built under R version 4.4.3
## Cargando paquete requerido: MASS
## Warning: package 'MASS' was built under R version 4.4.3
## Cargando paquete requerido: survival
## Warning: package 'survival' was built under R version 4.4.3
##
## Adjuntando el paquete: 'dplyr'
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Example of distribution fitting
x <- X # Vector of GDP changes
distributions <- c("norm", "cauchy", "logis", "unif", "t")
fits <- list()
set.seed(100000)
for(dist in distributions) {
if(dist == "t") {
fits[[dist]] <- fitdist(x, dist, start=list(df=length(x)-1), method="mge")
} else {
fits[[dist]] <- fitdist(x, dist, method="mge")
}
}## Warning in fitdist(x, dist, method = "mge"): maximum GOF estimation has a
## default 'gof' argument set to 'CvM'
## Warning in fitdist(x, dist, method = "mge"): maximum GOF estimation has a
## default 'gof' argument set to 'CvM'
## Warning in fitdist(x, dist, method = "mge"): maximum GOF estimation has a
## default 'gof' argument set to 'CvM'
## Warning in fitdist(x, dist, method = "mge"): maximum GOF estimation has a
## default 'gof' argument set to 'CvM'
## Warning in fitdist(x, dist, start = list(df = length(x) - 1), method = "mge"):
## maximum GOF estimation has a default 'gof' argument set to 'CvM'
## mean sd
## 0.006940808 0.004746252
## [1] -381.9854
## location scale
## 0.006903160 0.002882685
## [1] -922.4761
## location scale
## 0.006937262 0.002859244
## [1] -861.6542
## min max
## 0.0000480909 0.0138884505
## [1] Inf
## df
## 128
## [1] 242.4721
For datasets with size equal or less to 30 observations:
## [1] 100
## [1] 10
## [1] 57
## [1] 190
# Check connectivity using different methods
is_connected_undirected <- is_topology_connected(topology$topology)
is_connected_directed <- is_topology_connected2(topology$topology)
is_connected_manual <- is_topology_connected_manual(topology$topology)
is_connected_undirected## [1] TRUE
## [1] TRUE
## [1] TRUE
For larger datasets (\(n > 30\)):
## $mean_diff
## [1] 0.001223358
##
## $median_diff
## [1] 0.0001172427
##
## $sd
## [1] 0.01148499
##
## $iqr
## [1] 0.001503484
##
## $dbscan
## [1] 0.0006469434
# Analyze topology characteristics with different IQR factors
results <- analyze_topology_factors(X, factors = c(1, 2, 4, 8, 16))## factor threshold base_size max_set_size min_set_size
## 1 1 0.0060139348 957 129 0
## 2 2 0.0030069674 1217 129 0
## 3 4 0.0015034837 836 129 0
## 4 8 0.0007517418 382 129 0
## 5 16 0.0003758709 255 129 0
## method threshold base_size
## mean_diff mean_diff 0.0012233577 554
## median_diff median_diff 0.0001172427 119
## sd sd 0.0114849870 480
## iqr iqr 0.0015034837 836
## dbscan dbscan 0.0006469434 424
Dataset Size Recommendations
\(n \leq 30\): Use
complete_topology().
\(30 < n \leq 100\): Use threshold-based methods with default parameters.
\(n > 100\): Use optimized thresholds:
threshold <- IQR(x)/4 # Default conservative threshold
results <- analyze_topology_factors(x, factors = c(2, 4))1. Connectivity Analysis
Connected topology: Stable economic period.
Disconnected topology: Potential structural breaks.
2. Base Size Interpretation
relative_size < 0.3: Simple structure.Based on Kolmogorov complexity theory, a topology with fewer sets than 30% of possible combinations indicates high data compressibility. This corresponds to strong regularity in the underlying economic dynamics, which aligns with minimum description length principle, suggesting a parsimonious representation.
0.3 \(\leq\) relative_size \(\leq\) 0.7: Moderate complexity. Derived from entropy considerations in dynamical systems, reflects balance between order and disorder typical of stable economic systems. This range corresponds to phase transition region in complex networks and is supported by empirical studies showing most stable economic periods exhibit topological bases in this range.
relative_size > 0.7: Complex structure. Draws from random graph theory, in which the high relative size indicates near-random structure. This suggests potential market inefficiencies or structural instabilities and corresponds to high algorithmic complexity in the sense of Chaitin. Is typically observed during periods of economic turbulence or systemic change.
The above considerations must not be confused with complexity in the sense of complex systems/chaos theory.
Topological complexity measures the richness of the neighborhood structure and connectedness patterns. A high relative_size indicates many distinct open sets and intricate local relationships. Chaos-theory complexity measures the system’s sensitivity to initial conditions and long-term predictability. A chaotic system can have simple topology (few open sets) but complex dynamics.
For example, the logistic map can exhibit chaos with a relatively simple topological structure (relative_size < 0.3), while a stable periodic system might require complex topology (relative_size > 0.7) to capture its neighborhood relationships.
#Real GDP Growth quarterly data from 1960-04-01 to 2024-01-01 (FRED Data)
x <- c(
2.2, -0.5, 0.5, -1.3, 0.7, 1.7, 1.9, 2.0, 1.8, 0.9, 1.2, 0.3, 1.1, 1.1, 2.2,
0.7, 2.1, 1.1, 1.6, 0.3, 2.4, 1.3, 2.2, 2.3, 2.4, 0.3, 0.8, 0.8, 0.9, 0.1, 0.9,
0.8, 2.0, 1.7, 0.8, 0.4, 1.6, 0.3, 0.7, -0.5, -0.1, 0.1, 0.9, -1.1, 2.7, 0.5, 0.8,
0.2, 1.8, 2.3, 0.9, 1.7, 2.5, 1.1, -0.5, 0.9, -0.9, 0.2, -0.9, -0.4, -1.2, 0.7,
1.7, 1.3, 2.2, 0.7, 0.5, 0.7, 1.2, 1.9, 1.8, 0.0, 0.3, 3.9, 1.0, 1.3, 0.2, 0.1,
0.7, 0.3, 0.3, -2.1, -0.1, 1.9, 2.0, -0.7, 1.2, -1.1, -1.6, 0.5, -0.4, 0.0, 1.3,
2.3, 2.0, 2.1, 2.0, 1.7, 1.0, 0.8, 1.0, 0.9, 1.5, 0.7, 0.9, 0.5, 1.0, 0.5, 0.7,
1.1, 0.9, 1.7, 0.5, 1.3, 0.6, 1.3, 1.0, 0.8, 0.7, 0.2, 1.1, 0.4, 0.1, -0.9, -0.5,
0.8, 0.5, 0.3, 1.2, 1.1, 1.0, 1.0, 0.2, 0.6, 0.5, 1.4, 1.0, 1.4, 0.6, 1.1, 0.4,
0.3, 0.9, 0.7, 0.7, 1.7, 0.9, 1.0, 0.6, 1.7, 1.2, 0.9, 1.0, 0.9, 1.3, 1.6, 0.9,
0.8, 1.3, 1.6, 0.4, 1.8, 0.1, 0.6, -0.3, 0.6, -0.4, 0.3, 0.8, 0.6, 0.4, 0.1, 0.5,
0.9, 1.7, 1.2, 0.6, 0.8, 0.9, 1.0, 1.1, 0.5, 0.8, 0.6, 1.3, 0.3, 0.1, 0.9, 0.3,
0.6, 0.6, 0.6, -0.4, 0.6, -0.5, -2.2, -1.1, -0.2, 0.4, 1.1, 0.5, 1.0, 0.8, 0.5,
-0.2, 0.7, 0.0, 1.1, 0.8, 0.4, 0.1, 0.1, 1.0, 0.3, 0.9, 0.9, -0.3, 1.3, 1.2, 0.5,
0.9, 0.6, 0.4, 0.2, 0.6, 0.3, 0.7, 0.6, 0.5, 0.6, 0.8, 1.1, 0.8, 0.5, 0.6, 0.1,
0.5, 0.8, 1.1, 0.6, -1.4, -7.9, 7.8, 1.0, 1.3, 1.5, 0.8, 1.7, -0.5, -0.1, 0.7,
0.6, 0.6, 0.5, 1.2, 0.8, 0.4
)
# Compare multiple time periods
periods <- list(
pre_crisis = x[1:39],
various_crisis = x[40:125],
post_crisis = x[126:192],
all_period = x
)
# Data quality check
for(name in names(periods)) {
missing <- sum(is.na(periods[[name]]))
if(missing > 0) {
warning(sprintf("Period %s has %d missing values", name, missing))
}
}
results <- lapply(periods, function(period_data) {
if(length(period_data) <= 30) {
topology <- complete_topology(period_data)
} else {
# Use threshold-based approach
threshold <- IQR(period_data)/4
topology <- calculate_topology(period_data, threshold)
}
list(
connectivity = is_topology_connected(topology),
complexity = length(topology)/length(period_data)
)
})
results## $pre_crisis
## $pre_crisis$connectivity
## [1] TRUE
##
## $pre_crisis$complexity
## [1] 0.02564103
##
##
## $various_crisis
## $various_crisis$connectivity
## [1] TRUE
##
## $various_crisis$complexity
## [1] 0.01162791
##
##
## $post_crisis
## $post_crisis$connectivity
## [1] TRUE
##
## $post_crisis$complexity
## [1] 0.01492537
##
##
## $all_period
## $all_period$connectivity
## [1] TRUE
##
## $all_period$complexity
## [1] 0.003891051
# Crear un vector numérico de juguete con algunos NA's
dummy_var <- c(1.5, 2.3, NA, 4.7, 5.1, NA, 3.2, 6.4, NA, 7.8)
prepare_data <- function(x) {
# Remove missing values
x <- na.omit(x)
# Scale if necessary
if(sd(x) > 1) x <- scale(x)
# Check minimum sample size
if(length(x) < 10) warning("Sample size may be too small")
x
}
prepared_data <- prepare_data(dummy_var)## Warning in prepare_data(dummy_var): Sample size may be too small
## [,1]
## [1,] -1.3011743
## [2,] -0.9457316
## [3,] 0.1205966
## [4,] 0.2983180
## [5,] -0.5458585
## [6,] 0.8759125
## [7,] 1.4979372
## attr(,"scaled:center")
## [1] 4.428571
## attr(,"scaled:scale")
## [1] 2.250714
safe_topology_analysis <- function(x) {
tryCatch({
threshold <- IQR(x)/4
topology <- calculate_topology(x, threshold)
list(
topology = topology,
connectivity = is_topology_connected(topology),
base_size = length(topology)
)
}, error = function(e) {
message("Error in topology analysis: ", e$message)
NULL
})
}
result <- safe_topology_analysis(X3)
print(str(result))## List of 3
## $ topology : int 8
## $ connectivity: logi TRUE
## $ base_size : int 1
## NULL
# For n > 100
analyze_large_dataset <- function(x, sample_size = 100) {
if(length(x) > sample_size) {
warning("Using sampling for large dataset")
x <- sample(x, sample_size)
}
analyze_topology_factors(x)
}
result <- analyze_large_dataset(x)## Warning in analyze_large_dataset(x): Using sampling for large dataset
## factor threshold base_size max_set_size min_set_size
## 1 1 0.6000 90 100 0
## 2 2 0.3000 69 100 0
## 3 4 0.1500 63 100 0
## 4 8 0.0750 33 100 0
## 5 16 0.0375 33 100 0
For large data sets, recommended approaches include:
is_topology_connected_manual() for initial quick
evaluationis_topology_connected2() for more rigorous
analyses when necessary