Dan’s Data Notes — Topological Data Analysis Intro
A great way to simplify data analytics is by describing as it finding patterns or shapes in the data. An analytics model is usually an equation that assumes the data is “shaped” on a particular day and predicting or classifying a point in this shape is possible with said equation. One of the biggest challenges of data analysis is that you have to make sure you are comparing apples to apples. If you build assumptions about a sub-segment of your data across the entire dateset you may end up with less than desirable results. In previous posts, I have briefly talked about techniques that allow for identifying anomalies in the data. Topological data analysis (TDA) is gaining popularity in this space. But what is it?
What is TDA?
In mathematics according to Encyclopedia Britannica Topology is described as “a branch sometimes referred to as “rubber-sheet geometry,” in which two objects are considered equivalent if they can be continuously deformed into one another through such motions in space as bending, twisting, stretching, and shrinking while disallowing tearing apart or gluing together parts.” That is both very abstract and a mouthful. For someone unfamiliar with the space like me this raised more questions than answers. I remember from 9th-grade earth and space science class that topographic maps visualize physical landscapes by representing sections or areas in the map that are at the same elevation.
If you think about it, for data analysis we need to represent something similar. First, we need to understand where any point in our dataset belongs in the contest of the entire data universe so we can then do an analysis of that point compared to similar points.
Why use TDA?
TDA is handy when trying to extract meaning from unusual data patterns. Typically this can mean:
- A large number of patterns to look for in the data
- A large number of dependencies of said patterns
- A large dataset in general
This technique can help identify “local” patterns in the data. As mentioned before we analyze data mathematically we can derive shapes from datasets and taking it further that the shape or the pattern has meaning. When we talk about patterns we can refer to them as features or more generically columns. Beyond this there are specific properties of the data that can be analyzed.
What are some positives of TDA?
- There is no preconceived notion of what the patterns should be
- - Can help to efficiently perform localization or segmentation of data
- - Can be complemented with dimensionality reduction techniques to make the data less complex
- - Can extract information from Tata that is incomplete or noisy
How can TDA be applied?
According to the scikit-tda project, some shapes of data that can be analyzed include:
- non-linearity and linearity
- clusters
- flares
- loops
What are the current shortcomings?
- Performance: Some algorithms may require very large scale computing. Algorithm are performant is smaller data sets
What is next?
I’ll continue to research this topic and update it accordingly. And do some additional research in persistent homology. If you are an expert in the space and have comments or feedback let me know!