Dot Product
Similarity
Measures angle and magnitude, not just direction.
Neural networks, embedding comparison
Euclidean Distance
Distance
Straight-line distance between two vectors.
Clustering, KNN
Manhattan Distance (L1)
Distance
Sum of absolute differences.
High-dimensional, sparse data
Jaccard Similarity
Similarity
Intersection over union (for sets).
Binary vectors, tag/category overlap
Pearson Correlation
Similarity
Measures linear correlation, values from -1 to 1.
Feature correlation, time series
Hamming Distance
Distance
Number of bit positions where two vectors differ.
Binary strings, DNA, hashing
Mahalanobis Distance
Distance
Takes covariance into account.
Multivariate anomaly detection
Bray-Curtis Dissimilarity
Distance
Emphasizes proportional differences.
Ecology, composition vectors
Tanimoto Coefficient
Similarity
Generalization of Jaccard for real-valued vectors.
Chemical compound comparison
Soft Cosine Similarity
Similarity
Like cosine, but considers similarity between features (e.g., synonyms).
NLP with semantic overlap
视频信息
答案文本
视频字幕
The dot product is a similarity measure that considers both the angle and magnitude between two vectors. Unlike simple directional measures, it captures the full geometric relationship. The formula shows that the dot product equals the product of magnitudes times the cosine of the angle between them. This makes it particularly valuable in neural networks and embedding comparison tasks.
Two fundamental distance measures are Euclidean and Manhattan distance. Euclidean distance calculates the straight-line distance between points, commonly used in clustering and K-nearest neighbors algorithms. Manhattan distance sums the absolute differences along each dimension, creating a grid-like path. This makes Manhattan distance particularly effective for high-dimensional and sparse data where direct paths may not be meaningful.
Jaccard similarity measures the intersection over union of two sets, making it perfect for binary vectors and analyzing tag or category overlap. The formula divides the size of intersection by the size of union. Pearson correlation measures linear relationships between variables, with values ranging from negative one to positive one. It's widely used for feature correlation analysis and time series data, helping identify how strongly two variables move together.
Hamming distance counts the number of positions where two binary vectors differ, making it essential for binary string comparison, DNA sequence analysis, and hash functions. Mahalanobis distance is more sophisticated, taking into account the covariance structure of the data. Unlike Euclidean distance, it considers how variables correlate with each other, making it particularly powerful for multivariate anomaly detection where the shape of normal data distribution matters.
Advanced similarity measures address specific domain needs. Bray-Curtis dissimilarity emphasizes proportional differences in composition vectors, making it valuable for ecological studies and species distribution analysis. The Tanimoto coefficient generalizes Jaccard similarity for real-valued vectors, particularly useful in chemical compound comparison. Soft cosine similarity extends traditional cosine similarity by considering relationships between features, such as semantic similarity between words in natural language processing applications.