Choosing Metrics
Selecting the right association metric is crucial for meaningful results. This guide helps you choose metrics based on your research goals and data characteristics.
Quick Selection Guide
using TextAssociations, DataFrames
# Quick reference for metric selection
metric_guide = DataFrame(
Goal = [
"Find rare but meaningful associations",
"Validate known collocations",
"Compare across different corpora",
"Identify fixed expressions",
"Statistical significance testing",
"Symmetric word similarity"
],
RecommendedMetrics = [
"PMI, PPMI",
"LLR, Chi-square",
"LogDice, PPMI",
"Dice, MI",
"LLR, Chi-square, T-score",
"Dice, Jaccard, Cosine"
],
Reason = [
"High PMI for rare co-occurrences",
"Statistical tests for reliability",
"Stable across corpus sizes",
"High scores for fixed phrases",
"P-values for hypothesis testing",
"Symmetric similarity measures"
]
)
println("Metric Selection Guide:")
for row in eachrow(metric_guide)
println("\n$(row.Goal):")
println(" Use: $(row.RecommendedMetrics)")
println(" Why: $(row.Reason)")
end
Metric Selection Guide:
Find rare but meaningful associations:
Use: PMI, PPMI
Why: High PMI for rare co-occurrences
Validate known collocations:
Use: LLR, Chi-square
Why: Statistical tests for reliability
Compare across different corpora:
Use: LogDice, PPMI
Why: Stable across corpus sizes
Identify fixed expressions:
Use: Dice, MI
Why: High scores for fixed phrases
Statistical significance testing:
Use: LLR, Chi-square, T-score
Why: P-values for hypothesis testing
Symmetric word similarity:
Use: Dice, Jaccard, Cosine
Why: Symmetric similarity measures
Metric Properties Comparison
Information-Theoretic Metrics
using TextAssociations, DataFrames
text = """
Quantum computing revolutionizes computational power.
Classical computing cannot match quantum supremacy.
Quantum algorithms solve complex problems efficiently.
"""
ct = ContingencyTable(text, "quantum"; windowsize=3, minfreq=1)
results = assoc_score([PMI, PMI², PMI³, PPMI], ct)
println("PMI Family Comparison:")
for row in eachrow(results)
println("\n$(row.Collocate):")
println(" PMI: $(round(row.PMI, digits=2)) - Standard measure")
println(" PMI²: $(round(row.PMI², digits=2)) - Emphasizes frequency")
println(" PMI³: $(round(row.PMI³, digits=2)) - Strong frequency bias")
println(" PPMI: $(round(row.PPMI, digits=2)) - No negative values")
end
PMI Family Comparison:
algorithms:
PMI: -0.69 - Standard measure
PMI²: -0.69 - Emphasizes frequency
PMI³: -0.69 - Strong frequency bias
PPMI: 0.0 - No negative values
cannot:
PMI: -1.1 - Standard measure
PMI²: -1.1 - Emphasizes frequency
PMI³: -1.1 - Strong frequency bias
PPMI: 0.0 - No negative values
complex:
PMI: -1.1 - Standard measure
PMI²: -1.1 - Emphasizes frequency
PMI³: -1.1 - Strong frequency bias
PPMI: 0.0 - No negative values
computational:
PMI: -1.1 - Standard measure
PMI²: -1.1 - Emphasizes frequency
PMI³: -1.1 - Strong frequency bias
PPMI: 0.0 - No negative values
computing:
PMI: -1.1 - Standard measure
PMI²: -0.41 - Emphasizes frequency
PMI³: 0.29 - Strong frequency bias
PPMI: 0.0 - No negative values
match:
PMI: -0.69 - Standard measure
PMI²: -0.69 - Emphasizes frequency
PMI³: -0.69 - Strong frequency bias
PPMI: 0.0 - No negative values
quantum:
PMI: -1.1 - Standard measure
PMI²: -0.41 - Emphasizes frequency
PMI³: 0.29 - Strong frequency bias
PPMI: 0.0 - No negative values
revolutionizes:
PMI: -1.1 - Standard measure
PMI²: -1.1 - Emphasizes frequency
PMI³: -1.1 - Strong frequency bias
PPMI: 0.0 - No negative values
solve:
PMI: -1.1 - Standard measure
PMI²: -1.1 - Emphasizes frequency
PMI³: -1.1 - Strong frequency bias
PPMI: 0.0 - No negative values
supremacy:
PMI: -0.69 - Standard measure
PMI²: -0.69 - Emphasizes frequency
PMI³: -0.69 - Strong frequency bias
PPMI: 0.0 - No negative values
Statistical Significance Metrics
using TextAssociations, DataFrames
# Text with clear patterns
text = """
Statistical analysis requires statistical methods and statistical tools.
Random words appear randomly without random patterns.
Data analysis needs careful analysis of data patterns.
"""
ct = ContingencyTable(text, "statistical"; windowsize=4, minfreq=1)
results = assoc_score([LLR, ChiSquare, Tscore, Zscore], ct)
println("Statistical Tests Comparison:")
for row in eachrow(results)
llr_sig = row.LLR > 10.83 ? "p<0.001" : row.LLR > 6.63 ? "p<0.01" : row.LLR > 3.84 ? "p<0.05" : "n.s."
chi_sig = row.ChiSquare > 10.83 ? "p<0.001" : row.ChiSquare > 6.63 ? "p<0.01" : row.ChiSquare > 3.84 ? "p<0.05" : "n.s."
println("\n$(row.Collocate):")
println(" LLR: $(round(row.LLR, digits=2)) ($llr_sig)")
println(" χ²: $(round(row.ChiSquare, digits=2)) ($chi_sig)")
println(" t: $(round(row.Tscore, digits=2))")
println(" z: $(round(row.Zscore, digits=2))")
end
Statistical Tests Comparison:
analysis:
LLR: 0.58 (n.s.)
χ²: 0.64 (n.s.)
t: 0.45
z: 0.8
and:
LLR: 1.38 (n.s.)
χ²: 1.66 (n.s.)
t: 0.64
z: 1.29
appear:
LLR: 3.06 (n.s.)
χ²: 3.27 (n.s.)
t: 0.75
z: 1.81
methods:
LLR: 6.5 (p<0.05)
χ²: 10.0 (p<0.01)
t: 0.9
z: 3.16
random:
LLR: 3.06 (n.s.)
χ²: 3.27 (n.s.)
t: 0.75
z: 1.81
requires:
LLR: 6.5 (p<0.05)
χ²: 10.0 (p<0.01)
t: 0.9
z: 3.16
statistical:
LLR: 13.5 (p<0.001)
χ²: 12.0 (p<0.001)
t: 1.3
z: 3.46
tools:
LLR: 3.93 (p<0.05)
χ²: 4.95 (p<0.05)
t: 0.82
z: 2.22
words:
LLR: 3.06 (n.s.)
χ²: 3.27 (n.s.)
t: 0.75
z: 1.81
Similarity Metrics
using TextAssociations, DataFrames
text = """
Machine learning and deep learning share similar foundations.
Neural networks enable deep learning applications.
Learning algorithms power machine learning systems.
"""
ct = ContingencyTable(text, "learning"; windowsize=3, minfreq=1)
results = assoc_score([Dice, LogDice, JaccardIdx, CosineSim], ct)
println("Similarity Metrics Comparison:")
for row in eachrow(results)
println("\n$(row.Collocate):")
println(" Dice: $(round(row.Dice, digits=3)) ∈ [0,1]")
println(" LogDice: $(round(row.LogDice, digits=2)) ∈ [0,14]")
println(" Jaccard: $(round(row.JaccardIdx, digits=3)) ∈ [0,1]")
println(" Cosine: $(round(row.CosineSim, digits=3)) ∈ [0,1]")
end
Similarity Metrics Comparison:
algorithms:
Dice: 0.5 ∈ [0,1]
LogDice: 13.0 ∈ [0,14]
Jaccard: 0.333 ∈ [0,1]
Cosine: 0.577 ∈ [0,1]
and:
Dice: 0.4 ∈ [0,1]
LogDice: 12.68 ∈ [0,14]
Jaccard: 0.25 ∈ [0,1]
Cosine: 0.5 ∈ [0,1]
applications:
Dice: 0.4 ∈ [0,1]
LogDice: 12.68 ∈ [0,14]
Jaccard: 0.25 ∈ [0,1]
Cosine: 0.5 ∈ [0,1]
deep:
Dice: 0.8 ∈ [0,1]
LogDice: 13.68 ∈ [0,14]
Jaccard: 0.667 ∈ [0,1]
Cosine: 0.816 ∈ [0,1]
enable:
Dice: 0.333 ∈ [0,1]
LogDice: 12.42 ∈ [0,14]
Jaccard: 0.2 ∈ [0,1]
Cosine: 0.447 ∈ [0,1]
foundations:
Dice: 0.333 ∈ [0,1]
LogDice: 12.42 ∈ [0,14]
Jaccard: 0.2 ∈ [0,1]
Cosine: 0.447 ∈ [0,1]
learning:
Dice: 0.889 ∈ [0,1]
LogDice: 13.83 ∈ [0,14]
Jaccard: 0.8 ∈ [0,1]
Cosine: 0.894 ∈ [0,1]
machine:
Dice: 0.667 ∈ [0,1]
LogDice: 13.42 ∈ [0,14]
Jaccard: 0.5 ∈ [0,1]
Cosine: 0.707 ∈ [0,1]
networks:
Dice: 0.333 ∈ [0,1]
LogDice: 12.42 ∈ [0,14]
Jaccard: 0.2 ∈ [0,1]
Cosine: 0.447 ∈ [0,1]
power:
Dice: 0.4 ∈ [0,1]
LogDice: 12.68 ∈ [0,14]
Jaccard: 0.25 ∈ [0,1]
Cosine: 0.5 ∈ [0,1]
share:
Dice: 0.333 ∈ [0,1]
LogDice: 12.42 ∈ [0,14]
Jaccard: 0.2 ∈ [0,1]
Cosine: 0.447 ∈ [0,1]
similar:
Dice: 0.333 ∈ [0,1]
LogDice: 12.42 ∈ [0,14]
Jaccard: 0.2 ∈ [0,1]
Cosine: 0.447 ∈ [0,1]
systems:
Dice: 0.333 ∈ [0,1]
LogDice: 12.42 ∈ [0,14]
Jaccard: 0.2 ∈ [0,1]
Cosine: 0.447 ∈ [0,1]
Metric Behavior Analysis
Frequency Sensitivity
using TextAssociations, DataFrames
# Create texts with different frequency patterns
high_freq = "the the the word the the the"
low_freq = "rare unique word special unusual"
function analyze_frequency_sensitivity(text::String, node::String)
ct = ContingencyTable(text, node; windowsize=2, minfreq=1)
metrics = [PMI, LogDice, LLR, Dice]
results = assoc_score(metrics, ct)
return results
end
println("High frequency context:")
high_results = analyze_frequency_sensitivity(high_freq, "the")
for row in eachrow(high_results)
println(" $(row.Collocate): PMI=$(round(row.PMI, digits=2)), LogDice=$(round(row.LogDice, digits=2))")
end
println("\nLow frequency context:")
low_results = analyze_frequency_sensitivity(low_freq, "word")
for row in eachrow(low_results)
println(" $(row.Collocate): PMI=$(round(row.PMI, digits=2)), LogDice=$(round(row.LogDice, digits=2))")
end
High frequency context:
the: PMI=-1.79, LogDice=14.0
word: PMI=-1.1, LogDice=13.0
Low frequency context:
rare: PMI=0.0, LogDice=14.0
special: PMI=0.0, LogDice=14.0
unique: PMI=0.0, LogDice=14.0
unusual: PMI=0.0, LogDice=14.0
Corpus Size Stability
using TextAssociations, DataFrames
# Simulate different corpus sizes
small_corpus = "machine learning uses algorithms"
medium_corpus = repeat("machine learning uses algorithms and data ", 10)
large_corpus = repeat("machine learning uses algorithms and data for predictions ", 100)
function compare_corpus_sizes(node::String)
sizes = [
("Small", small_corpus),
("Medium", medium_corpus),
("Large", large_corpus)
]
println("\nAnalyzing '$node' across corpus sizes:")
for (size_name, corpus) in sizes
ct = ContingencyTable(corpus, node; windowsize=3, minfreq=1)
results = assoc_score([PMI, LogDice, LLR], ct)
if nrow(results) > 0
row = first(results) # Look at first collocate
println(" $size_name: PMI=$(round(row.PMI, digits=2)), " *
"LogDice=$(round(row.LogDice, digits=2)), " *
"LLR=$(round(row.LLR, digits=2))")
end
end
end
compare_corpus_sizes("learning")
Analyzing 'learning' across corpus sizes:
Small: PMI=0.0, LogDice=14.0, LLR=0.0
Medium: PMI=-2.4, LogDice=13.93, LLR=70.18
Large: PMI=-4.62, LogDice=13.99, LLR=266.04
Decision Trees
For Research Goals
# Decision tree for metric selection
function recommend_metrics(goal::Symbol)
recommendations = Dict(
:discovery => ["Use PMI/PPMI for finding new, surprising associations",
"High PMI (>5) indicates strong association",
"PPMI removes negative associations"],
:validation => ["Use LLR for statistical significance",
"LLR > 10.83 means p < 0.001",
"Combine with effect size (PMI) for importance"],
:comparison => ["Use LogDice for cross-corpus stability",
"LogDice range [0,14] is interpretable",
"Less affected by corpus size than PMI"],
:similarity => ["Use Dice/Jaccard for word similarity",
"Both are symmetric measures",
"Dice gives more weight to co-occurrences"]
)
return get(recommendations, goal, ["Unknown goal"])
end
For Data Characteristics
using TextAssociations, DataFrames
function recommend_by_data(corpus_size::Symbol, frequency::Symbol, goal::Symbol)
# Rule-based recommendations
recommendations = []
# Corpus size considerations
if corpus_size == :small
push!(recommendations, "LogDice (stable for small corpora)")
push!(recommendations, "Dice (less affected by sparse data)")
elseif corpus_size == :large
push!(recommendations, "LLR (better with more data)")
push!(recommendations, "PMI (meaningful with sufficient data)")
end
# Frequency considerations
if frequency == :rare
push!(recommendations, "PMI/PPMI (highlights rare associations)")
elseif frequency == :common
push!(recommendations, "LogDice (handles high frequency well)")
push!(recommendations, "LLR (good for common words)")
end
# Goal considerations
if goal == :exploratory
push!(recommendations, "Multiple metrics for validation")
elseif goal == :confirmatory
push!(recommendations, "LLR with significance threshold")
end
return unique(recommendations)
end
# Example recommendation
recs = recommend_by_data(:small, :rare, :exploratory)
println("Recommendations for small corpus with rare words (exploratory):")
for rec in recs
println(" • $rec")
end
Recommendations for small corpus with rare words (exploratory):
• LogDice (stable for small corpora)
• Dice (less affected by sparse data)
• PMI/PPMI (highlights rare associations)
• Multiple metrics for validation
Metric Interpretation Guide
Score Ranges and Thresholds
using TextAssociations, DataFrames
# Interpretation thresholds
thresholds = DataFrame(
Metric = ["PMI", "LogDice", "LLR", "Dice", "Jaccard"],
WeakAssociation = ["< 2", "< 5", "< 3.84", "< 0.1", "< 0.05"],
ModerateAssociation = ["2-5", "5-8", "3.84-10.83", "0.1-0.3", "0.05-0.2"],
StrongAssociation = ["> 5", "> 8", "> 10.83", "> 0.3", "> 0.2"],
Interpretation = [
"Higher = stronger",
"Max 14, stable",
"Statistical significance",
"0-1 scale",
"0-1 scale, stricter"
]
)
println("Metric Interpretation Thresholds:")
for row in eachrow(thresholds)
println("\n$(row.Metric):")
println(" Weak: $(row.WeakAssociation)")
println(" Moderate: $(row.ModerateAssociation)")
println(" Strong: $(row.StrongAssociation)")
println(" Note: $(row.Interpretation)")
end
Metric Interpretation Thresholds:
PMI:
Weak: < 2
Moderate: 2-5
Strong: > 5
Note: Higher = stronger
LogDice:
Weak: < 5
Moderate: 5-8
Strong: > 8
Note: Max 14, stable
LLR:
Weak: < 3.84
Moderate: 3.84-10.83
Strong: > 10.83
Note: Statistical significance
Dice:
Weak: < 0.1
Moderate: 0.1-0.3
Strong: > 0.3
Note: 0-1 scale
Jaccard:
Weak: < 0.05
Moderate: 0.05-0.2
Strong: > 0.2
Note: 0-1 scale, stricter
Practical Examples
using TextAssociations, DataFrames
# Different types of word relationships
texts = Dict(
"Fixed expression" => "by and large the results were positive",
"Technical term" => "machine learning algorithm performs classification",
"Semantic relation" => "doctor treats patient in hospital",
"Syntactic relation" => "very important extremely significant quite notable"
)
function analyze_relationship_type(text::String, node::String, collocate::String)
ct = ContingencyTable(text, node; windowsize=3, minfreq=1)
results = assoc_score([PMI, LogDice, Dice, LLR], ct)
# Find specific collocate
row = filter(r -> String(r.Collocate) == collocate, results)
if !isempty(row)
r = first(row)
println("$node + $collocate:")
println(" PMI: $(round(r.PMI, digits=2))")
println(" LogDice: $(round(r.LogDice, digits=2))")
println(" Dice: $(round(r.Dice, digits=3))")
println(" LLR: $(round(r.LLR, digits=2))")
end
end
# Analyze different relationship types
println("Fixed Expression:")
analyze_relationship_type(texts["Fixed expression"], "by", "and")
println("\nTechnical Term:")
analyze_relationship_type(texts["Technical term"], "machine", "learning")
println("\nSemantic Relation:")
analyze_relationship_type(texts["Semantic relation"], "doctor", "patient")
Fixed Expression:
by + and:
PMI: 0.0
LogDice: 14.0
Dice: 1.0
LLR: 4.5
Technical Term:
machine + learning:
PMI: 0.0
LogDice: 14.0
Dice: 1.0
LLR: 2.77
Semantic Relation:
doctor + patient:
PMI: 0.0
LogDice: 14.0
Dice: 1.0
LLR: 2.77
Advanced Metric Selection
Combining Multiple Metrics
using TextAssociations, DataFrames, Statistics
function combined_score_analysis(ct::ContingencyTable)
# Calculate multiple metrics
results = assoc_score([PMI, LogDice, LLR, Dice], ct)
# Normalize scores (0-1 range)
for col in [:PMI, :LogDice, :LLR, :Dice]
if hasproperty(results, col)
values = results[!, col]
min_val, max_val = extrema(values)
if max_val > min_val
results[!, Symbol(col, :_norm)] = (values .- min_val) ./ (max_val - min_val)
else
results[!, Symbol(col, :_norm)] = zeros(length(values))
end
end
end
# Combined score (weighted average)
results.CombinedScore = (
0.3 * results.PMI_norm +
0.3 * results.LogDice_norm +
0.2 * results.LLR_norm +
0.2 * results.Dice_norm
)
# Rank by combined score
sort!(results, :CombinedScore, rev=true)
return results
end
text = "Data science requires data analysis and data visualization"
ct = ContingencyTable(text, "data"; windowsize=3, minfreq=1)
combined = combined_score_analysis(ct)
println("Combined Metric Analysis:")
for row in eachrow(first(combined, 3))
println("$(row.Collocate): Combined=$(round(row.CombinedScore, digits=3))")
end
Combined Metric Analysis:
data: Combined=0.5
analysis: Combined=0.491
and: Combined=0.491
Metric Stability Analysis
using TextAssociations, DataFrames, Statistics
function metric_stability_test(base_text::String, node::String, iterations::Int=10)
metric_scores = Dict{Symbol,Vector{Float64}}()
for i in 1:iterations
# Add noise to simulate variation
noisy_text = base_text * " " * join(rand(split(base_text), 5), " ")
ct = ContingencyTable(noisy_text, node; windowsize=3, minfreq=1)
results = assoc_score([PMI, LogDice, LLR], ct)
if nrow(results) > 0
for metric in [:PMI, :LogDice, :LLR]
push!(get!(metric_scores, metric, Float64[]), results[1, metric])
end
end
end
# Calculate stability (lower std = more stable)
println("Metric Stability Analysis ($iterations iterations):")
for (metric, scores) in metric_scores
stability = std(scores) / mean(scores) # Coefficient of variation
println(" $metric: CV = $(round(stability, digits=3)) ($(stability < 0.1 ? "stable" : "unstable"))")
end
end
base = "artificial intelligence and machine learning are related fields"
metric_stability_test(base, "intelligence", 10)
Metric Stability Analysis (10 iterations):
PMI: CV = -0.517 (stable)
LogDice: CV = 0.029 (stable)
LLR: CV = 1.749 (unstable)
Best Practices
1. Use Multiple Metrics
# Always compare multiple perspectives
const COMPREHENSIVE_METRICS = [
PMI, # Informativeness
LogDice, # Stability
LLR, # Significance
Dice # Similarity
]
2. Set Appropriate Thresholds
# Domain-specific thresholds
const THRESHOLDS = Dict(
:academic => (pmi=3.0, logdice=7.0, llr=10.83),
:social_media => (pmi=2.0, logdice=5.0, llr=6.63),
:technical => (pmi=4.0, logdice=8.0, llr=15.13)
)
3. Validate with Domain Knowledge
Always verify that high-scoring collocations make sense in your domain:
function validate_results(results::DataFrame, known_good::Vector{String})
found = intersect(String.(results.Collocate), known_good)
coverage = length(found) / length(known_good)
println("Validation: Found $(length(found))/$(length(known_good)) known collocations")
println("Coverage: $(round(coverage * 100, digits=1))%")
return coverage > 0.7 # 70% coverage threshold
end
Next Steps
- Apply metrics in Working with Corpora
- See Advanced Features for specialized analyses
- Review API Reference for all available metrics