Information-Theoretic Metrics
Information-theoretic metrics measure the amount of information shared between words based on their co-occurrence patterns. Let's create a ContingencyTable
that will be used later in this documentation.
using TextAssociations
text = """
Natural language processing uses computational linguistics.
Computational methods analyze natural language data.
Language models process natural text efficiently.
"""
ct = ContingencyTable(text, "language"; windowsize=3, minfreq=1)
println("Contingency table created:")
println(" node : ", ct.node)
println(" window: 3, minfreq: 1")
Contingency table created:
node : language
window: 3, minfreq: 1
Pointwise Mutual Information (PMI)
Theory
PMI measures how much more likely two words are to co-occur than would be expected by chance:
\[PMI(x,y) = \log_2 \frac{P(x,y)}{P(x)P(y)}\]
Implementation
results = assoc_score(PMI, ct)
println("PMI scores for 'language':")
for row in eachrow(results)
interpretation = if row.PMI > 5
"very strong"
elseif row.PMI > 3
"strong"
elseif row.PMI > 0
"positive"
else
"negative/no association"
end
println(" $(row.Collocate): $(round(row.PMI, digits=2)) ($interpretation)")
end
PMI scores for 'language':
analyze: -1.79 (negative/no association)
computational: -1.1 (negative/no association)
data: -0.69 (negative/no association)
language: -1.1 (negative/no association)
methods: -1.1 (negative/no association)
models: -0.69 (negative/no association)
natural: -1.1 (negative/no association)
process: -1.1 (negative/no association)
processing: -1.1 (negative/no association)
uses: -1.1 (negative/no association)
PMI Variants
using TextAssociations
# Compare PMI variants
variants = assoc_score([PMI, PMI², PMI³, PPMI], ct)
println("\nPMI Variants Comparison:")
println("Standard PMI: Balanced information measure")
println("PMI²: Emphasizes frequency (f²/expected)")
println("PMI³: Strong frequency bias (f³/expected)")
println("PPMI: Positive PMI (negative values → 0)")
for row in eachrow(first(variants, 3))
println("\n$(row.Collocate):")
println(" PMI: $(round(row.PMI, digits=2))")
println(" PMI²: $(round(row.PMI², digits=2))")
println(" PMI³: $(round(row.PMI³, digits=2))")
println(" PPMI: $(round(row.PPMI, digits=2))")
end
PMI Variants Comparison:
Standard PMI: Balanced information measure
PMI²: Emphasizes frequency (f²/expected)
PMI³: Strong frequency bias (f³/expected)
PPMI: Positive PMI (negative values → 0)
analyze:
PMI: -1.79
PMI²: -1.79
PMI³: -1.79
PPMI: 0.0
computational:
PMI: -1.1
PMI²: -1.1
PMI³: -1.1
PPMI: 0.0
data:
PMI: -0.69
PMI²: -0.69
PMI³: -0.69
PPMI: 0.0
Log-Likelihood Ratio (LLR)
Theory
LLR compares observed frequencies with expected frequencies under independence:
\[LLR = 2 \sum_{ij} O_{ij} \log \frac{O_{ij}}{E_{ij}}\]
Implementation
using TextAssociations
# LLR for statistical significance
llr_results = assoc_score([LLR, LLR²], ct)
println("\nLog-Likelihood Ratio:")
for row in eachrow(llr_results)
# Critical values for significance
significance = if row.LLR > 15.13
"p < 0.0001 (****)"
elseif row.LLR > 10.83
"p < 0.001 (***)"
elseif row.LLR > 6.63
"p < 0.01 (**)"
elseif row.LLR > 3.84
"p < 0.05 (*)"
else
"not significant"
end
println(" $(row.Collocate): LLR=$(round(row.LLR, digits=2)) $significance")
println(" LLR²=$(round(row.LLR², digits=2)) (squared variant)")
end
Log-Likelihood Ratio:
analyze: LLR=0.0 not significant
LLR²=0.0 (squared variant)
computational: LLR=1.59 not significant
LLR²=2.52 (squared variant)
data: LLR=2.23 not significant
LLR²=4.98 (squared variant)
language: LLR=3.82 not significant
LLR²=14.59 (squared variant)
methods: LLR=1.59 not significant
LLR²=2.52 (squared variant)
models: LLR=2.23 not significant
LLR²=4.98 (squared variant)
natural: LLR=8.32 p < 0.01 (**)
LLR²=69.19 (squared variant)
process: LLR=1.59 not significant
LLR²=2.52 (squared variant)
processing: LLR=1.59 not significant
LLR²=2.52 (squared variant)
uses: LLR=1.59 not significant
LLR²=2.52 (squared variant)
Mutual Information
Theory
Mutual Information measures the total information shared between two variables:
\[MI(X;Y) = \sum_{x,y} P(x,y) \log \frac{P(x,y)}{P(x)P(y)}\]
While PMI is the pointwise version, MI is the expected value of PMI over all possible outcomes.
Relationship to PMI
using TextAssociations
# PMI is the pointwise (local) version
# MI would be the weighted sum over all contexts
function explain_mi_pmi_relationship()
println("Relationship between MI and PMI:")
println(" PMI: Information for specific word pair")
println(" MI: Average information over all pairs")
println("\nMI = Σ P(x,y) × PMI(x,y)")
println("\nPMI tells us about specific associations")
println("MI tells us about overall dependency")
end
explain_mi_pmi_relationship()
Relationship between MI and PMI:
PMI: Information for specific word pair
MI: Average information over all pairs
MI = Σ P(x,y) × PMI(x,y)
PMI tells us about specific associations
MI tells us about overall dependency
Information Gain
Information gain measures how much knowing one word reduces uncertainty about another:
using TextAssociations
# Demonstrate information gain concept
function information_gain_example(ct::ContingencyTable)
results = assoc_score([PMI, PPMI], ct)
# High PMI indicates high information gain
high_info = filter(row -> row.PMI > 3, results)
println("\nHigh Information Gain pairs (PMI > 3):")
for row in eachrow(high_info)
println(" Knowing '$(ct.node)' gives $(round(row.PMI, digits=2)) bits about '$(row.Collocate)'")
end
end
information_gain_example(ct)
High Information Gain pairs (PMI > 3):
Positive PMI (PPMI)
Motivation
PPMI addresses the issue that PMI can be negative for words that co-occur less than expected:
using TextAssociations
# Compare PMI and PPMI
both = assoc_score([PMI, PPMI], ct)
println("\nPMI vs PPMI:")
for row in eachrow(both)
if row.PMI < 0
println(" $(row.Collocate): PMI=$(round(row.PMI, digits=2)) → PPMI=0")
println(" (Negative association set to zero)")
else
println(" $(row.Collocate): PMI=$(round(row.PMI, digits=2)) = PPMI=$(round(row.PPMI, digits=2))")
end
end
PMI vs PPMI:
analyze: PMI=-1.79 → PPMI=0
(Negative association set to zero)
computational: PMI=-1.1 → PPMI=0
(Negative association set to zero)
data: PMI=-0.69 → PPMI=0
(Negative association set to zero)
language: PMI=-1.1 → PPMI=0
(Negative association set to zero)
methods: PMI=-1.1 → PPMI=0
(Negative association set to zero)
models: PMI=-0.69 → PPMI=0
(Negative association set to zero)
natural: PMI=-1.1 → PPMI=0
(Negative association set to zero)
process: PMI=-1.1 → PPMI=0
(Negative association set to zero)
processing: PMI=-1.1 → PPMI=0
(Negative association set to zero)
uses: PMI=-1.1 → PPMI=0
(Negative association set to zero)
Normalized Variants
Normalized PMI
Some applications benefit from normalized PMI variants:
# Normalized PMI (NPMI) - scales to [-1, 1]
# NPMI = PMI / -log(P(x,y))
# This makes PMI values more comparable across different frequency ranges
Practical Considerations
Frequency Effects
using TextAssociations
# Create examples with different frequencies
high_freq_text = repeat("the word the word the word ", 10)
low_freq_text = "rare unique special extraordinary unusual"
println("Frequency effects on PMI:")
# High frequency
ct_high = ContingencyTable(high_freq_text, "the"; windowsize=2, minfreq=1)
pmi_high = assoc_score(PMI, ct_high)
if !isempty(pmi_high)
println("\nHigh frequency word 'the':")
println(" Max PMI: $(round(maximum(pmi_high.PMI), digits=2))")
end
# Low frequency
ct_low = ContingencyTable(low_freq_text, "rare"; windowsize=3, minfreq=1)
pmi_low = assoc_score(PMI, ct_low)
if !isempty(pmi_low)
println("\nLow frequency word 'rare':")
println(" Max PMI: $(round(maximum(pmi_low.PMI), digits=2))")
end
println("\n→ PMI tends to favor low-frequency pairs")
Frequency effects on PMI:
High frequency word 'the':
Max PMI: -3.4
Low frequency word 'rare':
Max PMI: 0.0
→ PMI tends to favor low-frequency pairs
Sparse Data Problem
using TextAssociations
# Small corpus - sparse data
small_text = "word1 word2 word3"
ct_small = ContingencyTable(small_text, "word1"; windowsize=2, minfreq=1)
# Large corpus - more reliable
large_text = repeat("word1 word2 word3 word4 word5 ", 100)
ct_large = ContingencyTable(large_text, "word1"; windowsize=2, minfreq=1)
println("Sparse data effects:")
println(" Small corpus: $(length(split(small_text))) tokens")
println(" Large corpus: $(length(split(large_text))) tokens")
println("\n→ Information-theoretic metrics need sufficient data")
Sparse data effects:
Small corpus: 3 tokens
Large corpus: 500 tokens
→ Information-theoretic metrics need sufficient data
Choosing Information-Theoretic Metrics
Decision Guide
Use Case | Recommended Metric | Reason |
---|---|---|
Finding rare associations | PMI | Highlights low-frequency patterns |
Statistical validation | LLR | Provides p-values |
Dimensionality reduction | PPMI | No negative values, works well with SVD |
Frequency-weighted | PMI² or PMI³ | Emphasizes common patterns |
Cross-corpus comparison | Normalized PMI | Comparable across corpora |
Threshold Guidelines
using TextAssociations, DataFrames
thresholds = DataFrame(
Metric = ["PMI", "PPMI", "LLR", "LLR²"],
Weak = ["0-2", "0-2", "0-3.84", "0-15"],
Moderate = ["2-4", "2-4", "3.84-10.83", "15-50"],
Strong = ["4-7", "4-7", "10.83-15.13", "50-100"],
VeryStrong = [">7", ">7", ">15.13", ">100"]
)
println("Information-Theoretic Metric Thresholds:")
for row in eachrow(thresholds)
println("\n$(row.Metric):")
println(" Weak: $(row.Weak)")
println(" Moderate: $(row.Moderate)")
println(" Strong: $(row.Strong)")
println(" Very Strong: $(row.VeryStrong)")
end
Information-Theoretic Metric Thresholds:
PMI:
Weak: 0-2
Moderate: 2-4
Strong: 4-7
Very Strong: >7
PPMI:
Weak: 0-2
Moderate: 2-4
Strong: 4-7
Very Strong: >7
LLR:
Weak: 0-3.84
Moderate: 3.84-10.83
Strong: 10.83-15.13
Very Strong: >15.13
LLR²:
Weak: 0-15
Moderate: 15-50
Strong: 50-100
Very Strong: >100
Advanced Applications
Semantic Similarity
using TextAssociations
# Use PMI for semantic similarity
function semantic_similarity(corpus_text::String, word1::String, word2::String)
# Get PMI profiles for both words
ct1 = ContingencyTable(corpus_text, word1; windowsize=5, minfreq=1)
ct2 = ContingencyTable(corpus_text, word2; windowsize=5, minfreq=1)
pmi1 = assoc_score(PPMI, ct1)
pmi2 = assoc_score(PPMI, ct2)
# Find common collocates
if !isempty(pmi1) && !isempty(pmi2)
common = intersect(pmi1.Collocate, pmi2.Collocate)
println("Common contexts for '$word1' and '$word2': ", length(common))
if length(common) > 0
# Could compute cosine similarity of PMI vectors here
println("Shared collocates: ", first(common, min(5, length(common))))
end
end
end
text = """
Dogs are loyal pets. Cats are independent pets.
Dogs need walks. Cats need litter boxes.
Both dogs and cats make great companions.
"""
semantic_similarity(text, "dogs", "cats")
Common contexts for 'dogs' and 'cats': 13
Shared collocates: ["and", "are", "both", "boxes", "companions"]
Feature Extraction
using TextAssociations
using DataFrames
# Use PPMI for feature extraction
function extract_features(corpus_text::String, target_words::Vector{String}, top_n::Int=10)
features = Dict{String, Vector{String}}()
for word in target_words
ct = ContingencyTable(corpus_text, word; windowsize=5, minfreq=1)
ppmi = assoc_score(PPMI, ct)
if !isempty(ppmi)
# Top PPMI scores as features
sorted = sort(ppmi, :PPMI, rev=true)
features[word] = first(sorted.Collocate, min(top_n, nrow(sorted)))
else
features[word] = Symbol[]
end
end
return features
end
sample_text = """
Machine learning uses algorithms. Deep learning uses neural networks.
Statistics uses probability. Mathematics uses logic.
"""
features = extract_features(sample_text, ["learning", "uses"], 5)
println("\nExtracted features:")
for (word, feat) in features
println(" $word: $feat")
end
Extracted features:
uses: ["algorithms", "deep", "learning", "logic", "machine"]
learning: ["algorithms", "deep", "learning", "machine", "networks"]
References
- Church, K. W., & Hanks, P. (1990). Word association norms, mutual information, and lexicography.
- Bouma, G. (2009). Normalized (pointwise) mutual information in collocation extraction.
- Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence.
Next Steps
- Explore Statistical Metrics for hypothesis testing
- Learn about Similarity Metrics for symmetric measures
- See Choosing Metrics for practical guidance