Information-Theoretic Metrics

Information-theoretic metrics measure the amount of information shared between words based on their co-occurrence patterns. Let's create a ContingencyTable that will be used later in this documentation.

using TextAssociations

text = """
Natural language processing uses computational linguistics.
Computational methods analyze natural language data.
Language models process natural text efficiently.
"""

ct = ContingencyTable(text, "language"; windowsize=3, minfreq=1)

println("Contingency table created:")
println("  node  : ", ct.node)
println("  window: 3, minfreq: 1")

Contingency table created:
  node  : language
  window: 3, minfreq: 1

Pointwise Mutual Information (PMI)

Theory

PMI measures how much more likely two words are to co-occur than would be expected by chance:

\[PMI(x,y) = \log_2 \frac{P(x,y)}{P(x)P(y)}\]

Implementation

results = assoc_score(PMI, ct)

println("PMI scores for 'language':")
for row in eachrow(results)
    interpretation = if row.PMI > 5
        "very strong"
    elseif row.PMI > 3
        "strong"
    elseif row.PMI > 0
        "positive"
    else
        "negative/no association"
    end

    println("  $(row.Collocate): $(round(row.PMI, digits=2)) ($interpretation)")
end

PMI scores for 'language':
  analyze: -1.79 (negative/no association)
  computational: -1.1 (negative/no association)
  data: -0.69 (negative/no association)
  language: -1.1 (negative/no association)
  methods: -1.1 (negative/no association)
  models: -0.69 (negative/no association)
  natural: -1.1 (negative/no association)
  process: -1.1 (negative/no association)
  processing: -1.1 (negative/no association)
  uses: -1.1 (negative/no association)

PMI Variants

using TextAssociations

# Compare PMI variants
variants = assoc_score([PMI, PMI², PMI³, PPMI], ct)

println("\nPMI Variants Comparison:")
println("Standard PMI: Balanced information measure")
println("PMI²: Emphasizes frequency (f²/expected)")
println("PMI³: Strong frequency bias (f³/expected)")
println("PPMI: Positive PMI (negative values → 0)")

for row in eachrow(first(variants, 3))
    println("\n$(row.Collocate):")
    println("  PMI:  $(round(row.PMI, digits=2))")
    println("  PMI²: $(round(row.PMI², digits=2))")
    println("  PMI³: $(round(row.PMI³, digits=2))")
    println("  PPMI: $(round(row.PPMI, digits=2))")
end


PMI Variants Comparison:
Standard PMI: Balanced information measure
PMI²: Emphasizes frequency (f²/expected)
PMI³: Strong frequency bias (f³/expected)
PPMI: Positive PMI (negative values → 0)

analyze:
  PMI:  -1.79
  PMI²: -1.79
  PMI³: -1.79
  PPMI: 0.0

computational:
  PMI:  -1.1
  PMI²: -1.1
  PMI³: -1.1
  PPMI: 0.0

data:
  PMI:  -0.69
  PMI²: -0.69
  PMI³: -0.69
  PPMI: 0.0

Log-Likelihood Ratio (LLR)

Theory

LLR compares observed frequencies with expected frequencies under independence:

\[LLR = 2 \sum_{ij} O_{ij} \log \frac{O_{ij}}{E_{ij}}\]

Implementation

using TextAssociations

# LLR for statistical significance
llr_results = assoc_score([LLR, LLR²], ct)

println("\nLog-Likelihood Ratio:")
for row in eachrow(llr_results)
    # Critical values for significance
    significance = if row.LLR > 15.13
        "p < 0.0001 (****)"
    elseif row.LLR > 10.83
        "p < 0.001 (***)"
    elseif row.LLR > 6.63
        "p < 0.01 (**)"
    elseif row.LLR > 3.84
        "p < 0.05 (*)"
    else
        "not significant"
    end

    println("  $(row.Collocate): LLR=$(round(row.LLR, digits=2)) $significance")
    println("    LLR²=$(round(row.LLR², digits=2)) (squared variant)")
end


Log-Likelihood Ratio:
  analyze: LLR=0.0 not significant
    LLR²=0.0 (squared variant)
  computational: LLR=1.59 not significant
    LLR²=2.52 (squared variant)
  data: LLR=2.23 not significant
    LLR²=4.98 (squared variant)
  language: LLR=3.82 not significant
    LLR²=14.59 (squared variant)
  methods: LLR=1.59 not significant
    LLR²=2.52 (squared variant)
  models: LLR=2.23 not significant
    LLR²=4.98 (squared variant)
  natural: LLR=8.32 p < 0.01 (**)
    LLR²=69.19 (squared variant)
  process: LLR=1.59 not significant
    LLR²=2.52 (squared variant)
  processing: LLR=1.59 not significant
    LLR²=2.52 (squared variant)
  uses: LLR=1.59 not significant
    LLR²=2.52 (squared variant)

Mutual Information

Theory

Mutual Information measures the total information shared between two variables:

\[MI(X;Y) = \sum_{x,y} P(x,y) \log \frac{P(x,y)}{P(x)P(y)}\]

While PMI is the pointwise version, MI is the expected value of PMI over all possible outcomes.

Relationship to PMI

using TextAssociations

# PMI is the pointwise (local) version
# MI would be the weighted sum over all contexts

function explain_mi_pmi_relationship()
    println("Relationship between MI and PMI:")
    println("  PMI: Information for specific word pair")
    println("  MI: Average information over all pairs")
    println("\nMI = Σ P(x,y) × PMI(x,y)")
    println("\nPMI tells us about specific associations")
    println("MI tells us about overall dependency")
end

explain_mi_pmi_relationship()

Relationship between MI and PMI:
  PMI: Information for specific word pair
  MI: Average information over all pairs

MI = Σ P(x,y) × PMI(x,y)

PMI tells us about specific associations
MI tells us about overall dependency

Information Gain

Information gain measures how much knowing one word reduces uncertainty about another:

using TextAssociations

# Demonstrate information gain concept
function information_gain_example(ct::ContingencyTable)
    results = assoc_score([PMI, PPMI], ct)

    # High PMI indicates high information gain
    high_info = filter(row -> row.PMI > 3, results)

    println("\nHigh Information Gain pairs (PMI > 3):")
    for row in eachrow(high_info)
        println("  Knowing '$(ct.node)' gives $(round(row.PMI, digits=2)) bits about '$(row.Collocate)'")
    end
end

information_gain_example(ct)


High Information Gain pairs (PMI > 3):

Positive PMI (PPMI)

Motivation

PPMI addresses the issue that PMI can be negative for words that co-occur less than expected:

using TextAssociations

# Compare PMI and PPMI
both = assoc_score([PMI, PPMI], ct)

println("\nPMI vs PPMI:")
for row in eachrow(both)
    if row.PMI < 0
        println("  $(row.Collocate): PMI=$(round(row.PMI, digits=2)) → PPMI=0")
        println("    (Negative association set to zero)")
    else
        println("  $(row.Collocate): PMI=$(round(row.PMI, digits=2)) = PPMI=$(round(row.PPMI, digits=2))")
    end
end


PMI vs PPMI:
  analyze: PMI=-1.79 → PPMI=0
    (Negative association set to zero)
  computational: PMI=-1.1 → PPMI=0
    (Negative association set to zero)
  data: PMI=-0.69 → PPMI=0
    (Negative association set to zero)
  language: PMI=-1.1 → PPMI=0
    (Negative association set to zero)
  methods: PMI=-1.1 → PPMI=0
    (Negative association set to zero)
  models: PMI=-0.69 → PPMI=0
    (Negative association set to zero)
  natural: PMI=-1.1 → PPMI=0
    (Negative association set to zero)
  process: PMI=-1.1 → PPMI=0
    (Negative association set to zero)
  processing: PMI=-1.1 → PPMI=0
    (Negative association set to zero)
  uses: PMI=-1.1 → PPMI=0
    (Negative association set to zero)

Normalized Variants

Normalized PMI

Some applications benefit from normalized PMI variants:

# Normalized PMI (NPMI) - scales to [-1, 1]
# NPMI = PMI / -log(P(x,y))

# This makes PMI values more comparable across different frequency ranges

Practical Considerations

Frequency Effects

using TextAssociations

# Create examples with different frequencies
high_freq_text = repeat("the word the word the word ", 10)
low_freq_text = "rare unique special extraordinary unusual"

println("Frequency effects on PMI:")

# High frequency
ct_high = ContingencyTable(high_freq_text, "the"; windowsize=2, minfreq=1)
pmi_high = assoc_score(PMI, ct_high)
if !isempty(pmi_high)
    println("\nHigh frequency word 'the':")
    println("  Max PMI: $(round(maximum(pmi_high.PMI), digits=2))")
end

# Low frequency
ct_low = ContingencyTable(low_freq_text, "rare"; windowsize=3, minfreq=1)
pmi_low = assoc_score(PMI, ct_low)
if !isempty(pmi_low)
    println("\nLow frequency word 'rare':")
    println("  Max PMI: $(round(maximum(pmi_low.PMI), digits=2))")
end

println("\n→ PMI tends to favor low-frequency pairs")

Frequency effects on PMI:

High frequency word 'the':
  Max PMI: -3.4

Low frequency word 'rare':
  Max PMI: 0.0

→ PMI tends to favor low-frequency pairs

Sparse Data Problem

using TextAssociations

# Small corpus - sparse data
small_text = "word1 word2 word3"
ct_small = ContingencyTable(small_text, "word1"; windowsize=2, minfreq=1)

# Large corpus - more reliable
large_text = repeat("word1 word2 word3 word4 word5 ", 100)
ct_large = ContingencyTable(large_text, "word1"; windowsize=2, minfreq=1)

println("Sparse data effects:")
println("  Small corpus: $(length(split(small_text))) tokens")
println("  Large corpus: $(length(split(large_text))) tokens")
println("\n→ Information-theoretic metrics need sufficient data")

Sparse data effects:
  Small corpus: 3 tokens
  Large corpus: 500 tokens

→ Information-theoretic metrics need sufficient data

Choosing Information-Theoretic Metrics

Decision Guide

Use Case	Recommended Metric	Reason
Finding rare associations	PMI	Highlights low-frequency patterns
Statistical validation	LLR	Provides p-values
Dimensionality reduction	PPMI	No negative values, works well with SVD
Frequency-weighted	PMI² or PMI³	Emphasizes common patterns
Cross-corpus comparison	Normalized PMI	Comparable across corpora

Threshold Guidelines

using TextAssociations, DataFrames

thresholds = DataFrame(
    Metric = ["PMI", "PPMI", "LLR", "LLR²"],
    Weak = ["0-2", "0-2", "0-3.84", "0-15"],
    Moderate = ["2-4", "2-4", "3.84-10.83", "15-50"],
    Strong = ["4-7", "4-7", "10.83-15.13", "50-100"],
    VeryStrong = [">7", ">7", ">15.13", ">100"]
)

println("Information-Theoretic Metric Thresholds:")
for row in eachrow(thresholds)
    println("\n$(row.Metric):")
    println("  Weak: $(row.Weak)")
    println("  Moderate: $(row.Moderate)")
    println("  Strong: $(row.Strong)")
    println("  Very Strong: $(row.VeryStrong)")
end

Information-Theoretic Metric Thresholds:

PMI:
  Weak: 0-2
  Moderate: 2-4
  Strong: 4-7
  Very Strong: >7

PPMI:
  Weak: 0-2
  Moderate: 2-4
  Strong: 4-7
  Very Strong: >7

LLR:
  Weak: 0-3.84
  Moderate: 3.84-10.83
  Strong: 10.83-15.13
  Very Strong: >15.13

LLR²:
  Weak: 0-15
  Moderate: 15-50
  Strong: 50-100
  Very Strong: >100

Advanced Applications

Semantic Similarity

using TextAssociations

# Use PMI for semantic similarity
function semantic_similarity(corpus_text::String, word1::String, word2::String)
    # Get PMI profiles for both words
    ct1 = ContingencyTable(corpus_text, word1; windowsize=5, minfreq=1)
    ct2 = ContingencyTable(corpus_text, word2; windowsize=5, minfreq=1)

    pmi1 = assoc_score(PPMI, ct1)
    pmi2 = assoc_score(PPMI, ct2)

    # Find common collocates
    if !isempty(pmi1) && !isempty(pmi2)
        common = intersect(pmi1.Collocate, pmi2.Collocate)
        println("Common contexts for '$word1' and '$word2': ", length(common))

        if length(common) > 0
            # Could compute cosine similarity of PMI vectors here
            println("Shared collocates: ", first(common, min(5, length(common))))
        end
    end
end

text = """
Dogs are loyal pets. Cats are independent pets.
Dogs need walks. Cats need litter boxes.
Both dogs and cats make great companions.
"""

semantic_similarity(text, "dogs", "cats")

Common contexts for 'dogs' and 'cats': 13
Shared collocates: ["and", "are", "both", "boxes", "companions"]

Feature Extraction

using TextAssociations
using DataFrames

# Use PPMI for feature extraction
function extract_features(corpus_text::String, target_words::Vector{String}, top_n::Int=10)
    features = Dict{String, Vector{String}}()

    for word in target_words
        ct = ContingencyTable(corpus_text, word; windowsize=5, minfreq=1)
        ppmi = assoc_score(PPMI, ct)

        if !isempty(ppmi)
            # Top PPMI scores as features
            sorted = sort(ppmi, :PPMI, rev=true)
            features[word] = first(sorted.Collocate, min(top_n, nrow(sorted)))
        else
            features[word] = Symbol[]
        end
    end

    return features
end

sample_text = """
Machine learning uses algorithms. Deep learning uses neural networks.
Statistics uses probability. Mathematics uses logic.
"""

features = extract_features(sample_text, ["learning", "uses"], 5)
println("\nExtracted features:")
for (word, feat) in features
    println("  $word: $feat")
end


Extracted features:
  uses: ["algorithms", "deep", "learning", "logic", "machine"]
  learning: ["algorithms", "deep", "learning", "machine", "networks"]

References

Church, K. W., & Hanks, P. (1990). Word association norms, mutual information, and lexicography.
Bouma, G. (2009). Normalized (pointwise) mutual information in collocation extraction.
Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence.

Next Steps

Explore Statistical Metrics for hypothesis testing
Learn about Similarity Metrics for symmetric measures
See Choosing Metrics for practical guidance