Core Concepts

Understanding the fundamental concepts behind word association analysis is crucial for effective use of TextAssociations.jl.

Word Co-occurrence

Definition

Word co-occurrence is the foundation of collocation analysis. Two words co-occur when they appear near each other in text, within a defined window.

using TextAssociations

text = "The data scientist analyzed the data carefully."

# Visualize co-occurrence windows
function show_cooccurrences(text::String, node::String, windowsize::Int)
    words = split(lowercase(text))
    node_positions = findall(==(lowercase(node)), words)

    println("Text: $text")
    println("Node word: '$node' at positions $node_positions")
    println("Window size: $windowsize")

    for pos in node_positions
        left_window = max(1, pos - windowsize)
        right_window = min(length(words), pos + windowsize)

        context = words[left_window:right_window]
        println("\nWindow around position $pos:")
        println("  ", join(context, " "))
    end
end

show_cooccurrences(text, "data", 2)

Text: The data scientist analyzed the data carefully.
Node word: 'data' at positions [2, 6]
Window size: 2

Window around position 2:
  the data scientist analyzed

Window around position 6:
  analyzed the data carefully.

Context Windows

The window size determines how far from the node word we look for collocates:

Small windows (1-3): Capture syntactic relations (adjective-noun, verb-object)
Medium windows (4-7): Balance syntactic and semantic relations
Large windows (8+): Capture semantic/topical associations

Contingency Tables

The 2×2 Table

Association metrics are calculated from contingency tables that count co-occurrences:

using TextAssociations, DataFrames

# Create a simple example
text = "big data and data science require data analysis"
ct = ContingencyTable(text, "data"; windowsize=2, minfreq=1)

# Access the internal table
internal = cached_data(ct.con_tbl)
if !isempty(internal)
    println("Contingency table for 'data':")
    for row in eachrow(internal)
        println("\nCollocate: $(row.Collocate)")
        println("  a (both occur): $(row.a)")
        println("  b (only node): $(row.b)")
        println("  c (only collocate): $(row.c)")
        println("  d (neither): $(row.d)")
        println("  Total (N): $(row.N)")
    end
end

Contingency table for 'data':

Collocate: analysis
  a (both occur): 1
  b (only node): 2
  c (only collocate): 0
  d (neither): 0
  Total (N): 3

Collocate: and
  a (both occur): 1
  b (only node): 1
  c (only collocate): 0
  d (neither): 0
  Total (N): 2

Collocate: big
  a (both occur): 1
  b (only node): 2
  c (only collocate): 0
  d (neither): 0
  Total (N): 3

Collocate: data
  a (both occur): 2
  b (only node): 1
  c (only collocate): 0
  d (neither): 0
  Total (N): 3

Collocate: require
  a (both occur): 1
  b (only node): 1
  c (only collocate): 0
  d (neither): 0
  Total (N): 2

Collocate: science
  a (both occur): 1
  b (only node): 1
  c (only collocate): 0
  d (neither): 0
  Total (N): 2

Understanding the Cells

For each word pair (node, collocate):

Cell	Meaning	Interpretation
a	Co-occurrence frequency	How often they appear together
b	Node without collocate	Node appears alone
c	Collocate without node	Collocate appears alone
d	Neither appears	Rest of the corpus

Association Metrics

Metric Categories

Different metrics capture different aspects of word association:

using TextAssociations

# Demonstrate different metric properties
text = """
The bank provides financial services.
The river bank was steep and muddy.
Financial analysis requires careful consideration.
The bank offers investment opportunities.
"""

ct = ContingencyTable(text, "bank"; windowsize=3, minfreq=1)

# Calculate different metric types
info_metrics = assoc_score([PMI, PPMI], ct)
stat_metrics = assoc_score([LLR, ChiSquare], ct)
sim_metrics = assoc_score([Dice, JaccardIdx], ct)

println("Information-theoretic metrics (PMI, PPMI):")
println("  Focus: Surprise/informativeness")
println("  High when: Words occur together more than chance")

println("\nStatistical metrics (LLR, ChiSquare):")
println("  Focus: Significance/reliability")
println("  High when: Association is statistically significant")

println("\nSimilarity metrics (Dice, Jaccard):")
println("  Focus: Overlap/similarity")
println("  High when: Words share contexts")

Information-theoretic metrics (PMI, PPMI):
  Focus: Surprise/informativeness
  High when: Words occur together more than chance

Statistical metrics (LLR, ChiSquare):
  Focus: Significance/reliability
  High when: Association is statistically significant

Similarity metrics (Dice, Jaccard):
  Focus: Overlap/similarity
  High when: Words share contexts

Interpreting Scores

using TextAssociations
using DataFrames

# Score interpretation guidelines
function interpret_scores(results::DataFrame)
    for row in eachrow(results)
        collocate = row.Collocate

        # PMI interpretation
        pmi_strength = if row.PMI > 5
            "very strong"
        elseif row.PMI > 3
            "strong"
        elseif row.PMI > 0
            "positive"
        else
            "negative"
        end

        # LogDice interpretation (max 14)
        dice_reliability = if row.LogDice > 10
            "highly reliable"
        elseif row.LogDice > 7
            "reliable"
        else
            "weak"
        end

        println("$collocate:")
        println("  PMI: $(round(row.PMI, digits=2)) ($pmi_strength association)")
        println("  LogDice: $(round(row.LogDice, digits=2)) ($dice_reliability)")
    end
end

# Example
ct = ContingencyTable("machine learning uses learning algorithms", "learning"; windowsize=2, minfreq=1)
results = assoc_score([PMI, LogDice], ct)
interpret_scores(results)

algorithms:
  PMI: -0.69 (negative association)
  LogDice: 13.42 (highly reliable)
learning:
  PMI: -0.69 (negative association)
  LogDice: 14.0 (highly reliable)
machine:
  PMI: -0.69 (negative association)
  LogDice: 13.42 (highly reliable)
uses:
  PMI: 0.0 (negative association)
  LogDice: 14.0 (highly reliable)

Text Normalization

The TextNorm Configuration

Text preprocessing is controlled by the TextNorm struct:

using TextAssociations
using TextAnalysis: text

# Different normalization strategies
configs = [
    (name="Minimal",
     config=TextNorm(strip_case=false, strip_punctuation=false)),
    (name="Standard",
     config=TextNorm(strip_case=true, strip_punctuation=true)),
    (name="Aggressive",
     config=TextNorm(strip_case=true, strip_punctuation=true,
                    strip_accents=true, normalize_whitespace=true))
]

test_text = "Hello, WORLD! Café résumé... Multiple   spaces."

for (name, config) in configs
    doc = prep_string(test_text, config)
    println("$name: '$(text(doc))'")
end

Minimal: 'Hello, WORLD! Café résumé... Multiple spaces.'
Standard: 'hello world café résumé multiple spaces '
Aggressive: 'hello world cafe resume multiple spaces '

Unicode Normalization

Important for multilingual text:

using TextAssociations
using Unicode

# Different Unicode forms can affect matching
text1 = "café"  # é as single character
text2 = "café"  # e + combining accent

println("Visually identical: ", text1 == text2)
println("After NFC normalization: ",
    Unicode.normalize(text1, :NFC) == Unicode.normalize(text2, :NFC))

# TextNorm handles this automatically
config = TextNorm(unicode_form=:NFC)

TextNorm(true, false, :NFC, true, true, true, false, false)

Frequency Thresholds

Minimum Frequency Parameter

The minfreq parameter filters noise:

using TextAssociations
using DataFrames: nrow

text = """
The main hypothesis was confirmed.
Preliminary results support the hypothesis.
The xyzabc appeared only once.
"""

# Compare different thresholds
for minfreq in [1, 2, 3]
    ct = ContingencyTable(text, "the"; windowsize=3, minfreq=minfreq)
    results = assoc_score(PMI, ct)
    println("minfreq=$minfreq: $(nrow(results)) collocates")
end

minfreq=1: 10 collocates
minfreq=2: 2 collocates
minfreq=3: 0 collocates

Choosing Appropriate Thresholds

Guidelines for setting minfreq:

Corpus Size	Recommended minfreq	Rationale
< 1,000 words	1-2	Preserve all data
1,000-10,000	3-5	Filter hapax legomena
10,000-100,000	5-10	Remove noise
> 100,000	10-20	Focus on patterns

Lazy Evaluation

How LazyProcess Works

TextAssociations.jl uses lazy evaluation for efficiency:

using TextAssociations

# Contingency tables are computed lazily
println("Creating ContingencyTable...")
ct = ContingencyTable("sample text here", "text"; windowsize=3, minfreq=1)
println("Created (not computed yet)")

# Computation happens on first use
println("\nFirst access (triggers computation):")
@time results = assoc_score(PMI, ct)

println("\nSecond access (uses cache):")
@time results2 = assoc_score(LogDice, ct)

2×4 DataFrame

Row	Node	Collocate	Frequency	LogDice
	String	String	Int64	Float64
1	text	here	1	14.0
2	text	sample	1	14.0

Benefits

Memory efficiency: Data computed only when needed
Performance: Cached results for multiple metrics
Flexibility: Chain operations without intermediate computation

Statistical Significance

Understanding P-values

Some metrics provide statistical significance:

using TextAssociations

text = """
Statistical analysis requires careful statistical methods.
The statistical approach yields statistical significance.
Random words appear randomly without pattern.
"""

ct = ContingencyTable(text, "statistical"; windowsize=3, minfreq=1)
results = assoc_score([LLR, ChiSquare], ct)

# Interpret statistical significance
for row in eachrow(results)
    llr = row.LLR
    chi2 = row.ChiSquare

    # LLR critical values
    p_value = if llr > 10.83
        "p < 0.001"
    elseif llr > 6.63
        "p < 0.01"
    elseif llr > 3.84
        "p < 0.05"
    else
        "not significant"
    end

    println("$(row.Collocate): LLR=$(round(llr, digits=2)) ($p_value)")
end

analysis: LLR=1.92 (not significant)
approach: LLR=1.92 (not significant)
careful: LLR=1.92 (not significant)
methods: LLR=1.92 (not significant)
random: LLR=1.53 (not significant)
requires: LLR=1.92 (not significant)
significance: LLR=1.53 (not significant)
statistical: LLR=6.09 (p < 0.05)
the: LLR=1.92 (not significant)
words: LLR=1.53 (not significant)
yields: LLR=1.92 (not significant)

Best Practices

1. Metric Selection

# For discovery
discovery_metrics = [PMI, PPMI]

# For validation
validation_metrics = [LLR, ChiSquare]

# For comparison across corpora
stable_metrics = [LogDice, PPMI]

2. Parameter Guidelines

# Default parameters for different purposes
const SYNTAX_PARAMS = (windowsize=2, minfreq=5)
const SEMANTIC_PARAMS = (windowsize=5, minfreq=5)
const TOPIC_PARAMS = (windowsize=10, minfreq=10)

3. Validation Strategy

Always validate findings with multiple approaches:

Use multiple metrics
Check different window sizes
Examine concordance lines
Compare with domain knowledge

Next Steps

Learn about Text Preprocessing options
Understand Choosing Metrics for your task
Explore Working with Corpora