Core Concepts
Understanding the fundamental concepts behind word association analysis is crucial for effective use of TextAssociations.jl.
Word Co-occurrence
Definition
Word co-occurrence is the foundation of collocation analysis. Two words co-occur when they appear near each other in text, within a defined window.
using TextAssociations
text = "The data scientist analyzed the data carefully."
# Visualize co-occurrence windows
function show_cooccurrences(text::String, node::String, windowsize::Int)
words = split(lowercase(text))
node_positions = findall(==(lowercase(node)), words)
println("Text: $text")
println("Node word: '$node' at positions $node_positions")
println("Window size: $windowsize")
for pos in node_positions
left_window = max(1, pos - windowsize)
right_window = min(length(words), pos + windowsize)
context = words[left_window:right_window]
println("\nWindow around position $pos:")
println(" ", join(context, " "))
end
end
show_cooccurrences(text, "data", 2)
Text: The data scientist analyzed the data carefully.
Node word: 'data' at positions [2, 6]
Window size: 2
Window around position 2:
the data scientist analyzed
Window around position 6:
analyzed the data carefully.
Context Windows
The window size determines how far from the node word we look for collocates:
- Small windows (1-3): Capture syntactic relations (adjective-noun, verb-object)
- Medium windows (4-7): Balance syntactic and semantic relations
- Large windows (8+): Capture semantic/topical associations
Contingency Tables
The 2×2 Table
Association metrics are calculated from contingency tables that count co-occurrences:
using TextAssociations, DataFrames
# Create a simple example
text = "big data and data science require data analysis"
ct = ContingencyTable(text, "data"; windowsize=2, minfreq=1)
# Access the internal table
internal = cached_data(ct.con_tbl)
if !isempty(internal)
println("Contingency table for 'data':")
for row in eachrow(internal)
println("\nCollocate: $(row.Collocate)")
println(" a (both occur): $(row.a)")
println(" b (only node): $(row.b)")
println(" c (only collocate): $(row.c)")
println(" d (neither): $(row.d)")
println(" Total (N): $(row.N)")
end
end
Contingency table for 'data':
Collocate: analysis
a (both occur): 1
b (only node): 2
c (only collocate): 0
d (neither): 0
Total (N): 3
Collocate: and
a (both occur): 1
b (only node): 1
c (only collocate): 0
d (neither): 0
Total (N): 2
Collocate: big
a (both occur): 1
b (only node): 2
c (only collocate): 0
d (neither): 0
Total (N): 3
Collocate: data
a (both occur): 2
b (only node): 1
c (only collocate): 0
d (neither): 0
Total (N): 3
Collocate: require
a (both occur): 1
b (only node): 1
c (only collocate): 0
d (neither): 0
Total (N): 2
Collocate: science
a (both occur): 1
b (only node): 1
c (only collocate): 0
d (neither): 0
Total (N): 2
Understanding the Cells
For each word pair (node, collocate):
Cell | Meaning | Interpretation |
---|---|---|
a | Co-occurrence frequency | How often they appear together |
b | Node without collocate | Node appears alone |
c | Collocate without node | Collocate appears alone |
d | Neither appears | Rest of the corpus |
Association Metrics
Metric Categories
Different metrics capture different aspects of word association:
using TextAssociations
# Demonstrate different metric properties
text = """
The bank provides financial services.
The river bank was steep and muddy.
Financial analysis requires careful consideration.
The bank offers investment opportunities.
"""
ct = ContingencyTable(text, "bank"; windowsize=3, minfreq=1)
# Calculate different metric types
info_metrics = assoc_score([PMI, PPMI], ct)
stat_metrics = assoc_score([LLR, ChiSquare], ct)
sim_metrics = assoc_score([Dice, JaccardIdx], ct)
println("Information-theoretic metrics (PMI, PPMI):")
println(" Focus: Surprise/informativeness")
println(" High when: Words occur together more than chance")
println("\nStatistical metrics (LLR, ChiSquare):")
println(" Focus: Significance/reliability")
println(" High when: Association is statistically significant")
println("\nSimilarity metrics (Dice, Jaccard):")
println(" Focus: Overlap/similarity")
println(" High when: Words share contexts")
Information-theoretic metrics (PMI, PPMI):
Focus: Surprise/informativeness
High when: Words occur together more than chance
Statistical metrics (LLR, ChiSquare):
Focus: Significance/reliability
High when: Association is statistically significant
Similarity metrics (Dice, Jaccard):
Focus: Overlap/similarity
High when: Words share contexts
Interpreting Scores
using TextAssociations
using DataFrames
# Score interpretation guidelines
function interpret_scores(results::DataFrame)
for row in eachrow(results)
collocate = row.Collocate
# PMI interpretation
pmi_strength = if row.PMI > 5
"very strong"
elseif row.PMI > 3
"strong"
elseif row.PMI > 0
"positive"
else
"negative"
end
# LogDice interpretation (max 14)
dice_reliability = if row.LogDice > 10
"highly reliable"
elseif row.LogDice > 7
"reliable"
else
"weak"
end
println("$collocate:")
println(" PMI: $(round(row.PMI, digits=2)) ($pmi_strength association)")
println(" LogDice: $(round(row.LogDice, digits=2)) ($dice_reliability)")
end
end
# Example
ct = ContingencyTable("machine learning uses learning algorithms", "learning"; windowsize=2, minfreq=1)
results = assoc_score([PMI, LogDice], ct)
interpret_scores(results)
algorithms:
PMI: -0.69 (negative association)
LogDice: 13.42 (highly reliable)
learning:
PMI: -0.69 (negative association)
LogDice: 14.0 (highly reliable)
machine:
PMI: -0.69 (negative association)
LogDice: 13.42 (highly reliable)
uses:
PMI: 0.0 (negative association)
LogDice: 14.0 (highly reliable)
Text Normalization
The TextNorm Configuration
Text preprocessing is controlled by the TextNorm
struct:
using TextAssociations
using TextAnalysis: text
# Different normalization strategies
configs = [
(name="Minimal",
config=TextNorm(strip_case=false, strip_punctuation=false)),
(name="Standard",
config=TextNorm(strip_case=true, strip_punctuation=true)),
(name="Aggressive",
config=TextNorm(strip_case=true, strip_punctuation=true,
strip_accents=true, normalize_whitespace=true))
]
test_text = "Hello, WORLD! Café résumé... Multiple spaces."
for (name, config) in configs
doc = prep_string(test_text, config)
println("$name: '$(text(doc))'")
end
Minimal: 'Hello, WORLD! Café résumé... Multiple spaces.'
Standard: 'hello world café résumé multiple spaces '
Aggressive: 'hello world cafe resume multiple spaces '
Unicode Normalization
Important for multilingual text:
using TextAssociations
using Unicode
# Different Unicode forms can affect matching
text1 = "café" # é as single character
text2 = "café" # e + combining accent
println("Visually identical: ", text1 == text2)
println("After NFC normalization: ",
Unicode.normalize(text1, :NFC) == Unicode.normalize(text2, :NFC))
# TextNorm handles this automatically
config = TextNorm(unicode_form=:NFC)
TextNorm(true, false, :NFC, true, true, true, false, false)
Frequency Thresholds
Minimum Frequency Parameter
The minfreq
parameter filters noise:
using TextAssociations
using DataFrames: nrow
text = """
The main hypothesis was confirmed.
Preliminary results support the hypothesis.
The xyzabc appeared only once.
"""
# Compare different thresholds
for minfreq in [1, 2, 3]
ct = ContingencyTable(text, "the"; windowsize=3, minfreq=minfreq)
results = assoc_score(PMI, ct)
println("minfreq=$minfreq: $(nrow(results)) collocates")
end
minfreq=1: 10 collocates
minfreq=2: 2 collocates
minfreq=3: 0 collocates
Choosing Appropriate Thresholds
Guidelines for setting minfreq
:
Corpus Size | Recommended minfreq | Rationale |
---|---|---|
< 1,000 words | 1-2 | Preserve all data |
1,000-10,000 | 3-5 | Filter hapax legomena |
10,000-100,000 | 5-10 | Remove noise |
> 100,000 | 10-20 | Focus on patterns |
Lazy Evaluation
How LazyProcess Works
TextAssociations.jl uses lazy evaluation for efficiency:
using TextAssociations
# Contingency tables are computed lazily
println("Creating ContingencyTable...")
ct = ContingencyTable("sample text here", "text"; windowsize=3, minfreq=1)
println("Created (not computed yet)")
# Computation happens on first use
println("\nFirst access (triggers computation):")
@time results = assoc_score(PMI, ct)
println("\nSecond access (uses cache):")
@time results2 = assoc_score(LogDice, ct)
Row | Node | Collocate | Frequency | LogDice |
---|---|---|---|---|
String | String | Int64 | Float64 | |
1 | text | here | 1 | 14.0 |
2 | text | sample | 1 | 14.0 |
Benefits
- Memory efficiency: Data computed only when needed
- Performance: Cached results for multiple metrics
- Flexibility: Chain operations without intermediate computation
Statistical Significance
Understanding P-values
Some metrics provide statistical significance:
using TextAssociations
text = """
Statistical analysis requires careful statistical methods.
The statistical approach yields statistical significance.
Random words appear randomly without pattern.
"""
ct = ContingencyTable(text, "statistical"; windowsize=3, minfreq=1)
results = assoc_score([LLR, ChiSquare], ct)
# Interpret statistical significance
for row in eachrow(results)
llr = row.LLR
chi2 = row.ChiSquare
# LLR critical values
p_value = if llr > 10.83
"p < 0.001"
elseif llr > 6.63
"p < 0.01"
elseif llr > 3.84
"p < 0.05"
else
"not significant"
end
println("$(row.Collocate): LLR=$(round(llr, digits=2)) ($p_value)")
end
analysis: LLR=1.92 (not significant)
approach: LLR=1.92 (not significant)
careful: LLR=1.92 (not significant)
methods: LLR=1.92 (not significant)
random: LLR=1.53 (not significant)
requires: LLR=1.92 (not significant)
significance: LLR=1.53 (not significant)
statistical: LLR=6.09 (p < 0.05)
the: LLR=1.92 (not significant)
words: LLR=1.53 (not significant)
yields: LLR=1.92 (not significant)
Best Practices
1. Metric Selection
# For discovery
discovery_metrics = [PMI, PPMI]
# For validation
validation_metrics = [LLR, ChiSquare]
# For comparison across corpora
stable_metrics = [LogDice, PPMI]
2. Parameter Guidelines
# Default parameters for different purposes
const SYNTAX_PARAMS = (windowsize=2, minfreq=5)
const SEMANTIC_PARAMS = (windowsize=5, minfreq=5)
const TOPIC_PARAMS = (windowsize=10, minfreq=10)
3. Validation Strategy
Always validate findings with multiple approaches:
- Use multiple metrics
- Check different window sizes
- Examine concordance lines
- Compare with domain knowledge
Next Steps
- Learn about Text Preprocessing options
- Understand Choosing Metrics for your task
- Explore Working with Corpora