Basic Examples
This page provides practical examples for common use cases in word association analysis.
Example 1: Academic Paper Analysis
Analyze abstracts to find domain-specific terminology:
using TextAssociations, DataFrames
# Sample academic abstracts
abstracts = """
Machine learning algorithms have revolutionized data analysis by enabling
automated pattern recognition. Deep learning, a subset of machine learning,
uses neural networks to process complex data structures.
Recent advances in artificial intelligence have led to breakthroughs in
natural language processing. Transformer models have become the foundation
for modern language understanding systems.
Computer vision applications leverage convolutional neural networks to
extract features from images. Object detection and image segmentation
are key tasks in computer vision research.
"""
# Find technical terminology
ct = ContingencyTable(abstracts, "learning"; windowsize=5, minfreq=2,
norm_config=TextNorm(strip_case=true, strip_punctuation=true))
# Calculate multiple metrics for validation
results = assoc_score([PMI, LogDice, LLR], ct)
# Filter for domain-specific terms (high scores across metrics)
technical_terms = filter(row ->
row.PMI > 3.0 &&
row.LogDice > 7.0 &&
row.LLR > 10.83, # p < 0.001
results
)
println("Domain-specific collocates of 'learning':")
for row in eachrow(sort(technical_terms, :PMI, rev=true))
println(" $(row.Collocate): PMI=$(round(row.PMI, digits=2))")
end
Domain-specific collocates of 'learning':
Example 2: Social Media Trend Detection
Identify trending word combinations:
using TextAssociations
tweets = """
Breaking news: breakthrough in quantum computing announced today!
Quantum computing will transform cryptography and security.
Major tech companies investing billions in quantum research.
Scientists achieve quantum supremacy with new processor design.
Quantum algorithms solve problems classical computers cannot handle.
"""
# Analyze with larger window for social media
ct = ContingencyTable(tweets, "quantum"; windowsize=7, minfreq=1)
# Use LogDice for stable results across different sample sizes
results = assoc_score(LogDice, ct)
println("Trending with 'quantum' (LogDice scores):")
for row in eachrow(first(sort(results, :LogDice, rev=true), 5))
println(" $(row.Collocate): $(round(row.LogDice, digits=2))")
end
Trending with 'quantum' (LogDice scores):
quantum: 14.0
in: 13.68
computing: 13.19
achieve: 13.0
new: 13.0
Example 3: Comparative Analysis
Compare collocations across different genres:
using TextAssociations, DataFrames
# Two different text genres
technical = """
The algorithm optimizes performance through parallel processing.
System architecture supports distributed computing paradigms.
Database queries are optimized using indexing strategies.
"""
narrative = """
The story unfolds through multiple perspectives and timelines.
Character development drives the narrative forward compellingly.
Plot twists keep readers engaged throughout the journey.
"""
# Analyze same word in different contexts
function compare_genres(word::String)
# Technical context
ct_tech = ContingencyTable(technical, word; windowsize=3, minfreq=1)
tech_results = assoc_score(PMI, ct_tech; scores_only=false)
tech_results[!, :Genre] .= "Technical"
# Narrative context
ct_narr = ContingencyTable(narrative, word; windowsize=3, minfreq=1)
narr_results = assoc_score(PMI, ct_narr; scores_only=false)
narr_results[!, :Genre] .= "Narrative"
# Combine results
combined = vcat(tech_results, narr_results, cols=:union)
return combined
end
# Compare "the" in both genres
comparison = compare_genres("the")
grouped = groupby(comparison, :Genre)
println("Word associations by genre:")
for group in grouped
genre = first(group.Genre)
println("\n$genre context:")
for row in eachrow(first(sort(group, :PMI, rev=true), 3))
println(" $(row.Collocate): PMI=$(round(row.PMI, digits=2))")
end
end
Word associations by genre:
Technical context:
algorithm: PMI=0.0
optimizes: PMI=0.0
performance: PMI=0.0
Narrative context:
character: PMI=-1.1
compellingly: PMI=-1.1
development: PMI=-1.1
Example 4: Multi-word Expression Detection
Find fixed phrases and idioms:
using TextAssociations, DataFrames
text = """
The project was completed on time and under budget.
We need to think outside the box for this solution.
Let's touch base next week to discuss progress.
The new approach is a game changer for our industry.
It's important to keep an eye on market trends.
The results speak for themselves in this case.
"""
# Identify components of multi-word expressions
function find_expressions(text::String)
# Common function words that start expressions
starters = ["on", "outside", "touch", "game", "keep", "speak"]
expressions = DataFrame()
for starter in starters
ct = ContingencyTable(text, starter, windowsize=2, minfreq=1)
results = assoc_score([PMI, Dice], ct)
# High PMI + High Dice = likely fixed expression
fixed = filter(row -> row.PMI > 2.0 && row.Dice > 0.3, results)
if nrow(fixed) > 0
fixed[!, :Starter] .= starter
expressions = vcat(expressions, fixed, cols=:union)
end
end
return expressions
end
expressions = find_expressions(text)
println("Potential multi-word expressions:")
for row in eachrow(expressions)
println(" $(row.Starter) + $(row.Collocate)")
end
Potential multi-word expressions:
Example 5: Time-sensitive Analysis
Track changing associations over document sections:
using TextAssociations, DataFrames
# Documents with temporal progression
early_docs = """
Early computers used vacuum tubes for processing.
Punch cards were the primary input method.
Memory was measured in kilobytes.
"""
modern_docs = """
Modern computers use multi-core processors.
Cloud computing provides unlimited storage.
Memory is measured in terabytes.
"""
function temporal_comparison(word::String)
# Early period
ct_early = ContingencyTable(early_docs, word; windowsize=4, minfreq=1)
early = assoc_score(PMI, ct_early)
early[!, :Period] .= "Early"
# Modern period
ct_modern = ContingencyTable(modern_docs, word; windowsize=4, minfreq=1)
modern = assoc_score(PMI, ct_modern)
modern[!, :Period] .= "Modern"
return vcat(early, modern, cols=:union)
end
temporal = temporal_comparison("computers")
println("\nEvolution of 'computers' associations:")
for period in ["Early", "Modern"]
period_data = filter(row -> row.Period == period, temporal)
if nrow(period_data) > 0
println("\n$period period:")
for row in eachrow(first(sort(period_data, :PMI, rev=true), 2))
println(" $(row.Collocate): PMI=$(round(row.PMI, digits=2))")
end
end
end
Evolution of 'computers' associations:
Early period:
early: PMI=0.0
for: PMI=0.0
Modern period:
core: PMI=0.0
modern: PMI=0.0
Example 6: Cross-linguistic Analysis
using TextAssociations
# Greek text example
greek_text = """
Η τεχνητή νοημοσύνη αλλάζει τον κόσμο.
Η μηχανική μάθηση είναι μέρος της τεχνητής νοημοσύνης.
Τα νευρωνικά δίκτυα είναι ισχυρά εργαλεία.
"""
# Configure for Greek
greek_config = TextNorm(
strip_case=true,
strip_accents=true, # Remove tonos marks
unicode_form=:NFD,
strip_punctuation=true
)
# Analyze Greek text
ct = ContingencyTable(greek_text, "τεχνητής"; windowsize=3, minfreq=1,
norm_config=greek_config)
results = assoc_score(PMI, ct)
println("Greek text collocations:")
for row in eachrow(results)
println(" $(row.Node) + $(row.Collocate): PMI=$(round(row.PMI, digits=2))")
end
Greek text collocations:
τεχνητης + ειναι: PMI=-0.69
τεχνητης + μερος: PMI=0.0
τεχνητης + νευρωνικα: PMI=0.0
τεχνητης + νοημοσυνης: PMI=0.0
τεχνητης + τα: PMI=0.0
τεχνητης + της: PMI=0.0
Example 7: Building a Collocation Dictionary
Create a reference resource of strong collocations:
using TextAssociations, DataFrames
function build_collocation_dict(text::String, min_llr::Float64=3.0)
# Key words to analyze
keywords = ["data", "analysis", "model", "system", "process"]
dict = DataFrame()
for keyword in keywords
# Skip if word not in text
if !occursin(lowercase(keyword), lowercase(text))
continue
end
ct = ContingencyTable(text, keyword; windowsize=5, minfreq=2)
results = assoc_score([PMI, LogDice, LLR], ct)
# Strong collocations only
strong = filter(row -> row.LLR >= min_llr, results)
if nrow(strong) > 0
dict = vcat(dict, strong, cols=:union)
end
end
# Sort by node then PMI
sort!(dict, [:Node, order(:PMI, rev=true)])
return dict
end
sample_text = """
Data analysis requires careful data preparation and data validation.
Statistical models help analyze complex data patterns.
System design influences system performance and system reliability.
Process optimization improves process efficiency significantly.
Model validation ensures model accuracy and model robustness.
"""
dictionary = build_collocation_dict(sample_text, 2.0)
println("\nCollocation Dictionary:")
let current_node = ""
for row in eachrow(dictionary)
if row.Node != current_node
current_node = row.Node
println("\n$current_node:")
end
println(" → $(row.Collocate) (PMI: $(round(row.PMI, digits=2)))")
end
end
Collocation Dictionary:
analysis:
→ data (PMI: -1.39)
data:
→ system (PMI: -1.61)
→ data (PMI: -1.67)
model:
→ model (PMI: -1.1)
process:
→ system (PMI: -0.69)
→ process (PMI: -1.1)
system:
→ process (PMI: -1.1)
→ system (PMI: -1.1)
Example 8: Performance Benchmarking
Compare efficiency of different approaches:
using TextAssociations
using BenchmarkTools
text = repeat("The quick brown fox jumps over the lazy dog. ", 100)
# Benchmark different configurations
function benchmark_configs()
configs = [
(window=3, minfreq=1, desc="Small window, low threshold"),
(window=5, minfreq=5, desc="Medium window, medium threshold"),
(window=10, minfreq=10, desc="Large window, high threshold")
]
println("Configuration benchmarks:")
for config in configs
time = @elapsed begin
ct = ContingencyTable(text, "the"; windowsize=config.window, minfreq=config.minfreq)
results = assoc_score(PMI, ct; scores_only=true)
end
println(" $(config.desc):")
println(" Time: $(round(time*1000, digits=2))ms")
end
end
# Benchmark metrics
function benchmark_metrics()
ct = ContingencyTable(text, "quick"; windowsize=5, minfreq=1)
metrics = [PMI, LogDice, LLR, Dice]
println("\nMetric benchmarks:")
for metric in metrics
time = @elapsed assoc_score(metric, ct; scores_only=true)
println(" $metric: $(round(time*1000, digits=3))ms")
end
end
benchmark_configs()
benchmark_metrics()
Next Steps
- For corpus-level analysis, see Working with Corpora
- To understand metric selection, see Choosing Metrics
- For advanced features, see Temporal Analysis