Quick Tutorial
This tutorial provides a comprehensive introduction to TextAssociations.jl. By the end, you'll understand how to analyze word associations in text and interpret the results.
Prerequisites
This tutorial assumes you have:
- Julia 1.9 or later installed
- TextAssociations.jl installed (
using Pkg; Pkg.add("TextAssociations")
) - Basic familiarity with Julia
Your First Analysis
Let's start with a simple example analyzing word associations in text about technology.
Step 1: Load the Package
using TextAssociations
using DataFrames
Step 2: Prepare Your Text
docs = [
"Machine learning algorithms can learn from data without explicit programming.",
"Deep learning is a subset of machine learning that uses neural networks.",
"Artificial intelligence includes machine learning and deep learning techniques.",
"Neural networks are the foundation of modern deep learning systems."
]
# Combine documents into one text
s = join(docs, " ")
println("Text length: ", length(s), " characters")
Text length: 298 characters
Step 3: Create a Contingency Table
The contingency table captures co-occurrence patterns between your target word and its context.
# Analyze the word "learning"
ct = ContingencyTable(
s,
"learning";
windowsize=3, # Consider 3 words on each side
minfreq=1 # Include words appearing at least once
)
println("Contingency table created for 'learning'")
Contingency table created for 'learning'
Parameters explained:
windowsize=3
: Looks 3 words left and right of "learning"minfreq=1
: Only includes words that appear at least once as collocates
Step 4: Calculate Association Scores
Now let's calculate PMI (Pointwise Mutual Information) scores to identify strong collocates.
# Calculate PMI scores
results = assoc_score(PMI, ct)
println("\nTop 5 collocates of 'learning':")
println(first(sort(results, :PMI, rev=true), 5))
Top 5 collocates of 'learning':
5×4 DataFrame
Row │ Node Collocate Frequency PMI
│ String String Int64 Float64
─────┼───────────────────────────────────────────
1 │ learning and 1 -1.60944
2 │ learning deep 3 -1.60944
3 │ learning subset 1 -1.60944
4 │ learning machine 3 -1.79176
5 │ learning algorithms 1 -1.79176
Step 5: Try Multiple Metrics
Different metrics reveal different aspects of associations.
# Calculate multiple metrics at once
multi_results = assoc_score([PMI, LogDice, LLR], ct)
println("\nTop 3 collocates with multiple metrics:")
println(first(sort(multi_results, :PMI, rev=true), 3))
Top 3 collocates with multiple metrics:
3×6 DataFrame
Row │ Node Collocate Frequency PMI LogDice LLR
│ String String Int64 Float64 Float64 Float64
─────┼────────────────────────────────────────────────────────────
1 │ learning and 1 -1.60944 12.415 1.88004
2 │ learning deep 3 -1.60944 13.585 6.76593
3 │ learning subset 1 -1.60944 12.415 1.88004
Working with Corpora
For analyzing multiple documents, use the Corpus functionality.
using TextAnalysis: StringDocument
# Create a simple corpus directly
doc_objects = [StringDocument(d) for d in docs]
corpus = Corpus(doc_objects)
# Analyze "learning" across the corpus
corpus_results = analyze_corpus(
corpus,
"learning", # Node word
PMI, # Metric
windowsize=3, # Context window
minfreq=2 # Min frequency across corpus
)
println("Top collocates of 'learning' in corpus:")
println(first(corpus_results, 5))
Top collocates of 'learning' in corpus:
4×5 DataFrame
Row │ Node Collocate Score Frequency DocFrequency
│ String String Float64 Int64 Int64
─────┼─────────────────────────────────────────────────────────
1 │ learning learning -0.693147 2 1
2 │ learning of -1.09861 2 2
3 │ learning deep -1.38629 3 3
4 │ learning machine -1.60944 3 3
Understanding the Results
Interpreting PMI Scores
PMI (Pointwise Mutual Information) measures how much more likely two words co-occur than by chance:
- PMI > 0: Words co-occur more than expected (positive association)
- PMI = 0: Co-occurrence matches random expectation
- PMI < 0: Words co-occur less than expected (negative association)
# Filter for strong associations
strong_assoc = filter(row -> row.PMI > 3.0, results)
println("\nStrong associations (PMI > 3):")
println(strong_assoc)
Strong associations (PMI > 3):
0×4 DataFrame
Row │ Node Collocate Frequency PMI
│ String String Int64 Float64
─────┴───────────────────────────────────────
Comparing Metrics
Different metrics highlight different aspects:
# Create comparison
comparison = assoc_score([PMI, LogDice, Dice], ct)
println("\nMetric comparison for top collocate:")
if nrow(comparison) > 0
top_row = first(sort(comparison, :PMI, rev=true))
println(" Collocate: ", top_row.Collocate)
println(" PMI: ", round(top_row.PMI, digits=2))
println(" LogDice: ", round(top_row.LogDice, digits=2))
println(" Dice: ", round(top_row.Dice, digits=3))
end
Metric comparison for top collocate:
Collocate: and
PMI: -1.61
LogDice: 12.42
Dice: 0.333
Text Preprocessing
Control how text is normalized before analysis:
using TextAnalysis: text
# Example with case-sensitive analysis
text_mixed = "Machine Learning and machine learning are related. Machine learning is powerful."
# Default: case normalization ON
config_lower = TextNorm(strip_case=true)
ct_lower = ContingencyTable(text_mixed, "learning";
windowsize=3, minfreq=1, norm_config=config_lower)
# Case-sensitive: case normalization OFF
config_case = TextNorm(strip_case=false)
ct_case = ContingencyTable(text_mixed, "learning";
windowsize=3, minfreq=1, norm_config=config_case)
println("Lowercase normalization: ", nrow(assoc_score(PMI, ct_lower)), " collocates")
println("Case-sensitive: ", nrow(assoc_score(PMI, ct_case)), " collocates")
Lowercase normalization: 7 collocates
Case-sensitive: 8 collocates
Preprocessing Options
# Full preprocessing configuration
full_config = TextNorm(
strip_case=true, # Convert to lowercase
strip_punctuation=true, # Remove punctuation
strip_accents=false, # Keep diacritics
normalize_whitespace=true, # Collapse multiple spaces
unicode_form=:NFC # Unicode normalization
)
# Apply preprocessing
preprocessed = prep_string(s, full_config)
println("Preprocessed text (first 100 chars):")
println(first(text(preprocessed), 100), "...")
Preprocessed text (first 100 chars):
machine learning algorithms can learn from data without explicit programming deep learning is a subs...
Common Workflows
1. Find Strong Collocations
function find_strong_collocations(text, word, threshold=3.0)
ct = ContingencyTable(text, word; windowsize=5, minfreq=2)
results = assoc_score([PMI, LogDice], ct)
# Filter for strong associations
strong = filter(row -> row.PMI > threshold, results)
sort!(strong, :PMI, rev=true)
return strong
end
collocations = find_strong_collocations(s, "learning")
println("\nStrong collocations found: ", nrow(collocations))
Strong collocations found: 0
2. Compare Multiple Words
function compare_words(text, words, metric=PMI)
all_results = DataFrame()
for word in words
ct = ContingencyTable(text, word; windowsize=5, minfreq=1)
word_results = assoc_score(metric, ct)
word_results.Node .= word
append!(all_results, word_results)
end
return all_results
end
comparison_results = compare_words(s, ["learning", "neural", "deep"])
println("\nComparison across words:")
println(first(sort(comparison_results, :PMI, rev=true), 10))
Comparison across words:
10×4 DataFrame
Row │ Node Collocate Frequency PMI
│ String String Int64 Float64
─────┼────────────────────────────────────────────
1 │ neural networks 2 -0.693147
2 │ neural of 2 -0.693147
3 │ neural deep 1 -0.693147
4 │ neural foundation 1 -0.693147
5 │ neural includes 1 -0.693147
6 │ neural intelligence 1 -0.693147
7 │ neural techniques 1 -0.693147
8 │ neural that 1 -0.693147
9 │ neural the 1 -0.693147
10 │ neural uses 1 -0.693147
3. Parameter Tuning
function tune_parameters(text, word)
configs = [
(ws=2, mf=1, name="Narrow"),
(ws=5, mf=2, name="Balanced"),
(ws=10, mf=3, name="Wide")
]
for config in configs
ct = ContingencyTable(text, word;
windowsize=config.ws, minfreq=config.mf)
tune_results = assoc_score(PMI, ct)
println("$(config.name): $(nrow(tune_results)) collocates")
end
end
println("\nParameter tuning for 'learning':")
tune_parameters(s, "learning")
Parameter tuning for 'learning':
Narrow: 16 collocates
Balanced: 6 collocates
Wide: 3 collocates
Next Steps
Now that you understand the basics, explore:
- Metrics Guide: Learn about all available metrics
- Corpus Analysis: Advanced corpus techniques
- Preprocessing: Detailed text normalization
- API Reference: Complete function documentation
Quick Reference
Basic Analysis Pattern
# 1. Load package
using TextAssociations
# 2. Create contingency table
ct = ContingencyTable(s, "word"; windowsize=5, minfreq=2)
# 3. Calculate scores
results = assoc_score(PMI, ct)
# 4. Examine results
sort!(results, :PMI, rev=true)
Common Parameters
Parameter | Typical Range | Description |
---|---|---|
windowsize | 2-10 | Context window size |
minfreq | 1-5 | Minimum co-occurrence frequency |
strip_case | true/false | Convert to lowercase |
strip_punctuation | true/false | Remove punctuation |
Recommended Metrics
- PMI: General-purpose, interpretable
- LogDice: Balanced, less affected by frequency
- LLR: Statistical significance testing
- Dice: Simple similarity measure
Troubleshooting
No Results Found
# Check if word exists in text
doc = prep_string(s, TextNorm(strip_case=true))
tokens = TextAnalysis.tokens(doc)
word_count = count(==("yourword"), tokens)
println("Word appears $word_count times")
Empty DataFrame
Possible causes:
minfreq
too high - tryminfreq=1
windowsize
too small - trywindowsize=10
- Word not in text - check spelling and case
- Text too short - need more context
Memory Issues
# Use scores_only for large analyses
scores = assoc_score(PMI, ct, scores_only=true) # Returns Vector{Float64}
Practice Exercises
- Analyze your own text data
- Compare different window sizes
- Try all available metrics
- Build a collocate extraction pipeline
- Analyze a corpus of documents
Happy analyzing!