Main Functions
This section provides comprehensive documentation for all main API functions in TextAssociations.jl.
Function Categories
Core Evaluation Functions
assoc_score - Primary Evaluation Function
TextAssociations.assoc_score
— Functionassoc_score(metricType::Type{<:AssociationMetric}, x::AssociationDataFormat;
scores_only::Bool=false,
tokens::Union{Nothing,Vector{String}}=nothing,
kwargs...)
Evaluate a metric on any association data format (CT or CCT).
- If the metric requires tokens (e.g., LexicalGravity), pass
tokens=...
or implementassoc_tokens(::YourType)
to supply them automatically. - Returns a DataFrame by default: [:Node, :Collocate, :Frequency, :<MetricName>].
- If
scores_only=true
, returns only the scores Vector.
assoc_score(metrics::AbstractVector{<:Type{<:AssociationMetric}},
x::AssociationDataFormat;
scores_only::Bool=false,
tokens::Union{Nothing,Vector{String}}=nothing,
kwargs...)
Evaluate multiple metrics on CT or CCT.
- Returns a DataFrame with one column per metric by default.
- If
scores_only=true
, returns Dict{String,Vector{Float64}}.
assoc_score(metricType::Type{<:AssociationMetric},
inputstring::AbstractString,
node::AbstractString;
windowsize::Int,
minfreq::Int=5;
scores_only::Bool=false,
norm_config::TextNorm=TextNorm(),
tokens::Union{Nothing,Vector{String}}=nothing,
kwargs...)
Convenience overload to compute a metric directly from a raw string.
assoc_score(metrics::AbstractVector{<:Type{<:AssociationMetric}},
inputstring::AbstractString,
node::AbstractString;
windowsize::Int,
minfreq::Int=5;
scores_only::Bool=false,
norm_config::TextNorm=TextNorm(),
tokens::Union{Nothing,Vector{String}}=nothing,
kwargs...)
Convenience overload to compute multiple metrics directly from raw text.
assoc_score(metricType::Type{<:AssociationMetric}, corpus::Corpus, node::AbstractString;
windowsize::Int=5, minfreq::Int=5, kwargs...)
Evaluate a metric on a corpus - convenience method that delegates to analyze_corpus.
assoc_score(metricType::Type{<:AssociationMetric}, corpus::Corpus,
nodes::Vector{String}; windowsize::Int=5, minfreq::Int=5,
top_n::Int=100, kwargs...)
Evaluate a metric on multiple nodes in a corpus. Returns a Dict{String,DataFrame} with results for each node.
assoc_score(metrics::AbstractVector{<:Type{<:AssociationMetric}},
corpus::Corpus, nodes::Vector{String};
windowsize::Int=5, minfreq::Int=5, top_n::Int=100, kwargs...)
Evaluate multiple metrics on multiple nodes in a corpus. Returns a Dict{String,DataFrame} with combined metric results for each node.
assoc_score(metric::Type{<:AssociationMetric}, corpus::Corpus;
nodes::Vector{String}, windowsize::Int=5, minfreq::Int=5,
top_n::Int=100, kwargs...)
Alternative syntax with nodes as keyword argument.
The assoc_score
function is the primary interface for computing association metrics. It supports multiple signatures for different use cases.
Method Signatures
# Single metric on prepared data
assoc_score(metric::Type{<:AssociationMetric},
data::AssociationDataFormat;
scores_only::Bool=false,
tokens::Union{Nothing,Vector{String}}=nothing) -> Union{DataFrame, Vector}
# Multiple metrics on prepared data
assoc_score(metrics::Vector{DataType},
data::AssociationDataFormat;
scores_only::Bool=false) -> Union{DataFrame, Dict}
# Direct from text (single metric)
assoc_score(metric::Type{<:AssociationMetric},
s::AbstractString,
node::AbstractString,
windowsize::Int,
minfreq::Int=5;
scores_only::Bool=false) -> Union{DataFrame, Vector}
Parameters
Parameter | Type | Description | Default |
---|---|---|---|
metric/metrics | Type or Vector{DataType} | Association metric(s) to compute | Required |
data | AssociationDataFormat | ContingencyTable or CorpusContingencyTable | Required |
scores_only | Bool | Return only numeric scores | false |
tokens | Vector{String} | Token list for metrics that need it | nothing |
text | AbstractString | Raw text for direct evaluation | - |
node | AbstractString | Target word | - |
windowsize | Int | Context window size | - |
minfreq | Int | Minimum frequency threshold | 5 |
Return Values
Default (
scores_only=false
): ReturnsDataFrame
with columns:Node
: Target wordCollocate
: Co-occurring wordFrequency
: Co-occurrence frequency[MetricName]
: Score column(s) named after metric(s)
Performance mode (
scores_only=true
):- Single metric:
Vector{Float64}
of scores - Multiple metrics:
Dict{String, Vector{Float64}}
- Single metric:
Examples
Basic Usage
using TextAssociations
using TextAnalysis
s = """
Data science combines mathematics, statistics, and computer science.
Machine learning is a crucial part of data science.
Data analysis helps extract insights from data.
"""
# Create contingency table
ct = ContingencyTable(s, "data"; windowsize=3, minfreq=1)
# Single metric evaluation
pmi_results = assoc_score(PMI, ct)
println("PMI Results:")
println(pmi_results)
PMI Results:
12×4 DataFrame
Row │ Node Collocate Frequency PMI
│ String String Int64 Float64
─────┼──────────────────────────────────────────
1 │ data analysis 1 -1.79176
2 │ data combines 1 -1.38629
3 │ data crucial 1 -1.38629
4 │ data data 2 -1.38629
5 │ data extract 1 -1.09861
6 │ data from 1 -1.38629
7 │ data helps 1 -1.38629
8 │ data insights 1 -1.38629
9 │ data mathematics 1 -1.38629
10 │ data of 1 -1.09861
11 │ data part 1 -1.38629
12 │ data science 2 -1.09861
Multiple Metrics
# Evaluate multiple metrics simultaneously
metrics = [PMI, LogDice, LLR, Dice]
multi_results = assoc_score(metrics, ct)
println("\nColumns in results: ", names(multi_results))
println("Top result by PMI:")
println(first(sort(multi_results, :PMI, rev=true), 1))
Columns in results: ["Node", "Collocate", "Frequency", "PMI", "LogDice", "LLR", "Dice"]
Top result by PMI:
1×7 DataFrame
Row │ Node Collocate Frequency PMI LogDice LLR Dice
│ String String Int64 Float64 Float64 Float64 Float64
─────┼───────────────────────────────────────────────────────────────────
1 │ data science 2 -1.09861 13.6781 6.18896 0.8
Direct from Text
# Skip contingency table creation
results = assoc_score(PMI, s, "science", windowsize=4, minfreq=1)
println("\nDirect evaluation results:")
println(results)
Direct evaluation results:
16×4 DataFrame
Row │ Node Collocate Frequency PMI
│ String String Int64 Float64
─────┼────────────────────────────────────────────
1 │ science a 1 -1.79176
2 │ science analysis 1 -1.09861
3 │ science and 1 -0.693147
4 │ science combines 1 -1.09861
5 │ science computer 1 -1.09861
6 │ science crucial 1 -1.09861
7 │ science data 3 -1.38629
8 │ science extract 1 -1.09861
9 │ science helps 1 -1.09861
10 │ science is 1 -1.09861
11 │ science learning 1 -1.09861
12 │ science machine 1 -1.09861
13 │ science mathematics 1 -0.693147
14 │ science of 1 -1.09861
15 │ science part 1 -1.09861
16 │ science statistics 1 -0.693147
Performance Mode
# Get only scores for better performance
scores = assoc_score(PMI, ct, scores_only=true)
println("\nScore vector: ", scores)
println("Length: ", length(scores))
# Multiple metrics with scores_only
score_dict = assoc_score([PMI, LogDice], ct, scores_only=true)
println("\nScore dictionary keys: ", keys(score_dict))
Score vector: [-1.791759469228055, -1.3862943611198906, -1.3862943611198906, -1.3862943611198904, -1.0986122886681098, -1.3862943611198906, -1.3862943611198906, -1.3862943611198906, -1.3862943611198906, -1.0986122886681098, -1.3862943611198906, -1.0986122886681096]
Length: 12
Score dictionary keys: ["LogDice", "PMI"]
Advanced Usage
Custom Filtering Pipeline
using TextAssociations
using TextAnalysis
using DataFrames
# Evaluate and filter in one pipeline
function analyze_with_thresholds(text, word, thresholds)
ct = ContingencyTable(text, word; windowsize=5, minfreq=2)
results = assoc_score([PMI, LogDice, LLR], ct)
# Apply multiple thresholds
filtered = filter(row ->
row.PMI >= thresholds[:pmi] &&
row.LogDice >= thresholds[:logdice] &&
row.LLR >= thresholds[:llr],
results
)
return sort(filtered, :PMI, rev=true)
end
thresholds = Dict(:pmi => 2.0, :logdice => 5.0, :llr => 3.84)
filtered = analyze_with_thresholds(s, "data", thresholds)
println("Filtered results: ", nrow(filtered), " collocates")
Filtered results: 0 collocates
Text Processing Functions
prep_string - Text Preprocessing
TextAssociations.prep_string
— Functionprep_string(input_path::AbstractString, config::TextNorm) -> StringDocument
Prepare and preprocess text from various sources into a StringDocument
.
Arguments
input_path
: File path, directory path, or raw text string.
Preprocessing options
Uses TextNorm
configuration for all preprocessing options.
Returns
A preprocessed StringDocument
object suitable for downstream corpus analysis.
Preprocesses text with extensive customization options for different languages and domains.
Parameters
Parameter | Type | Description | Default |
---|---|---|---|
input_path | AbstractString | File path, directory, or raw text | Required |
strip_punctuation | Bool | Remove punctuation | true |
punctuation_to_space | Bool | Replace punctuation with spaces | true |
strip_whitespace | Bool | Remove all whitespace | false |
normalize_whitespace | Bool | Collapse multiple spaces | true |
strip_case | Bool | Convert to lowercase | true |
strip_accents | Bool | Remove diacritical marks | false |
unicode_form | Symbol | Unicode normalization form | :NFC |
use_prepare | Bool | Apply TextAnalysis pipeline | false |
Examples
Basic Preprocessing
using TextAssociations
using TextAnalysis
# Default preprocessing
s = "Hello, WORLD!!! Multiple spaces..."
doc = prep_string(s)
println("Default: '", text(doc), "'")
# Custom preprocessing
doc_custom = prep_string(s,
TextNorm(strip_case=false, # Keep original case
strip_punctuation=false, # Keep punctuation
normalize_whitespace=true # Fix spacing only
))
println("Custom: '", text(doc_custom), "'")
Default: 'hello world multiple spaces '
Custom: 'Hello, WORLD!!! Multiple spaces...'
Multilingual Text
using TextAssociations
using TextAnalysis
# Greek text with diacritics
greek = "Καλημέρα! Η ανάλυση κειμένου είναι σημαντική."
# Keep diacritics (default)
doc_with = prep_string(greek, TextNorm(strip_accents=false))
println("With accents: '", text(doc_with), "'")
# Remove diacritics
doc_without = prep_string(greek, TextNorm(strip_accents=true))
println("Without accents: '", text(doc_without), "'")
With accents: 'καλημέρα η ανάλυση κειμένου είναι σημαντική '
Without accents: 'καλημερα η αναλυση κειμενου ειναι σημαντικη '
Processing Files and Directories
using TextAssociations
using TextAnalysis
# From file
# doc = prep_string("document.txt")
# From directory (concatenates all .txt files)
# doc = prep_string("corpus/")
# Example with temporary file
using Mmap
temp_file = tempname() * ".txt"
write(temp_file, "Sample text from file.")
doc = prep_string(temp_file)
println("From file: '", text(doc), "'")
rm(temp_file)
From file: 'sample text from file '
build_vocab - Vocabulary Creation
TextAssociations.build_vocab
— Functionbuild_vocab(input::Union{StringDocument,Vector{String}}) -> OrderedDict
Create vocabulary dictionary from text input.
Creates an ordered dictionary mapping words to indices.
Parameters
input
: Either aStringDocument
orVector{String}
Returns
OrderedDict{String,Int}
: Word-to-index mapping
Examples
using TextAssociations
# From document
doc = prep_string("The quick brown fox jumps over the lazy dog")
vocab = build_vocab(doc)
println("Vocabulary size: ", length(vocab))
println("First 5 words:")
for (word, idx) in Iterators.take(vocab, 5)
println(" $idx: '$word'")
end
# From word vector
words = ["apple", "banana", "cherry", "apple"] # Duplicates removed
vocab2 = build_vocab(words)
println("\nUnique words: ", length(vocab2))
Vocabulary size: 8
First 5 words:
1: 'the'
2: 'quick'
3: 'brown'
4: 'fox'
5: 'jumps'
Unique words: 3
Utility Functions
available_metrics - List Available Metrics
TextAssociations.available_metrics
— Functionavailable_metrics() -> Vector{DataType}
Returns a list of all supported association metrics.
Returns a vector of all available association metric symbols.
Example
using TextAssociations
metrics = available_metrics()
println("Total available metrics: ", length(metrics))
println("\nInformation-theoretic metrics:")
info_metrics = filter(m -> occursin("PMI", String(m)) || m == :PPMI, metrics)
println(info_metrics)
println("\nStatistical metrics:")
stat_metrics = filter(m -> m in [:LLR, :ChiSquare, :Tscore, :Zscore], metrics)
println(stat_metrics)
Total available metrics: 47
Information-theoretic metrics:
[:PMI, :PMI², :PMI³, :PPMI]
Statistical metrics:
[:LLR, :Tscore, :Zscore, :ChiSquare]
cached_data - Access Lazy Data
TextAssociations.cached_data
— Functioncached_data(z::LazyProcess{T,R}) -> R
Extract data from a LazyProcess, computing it if necessary.
Extracts data from a LazyProcess
, computing it if necessary.
Parameters
z::LazyProcess
: Lazy process wrapper
Returns
- The computed/cached result
Example
using TextAssociations
# ContingencyTable uses lazy evaluation internally
ct = ContingencyTable("sample text", "text"; windowsize=3, minfreq=1)
# First access computes the table
println("First access...")
data1 = cached_data(ct.con_tbl)
# Second access uses cache (no computation)
println("Second access...")
data2 = cached_data(ct.con_tbl)
println("Same object? ", data1 === data2) # true - same cached object
First access...
Second access...
Same object? true
document - Access Document
TextAssociations.document
— Functiondocument(input::LazyInput) -> StringDocument
Extract the document from a LazyInput wrapper.
Extracts the document from a LazyInput
wrapper.
Parameters
input::LazyInput
: Lazy input wrapper
Returns
StringDocument
: The stored document
Batch Processing Functions
Processing Multiple Nodes
using TextAssociations
using DataFrames
s = """
Artificial intelligence and machine learning are transforming technology.
Deep learning, a subset of machine learning, uses neural networks.
Machine learning algorithms can learn from data without explicit programming.
"""
# Analyze multiple words
nodes = ["learning", "machine", "neural", "data"]
results = Dict{String, DataFrame}()
for node in nodes
ct = ContingencyTable(s, node; windowsize=3, minfreq=1)
results[node] = assoc_score(PMI, ct)
end
println("Results per node:")
for (node, df) in results
println(" $node: $(nrow(df)) collocates, top PMI = $(round(maximum(df.PMI), digits=2))")
end
Results per node:
neural: 4 collocates, top PMI = 0.0
data: 6 collocates, top PMI = 0.0
machine: 14 collocates, top PMI = -0.69
learning: 16 collocates, top PMI = -1.1
Comparative Analysis
using TextAssociations
using DataFrames
using Statistics
# Compare different window sizes
function compare_parameters(text, word)
params = [
(window=2, minfreq=1),
(window=5, minfreq=1),
(window=10, minfreq=1)
]
comparison = DataFrame()
for p in params
ct = ContingencyTable(text, word; windowsize=p.window, minfreq=p.minfreq)
df = assoc_score(PMI, ct)
df.WindowSize .= p.window
append!(comparison, df)
end
return comparison
end
comparison = compare_parameters(s, "learning")
grouped = groupby(comparison, :WindowSize)
summary = combine(grouped,
nrow => :NumCollocates,
:PMI => mean => :AvgPMI,
:PMI => maximum => :MaxPMI
)
println("\nWindow size comparison:")
println(summary)
Window size comparison:
3×4 DataFrame
Row │ WindowSize NumCollocates AvgPMI MaxPMI
│ Int64 Int64 Float64 Float64
─────┼───────────────────────────────────────────────
1 │ 2 4 -0.274653 0.0
2 │ 5 10 -0.109861 0.0
3 │ 10 15 -0.192691 0.0
Performance Optimization
Memory-Efficient Processing
using TextAssociations
# Use scores_only for large-scale processing
function process_many_nodes(text, nodes)
scores = Dict{String, Vector{Float64}}()
for node in nodes
ct = ContingencyTable(text, node; windowsize=5, minfreq=1)
# Get only scores to save memory
scores[node] = assoc_score(PMI, ct, scores_only=true)
end
return scores
end
nodes = ["intelligence", "artificial", "learning"]
score_dict = process_many_nodes(s, nodes)
println("\nScore vectors per node:")
for (node, scores) in score_dict
if isempty(scores)
println(" $node: no collocates found")
else
println(" $node: $(length(scores)) scores, max = $(round(maximum(scores), digits=2))")
end
end
Score vectors per node:
artificial: no collocates found
learning: 10 scores, max = 0.0
intelligence: no collocates found
Parallel Evaluation
using TextAssociations
# Function for parallel processing (conceptual)
function parallel_evaluate(strings, word, metrics)
results = []
# In practice, use @distributed or Threads.@threads
for s in strings
ct = ContingencyTable(s, word; windowsize=5, minfreq=2)
push!(results, assoc_score(metrics, ct))
end
return results
end
# Example with multiple text segments
strings = [
"Machine learning is powerful.",
"Deep learning uses neural networks.",
"Artificial intelligence includes machine learning."
]
results = parallel_evaluate(strings, "learning", [PMI, LogDice])
println("\nResults from $(length(results)) text segments processed")
Results from 3 text segments processed
Error Handling and Validation
Input Validation
using TextAssociations
using DataFrames
# Handle empty or invalid inputs
function safe_evaluate(s, word, metric)
try
# Validate inputs
isempty(s) && throw(ArgumentError("Text cannot be empty"))
isempty(word) && throw(ArgumentError("Word cannot be empty"))
ct = ContingencyTable(s, word; windowsize=5, minfreq=1)
results = assoc_score(metric, ct)
if isempty(results)
println("Warning: No collocates found for '$word'")
return DataFrame()
end
return results
catch e
println("Error: ", e)
return DataFrame()
end
end
# Test with various inputs
println("Valid input:")
valid = safe_evaluate(s, "learning", PMI)
println(" Found $(nrow(valid)) collocates")
println("\nEmpty word:")
empty_word = safe_evaluate(s, "", PMI)
println("\nWord not in text:")
missing = safe_evaluate(s, "quantum", PMI)
Parameter Validation
# Validate parameters before processing
function validated_analysis(s, word, windowsize, minfreq)
# Check window size
if windowsize < 1
throw(ArgumentError("Window size must be positive"))
elseif windowsize > 50
@warn "Large window size may include noise" windowsize
end
# Check minimum frequency
if minfreq < 1
throw(ArgumentError("Minimum frequency must be at least 1"))
elseif minfreq > 100
@warn "High minimum frequency may exclude valid collocates" minfreq
end
ct = ContingencyTable(s, word; windowsize, minfreq)
return assoc_score(PMI, ct)
end
# Test validation
try
validated_analysis(s, "learning", -1, 5)
catch e
println("Caught error: ", e)
end
results = validated_analysis(s, "learning", 3, 1)
println("Valid analysis: $(nrow(results)) results")
Caught error: ArgumentError("Window size must be positive")
Valid analysis: 6 results
Integration Examples
Complete Analysis Pipeline
using TextAssociations
using DataFrames
using TextAnalysis
function comprehensive_analysis(s, target_word)
# Step 1: Preprocess
doc = prep_string(s,
TextNorm(strip_punctuation=true,
strip_case=true,
normalize_whitespace=true)
)
# Step 2: Create contingency table
ct = ContingencyTable(text(doc), target_word; windowsize=5, minfreq=1)
# Step 3: Evaluate multiple metrics
metrics = [PMI, LogDice, LLR, Dice, JaccardIdx]
results = assoc_score(metrics, ct)
# Step 4: Add composite score
results.CompositeScore = (
results.PMI / maximum(results.PMI) * 0.3 +
results.LogDice / 14 * 0.3 +
results.LLR / maximum(results.LLR) * 0.2 +
results.Dice * 0.1 +
results.JaccardIdx * 0.1
)
# Step 5: Sort by composite score
sort!(results, :CompositeScore, rev=true)
return results
end
analysis = comprehensive_analysis(s, "learning")
println("\nTop 3 collocates by composite score:")
for row in eachrow(first(analysis, 3))
println(" $(row.Collocate): Score = $(round(row.CompositeScore, digits=3))")
end
Top 3 collocates by composite score:
and: Score = NaN
computer: Score = NaN
crucial: Score = NaN
Export Functions
using TextAssociations
using CSV
using Dates
# Prepare results for export
ct = ContingencyTable(s, "intelligence"; windowsize=5, minfreq=1)
results = assoc_score([PMI, LogDice, LLR], ct)
# Add metadata
metadata!(results, "node", "intelligence", style=:note)
metadata!(results, "window_size", 5, style=:note)
metadata!(results, "min_freq", 1, style=:note)
metadata!(results, "timestamp", Dates.now(), style=:note)
# Export to CSV
output_file = tempname() * ".csv"
CSV.write(output_file, results)
println("Results exported to: ", output_file)
# Clean up
rm(output_file)
Results exported to: /tmp/jl_YLOHut0VuB.csv
Function Chaining and Composition
Using Chain.jl
using TextAssociations
using Chain
using DataFrames
using TextAnalysis
# Chain operations for cleaner code
result = @chain s begin
prep_string(_, TextNorm(strip_accents=false))
TextAnalysis.text(_)
ContingencyTable("learning"; windowsize=4, minfreq=1)
assoc_score([PMI, LogDice], _)
filter(row -> row.PMI > 2 && row.LogDice > 5, _)
sort(:PMI, rev=true)
first(5)
end
println("\nChained analysis result:")
println(result)
Chained analysis result:
0×5 DataFrame
Row │ Node Collocate Frequency PMI LogDice
│ String String Int64 Float64 Float64
─────┴────────────────────────────────────────────────
Custom Function Composition
# Compose functions for reusable pipelines
preprocess = s -> prep_string(s, TextNorm(strip_case=true, strip_punctuation=true))
analyze = (s, word) -> ContingencyTable(s, word; windowsize=5, minfreq=2)
evaluate = ct -> assoc_score([PMI, LogDice, LLR], ct)
filter_strong = df -> filter(row -> row.PMI > 3 && row.LLR > 10.83, df)
# Use composition
pipeline = s -> begin
doc = preprocess(s)
ct = analyze(text(doc), "machine")
results = evaluate(ct)
filter_strong(results)
end
final_results = pipeline(s)
println("\nPipeline results: $(nrow(final_results)) strong collocates")
Pipeline results: 0 strong collocates
Best Practices
1. Parameter Selection
# Recommended defaults
const DEFAULT_PARAMS = Dict(
:windowsize => 5, # Balanced for most applications
:minfreq => 5, # Filter noise in medium corpora
:strip_case => true, # Standard normalization
:strip_punctuation => true,
:normalize_whitespace => true
)
2. Metric Selection Guide
# Choose metrics based on goal
const METRIC_GUIDE = Dict(
"discovery" => [PMI, PPMI], # Find new associations
"validation" => [LLR, ChiSquare], # Test significance
"comparison" => [LogDice, PPMI], # Cross-corpus stable
"similarity" => [Dice, JaccardIdx], # Measure overlap
"comprehensive" => [PMI, LogDice, LLR, Dice] # Multiple perspectives
)
3. Performance Tips
# For large-scale processing
function optimized_processing(corpus, nodes, metrics)
# 1. Reuse contingency tables
cache = Dict{String, ContingencyTable}()
# 2. Use scores_only when possible
# 3. Process in batches
# 4. Consider parallel processing
results = Dict()
for node in nodes
if !haskey(cache, node)
cache[node] = ContingencyTable(corpus, node; windowsize=5, minfreq=10)
end
results[node] = assoc_score(metrics, cache[node], scores_only=true)
end
return results
end
Troubleshooting
Common Issues and Solutions
Issue | Cause | Solution |
---|---|---|
Empty results | Word not in text or too rare | Lower minfreq , check preprocessing |
Memory error | Large vocabulary | Use scores_only=true , stream processing |
Slow performance | Large corpus or window | Reduce window size, increase minfreq |
Unexpected collocates | Preprocessing issues | Check strip_accents , strip_case settings |
Debug Helper
function debug_analysis(s, word, windowsize, minfreq)
println("Debug Analysis for '$word'")
println("="^40)
# Check preprocessing
doc = prep_string(s)
tokens = TextAnalysis.tokens(doc)
println("Total tokens: ", length(tokens))
println("Unique tokens: ", length(unique(tokens)))
println("Word frequency: ", count(==(lowercase(word)), tokens))
# Check contingency table
ct = ContingencyTable(text(doc), word; windowsize, minfreq)
data = cached_data(ct.con_tbl)
println("Contingency table rows: ", nrow(data))
if !isempty(data)
println("Frequency range: ", minimum(data.a), " - ", maximum(data.a))
end
# Check results
results = assoc_score(PMI, ct)
println("Final results: ", nrow(results), " collocates")
return results
end
debug_results = debug_analysis(s, "learning", 3, 1)
Row | Node | Collocate | Frequency | PMI |
---|---|---|---|---|
String | String | Int64 | Float64 | |
1 | learning | a | 1 | -1.09861 |
2 | learning | computer | 1 | 0.0 |
3 | learning | crucial | 1 | 0.0 |
4 | learning | is | 1 | 0.0 |
5 | learning | machine | 1 | 0.0 |
6 | learning | science | 1 | 0.0 |
See Also
- Core Types: Type definitions and structures
- Corpus Functions: Corpus-level operations
- Metrics Guide: Detailed metric descriptions
- Examples: More usage examples