Main Functions

This section provides comprehensive documentation for all main API functions in TextAssociations.jl.

Function Categories

Core Evaluation Functions

assoc_score - Primary Evaluation Function

TextAssociations.assoc_scoreFunction
assoc_score(metricType::Type{<:AssociationMetric}, x::AssociationDataFormat;
          scores_only::Bool=false,
          tokens::Union{Nothing,Vector{String}}=nothing,
          kwargs...)

Evaluate a metric on any association data format (CT or CCT).

  • If the metric requires tokens (e.g., LexicalGravity), pass tokens=... or implement assoc_tokens(::YourType) to supply them automatically.
  • Returns a DataFrame by default: [:Node, :Collocate, :Frequency, :<MetricName>].
  • If scores_only=true, returns only the scores Vector.
source
assoc_score(metrics::AbstractVector{<:Type{<:AssociationMetric}},
          x::AssociationDataFormat;
          scores_only::Bool=false,
          tokens::Union{Nothing,Vector{String}}=nothing,
          kwargs...)

Evaluate multiple metrics on CT or CCT.

  • Returns a DataFrame with one column per metric by default.
  • If scores_only=true, returns Dict{String,Vector{Float64}}.
source
assoc_score(metricType::Type{<:AssociationMetric},
          inputstring::AbstractString,
          node::AbstractString;
          windowsize::Int,
          minfreq::Int=5;
          scores_only::Bool=false,
          norm_config::TextNorm=TextNorm(),
          tokens::Union{Nothing,Vector{String}}=nothing,
          kwargs...)

Convenience overload to compute a metric directly from a raw string.

source
assoc_score(metrics::AbstractVector{<:Type{<:AssociationMetric}},
          inputstring::AbstractString,
          node::AbstractString;
          windowsize::Int,
          minfreq::Int=5;
          scores_only::Bool=false,
          norm_config::TextNorm=TextNorm(),
          tokens::Union{Nothing,Vector{String}}=nothing,
          kwargs...)

Convenience overload to compute multiple metrics directly from raw text.

source
assoc_score(metricType::Type{<:AssociationMetric}, corpus::Corpus, node::AbstractString;
            windowsize::Int=5, minfreq::Int=5, kwargs...)

Evaluate a metric on a corpus - convenience method that delegates to analyze_corpus.

source
assoc_score(metricType::Type{<:AssociationMetric}, corpus::Corpus, 
            nodes::Vector{String}; windowsize::Int=5, minfreq::Int=5, 
            top_n::Int=100, kwargs...)

Evaluate a metric on multiple nodes in a corpus. Returns a Dict{String,DataFrame} with results for each node.

source
assoc_score(metrics::AbstractVector{<:Type{<:AssociationMetric}}, 
            corpus::Corpus, nodes::Vector{String}; 
            windowsize::Int=5, minfreq::Int=5, top_n::Int=100, kwargs...)

Evaluate multiple metrics on multiple nodes in a corpus. Returns a Dict{String,DataFrame} with combined metric results for each node.

source
assoc_score(metric::Type{<:AssociationMetric}, corpus::Corpus;
            nodes::Vector{String}, windowsize::Int=5, minfreq::Int=5,
            top_n::Int=100, kwargs...)

Alternative syntax with nodes as keyword argument.

source

The assoc_score function is the primary interface for computing association metrics. It supports multiple signatures for different use cases.

Method Signatures

# Single metric on prepared data
assoc_score(metric::Type{<:AssociationMetric},
          data::AssociationDataFormat;
          scores_only::Bool=false,
          tokens::Union{Nothing,Vector{String}}=nothing) -> Union{DataFrame, Vector}

# Multiple metrics on prepared data
assoc_score(metrics::Vector{DataType},
          data::AssociationDataFormat;
          scores_only::Bool=false) -> Union{DataFrame, Dict}

# Direct from text (single metric)
assoc_score(metric::Type{<:AssociationMetric},
          s::AbstractString,
          node::AbstractString,
          windowsize::Int,
          minfreq::Int=5;
          scores_only::Bool=false) -> Union{DataFrame, Vector}

Parameters

ParameterTypeDescriptionDefault
metric/metricsType or Vector{DataType}Association metric(s) to computeRequired
dataAssociationDataFormatContingencyTable or CorpusContingencyTableRequired
scores_onlyBoolReturn only numeric scoresfalse
tokensVector{String}Token list for metrics that need itnothing
textAbstractStringRaw text for direct evaluation-
nodeAbstractStringTarget word-
windowsizeIntContext window size-
minfreqIntMinimum frequency threshold5

Return Values

  • Default (scores_only=false): Returns DataFrame with columns:

    • Node: Target word
    • Collocate: Co-occurring word
    • Frequency: Co-occurrence frequency
    • [MetricName]: Score column(s) named after metric(s)
  • Performance mode (scores_only=true):

    • Single metric: Vector{Float64} of scores
    • Multiple metrics: Dict{String, Vector{Float64}}

Examples

Basic Usage
using TextAssociations
using TextAnalysis

s = """
Data science combines mathematics, statistics, and computer science.
Machine learning is a crucial part of data science.
Data analysis helps extract insights from data.
"""

# Create contingency table
ct = ContingencyTable(s, "data"; windowsize=3, minfreq=1)

# Single metric evaluation
pmi_results = assoc_score(PMI, ct)
println("PMI Results:")
println(pmi_results)
PMI Results:
12×4 DataFrame
 Row │ Node    Collocate    Frequency  PMI
     │ String  String       Int64      Float64
─────┼──────────────────────────────────────────
   1 │ data    analysis             1  -1.79176
   2 │ data    combines             1  -1.38629
   3 │ data    crucial              1  -1.38629
   4 │ data    data                 2  -1.38629
   5 │ data    extract              1  -1.09861
   6 │ data    from                 1  -1.38629
   7 │ data    helps                1  -1.38629
   8 │ data    insights             1  -1.38629
   9 │ data    mathematics          1  -1.38629
  10 │ data    of                   1  -1.09861
  11 │ data    part                 1  -1.38629
  12 │ data    science              2  -1.09861
Multiple Metrics
# Evaluate multiple metrics simultaneously
metrics = [PMI, LogDice, LLR, Dice]
multi_results = assoc_score(metrics, ct)

println("\nColumns in results: ", names(multi_results))
println("Top result by PMI:")
println(first(sort(multi_results, :PMI, rev=true), 1))

Columns in results: ["Node", "Collocate", "Frequency", "PMI", "LogDice", "LLR", "Dice"]
Top result by PMI:
1×7 DataFrame
 Row │ Node    Collocate  Frequency  PMI       LogDice  LLR      Dice
     │ String  String     Int64      Float64   Float64  Float64  Float64
─────┼───────────────────────────────────────────────────────────────────
   1 │ data    science            2  -1.09861  13.6781  6.18896      0.8
Direct from Text
# Skip contingency table creation
results = assoc_score(PMI, s, "science", windowsize=4, minfreq=1)
println("\nDirect evaluation results:")
println(results)

Direct evaluation results:
16×4 DataFrame
 Row │ Node     Collocate    Frequency  PMI
     │ String   String       Int64      Float64
─────┼────────────────────────────────────────────
   1 │ science  a                    1  -1.79176
   2 │ science  analysis             1  -1.09861
   3 │ science  and                  1  -0.693147
   4 │ science  combines             1  -1.09861
   5 │ science  computer             1  -1.09861
   6 │ science  crucial              1  -1.09861
   7 │ science  data                 3  -1.38629
   8 │ science  extract              1  -1.09861
   9 │ science  helps                1  -1.09861
  10 │ science  is                   1  -1.09861
  11 │ science  learning             1  -1.09861
  12 │ science  machine              1  -1.09861
  13 │ science  mathematics          1  -0.693147
  14 │ science  of                   1  -1.09861
  15 │ science  part                 1  -1.09861
  16 │ science  statistics           1  -0.693147
Performance Mode
# Get only scores for better performance
scores = assoc_score(PMI, ct, scores_only=true)
println("\nScore vector: ", scores)
println("Length: ", length(scores))

# Multiple metrics with scores_only
score_dict = assoc_score([PMI, LogDice], ct, scores_only=true)
println("\nScore dictionary keys: ", keys(score_dict))

Score vector: [-1.791759469228055, -1.3862943611198906, -1.3862943611198906, -1.3862943611198904, -1.0986122886681098, -1.3862943611198906, -1.3862943611198906, -1.3862943611198906, -1.3862943611198906, -1.0986122886681098, -1.3862943611198906, -1.0986122886681096]
Length: 12

Score dictionary keys: ["LogDice", "PMI"]

Advanced Usage

Custom Filtering Pipeline
using TextAssociations
using TextAnalysis
using DataFrames

# Evaluate and filter in one pipeline
function analyze_with_thresholds(text, word, thresholds)
    ct = ContingencyTable(text, word; windowsize=5, minfreq=2)
    results = assoc_score([PMI, LogDice, LLR], ct)

    # Apply multiple thresholds
    filtered = filter(row ->
        row.PMI >= thresholds[:pmi] &&
        row.LogDice >= thresholds[:logdice] &&
        row.LLR >= thresholds[:llr],
        results
    )

    return sort(filtered, :PMI, rev=true)
end

thresholds = Dict(:pmi => 2.0, :logdice => 5.0, :llr => 3.84)
filtered = analyze_with_thresholds(s, "data", thresholds)
println("Filtered results: ", nrow(filtered), " collocates")
Filtered results: 0 collocates

Text Processing Functions

prep_string - Text Preprocessing

TextAssociations.prep_stringFunction
prep_string(input_path::AbstractString, config::TextNorm) -> StringDocument

Prepare and preprocess text from various sources into a StringDocument.

Arguments

  • input_path: File path, directory path, or raw text string.

Preprocessing options

Uses TextNorm configuration for all preprocessing options.

Returns

A preprocessed StringDocument object suitable for downstream corpus analysis.

source

Preprocesses text with extensive customization options for different languages and domains.

Parameters

ParameterTypeDescriptionDefault
input_pathAbstractStringFile path, directory, or raw textRequired
strip_punctuationBoolRemove punctuationtrue
punctuation_to_spaceBoolReplace punctuation with spacestrue
strip_whitespaceBoolRemove all whitespacefalse
normalize_whitespaceBoolCollapse multiple spacestrue
strip_caseBoolConvert to lowercasetrue
strip_accentsBoolRemove diacritical marksfalse
unicode_formSymbolUnicode normalization form:NFC
use_prepareBoolApply TextAnalysis pipelinefalse

Examples

Basic Preprocessing
using TextAssociations
using TextAnalysis

# Default preprocessing
s = "Hello, WORLD!!! Multiple   spaces..."
doc = prep_string(s)
println("Default: '", text(doc), "'")

# Custom preprocessing
doc_custom = prep_string(s,
    TextNorm(strip_case=false,        # Keep original case
    strip_punctuation=false, # Keep punctuation
    normalize_whitespace=true # Fix spacing only
    ))
println("Custom: '", text(doc_custom), "'")
Default: 'hello world multiple spaces '
Custom: 'Hello, WORLD!!! Multiple spaces...'
Multilingual Text
using TextAssociations
using TextAnalysis

# Greek text with diacritics
greek = "Καλημέρα! Η ανάλυση κειμένου είναι σημαντική."

# Keep diacritics (default)
doc_with = prep_string(greek, TextNorm(strip_accents=false))
println("With accents: '", text(doc_with), "'")

# Remove diacritics
doc_without = prep_string(greek, TextNorm(strip_accents=true))
println("Without accents: '", text(doc_without), "'")
With accents: 'καλημέρα η ανάλυση κειμένου είναι σημαντική '
Without accents: 'καλημερα η αναλυση κειμενου ειναι σημαντικη '
Processing Files and Directories
using TextAssociations
using TextAnalysis

# From file
# doc = prep_string("document.txt")

# From directory (concatenates all .txt files)
# doc = prep_string("corpus/")

# Example with temporary file
using Mmap
temp_file = tempname() * ".txt"
write(temp_file, "Sample text from file.")
doc = prep_string(temp_file)
println("From file: '", text(doc), "'")
rm(temp_file)
From file: 'sample text from file '

build_vocab - Vocabulary Creation

Creates an ordered dictionary mapping words to indices.

Parameters

  • input: Either a StringDocument or Vector{String}

Returns

  • OrderedDict{String,Int}: Word-to-index mapping

Examples

using TextAssociations

# From document
doc = prep_string("The quick brown fox jumps over the lazy dog")
vocab = build_vocab(doc)

println("Vocabulary size: ", length(vocab))
println("First 5 words:")
for (word, idx) in Iterators.take(vocab, 5)
    println("  $idx: '$word'")
end

# From word vector
words = ["apple", "banana", "cherry", "apple"]  # Duplicates removed
vocab2 = build_vocab(words)
println("\nUnique words: ", length(vocab2))
Vocabulary size: 8
First 5 words:
  1: 'the'
  2: 'quick'
  3: 'brown'
  4: 'fox'
  5: 'jumps'

Unique words: 3

Utility Functions

available_metrics - List Available Metrics

Returns a vector of all available association metric symbols.

Example

using TextAssociations

metrics = available_metrics()
println("Total available metrics: ", length(metrics))
println("\nInformation-theoretic metrics:")
info_metrics = filter(m -> occursin("PMI", String(m)) || m == :PPMI, metrics)
println(info_metrics)

println("\nStatistical metrics:")
stat_metrics = filter(m -> m in [:LLR, :ChiSquare, :Tscore, :Zscore], metrics)
println(stat_metrics)
Total available metrics: 47

Information-theoretic metrics:
[:PMI, :PMI², :PMI³, :PPMI]

Statistical metrics:
[:LLR, :Tscore, :Zscore, :ChiSquare]

cached_data - Access Lazy Data

Extracts data from a LazyProcess, computing it if necessary.

Parameters

  • z::LazyProcess: Lazy process wrapper

Returns

  • The computed/cached result

Example

using TextAssociations

# ContingencyTable uses lazy evaluation internally
ct = ContingencyTable("sample text", "text"; windowsize=3, minfreq=1)

# First access computes the table
println("First access...")
data1 = cached_data(ct.con_tbl)

# Second access uses cache (no computation)
println("Second access...")
data2 = cached_data(ct.con_tbl)

println("Same object? ", data1 === data2)  # true - same cached object
First access...
Second access...
Same object? true

document - Access Document

Extracts the document from a LazyInput wrapper.

Parameters

  • input::LazyInput: Lazy input wrapper

Returns

  • StringDocument: The stored document

Batch Processing Functions

Processing Multiple Nodes

using TextAssociations
using DataFrames

s = """
Artificial intelligence and machine learning are transforming technology.
Deep learning, a subset of machine learning, uses neural networks.
Machine learning algorithms can learn from data without explicit programming.
"""

# Analyze multiple words
nodes = ["learning", "machine", "neural", "data"]
results = Dict{String, DataFrame}()

for node in nodes
    ct = ContingencyTable(s, node; windowsize=3, minfreq=1)
    results[node] = assoc_score(PMI, ct)
end

println("Results per node:")
for (node, df) in results
    println("  $node: $(nrow(df)) collocates, top PMI = $(round(maximum(df.PMI), digits=2))")
end
Results per node:
  neural: 4 collocates, top PMI = 0.0
  data: 6 collocates, top PMI = 0.0
  machine: 14 collocates, top PMI = -0.69
  learning: 16 collocates, top PMI = -1.1

Comparative Analysis

using TextAssociations
using DataFrames
using Statistics

# Compare different window sizes
function compare_parameters(text, word)
    params = [
        (window=2, minfreq=1),
        (window=5, minfreq=1),
        (window=10, minfreq=1)
    ]

    comparison = DataFrame()
    for p in params
        ct = ContingencyTable(text, word; windowsize=p.window, minfreq=p.minfreq)
        df = assoc_score(PMI, ct)
        df.WindowSize .= p.window
        append!(comparison, df)
    end

    return comparison
end

comparison = compare_parameters(s, "learning")
grouped = groupby(comparison, :WindowSize)
summary = combine(grouped,
    nrow => :NumCollocates,
    :PMI => mean => :AvgPMI,
    :PMI => maximum => :MaxPMI
)
println("\nWindow size comparison:")
println(summary)

Window size comparison:
3×4 DataFrame
 Row │ WindowSize  NumCollocates  AvgPMI     MaxPMI
     │ Int64       Int64          Float64    Float64
─────┼───────────────────────────────────────────────
   1 │          2              4  -0.274653      0.0
   2 │          5             10  -0.109861      0.0
   3 │         10             15  -0.192691      0.0

Performance Optimization

Memory-Efficient Processing

using TextAssociations

# Use scores_only for large-scale processing
function process_many_nodes(text, nodes)
    scores = Dict{String, Vector{Float64}}()

    for node in nodes
        ct = ContingencyTable(text, node; windowsize=5, minfreq=1)
        # Get only scores to save memory
        scores[node] = assoc_score(PMI, ct, scores_only=true)
    end

    return scores
end

nodes = ["intelligence", "artificial", "learning"]
score_dict = process_many_nodes(s, nodes)
println("\nScore vectors per node:")
for (node, scores) in score_dict
    if isempty(scores)
        println("  $node: no collocates found")
    else
        println("  $node: $(length(scores)) scores, max = $(round(maximum(scores), digits=2))")
    end
end

Score vectors per node:
  artificial: no collocates found
  learning: 10 scores, max = 0.0
  intelligence: no collocates found

Parallel Evaluation

using TextAssociations

# Function for parallel processing (conceptual)
function parallel_evaluate(strings, word, metrics)
    results = []

    # In practice, use @distributed or Threads.@threads
    for s in strings
        ct = ContingencyTable(s, word; windowsize=5, minfreq=2)
        push!(results, assoc_score(metrics, ct))
    end

    return results
end

# Example with multiple text segments
strings = [
    "Machine learning is powerful.",
    "Deep learning uses neural networks.",
    "Artificial intelligence includes machine learning."
]

results = parallel_evaluate(strings, "learning", [PMI, LogDice])
println("\nResults from $(length(results)) text segments processed")

Results from 3 text segments processed

Error Handling and Validation

Input Validation

using TextAssociations
using DataFrames

# Handle empty or invalid inputs
function safe_evaluate(s, word, metric)
    try
        # Validate inputs
        isempty(s) && throw(ArgumentError("Text cannot be empty"))
        isempty(word) && throw(ArgumentError("Word cannot be empty"))

        ct = ContingencyTable(s, word; windowsize=5, minfreq=1)
        results = assoc_score(metric, ct)

        if isempty(results)
            println("Warning: No collocates found for '$word'")
            return DataFrame()
        end

        return results
    catch e
        println("Error: ", e)
        return DataFrame()
    end
end

# Test with various inputs
println("Valid input:")
valid = safe_evaluate(s, "learning", PMI)
println("  Found $(nrow(valid)) collocates")

println("\nEmpty word:")
empty_word = safe_evaluate(s, "", PMI)

println("\nWord not in text:")
missing = safe_evaluate(s, "quantum", PMI)
0×0 DataFrame

Parameter Validation

# Validate parameters before processing
function validated_analysis(s, word, windowsize, minfreq)
    # Check window size
    if windowsize < 1
        throw(ArgumentError("Window size must be positive"))
    elseif windowsize > 50
        @warn "Large window size may include noise" windowsize
    end

    # Check minimum frequency
    if minfreq < 1
        throw(ArgumentError("Minimum frequency must be at least 1"))
    elseif minfreq > 100
        @warn "High minimum frequency may exclude valid collocates" minfreq
    end

    ct = ContingencyTable(s, word; windowsize, minfreq)
    return assoc_score(PMI, ct)
end

# Test validation
try
    validated_analysis(s, "learning", -1, 5)
catch e
    println("Caught error: ", e)
end

results = validated_analysis(s, "learning", 3, 1)
println("Valid analysis: $(nrow(results)) results")
Caught error: ArgumentError("Window size must be positive")
Valid analysis: 6 results

Integration Examples

Complete Analysis Pipeline

using TextAssociations
using DataFrames
using TextAnalysis

function comprehensive_analysis(s, target_word)
    # Step 1: Preprocess
    doc = prep_string(s,
        TextNorm(strip_punctuation=true,
        strip_case=true,
        normalize_whitespace=true)
    )

    # Step 2: Create contingency table
    ct = ContingencyTable(text(doc), target_word; windowsize=5, minfreq=1)

    # Step 3: Evaluate multiple metrics
    metrics = [PMI, LogDice, LLR, Dice, JaccardIdx]
    results = assoc_score(metrics, ct)

    # Step 4: Add composite score
    results.CompositeScore = (
        results.PMI / maximum(results.PMI) * 0.3 +
        results.LogDice / 14 * 0.3 +
        results.LLR / maximum(results.LLR) * 0.2 +
        results.Dice * 0.1 +
        results.JaccardIdx * 0.1
    )

    # Step 5: Sort by composite score
    sort!(results, :CompositeScore, rev=true)

    return results
end

analysis = comprehensive_analysis(s, "learning")
println("\nTop 3 collocates by composite score:")
for row in eachrow(first(analysis, 3))
    println("  $(row.Collocate): Score = $(round(row.CompositeScore, digits=3))")
end

Top 3 collocates by composite score:
  and: Score = NaN
  computer: Score = NaN
  crucial: Score = NaN

Export Functions

using TextAssociations
using CSV
using Dates

# Prepare results for export
ct = ContingencyTable(s, "intelligence"; windowsize=5, minfreq=1)
results = assoc_score([PMI, LogDice, LLR], ct)

# Add metadata
metadata!(results, "node", "intelligence", style=:note)
metadata!(results, "window_size", 5, style=:note)
metadata!(results, "min_freq", 1, style=:note)
metadata!(results, "timestamp", Dates.now(), style=:note)

# Export to CSV
output_file = tempname() * ".csv"
CSV.write(output_file, results)
println("Results exported to: ", output_file)

# Clean up
rm(output_file)
Results exported to: /tmp/jl_YLOHut0VuB.csv

Function Chaining and Composition

Using Chain.jl

using TextAssociations
using Chain
using DataFrames
using TextAnalysis

# Chain operations for cleaner code
result = @chain s begin
    prep_string(_, TextNorm(strip_accents=false))
    TextAnalysis.text(_)
    ContingencyTable("learning"; windowsize=4, minfreq=1)
    assoc_score([PMI, LogDice], _)
    filter(row -> row.PMI > 2 && row.LogDice > 5, _)
    sort(:PMI, rev=true)
    first(5)
end

println("\nChained analysis result:")
println(result)

Chained analysis result:
0×5 DataFrame
 Row │ Node    Collocate  Frequency  PMI      LogDice
     │ String  String     Int64      Float64  Float64
─────┴────────────────────────────────────────────────

Custom Function Composition

# Compose functions for reusable pipelines
preprocess = s -> prep_string(s, TextNorm(strip_case=true, strip_punctuation=true))
analyze = (s, word) -> ContingencyTable(s, word; windowsize=5, minfreq=2)
evaluate = ct -> assoc_score([PMI, LogDice, LLR], ct)
filter_strong = df -> filter(row -> row.PMI > 3 && row.LLR > 10.83, df)

# Use composition
pipeline = s -> begin
    doc = preprocess(s)
    ct = analyze(text(doc), "machine")
    results = evaluate(ct)
    filter_strong(results)
end

final_results = pipeline(s)
println("\nPipeline results: $(nrow(final_results)) strong collocates")

Pipeline results: 0 strong collocates

Best Practices

1. Parameter Selection

# Recommended defaults
const DEFAULT_PARAMS = Dict(
    :windowsize => 5,      # Balanced for most applications
    :minfreq => 5,         # Filter noise in medium corpora
    :strip_case => true,   # Standard normalization
    :strip_punctuation => true,
    :normalize_whitespace => true
)

2. Metric Selection Guide

# Choose metrics based on goal
const METRIC_GUIDE = Dict(
    "discovery" => [PMI, PPMI],           # Find new associations
    "validation" => [LLR, ChiSquare],     # Test significance
    "comparison" => [LogDice, PPMI],      # Cross-corpus stable
    "similarity" => [Dice, JaccardIdx],   # Measure overlap
    "comprehensive" => [PMI, LogDice, LLR, Dice]  # Multiple perspectives
)

3. Performance Tips

# For large-scale processing
function optimized_processing(corpus, nodes, metrics)
    # 1. Reuse contingency tables
    cache = Dict{String, ContingencyTable}()

    # 2. Use scores_only when possible
    # 3. Process in batches
    # 4. Consider parallel processing

    results = Dict()
    for node in nodes
        if !haskey(cache, node)
            cache[node] = ContingencyTable(corpus, node; windowsize=5, minfreq=10)
        end
        results[node] = assoc_score(metrics, cache[node], scores_only=true)
    end

    return results
end

Troubleshooting

Common Issues and Solutions

IssueCauseSolution
Empty resultsWord not in text or too rareLower minfreq, check preprocessing
Memory errorLarge vocabularyUse scores_only=true, stream processing
Slow performanceLarge corpus or windowReduce window size, increase minfreq
Unexpected collocatesPreprocessing issuesCheck strip_accents, strip_case settings

Debug Helper

function debug_analysis(s, word, windowsize, minfreq)
    println("Debug Analysis for '$word'")
    println("="^40)

    # Check preprocessing
    doc = prep_string(s)
    tokens = TextAnalysis.tokens(doc)
    println("Total tokens: ", length(tokens))
    println("Unique tokens: ", length(unique(tokens)))
    println("Word frequency: ", count(==(lowercase(word)), tokens))

    # Check contingency table
    ct = ContingencyTable(text(doc), word; windowsize, minfreq)
    data = cached_data(ct.con_tbl)
    println("Contingency table rows: ", nrow(data))

    if !isempty(data)
        println("Frequency range: ", minimum(data.a), " - ", maximum(data.a))
    end

    # Check results
    results = assoc_score(PMI, ct)
    println("Final results: ", nrow(results), " collocates")

    return results
end

debug_results = debug_analysis(s, "learning", 3, 1)
6×4 DataFrame
RowNodeCollocateFrequencyPMI
StringStringInt64Float64
1learninga1-1.09861
2learningcomputer10.0
3learningcrucial10.0
4learningis10.0
5learningmachine10.0
6learningscience10.0

See Also