TextAssociations

Documentation for TextAssociations.

Install

You can install TextAssociations.jl directly from its GitHub repository using Julia’s package manager. In the Julia REPL, press ] to enter Pkg mode and run:

pkg> add https://github.com/atantos/TextAssociations.jl

Once the package is registered in the Julia General registry, you will be able to install it more simply with:

pkg> add TextAssociations

See Installation for detailed instructions and troubleshooting.

TextAssociations.jl

A Julia package for word association measures, collocation analysis and descriptive statistics across text and corpus levels.

Overview

TextAssociations.jl is a comprehensive framework for computing word association metrics, performing collocation analysis, and producing a wide range of descriptive statistical indices at both the text and corpus levels. With 47 implemented association measures, it is designed to support research in computational linguistics, corpus linguistics, natural language processing, and the digital humanities.

!!! tip Package Highlights

  • 47 association metrics across statistical, information-theoretic, similarity and epidemiological families — including PMI, LogDice, LLR, Chi-square, Odds Ratio, Lexical Gravity, and many more

  • Efficient processing of large corpora with lazy evaluation and caching

  • Corpus analysis at scale via batch and streaming modes, with optional parallelism

  • Multilingual support with proper Unicode handling and diacritic normalization

  • Advanced features:

    • Temporal association analysis and trend detection
    • Subcorpus comparisons with effect sizes and statistical testing
    • Collocation network construction and export (e.g., to Gephi)
    • KWIC concordances for contextual exploration
    • Keyword extraction (currently TF-IDF, with RAKE and TextRank planned)

Quick Start

After installation, you can immediately begin analyzing text and exploring collocations with just a few lines of code. The example below demonstrates how to create a contingency table for a target word, compute multiple association measures, and display the top collocates.

For a step-by-step explanation of what happens in each stage and detailed guidance on how to use the package effectively, see the Tutorial section of this documentation.

using TextAssociations

# Analyze collocations in text
text = """
Machine learning algorithms learn patterns from data.
Deep learning is a subset of machine learning.
Neural networks power deep learning systems.
"""

# Find collocations of "learning"
ct = ContingencyTable(text, "learning", windowsize=3, minfreq=1)

# Calculate multiple metrics
results = assoc_score([PMI, LogDice, LLR], ct)

# Display top collocations
using DataFrames
sort!(results, :PMI, rev=true)
first(results, 5)
5×6 DataFrame
RowNodeCollocateFrequencyPMILogDiceLLR
StringStringInt64Float64Float64Float64
1learningnetworks1-1.0986113.00.0
2learningpower1-1.0986113.00.0
3learningsubset1-1.0986113.00.0
4learningdeep2-1.3862913.4150.0
5learningmachine2-1.3862913.4150.0

Key Features

📊 Comprehensive Metric Collection

The package provides metrics from several families of measures. The examples below are representative; the full list of implemented measures, along with their formulae, is provided in the Measures section.

  • Information-theoretic: PMI, PPMI, Mutual Information variants
  • Statistical: Log-likelihood ratio, Chi-square, T-score, Z-score
  • Similarity-based: Dice, Jaccard, Cosine similarity
  • Effect size: Odds ratio, Relative risk, Cohen's d
  • Specialized: Lexical Gravity, Delta P, Minimum Sensitivity

🚀 Performance and Scalability

  • Lazy evaluation: Computations are deferred and cached
  • Memory efficient: Stream processing for large corpora
  • Parallel processing: Built-in support for distributed computing
  • Optimized algorithms: Efficient implementations for all metrics

🔧 Flexible and Extensible

  • Multiple input formats: Raw text, files, directories, CSV, JSON
  • Customizable preprocessing: Full control over text normalization
  • Extensible design: Easy to add new metrics or modify existing ones
  • Rich output options: DataFrames, CSV, JSON, Excel export

Basic Usage

Single Document Analysis

using TextAssociations
using TextAnalysis: text
using DataFrames

text_sample = "Machine learning algorithms learn from data. Deep learning uses neural networks."

doc = prep_string(text_sample, TextNorm(
    strip_punctuation=true,
    strip_case=true
))

ct = ContingencyTable(text(doc), "learning"; windowsize=5, minfreq=1)
pmi_scores = assoc_score(PMI, ct)
println("Found $(nrow(pmi_scores)) collocates")
Found 9 collocates

Corpus-Level Analysis

using TextAssociations

# Create a temporary mini-corpus with longer texts
dir = mktempdir()

files = Dict(
    "doc1.txt" => """
    Computational linguistics increasingly intersects with innovation practice.
    Teams use data to evaluate hypotheses, prototype ideas quickly, and measure impact with reproducible pipelines.
    In modern research workflows, small models are validated against well-defined tasks before scaling, ensuring that innovation is more than a buzzword—it is a methodical, testable process.
    When AI systems are involved, documentation and transparent governance help peers replicate results and trust conclusions.
    """,

    "doc2.txt" => """
    Successful innovation rarely happens in isolation.
    It emerges from an ecosystem of universities, startups, industry labs, and public institutions that collaborate and share partial results early.
    Well-run projects cultivate collaboration rituals—design reviews, error analyses, and postmortems—so ideas move from promising theory to usable tools.
    Open exchange reduces duplication and accelerates learning across the ecosystem.
    """,

    "doc3.txt" => """
    Prototyping is the bridge between research and deployment.
    A minimal prototype clarifies the problem, surfaces risks, and reveals unknown edge cases.
    From there, teams harden the system for scalability, add observability, and evaluate ethical trade-offs such as bias, privacy, and safety.
    A principled evaluation plan is part of the prototype, not an afterthought.
    """,

    "doc4.txt" => """
    Education benefits when innovation is human-centered.
    Instructors can combine classic readings with hands-on labs that trace data through each step of the pipeline.
    Open-source examples and clear rubrics help students reason about uncertainty, interpret model behavior, and articulate the limits of automation.
    The goal is durable understanding and real-world impact, not just higher benchmark scores.
    """
)

# Write files
for (name, content) in files
    open(joinpath(dir, name), "w") do io
        write(io, strip(content))
    end
end

# Load the corpus from the real path we just created
corpus = read_corpus(dir)

# Analyze across entire corpus
results = analyze_corpus(corpus, "innovation", PMI,
    windowsize=5,
    minfreq=1
)

# Get corpus statistics
stats = corpus_stats(corpus)
println("Documents: $(stats[:num_documents])")
println("Vocabulary: $(stats[:vocabulary_size])")
Loaded 4 documents
Documents: 4
Vocabulary: 175

Advanced Features

Collocation Networks

Build networks of related terms:

network = colloc_graph(
    corpus, ["artificial", "intelligence"],
    metric=PMI, depth=2
)

Comparative Analysis

Compare associations across subcorpora:

comparison = compare_subcorpora(
    corpus, :category, "technology", PMI
)

Temporal Analysis

Track how word associations change over time:

temporal_analysis = analyze_temporal(
    corpus, ["digital", "transformation"], :year, PMI
)

Package Architecture

TextAssociations.jl
│
├─ Types & Basics
│  ├─ AssociationMetric / AssociationDataFormat
│  ├─ TextNorm (single source of truth for normalization)
│  └─ LazyProcess / LazyInput (lazy evaluation & caching)
│
├─ Utils
│  ├─ I/O & encoding (read_text_smart)
│  ├─ Text processing (normalize_node, prep_string, strip_diacritics)
│  ├─ Statistical helpers (available_metrics, log_safe)
│  └─ Text analysis helpers (token find/count utilities)
│
├─ Core Data Structures
│  ├─ ContingencyTable          # per-document co-occurrence table
│  ├─ Corpus                    # collection + vocabulary/DTM
│  └─ CorpusContingencyTable    # corpus-level aggregation (lazy)
│
├─ API (Unified)
│  └─ assoc_score(metric(s), x::AssociationDataFormat; …)
│
├─ Metrics
│  ├─ Interface + dispatch
│  └─ 47 measures across families (PMI, LLR, LogDice, χ², OR, etc.)
│
├─ Analysis Functions
│  ├─ analyze_corpus / analyze_nodes
│  ├─ corpus_stats, token_distribution, vocab_coverage
│  ├─ write_results, export/load with metadata
│  ├─ batch_process_corpus, stream_corpus_analysis
│  └─ keyterms (TF-IDF; RAKE/TextRank placeholders)
│
└─ Advanced Features
   ├─ analyze_temporal, compare_subcorpora
   ├─ colloc_graph → gephi_graph (network export)
   └─ kwic (concordance)

Documentation Guide

<!–

Performance Benchmarks

TaskSizeTimeMemory
Single document10K words~50ms10MB
Small corpus100 docs~2s50MB
Large corpus10K docs~30s500MB
StreamingUnlimitedLinearConstant

Community and Support

<!– - 📧 Contact: alextantos@lit.auth.gr –>

<!– ## Citation

If you use TextAssociations.jl in your research, please cite:

@software{textassociations2025,
    title = {TextAssociations.jl: A Julia Package for Word Association Analysis},
    author = {Your Name},
    year = {2025},
    url = {https://github.com/yourusername/TextAssociations.jl},
    version = {0.1.0}
}

–>

Contributing

We welcome contributions! See our Contributing Guide for:

  • Bug reports and feature requests
  • Code contributions
  • Documentation improvements
  • Adding new metrics

License

TextAssociations.jl is licensed under the MIT License.

Acknowledgments

This package builds upon decades of research in corpus and computational linguistics. The references that follow are illustrative rather than exhaustive, highlighting some of the key contributions that have shaped the development of association measures and corpus analysis methods.

  • Evert, S. (2008). "Corpora and collocations." Corpus Linguistics: An International Handbook
  • Church, K. W., & Hanks, P. (1990). "Word association norms, mutual information, and lexicography." Computational Linguistics
  • Pecina, P. (2010). "Lexical association measures and collocation extraction." Language Resources and Evaluation

Index

Functions

TextAssociations.analyze_corpusMethod
analyze_corpus(corpus::Corpus, node::AbstractString, metric::Type{<:AssociationMetric};
              windowsize::Int, minfreq::Int=5) -> DataFrame

Analyze a single node word across the entire corpus using corpus's normalization. Returns DataFrame with Node, Collocate, Score, Frequency, and DocFrequency columns.

source
TextAssociations.analyze_nodesMethod
analyze_nodes(corpus::Corpus, nodes::Vector{String}, metrics::Vector{DataType};
             windowsize::Int, minfreq::Int=5, top_n::Int=100,
             parallel::Bool=false) -> MultiNodeAnalysis

Analyze multiple nodes with consistent normalization. Each result DataFrame now includes the Node column and metadata.

source
TextAssociations.analyze_temporalMethod
analyze_temporal(corpus::Corpus,
                        nodes::Vector{String},
                        time_field::Symbol,
                        metric::Type{<:AssociationMetric};
                        time_bins::Int=10,
                        windowsize::Int=5,
                        minfreq::Int=5) -> TemporalCorpusAnalysis

Analyze how word associations change over time.

source
TextAssociations.assoc_scoreMethod
assoc_score(metrics::AbstractVector{<:Type{<:AssociationMetric}},
          inputstring::AbstractString,
          node::AbstractString;
          windowsize::Int,
          minfreq::Int=5;
          scores_only::Bool=false,
          norm_config::TextNorm=TextNorm(),
          tokens::Union{Nothing,Vector{String}}=nothing,
          kwargs...)

Convenience overload to compute multiple metrics directly from raw text.

source
TextAssociations.assoc_scoreMethod
assoc_score(metrics::AbstractVector{<:Type{<:AssociationMetric}},
          x::AssociationDataFormat;
          scores_only::Bool=false,
          tokens::Union{Nothing,Vector{String}}=nothing,
          kwargs...)

Evaluate multiple metrics on CT or CCT.

  • Returns a DataFrame with one column per metric by default.
  • If scores_only=true, returns Dict{String,Vector{Float64}}.
source
TextAssociations.assoc_scoreMethod
assoc_score(metrics::AbstractVector{<:Type{<:AssociationMetric}}, 
            corpus::Corpus, nodes::Vector{String}; 
            windowsize::Int=5, minfreq::Int=5, top_n::Int=100, kwargs...)

Evaluate multiple metrics on multiple nodes in a corpus. Returns a Dict{String,DataFrame} with combined metric results for each node.

source
TextAssociations.assoc_scoreMethod
assoc_score(metricType::Type{<:AssociationMetric},
          inputstring::AbstractString,
          node::AbstractString;
          windowsize::Int,
          minfreq::Int=5;
          scores_only::Bool=false,
          norm_config::TextNorm=TextNorm(),
          tokens::Union{Nothing,Vector{String}}=nothing,
          kwargs...)

Convenience overload to compute a metric directly from a raw string.

source
TextAssociations.assoc_scoreMethod
assoc_score(metricType::Type{<:AssociationMetric}, x::AssociationDataFormat;
          scores_only::Bool=false,
          tokens::Union{Nothing,Vector{String}}=nothing,
          kwargs...)

Evaluate a metric on any association data format (CT or CCT).

  • If the metric requires tokens (e.g., LexicalGravity), pass tokens=... or implement assoc_tokens(::YourType) to supply them automatically.
  • Returns a DataFrame by default: [:Node, :Collocate, :Frequency, :<MetricName>].
  • If scores_only=true, returns only the scores Vector.
source
TextAssociations.assoc_scoreMethod
assoc_score(metricType::Type{<:AssociationMetric}, corpus::Corpus, node::AbstractString;
            windowsize::Int=5, minfreq::Int=5, kwargs...)

Evaluate a metric on a corpus - convenience method that delegates to analyze_corpus.

source
TextAssociations.assoc_scoreMethod
assoc_score(metricType::Type{<:AssociationMetric}, corpus::Corpus, 
            nodes::Vector{String}; windowsize::Int=5, minfreq::Int=5, 
            top_n::Int=100, kwargs...)

Evaluate a metric on multiple nodes in a corpus. Returns a Dict{String,DataFrame} with results for each node.

source
TextAssociations.assoc_scoreMethod
assoc_score(metric::Type{<:AssociationMetric}, corpus::Corpus;
            nodes::Vector{String}, windowsize::Int=5, minfreq::Int=5,
            top_n::Int=100, kwargs...)

Alternative syntax with nodes as keyword argument.

source
TextAssociations.batch_process_corpusMethod
batch_process_corpus(corpus::Corpus,
                    node_file::AbstractString,
                    output_dir::AbstractString;
                    metrics::Vector{DataType}=[PMI, LogDice],
                    windowsize::Int,
                    minfreq::Int=5,
                    batch_size::Int=100)

Process a large list of node words in batches. Results include Node column.

source
TextAssociations.colloc_graphMethod
colloc_graph(corpus::Corpus,
                        seed_words::Vector{String};
                        metric::Type{<:AssociationMetric}=PMI,
                        depth::Int=2,
                        min_score::Float64=3.0,
                        max_neighbors::Int=20) -> CollocationNetwork

Build a collocation network starting from seed words.

source
TextAssociations.compare_subcorporaMethod
compare_subcorpora(corpus::Corpus,
                  split_field::Symbol,
                  node::String,
                  metric::Type{<:AssociationMetric};
                  windowsize::Int=5,
                  minfreq::Int=5) -> SubcorpusComparison

Compare word associations across different subcorpora.

source
TextAssociations.cont_tableMethod
cont_table(input_doc::StringDocument, target_word::AbstractString,
          windowsize::Int=5, minfreq::Int=3) -> DataFrame

Compute the contingency table for a target word in a document. Note: target_word should already be normalized before calling this function.

source
TextAssociations.eval_lexicalgravityMethod
eval_lexicalgravity(data::AssociationDataFormat; 
                   tokens::Vector{String},
                   formula::Symbol=:original)

Compute Lexical Gravity measure based on Daudaravičius & Marcinkevičienė (2004).

Arguments

  • data: AssociationDataFormat with co-occurrence data
  • tokens: Required tokenized text (provided by assoc_score when called through API)
  • formula: Which formula to use:
    • :original - The main formula from the paper: G→(w1,w2) = log(f→×n+/f1) + log(f←×n-/f2)
    • :simplified - Simplified version: G = log₂((f²×span)/(f1×f2))
    • :pmi_weighted - PMI-style weighting: G = f(w1,w2) × log((f×N)/(f1×f2))

Original Formula from Paper

G→(w1,w2) = log(f→(w1,w2)/f(w1) × n+(w1)) + log(f←(w1,w2)/f(w2) × n-(w2))

Where:

  • f→(w1,w2) = frequency of w2 following w1 within window
  • f←(w1,w2) = frequency of w1 preceding w2 within window
  • n+(w1) = number of different word types that follow w1
  • n-(w2) = number of different word types that precede w2
  • f(w1), f(w2) = total frequencies of words

Note

This function expects tokens to be provided. When called through assoc_score(), tokens are automatically fetched based on the NeedsTokens trait.

source
TextAssociations.find_following_wordsMethod
find_following_words(doc::StringDocument, word::String, window::Int) -> Set{String}

Find unique words that appear within window words after each occurrence of word in the document.

source
TextAssociations.find_prior_wordsMethod
find_prior_words(doc::StringDocument, word::String, window::Int) -> Set{String}

Find unique words that appear within window words before each occurrence of word in the document.

source
TextAssociations.gephi_graphMethod
gephi_graph(network::CollocationNetwork,
                       nodes_file::String,
                       edges_file::String)

Export network for visualization in Gephi or similar tools.

source
TextAssociations.keytermsMethod
keyterms(corpus::Corpus;
                method::Symbol=:tfidf,
                num_keywords::Int=50,
                min_doc_freq::Int=3,
                max_doc_freq_ratio::Float64=0.5) -> DataFrame

Extract keywords from corpus using various methods.

source
TextAssociations.kwicMethod
kwic(corpus::Corpus,
                    node::String;
                    context_size::Int=50,
                    max_lines::Int=1000) -> Concordance

Generate KWIC concordance for a node word.

source
TextAssociations.lexical_gravity_analysisMethod
lexical_gravity_analysis(data::AssociationDataFormat; 
                        tokens::Union{Nothing,Vector{String}}=nothing)

Comprehensive analysis using all gravity formulas for comparison. Returns results from all three formulas plus directional analysis.

Note: If tokens are not provided, will attempt to fetch them using assoc_tokens.

source
TextAssociations.normalize_nodeMethod
normalize_node(node::AbstractString, config::TextNorm) -> String

Normalize a node word according to the given TextNorm configuration. This is the single source of truth for node normalization.

source
TextAssociations.prep_stringFunction
prep_string(input_path::AbstractString, config::TextNorm) -> StringDocument

Prepare and preprocess text from various sources into a StringDocument.

Arguments

  • input_path: File path, directory path, or raw text string.

Preprocessing options

Uses TextNorm configuration for all preprocessing options.

Returns

A preprocessed StringDocument object suitable for downstream corpus analysis.

source
TextAssociations.read_corpus_dfMethod
read_corpus_df(df::DataFrame; kwargs...) -> Corpus

Load corpus directly from a DataFrame with consistent normalization. Metadata columns are stored at the corpus level with document indices.

source
TextAssociations.stream_corpus_analysisMethod
stream_corpus_analysis(file_pattern::AbstractString,
                      node::AbstractString,
                      metric::Type{<:AssociationMetric};
                      windowsize::Int,
                      chunk_size::Int=1000)

Stream-process large corpora without loading everything into memory.

source
TextAssociations.strip_diacriticsMethod
strip_diacritics(s::AbstractString; target_form::Symbol = :NFC) -> String

Remove all combining diacritics (e.g., Greek tonos, dialytika) using Unicode normalization.

Internally:

  • canonically decomposes as needed,
  • strips combining marks (Mn),
  • and normalizes to target_form (default :NFC).

If you don’t care about the final form, leave target_form at the default.

Example: julia> strip_diacritics("ένα το χελιδόϊι") "ενα το χελιδοιι"

source
TextAssociations.token_distributionMethod
token_distribution(text::AbstractString) -> DataFrame

Analyze the distribution of tokens in a string and return a DataFrame with info about absolute and relative token frequencies.

source
TextAssociations.vocab_coverageMethod
vocab_coverage(corpus::Corpus; 
                         percentiles=0.01:0.01:1.0) -> DataFrame

Calculate vocabulary coverage curve showing how many words are needed to cover various percentages of the corpus. Uses the corpus vocabulary for consistent calculations.

source
TextAssociations.write_resultsMethod
write_results(analysis::MultiNodeAnalysis, path::AbstractString; format::Symbol=:csv)

Export analysis results to file. Results now include Node column.

source