TextAssociations

Install

You can install TextAssociations.jl directly from its GitHub repository using Julia’s package manager. In the Julia REPL, press ] to enter Pkg mode and run:

pkg> add https://github.com/atantos/TextAssociations.jl

Once the package is registered in the Julia General registry, you will be able to install it more simply with:

pkg> add TextAssociations

See Installation for detailed instructions and troubleshooting.

TextAssociations.jl

A Julia package for word association measures, collocation analysis and descriptive statistics across text and corpus levels.

TextAssociations.jl is a comprehensive framework for computing word association metrics, performing collocation analysis, and producing a wide range of descriptive statistical indices at both the text and corpus levels. With 47 implemented association measures, it is designed to support research in computational linguistics, corpus linguistics, natural language processing, and the digital humanities.

!!! tip Package Highlights

47 association metrics across statistical, information-theoretic, similarity and epidemiological families — including PMI, LogDice, LLR, Chi-square, Odds Ratio, Lexical Gravity, and many more
Efficient processing of large corpora with lazy evaluation and caching
Corpus analysis at scale via batch and streaming modes, with optional parallelism
Multilingual support with proper Unicode handling and diacritic normalization
Advanced features:
- Temporal association analysis and trend detection
- Subcorpus comparisons with effect sizes and statistical testing
- Collocation network construction and export (e.g., to Gephi)
- KWIC concordances for contextual exploration
- Keyword extraction (currently TF-IDF, with RAKE and TextRank planned)

Quick Start

After installation, you can immediately begin analyzing text and exploring collocations with just a few lines of code. The example below demonstrates how to create a contingency table for a target word, compute multiple association measures, and display the top collocates.

For a step-by-step explanation of what happens in each stage and detailed guidance on how to use the package effectively, see the Tutorial section of this documentation.

using TextAssociations

# Analyze collocations in text
text = """
Machine learning algorithms learn patterns from data.
Deep learning is a subset of machine learning.
Neural networks power deep learning systems.
"""

# Find collocations of "learning"
ct = ContingencyTable(text, "learning", windowsize=3, minfreq=1)

# Calculate multiple metrics
results = assoc_score([PMI, LogDice, LLR], ct)

# Display top collocations
using DataFrames
sort!(results, :PMI, rev=true)
first(results, 5)

5×6 DataFrame

Row	Node	Collocate	Frequency	PMI	LogDice	LLR
	String	String	Int64	Float64	Float64	Float64
1	learning	networks	1	-1.09861	13.0	0.0
2	learning	power	1	-1.09861	13.0	0.0
3	learning	subset	1	-1.09861	13.0	0.0
4	learning	deep	2	-1.38629	13.415	0.0
5	learning	machine	2	-1.38629	13.415	0.0

Key Features

📊 Comprehensive Metric Collection

The package provides metrics from several families of measures. The examples below are representative; the full list of implemented measures, along with their formulae, is provided in the Measures section.

Information-theoretic: PMI, PPMI, Mutual Information variants
Statistical: Log-likelihood ratio, Chi-square, T-score, Z-score
Similarity-based: Dice, Jaccard, Cosine similarity
Effect size: Odds ratio, Relative risk, Cohen's d
Specialized: Lexical Gravity, Delta P, Minimum Sensitivity

🚀 Performance and Scalability

Lazy evaluation: Computations are deferred and cached
Memory efficient: Stream processing for large corpora
Parallel processing: Built-in support for distributed computing
Optimized algorithms: Efficient implementations for all metrics

🔧 Flexible and Extensible

Multiple input formats: Raw text, files, directories, CSV, JSON
Customizable preprocessing: Full control over text normalization
Extensible design: Easy to add new metrics or modify existing ones
Rich output options: DataFrames, CSV, JSON, Excel export

Basic Usage

Single Document Analysis

using TextAssociations
using TextAnalysis: text
using DataFrames

text_sample = "Machine learning algorithms learn from data. Deep learning uses neural networks."

doc = prep_string(text_sample, TextNorm(
    strip_punctuation=true,
    strip_case=true
))

ct = ContingencyTable(text(doc), "learning"; windowsize=5, minfreq=1)
pmi_scores = assoc_score(PMI, ct)
println("Found $(nrow(pmi_scores)) collocates")

Found 9 collocates

Corpus-Level Analysis

using TextAssociations

# Create a temporary mini-corpus with longer texts
dir = mktempdir()

files = Dict(
    "doc1.txt" => """
    Computational linguistics increasingly intersects with innovation practice.
    Teams use data to evaluate hypotheses, prototype ideas quickly, and measure impact with reproducible pipelines.
    In modern research workflows, small models are validated against well-defined tasks before scaling, ensuring that innovation is more than a buzzword—it is a methodical, testable process.
    When AI systems are involved, documentation and transparent governance help peers replicate results and trust conclusions.
    """,

    "doc2.txt" => """
    Successful innovation rarely happens in isolation.
    It emerges from an ecosystem of universities, startups, industry labs, and public institutions that collaborate and share partial results early.
    Well-run projects cultivate collaboration rituals—design reviews, error analyses, and postmortems—so ideas move from promising theory to usable tools.
    Open exchange reduces duplication and accelerates learning across the ecosystem.
    """,

    "doc3.txt" => """
    Prototyping is the bridge between research and deployment.
    A minimal prototype clarifies the problem, surfaces risks, and reveals unknown edge cases.
    From there, teams harden the system for scalability, add observability, and evaluate ethical trade-offs such as bias, privacy, and safety.
    A principled evaluation plan is part of the prototype, not an afterthought.
    """,

    "doc4.txt" => """
    Education benefits when innovation is human-centered.
    Instructors can combine classic readings with hands-on labs that trace data through each step of the pipeline.
    Open-source examples and clear rubrics help students reason about uncertainty, interpret model behavior, and articulate the limits of automation.
    The goal is durable understanding and real-world impact, not just higher benchmark scores.
    """
)

# Write files
for (name, content) in files
    open(joinpath(dir, name), "w") do io
        write(io, strip(content))
    end
end

# Load the corpus from the real path we just created
corpus = read_corpus(dir)

# Analyze across entire corpus
results = analyze_corpus(corpus, "innovation", PMI,
    windowsize=5,
    minfreq=1
)

# Get corpus statistics
stats = corpus_stats(corpus)
println("Documents: $(stats[:num_documents])")
println("Vocabulary: $(stats[:vocabulary_size])")

Loaded 4 documents
Documents: 4
Vocabulary: 175

Advanced Features

Collocation Networks

Build networks of related terms:

network = colloc_graph(
    corpus, ["artificial", "intelligence"],
    metric=PMI, depth=2
)

Comparative Analysis

Compare associations across subcorpora:

comparison = compare_subcorpora(
    corpus, :category, "technology", PMI
)

Temporal Analysis

Track how word associations change over time:

temporal_analysis = analyze_temporal(
    corpus, ["digital", "transformation"], :year, PMI
)

Package Architecture

TextAssociations.jl
│
├─ Types & Basics
│  ├─ AssociationMetric / AssociationDataFormat
│  ├─ TextNorm (single source of truth for normalization)
│  └─ LazyProcess / LazyInput (lazy evaluation & caching)
│
├─ Utils
│  ├─ I/O & encoding (read_text_smart)
│  ├─ Text processing (normalize_node, prep_string, strip_diacritics)
│  ├─ Statistical helpers (available_metrics, log_safe)
│  └─ Text analysis helpers (token find/count utilities)
│
├─ Core Data Structures
│  ├─ ContingencyTable          # per-document co-occurrence table
│  ├─ Corpus                    # collection + vocabulary/DTM
│  └─ CorpusContingencyTable    # corpus-level aggregation (lazy)
│
├─ API (Unified)
│  └─ assoc_score(metric(s), x::AssociationDataFormat; …)
│
├─ Metrics
│  ├─ Interface + dispatch
│  └─ 47 measures across families (PMI, LLR, LogDice, χ², OR, etc.)
│
├─ Analysis Functions
│  ├─ analyze_corpus / analyze_nodes
│  ├─ corpus_stats, token_distribution, vocab_coverage
│  ├─ write_results, export/load with metadata
│  ├─ batch_process_corpus, stream_corpus_analysis
│  └─ keyterms (TF-IDF; RAKE/TextRank placeholders)
│
└─ Advanced Features
   ├─ analyze_temporal, compare_subcorpora
   ├─ colloc_graph → gephi_graph (network export)
   └─ kwic (concordance)

Documentation Guide

🚀 Getting Started

📖 User Guide

📊 Metrics

🔬 Advanced

<!–

Performance Benchmarks

Task	Size	Time	Memory
Single document	10K words	~50ms	10MB
Small corpus	100 docs	~2s	50MB
Large corpus	10K docs	~30s	500MB
Streaming	Unlimited	Linear	Constant

Community and Support

<!– - 📧 Contact: alextantos@lit.auth.gr –>

<!– ## Citation

If you use TextAssociations.jl in your research, please cite:

@software{textassociations2025,
    title = {TextAssociations.jl: A Julia Package for Word Association Analysis},
    author = {Your Name},
    year = {2025},
    url = {https://github.com/yourusername/TextAssociations.jl},
    version = {0.1.0}
}

–>

Contributing

We welcome contributions! See our Contributing Guide for:

Bug reports and feature requests
Code contributions
Documentation improvements
Adding new metrics

License

TextAssociations.jl is licensed under the MIT License.

Acknowledgments

This package builds upon decades of research in corpus and computational linguistics. The references that follow are illustrative rather than exhaustive, highlighting some of the key contributions that have shaped the development of association measures and corpus analysis methods.

Evert, S. (2008). "Corpora and collocations." Corpus Linguistics: An International Handbook
Church, K. W., & Hanks, P. (1990). "Word association norms, mutual information, and lexicography." Computational Linguistics
Pecina, P. (2010). "Lexical association measures and collocation extraction." Language Resources and Evaluation

Index

TextAssociations.AssociationDataFormat
TextAssociations.AssociationMetric
TextAssociations.CollocationNetwork
TextAssociations.Concordance
TextAssociations.ContingencyTable
TextAssociations.Corpus
TextAssociations.CorpusContingencyTable
TextAssociations.LazyInput
TextAssociations.LazyProcess
TextAssociations.MultiNodeAnalysis
TextAssociations.SubcorpusComparison
TextAssociations.TemporalCorpusAnalysis
TextAssociations.TextNorm
TextAssociations._gravity_directional_analysis
TextAssociations._gravity_original_formula
TextAssociations._gravity_pmi_weighted
TextAssociations._gravity_simplified_formula
TextAssociations.aggregate_contingency_tables
TextAssociations.analyze_corpus
TextAssociations.analyze_nodes
TextAssociations.analyze_temporal
TextAssociations.assoc_score
TextAssociations.assoc_score
TextAssociations.assoc_score
TextAssociations.assoc_score
TextAssociations.assoc_score
TextAssociations.assoc_score
TextAssociations.assoc_score
TextAssociations.assoc_score
TextAssociations.assoc_score
TextAssociations.available_metrics
TextAssociations.available_metrics
TextAssociations.batch_process_corpus
TextAssociations.build_document_term_matrix
TextAssociations.build_vocab
TextAssociations.build_vocab
TextAssociations.cached_data
TextAssociations.cached_data
TextAssociations.calculate_effect_sizes
TextAssociations.colloc_graph
TextAssociations.compare_subcorpora
TextAssociations.compute_association_trends
TextAssociations.cont_table
TextAssociations.corpus_stats
TextAssociations.count_substrings
TextAssociations.count_substrings
TextAssociations.count_word_frequency
TextAssociations.coverage_summary
TextAssociations.document
TextAssociations.document
TextAssociations.eval_lexicalgravity
TextAssociations.extract_rake_keywords
TextAssociations.extract_textrank_keywords
TextAssociations.extract_tfidf_keywords
TextAssociations.find_following_words
TextAssociations.find_prior_words
TextAssociations.gephi_graph
TextAssociations.keyterms
TextAssociations.kwic
TextAssociations.lexical_gravity_analysis
TextAssociations.log_safe
TextAssociations.normalize_node
TextAssociations.perform_statistical_tests
TextAssociations.prep_string
TextAssociations.prep_string
TextAssociations.read_corpus
TextAssociations.read_corpus_df
TextAssociations.stream_corpus_analysis
TextAssociations.strip_diacritics
TextAssociations.token_distribution
TextAssociations.token_distribution
TextAssociations.tostringvector
TextAssociations.vocab_coverage
TextAssociations.write_results

Functions

TextAssociations._gravity_directional_analysis — Method

Analyze directional preferences (left vs right) for collocations.