TextAssociations
Documentation for TextAssociations.
Install
You can install TextAssociations.jl
directly from its GitHub
repository using Julia
’s package manager. In the Julia REPL
, press ]
to enter Pkg
mode and run:
pkg> add https://github.com/atantos/TextAssociations.jl
Once the package is registered in the Julia
General registry, you will be able to install it more simply with:
pkg> add TextAssociations
See Installation for detailed instructions and troubleshooting.
TextAssociations.jl
A Julia package for word association measures, collocation analysis and descriptive statistics across text and corpus levels.
Overview
TextAssociations.jl
is a comprehensive framework for computing word association metrics, performing collocation analysis, and producing a wide range of descriptive statistical indices at both the text and corpus levels. With 47 implemented association measures, it is designed to support research in computational linguistics, corpus linguistics, natural language processing, and the digital humanities.
!!! tip Package Highlights
47 association metrics across statistical, information-theoretic, similarity and epidemiological families — including PMI, LogDice, LLR, Chi-square, Odds Ratio, Lexical Gravity, and many more
Efficient processing of large corpora with lazy evaluation and caching
Corpus analysis at scale via batch and streaming modes, with optional parallelism
Multilingual support with proper Unicode handling and diacritic normalization
Advanced features:
- Temporal association analysis and trend detection
- Subcorpus comparisons with effect sizes and statistical testing
- Collocation network construction and export (e.g., to Gephi)
- KWIC concordances for contextual exploration
- Keyword extraction (currently TF-IDF, with RAKE and TextRank planned)
Quick Start
After installation, you can immediately begin analyzing text and exploring collocations with just a few lines of code. The example below demonstrates how to create a contingency table for a target word, compute multiple association measures, and display the top collocates.
For a step-by-step explanation of what happens in each stage and detailed guidance on how to use the package effectively, see the Tutorial section of this documentation.
using TextAssociations
# Analyze collocations in text
text = """
Machine learning algorithms learn patterns from data.
Deep learning is a subset of machine learning.
Neural networks power deep learning systems.
"""
# Find collocations of "learning"
ct = ContingencyTable(text, "learning", windowsize=3, minfreq=1)
# Calculate multiple metrics
results = assoc_score([PMI, LogDice, LLR], ct)
# Display top collocations
using DataFrames
sort!(results, :PMI, rev=true)
first(results, 5)
Row | Node | Collocate | Frequency | PMI | LogDice | LLR |
---|---|---|---|---|---|---|
String | String | Int64 | Float64 | Float64 | Float64 | |
1 | learning | networks | 1 | -1.09861 | 13.0 | 0.0 |
2 | learning | power | 1 | -1.09861 | 13.0 | 0.0 |
3 | learning | subset | 1 | -1.09861 | 13.0 | 0.0 |
4 | learning | deep | 2 | -1.38629 | 13.415 | 0.0 |
5 | learning | machine | 2 | -1.38629 | 13.415 | 0.0 |
Key Features
📊 Comprehensive Metric Collection
The package provides metrics from several families of measures. The examples below are representative; the full list of implemented measures, along with their formulae, is provided in the Measures section.
- Information-theoretic: PMI, PPMI, Mutual Information variants
- Statistical: Log-likelihood ratio, Chi-square, T-score, Z-score
- Similarity-based: Dice, Jaccard, Cosine similarity
- Effect size: Odds ratio, Relative risk, Cohen's d
- Specialized: Lexical Gravity, Delta P, Minimum Sensitivity
🚀 Performance and Scalability
- Lazy evaluation: Computations are deferred and cached
- Memory efficient: Stream processing for large corpora
- Parallel processing: Built-in support for distributed computing
- Optimized algorithms: Efficient implementations for all metrics
🔧 Flexible and Extensible
- Multiple input formats: Raw text, files, directories, CSV, JSON
- Customizable preprocessing: Full control over text normalization
- Extensible design: Easy to add new metrics or modify existing ones
- Rich output options: DataFrames, CSV, JSON, Excel export
Basic Usage
Single Document Analysis
using TextAssociations
using TextAnalysis: text
using DataFrames
text_sample = "Machine learning algorithms learn from data. Deep learning uses neural networks."
doc = prep_string(text_sample, TextNorm(
strip_punctuation=true,
strip_case=true
))
ct = ContingencyTable(text(doc), "learning"; windowsize=5, minfreq=1)
pmi_scores = assoc_score(PMI, ct)
println("Found $(nrow(pmi_scores)) collocates")
Found 9 collocates
Corpus-Level Analysis
using TextAssociations
# Create a temporary mini-corpus with longer texts
dir = mktempdir()
files = Dict(
"doc1.txt" => """
Computational linguistics increasingly intersects with innovation practice.
Teams use data to evaluate hypotheses, prototype ideas quickly, and measure impact with reproducible pipelines.
In modern research workflows, small models are validated against well-defined tasks before scaling, ensuring that innovation is more than a buzzword—it is a methodical, testable process.
When AI systems are involved, documentation and transparent governance help peers replicate results and trust conclusions.
""",
"doc2.txt" => """
Successful innovation rarely happens in isolation.
It emerges from an ecosystem of universities, startups, industry labs, and public institutions that collaborate and share partial results early.
Well-run projects cultivate collaboration rituals—design reviews, error analyses, and postmortems—so ideas move from promising theory to usable tools.
Open exchange reduces duplication and accelerates learning across the ecosystem.
""",
"doc3.txt" => """
Prototyping is the bridge between research and deployment.
A minimal prototype clarifies the problem, surfaces risks, and reveals unknown edge cases.
From there, teams harden the system for scalability, add observability, and evaluate ethical trade-offs such as bias, privacy, and safety.
A principled evaluation plan is part of the prototype, not an afterthought.
""",
"doc4.txt" => """
Education benefits when innovation is human-centered.
Instructors can combine classic readings with hands-on labs that trace data through each step of the pipeline.
Open-source examples and clear rubrics help students reason about uncertainty, interpret model behavior, and articulate the limits of automation.
The goal is durable understanding and real-world impact, not just higher benchmark scores.
"""
)
# Write files
for (name, content) in files
open(joinpath(dir, name), "w") do io
write(io, strip(content))
end
end
# Load the corpus from the real path we just created
corpus = read_corpus(dir)
# Analyze across entire corpus
results = analyze_corpus(corpus, "innovation", PMI,
windowsize=5,
minfreq=1
)
# Get corpus statistics
stats = corpus_stats(corpus)
println("Documents: $(stats[:num_documents])")
println("Vocabulary: $(stats[:vocabulary_size])")
Loaded 4 documents
Documents: 4
Vocabulary: 175
Advanced Features
Collocation Networks
Build networks of related terms:
network = colloc_graph(
corpus, ["artificial", "intelligence"],
metric=PMI, depth=2
)
Comparative Analysis
Compare associations across subcorpora:
comparison = compare_subcorpora(
corpus, :category, "technology", PMI
)
Temporal Analysis
Track how word associations change over time:
temporal_analysis = analyze_temporal(
corpus, ["digital", "transformation"], :year, PMI
)
Package Architecture
TextAssociations.jl
│
├─ Types & Basics
│ ├─ AssociationMetric / AssociationDataFormat
│ ├─ TextNorm (single source of truth for normalization)
│ └─ LazyProcess / LazyInput (lazy evaluation & caching)
│
├─ Utils
│ ├─ I/O & encoding (read_text_smart)
│ ├─ Text processing (normalize_node, prep_string, strip_diacritics)
│ ├─ Statistical helpers (available_metrics, log_safe)
│ └─ Text analysis helpers (token find/count utilities)
│
├─ Core Data Structures
│ ├─ ContingencyTable # per-document co-occurrence table
│ ├─ Corpus # collection + vocabulary/DTM
│ └─ CorpusContingencyTable # corpus-level aggregation (lazy)
│
├─ API (Unified)
│ └─ assoc_score(metric(s), x::AssociationDataFormat; …)
│
├─ Metrics
│ ├─ Interface + dispatch
│ └─ 47 measures across families (PMI, LLR, LogDice, χ², OR, etc.)
│
├─ Analysis Functions
│ ├─ analyze_corpus / analyze_nodes
│ ├─ corpus_stats, token_distribution, vocab_coverage
│ ├─ write_results, export/load with metadata
│ ├─ batch_process_corpus, stream_corpus_analysis
│ └─ keyterms (TF-IDF; RAKE/TextRank placeholders)
│
└─ Advanced Features
├─ analyze_temporal, compare_subcorpora
├─ colloc_graph → gephi_graph (network export)
└─ kwic (concordance)
Documentation Guide
🚀 Getting Started
📖 User Guide
🔬 Advanced
<!–
Performance Benchmarks
Task | Size | Time | Memory |
---|---|---|---|
Single document | 10K words | ~50ms | 10MB |
Small corpus | 100 docs | ~2s | 50MB |
Large corpus | 10K docs | ~30s | 500MB |
Streaming | Unlimited | Linear | Constant |
Community and Support
<!– - 📧 Contact: alextantos@lit.auth.gr –>
<!– ## Citation
If you use TextAssociations.jl in your research, please cite:
@software{textassociations2025,
title = {TextAssociations.jl: A Julia Package for Word Association Analysis},
author = {Your Name},
year = {2025},
url = {https://github.com/yourusername/TextAssociations.jl},
version = {0.1.0}
}
–>
Contributing
We welcome contributions! See our Contributing Guide for:
- Bug reports and feature requests
- Code contributions
- Documentation improvements
- Adding new metrics
License
TextAssociations.jl is licensed under the MIT License.
Acknowledgments
This package builds upon decades of research in corpus and computational linguistics. The references that follow are illustrative rather than exhaustive, highlighting some of the key contributions that have shaped the development of association measures and corpus analysis methods.
- Evert, S. (2008). "Corpora and collocations." Corpus Linguistics: An International Handbook
- Church, K. W., & Hanks, P. (1990). "Word association norms, mutual information, and lexicography." Computational Linguistics
- Pecina, P. (2010). "Lexical association measures and collocation extraction." Language Resources and Evaluation
Index
TextAssociations.AssociationDataFormat
TextAssociations.AssociationMetric
TextAssociations.CollocationNetwork
TextAssociations.Concordance
TextAssociations.ContingencyTable
TextAssociations.Corpus
TextAssociations.CorpusContingencyTable
TextAssociations.LazyInput
TextAssociations.LazyProcess
TextAssociations.MultiNodeAnalysis
TextAssociations.SubcorpusComparison
TextAssociations.TemporalCorpusAnalysis
TextAssociations.TextNorm
TextAssociations._gravity_directional_analysis
TextAssociations._gravity_original_formula
TextAssociations._gravity_pmi_weighted
TextAssociations._gravity_simplified_formula
TextAssociations.aggregate_contingency_tables
TextAssociations.analyze_corpus
TextAssociations.analyze_nodes
TextAssociations.analyze_temporal
TextAssociations.assoc_score
TextAssociations.assoc_score
TextAssociations.assoc_score
TextAssociations.assoc_score
TextAssociations.assoc_score
TextAssociations.assoc_score
TextAssociations.assoc_score
TextAssociations.assoc_score
TextAssociations.assoc_score
TextAssociations.available_metrics
TextAssociations.available_metrics
TextAssociations.batch_process_corpus
TextAssociations.build_document_term_matrix
TextAssociations.build_vocab
TextAssociations.build_vocab
TextAssociations.cached_data
TextAssociations.cached_data
TextAssociations.calculate_effect_sizes
TextAssociations.colloc_graph
TextAssociations.compare_subcorpora
TextAssociations.compute_association_trends
TextAssociations.cont_table
TextAssociations.corpus_stats
TextAssociations.count_substrings
TextAssociations.count_substrings
TextAssociations.count_word_frequency
TextAssociations.coverage_summary
TextAssociations.document
TextAssociations.document
TextAssociations.eval_lexicalgravity
TextAssociations.extract_rake_keywords
TextAssociations.extract_textrank_keywords
TextAssociations.extract_tfidf_keywords
TextAssociations.find_following_words
TextAssociations.find_prior_words
TextAssociations.gephi_graph
TextAssociations.keyterms
TextAssociations.kwic
TextAssociations.lexical_gravity_analysis
TextAssociations.log_safe
TextAssociations.normalize_node
TextAssociations.perform_statistical_tests
TextAssociations.prep_string
TextAssociations.prep_string
TextAssociations.read_corpus
TextAssociations.read_corpus_df
TextAssociations.stream_corpus_analysis
TextAssociations.strip_diacritics
TextAssociations.token_distribution
TextAssociations.token_distribution
TextAssociations.tostringvector
TextAssociations.vocab_coverage
TextAssociations.write_results
Functions
TextAssociations._gravity_directional_analysis
— MethodAnalyze directional preferences (left vs right) for collocations.
TextAssociations._gravity_original_formula
— MethodThe original Daudaravičius & Marcinkevičienė formula. This is the main contribution of their paper.
TextAssociations._gravity_pmi_weighted
— MethodPMI-weighted gravity (alternative formulation). G = f(w1,w2) × PMI(w1,w2)
TextAssociations._gravity_simplified_formula
— MethodSimplified gravity formula often used in implementations. G = log₂((f(w1,w2)² × span) / (f(w1) × f(w2)))
TextAssociations.aggregate_contingency_tables
— Methodaggregate_contingency_tables(tables::Vector{ContingencyTable}, minfreq::Int) -> DataFrame
Aggregate multiple contingency tables into a single table.
TextAssociations.analyze_corpus
— Methodanalyze_corpus(corpus::Corpus, node::AbstractString, metric::Type{<:AssociationMetric};
windowsize::Int, minfreq::Int=5) -> DataFrame
Analyze a single node word across the entire corpus using corpus's normalization. Returns DataFrame with Node, Collocate, Score, Frequency, and DocFrequency columns.
TextAssociations.analyze_nodes
— Methodanalyze_nodes(corpus::Corpus, nodes::Vector{String}, metrics::Vector{DataType};
windowsize::Int, minfreq::Int=5, top_n::Int=100,
parallel::Bool=false) -> MultiNodeAnalysis
Analyze multiple nodes with consistent normalization. Each result DataFrame now includes the Node column and metadata.
TextAssociations.analyze_temporal
— Methodanalyze_temporal(corpus::Corpus,
nodes::Vector{String},
time_field::Symbol,
metric::Type{<:AssociationMetric};
time_bins::Int=10,
windowsize::Int=5,
minfreq::Int=5) -> TemporalCorpusAnalysis
Analyze how word associations change over time.
TextAssociations.assoc_score
— Methodassoc_score(metrics::AbstractVector{<:Type{<:AssociationMetric}},
inputstring::AbstractString,
node::AbstractString;
windowsize::Int,
minfreq::Int=5;
scores_only::Bool=false,
norm_config::TextNorm=TextNorm(),
tokens::Union{Nothing,Vector{String}}=nothing,
kwargs...)
Convenience overload to compute multiple metrics directly from raw text.
TextAssociations.assoc_score
— Methodassoc_score(metrics::AbstractVector{<:Type{<:AssociationMetric}},
x::AssociationDataFormat;
scores_only::Bool=false,
tokens::Union{Nothing,Vector{String}}=nothing,
kwargs...)
Evaluate multiple metrics on CT or CCT.
- Returns a DataFrame with one column per metric by default.
- If
scores_only=true
, returns Dict{String,Vector{Float64}}.
TextAssociations.assoc_score
— Methodassoc_score(metrics::AbstractVector{<:Type{<:AssociationMetric}},
corpus::Corpus, nodes::Vector{String};
windowsize::Int=5, minfreq::Int=5, top_n::Int=100, kwargs...)
Evaluate multiple metrics on multiple nodes in a corpus. Returns a Dict{String,DataFrame} with combined metric results for each node.
TextAssociations.assoc_score
— Methodassoc_score(metricType::Type{<:AssociationMetric},
inputstring::AbstractString,
node::AbstractString;
windowsize::Int,
minfreq::Int=5;
scores_only::Bool=false,
norm_config::TextNorm=TextNorm(),
tokens::Union{Nothing,Vector{String}}=nothing,
kwargs...)
Convenience overload to compute a metric directly from a raw string.
TextAssociations.assoc_score
— Methodassoc_score(metricType::Type{<:AssociationMetric}, x::AssociationDataFormat;
scores_only::Bool=false,
tokens::Union{Nothing,Vector{String}}=nothing,
kwargs...)
Evaluate a metric on any association data format (CT or CCT).
- If the metric requires tokens (e.g., LexicalGravity), pass
tokens=...
or implementassoc_tokens(::YourType)
to supply them automatically. - Returns a DataFrame by default: [:Node, :Collocate, :Frequency, :<MetricName>].
- If
scores_only=true
, returns only the scores Vector.
TextAssociations.assoc_score
— Methodassoc_score(metricType::Type{<:AssociationMetric}, corpus::Corpus, node::AbstractString;
windowsize::Int=5, minfreq::Int=5, kwargs...)
Evaluate a metric on a corpus - convenience method that delegates to analyze_corpus.
TextAssociations.assoc_score
— Methodassoc_score(metricType::Type{<:AssociationMetric}, corpus::Corpus,
nodes::Vector{String}; windowsize::Int=5, minfreq::Int=5,
top_n::Int=100, kwargs...)
Evaluate a metric on multiple nodes in a corpus. Returns a Dict{String,DataFrame} with results for each node.
TextAssociations.assoc_score
— Methodassoc_score(metric::Type{<:AssociationMetric}, corpus::Corpus;
nodes::Vector{String}, windowsize::Int=5, minfreq::Int=5,
top_n::Int=100, kwargs...)
Alternative syntax with nodes as keyword argument.
TextAssociations.available_metrics
— Methodavailable_metrics() -> Vector{DataType}
Returns a list of all supported association metrics.
TextAssociations.batch_process_corpus
— Methodbatch_process_corpus(corpus::Corpus,
node_file::AbstractString,
output_dir::AbstractString;
metrics::Vector{DataType}=[PMI, LogDice],
windowsize::Int,
minfreq::Int=5,
batch_size::Int=100)
Process a large list of node words in batches. Results include Node column.
TextAssociations.build_document_term_matrix
— Methodbuild_document_term_matrix(documents, vocabulary) -> SparseMatrixCSC
Build a document-term matrix from documents.
TextAssociations.build_vocab
— Methodbuild_vocab(input::Union{StringDocument,Vector{String}}) -> OrderedDict
Create vocabulary dictionary from text input.
TextAssociations.cached_data
— Methodcached_data(z::LazyProcess{T,R}) -> R
Extract data from a LazyProcess, computing it if necessary.
TextAssociations.calculate_effect_sizes
— Methodcalculate_effect_sizes(results, metric) -> DataFrame
Calculate effect sizes for differences between subcorpora.
TextAssociations.colloc_graph
— Methodcolloc_graph(corpus::Corpus,
seed_words::Vector{String};
metric::Type{<:AssociationMetric}=PMI,
depth::Int=2,
min_score::Float64=3.0,
max_neighbors::Int=20) -> CollocationNetwork
Build a collocation network starting from seed words.
TextAssociations.compare_subcorpora
— Methodcompare_subcorpora(corpus::Corpus,
split_field::Symbol,
node::String,
metric::Type{<:AssociationMetric};
windowsize::Int=5,
minfreq::Int=5) -> SubcorpusComparison
Compare word associations across different subcorpora.
TextAssociations.compute_association_trends
— Methodcompute_association_trends(results_by_period, nodes, metric) -> DataFrame
Compute trend statistics for associations over time.
TextAssociations.cont_table
— Methodcont_table(input_doc::StringDocument, target_word::AbstractString,
windowsize::Int=5, minfreq::Int=3) -> DataFrame
Compute the contingency table for a target word in a document. Note: target_word should already be normalized before calling this function.
TextAssociations.corpus_stats
— Methodcorpus_stats(corpus::Corpus;
include_token_distribution::Bool=true) -> Dict
Get comprehensive statistics about the corpus.
TextAssociations.count_substrings
— Methodcount_substrings(text::String, substring::String) -> Int
Count occurrences of a substring in text.
TextAssociations.count_substrings
— Methodcount_substrings(text::String, substrings::Vector{String}) -> Dict{String,Int}
Count occurrences of multiple substrings in text.
TextAssociations.count_word_frequency
— Methodcount_word_frequency(doc::StringDocument, word::String) -> Int
Count the frequency of a word in the document.
TextAssociations.coverage_summary
— Methodcoverage_summary(stats::Dict)
Pretty print the vocabulary coverage statistics.
TextAssociations.document
— Methoddocument(input::LazyInput) -> StringDocument
Extract the document from a LazyInput wrapper.
TextAssociations.eval_lexicalgravity
— Methodeval_lexicalgravity(data::AssociationDataFormat;
tokens::Vector{String},
formula::Symbol=:original)
Compute Lexical Gravity measure based on Daudaravičius & Marcinkevičienė (2004).
Arguments
data
: AssociationDataFormat with co-occurrence datatokens
: Required tokenized text (provided by assoc_score when called through API)formula
: Which formula to use::original
- The main formula from the paper: G→(w1,w2) = log(f→×n+/f1) + log(f←×n-/f2):simplified
- Simplified version: G = log₂((f²×span)/(f1×f2)):pmi_weighted
- PMI-style weighting: G = f(w1,w2) × log((f×N)/(f1×f2))
Original Formula from Paper
G→(w1,w2) = log(f→(w1,w2)/f(w1) × n+(w1)) + log(f←(w1,w2)/f(w2) × n-(w2))
Where:
- f→(w1,w2) = frequency of w2 following w1 within window
- f←(w1,w2) = frequency of w1 preceding w2 within window
- n+(w1) = number of different word types that follow w1
- n-(w2) = number of different word types that precede w2
- f(w1), f(w2) = total frequencies of words
Note
This function expects tokens to be provided. When called through assoc_score(), tokens are automatically fetched based on the NeedsTokens trait.
TextAssociations.extract_rake_keywords
— Methodextract_rake_keywords(corpus, num_keywords) -> DataFrame
Extract keywords using RAKE algorithm (placeholder).
TextAssociations.extract_textrank_keywords
— Methodextract_textrank_keywords(corpus, num_keywords) -> DataFrame
Extract keywords using TextRank algorithm (placeholder).
TextAssociations.extract_tfidf_keywords
— Methodextract_tfidf_keywords(corpus, num_keywords, min_doc_freq, max_doc_freq_ratio) -> DataFrame
Extract keywords using TF-IDF scoring.
TextAssociations.find_following_words
— Methodfind_following_words(doc::StringDocument, word::String, window::Int) -> Set{String}
Find unique words that appear within window
words after each occurrence of word
in the document.
TextAssociations.find_prior_words
— Methodfind_prior_words(doc::StringDocument, word::String, window::Int) -> Set{String}
Find unique words that appear within window
words before each occurrence of word
in the document.
TextAssociations.gephi_graph
— Methodgephi_graph(network::CollocationNetwork,
nodes_file::String,
edges_file::String)
Export network for visualization in Gephi or similar tools.
TextAssociations.keyterms
— Methodkeyterms(corpus::Corpus;
method::Symbol=:tfidf,
num_keywords::Int=50,
min_doc_freq::Int=3,
max_doc_freq_ratio::Float64=0.5) -> DataFrame
Extract keywords from corpus using various methods.
TextAssociations.kwic
— Methodkwic(corpus::Corpus,
node::String;
context_size::Int=50,
max_lines::Int=1000) -> Concordance
Generate KWIC concordance for a node word.
TextAssociations.lexical_gravity_analysis
— Methodlexical_gravity_analysis(data::AssociationDataFormat;
tokens::Union{Nothing,Vector{String}}=nothing)
Comprehensive analysis using all gravity formulas for comparison. Returns results from all three formulas plus directional analysis.
Note: If tokens are not provided, will attempt to fetch them using assoc_tokens.
TextAssociations.log_safe
— MethodSafe logarithm that handles zero and negative values.
TextAssociations.normalize_node
— Methodnormalize_node(node::AbstractString, config::TextNorm) -> String
Normalize a node word according to the given TextNorm configuration. This is the single source of truth for node normalization.
TextAssociations.perform_statistical_tests
— Methodperform_statistical_tests(results, metric) -> DataFrame
Perform statistical tests between subcorpora.
TextAssociations.prep_string
— Functionprep_string(input_path::AbstractString, config::TextNorm) -> StringDocument
Prepare and preprocess text from various sources into a StringDocument
.
Arguments
input_path
: File path, directory path, or raw text string.
Preprocessing options
Uses TextNorm
configuration for all preprocessing options.
Returns
A preprocessed StringDocument
object suitable for downstream corpus analysis.
TextAssociations.read_corpus
— Methodread_corpus(path::AbstractString; kwargs...) -> Corpus
Load a corpus from various sources with consistent normalization.
TextAssociations.read_corpus_df
— Methodread_corpus_df(df::DataFrame; kwargs...) -> Corpus
Load corpus directly from a DataFrame with consistent normalization. Metadata columns are stored at the corpus level with document indices.
TextAssociations.stream_corpus_analysis
— Methodstream_corpus_analysis(file_pattern::AbstractString,
node::AbstractString,
metric::Type{<:AssociationMetric};
windowsize::Int,
chunk_size::Int=1000)
Stream-process large corpora without loading everything into memory.
TextAssociations.strip_diacritics
— Methodstrip_diacritics(s::AbstractString; target_form::Symbol = :NFC) -> String
Remove all combining diacritics (e.g., Greek tonos, dialytika) using Unicode normalization.
Internally:
- canonically decomposes as needed,
- strips combining marks (
Mn
), - and normalizes to
target_form
(default :NFC).
If you don’t care about the final form, leave target_form
at the default.
Example: julia> strip_diacritics("ένα το χελιδόϊι") "ενα το χελιδοιι"
TextAssociations.token_distribution
— Methodtoken_distribution(text::AbstractString) -> DataFrame
Analyze the distribution of tokens in a string and return a DataFrame with info about absolute and relative token frequencies.
TextAssociations.token_distribution
— Methodtoken_distribution(corpus::Corpus) -> DataFrame
Analyze the distribution of tokens in the corpus.
TextAssociations.tostringvector
— MethodConvert input to string vector for vocabulary creation.
TextAssociations.vocab_coverage
— Methodvocab_coverage(corpus::Corpus;
percentiles=0.01:0.01:1.0) -> DataFrame
Calculate vocabulary coverage curve showing how many words are needed to cover various percentages of the corpus. Uses the corpus vocabulary for consistent calculations.
TextAssociations.write_results
— Methodwrite_results(analysis::MultiNodeAnalysis, path::AbstractString; format::Symbol=:csv)
Export analysis results to file. Results now include Node column.