Core Types

This section documents all core types in TextAssociations.jl. These types form the foundation of the package's functionality.

Overview

The type system is organized into several categories:

TextAssociations Types
├── Data Structures
│   ├── ContingencyTable       # Word co-occurrence data
│   ├── Corpus                 # Document collection
│   └── CorpusContingencyTable # Aggregated corpus data
├── Analysis Results
│   ├── MultiNodeAnalysis      # Multiple word analysis
│   ├── TemporalCorpusAnalysis # Time-based analysis
│   ├── SubcorpusComparison    # Comparative analysis
│   ├── CollocationNetwork     # Network representation
│   └── Concordance            # KWIC concordance
├── Abstract Types
│   ├── AssociationMetric      # Base for all metrics
│   └── AssociationDataFormat  # Base for data formats
└── Utility Types
    ├── LazyProcess            # Lazy evaluation wrapper
    └── LazyInput              # Lazy input wrapper

Primary Data Structures

ContingencyTable

TextAssociations.ContingencyTable — Type

ContingencyTable <: AssociationDataFormat

Represents a contingency table for word co-occurrence analysis.

Fields

con_tbl: Lazy-loaded contingency table data
node: Target word (normalized)
windowsize: Context window size
minfreq: Minimum frequency threshold
input_ref: Reference to the processed input document
norm_config: Text normalization configuration

source

TextAssociations.TextNorm — Type

TextNorm(; strip_case=true,
           strip_accents=false,
           unicode_form=:NFC,
           strip_punctuation=true,
           punctuation_to_space=true,
           normalize_whitespace=true,
           strip_whitespace=false,
           use_prepare=false)

Configuration for text normalization used by prep_string and corpus loaders.

Fields

strip_case::Bool — Lowercase the text when true.
strip_accents::Bool — Remove combining diacritics (e.g., Greek tonos, diaeresis).
unicode_form::Symbol — Unicode normalization form (:NFC, :NFD, :NFKC, :NFKD).
strip_punctuation::Bool — If true, remove punctuation; combined with punctuation_to_space to decide replace with a space vs delete.
punctuation_to_space::Bool — When stripping punctuation, replace it with a single space (if true) or remove it (if false).
normalize_whitespace::Bool — Collapse consecutive whitespace to a single space.
strip_whitespace::Bool — Trim leading/trailing whitespace.
use_prepare::Bool — Internal flag to route through a more aggressive prepare path.

Constructors

TextNorm() — defaults above.
TextNorm(d::Dict) and TextNorm(nt::NamedTuple) — convenience constructors; keys must match field names.

Notes

In prep_string, normalization typically proceeds as: Unicode normalization → punctuation handling → whitespace normalization, then (if enabled) case folding and accent stripping.

Examples

```julia julia> cfg = TextNorm(stripaccents=true, stripwhitespace=true);

julia> doc = prep_string(" Καφέ, naïve résumé! ", cfg); julia> text(doc) "καφε naive resume"

source

The ContingencyTable is the fundamental data structure for word co-occurrence analysis. It stores information about how often words co-occur within a specified window.

Constructor

function ContingencyTable(inputstring::AbstractString,
    node::AbstractString;
    windowsize::Int,
    minfreq::Int=5,
    norm_config::TextNorm=TextNorm())

Parameters

inputstring: Text to analyze (can be raw text, file path, or directory)
node: Target word to analyze
windowsize: Number of words to consider on each side
minfreq: Minimum frequency threshold (default: 5)
auto_prep: Automatically preprocess text (default: true)
strip_accents: Remove diacritical marks (default: false)

Fields

con_tbl::LazyProcess{T,DataFrame}: Lazy-loaded contingency data
node::AbstractString: The target word
windowsize::Int: Context window size
minfreq::Int64: Minimum frequency threshold
input_ref::LazyInput: Reference to processed input

Example Usage

using TextAssociations
using DataFrames

text = """
The field of data science combines statistical analysis with machine learning.
Data scientists use various tools for data visualization and data mining.
Modern data science relies heavily on big data technologies.
"""

# Create contingency table for "data" (use positional args)
ct = ContingencyTable(text, "data"; windowsize=3, minfreq=1)

# The table is computed lazily when first accessed
results = assoc_score(PMI, ct)
println("Found $(nrow(results)) collocates for 'data'")

Found 24 collocates for 'data'

Contingency Table Structure

The internal contingency table contains the following values for each word pair:

Cell	Description	Formula
a	Co-occurrence frequency	f(node, collocate)
b	Node without collocate	f(node) - a
c	Collocate without node	f(collocate) - a
d	Neither occurs	N - a - b - c
N	Total observations	Total positions

Corpus

TextAssociations.Corpus — Type

Corpus <: AssociationDataFormat

Represents a collection of documents for corpus-level analysis.

source

Represents a collection of documents for corpus-level analysis.

Constructor

Corpus(documents::Vector{StringDocument};
       build_dtm::Bool=false,
       metadata::Dict{String,Any}=Dict())

Fields

documents::Vector{StringDocument}: Collection of documents
metadata::Dict{String,Any}: Document metadata
vocabulary::OrderedDict{String,Int}: Word-to-index mapping
doc_term_matrix::Union{Nothing,SparseMatrixCSC}: Optional document-term matrix

Example Usage

using TextAssociations
using TextAnalysis: StringDocument  # avoid bringing TextAnalysis.Corpus into scope

# Create corpus from documents
docs = [
    StringDocument("Artificial intelligence is transforming technology."),
    StringDocument("Machine learning is a subset of artificial intelligence."),
    StringDocument("Deep learning uses neural networks.")
]

corpus = TextAssociations.Corpus(docs, metadata=Dict{String,Any}("source" => "AI texts"))

println("Corpus Statistics:")
println("  Documents: ", length(corpus.documents))
println("  Vocabulary size: ", length(corpus.vocabulary))
println("  Metadata: ", collect(keys(corpus.metadata)))

Corpus Statistics:
  Documents: 3
  Vocabulary size: 16
  Metadata: ["source"]

CorpusContingencyTable

TextAssociations.CorpusContingencyTable — Type

CorpusContingencyTable

Aggregated contingency table across an entire corpus. Uses the corpus's normalization configuration.

source

Aggregates contingency tables across an entire corpus for comprehensive analysis.

Constructor

CorpusContingencyTable(corpus::Corpus,
                       node::AbstractString;
                       windowsize::Int,
                       minfreq::Int64=5)

Fields

tables::Vector{ContingencyTable}: Individual document tables
aggregated_table::LazyProcess: Lazily computed aggregate
node::AbstractString: Target word
windowsize::Int: Context window
minfreq::Int64: Minimum frequency threshold
corpus_ref::Corpus: Reference to source corpus

Analysis Result Types

MultiNodeAnalysis

TextAssociations.MultiNodeAnalysis — Type

MultiNodeAnalysis

Analysis results for multiple node words across a corpus.

source

Stores results from analyzing multiple node words across a corpus.

Fields

nodes::Vector{String}: Analyzed words
results::Dict{String,DataFrame}: Results per node
corpus_ref::Corpus: Source corpus
parameters::Dict{Symbol,Any}: Analysis parameters

Example

# Example (illustrative)
# analysis = MultiNodeAnalysis(
#     ["learning", "intelligence"],
#     Dict("learning" => DataFrame(), "intelligence" => DataFrame()),
#     corpus,
#     Dict(:windowsize => 5, :metric => PMI)
# )

TemporalCorpusAnalysis

TextAssociations.TemporalCorpusAnalysis — Type

TemporalCorpusAnalysis

Analysis of word associations over time periods.

source

Results from analyzing word associations over time periods.

Fields

time_periods::Vector{String}: Period labels
results_by_period::Dict{String,MultiNodeAnalysis}: Period-specific results
trend_analysis::DataFrame: Trend statistics
corpus_ref::Corpus: Source corpus

SubcorpusComparison

TextAssociations.SubcorpusComparison — Type

SubcorpusComparison

Comparison of word associations between subcorpora.

source

Results from comparing word associations between different subcorpora.

Fields

subcorpora::Dict{String,Corpus}: Subcorpus divisions
node::String: Analyzed word
results::Dict{String,DataFrame}: Results per subcorpus
statistical_tests::DataFrame: Statistical comparisons
effect_sizes::DataFrame: Effect size calculations

CollocationNetwork

TextAssociations.CollocationNetwork — Type

CollocationNetwork

Network representation of word collocations.

source

Network representation of word collocations for visualization and analysis.

Fields

nodes::Vector{String}: Network nodes (words)
edges::DataFrame: Edge data with columns [Source, Target, Weight, Metric]
node_metrics::DataFrame: Per-node metrics
parameters::Dict{Symbol,Any}: Network construction parameters

Example

# Example (illustrative)
# using DataFrames
# nodes = ["machine", "learning", "deep", "neural"]
# edges = DataFrame(
#     Source = ["machine", "machine", "deep"],
#     Target = ["learning", "deep", "neural"],
#     Weight = [8.5, 6.2, 7.8],
#     Metric = ["PMI", "PMI", "PMI"]
# )
# node_metrics = DataFrame(
#     Node = nodes,
#     Degree = [2, 1, 2, 1],
#     AvgScore = [7.35, 8.5, 7.0, 7.8]
# )
# network = CollocationNetwork(
#     nodes, edges, node_metrics,
#     Dict(:metric => PMI, :depth => 2)
# )

Concordance

TextAssociations.Concordance — Type

Concordance

KWIC (Key Word In Context) concordance lines.

source

KWIC (Key Word In Context) concordance representation.

Fields

node::String: Target word
lines::DataFrame: Concordance lines with columns [LeftContext, Node, RightContext, DocId, Position]
statistics::Dict{Symbol,Any}: Occurrence statistics

Abstract Types

AssociationMetric

TextAssociations.AssociationMetric — Type

Abstract type for all association metrics.

source

Abstract supertype for all association metrics. All specific metrics (PMI, Dice, LLR, etc.) inherit from this type.

Type Hierarchy

abstract type AssociationMetric <: SemiMetric end

# Concrete subtypes (examples)
abstract type PMI <: AssociationMetric end
abstract type Dice <: AssociationMetric end
abstract type LLR <: AssociationMetric end
# ... more metrics

AssociationDataFormat

TextAssociations.AssociationDataFormat — Type

Abstract type for data formats used in association computations.

source

Abstract supertype for data formats used in association computations.

Subtypes

ContingencyTable: Single document analysis
CorpusContingencyTable: Corpus-level analysis

Utility Types

LazyProcess

TextAssociations.LazyProcess — Type

LazyProcess{T,R}

Lazy evaluation wrapper for deferred computations. Stores a function that computes a result when first needed and caches it.

source

Enables lazy evaluation with caching for expensive computations.

Type Parameters

T: Function type
R: Result type

Fields

f::T: Function to compute result
cached_result::Union{Nothing,R}: Cached result
cached_process::Bool: Whether result is cached

Example

using TextAssociations
using DataFrames

# Return a DataFrame so it matches LazyProcess{..., DataFrame}
expensive_df() = DataFrame(x = 1:3, y = [10, 20, 30])

lp = LazyProcess(expensive_df)   # default R = DataFrame

# First call computes the result
result1 = cached_data(lp)

# Second call uses cache
result2 = cached_data(lp)

println("Results equal: ", result1 == result2)

Results equal: true

LazyInput

TextAssociations.LazyInput — Type

LazyInput

Wrapper for lazily storing and accessing the processed input document. This is used by metrics like Lexical Gravity that need access to the original text beyond just the contingency table.

source

Wrapper for lazily storing and accessing processed input documents.

Fields

loader::LazyProcess{F,StringDocument}: Lazy document loader

Type Traits and Extensions

Metric Traits

Some metrics require additional information beyond the contingency table:

# Trait to indicate token requirement
NeedsTokens(::Type{LexicalGravity}) = Val(true)
NeedsTokens(::Type{PMI}) = Val(false)

Custom Types

You can extend the type system with custom metrics:

# Define custom metric
abstract type MyCustomMetric <: AssociationMetric end

# Implement evaluation function
function eval_mycustommetric(data::AssociationDataFormat)
    # Your implementation
end

Type Conversions and Utilities

Common Conversions

# Convert string to StringDocument
doc = StringDocument("your text")

# Convert to ContingencyTable
ct = ContingencyTable(text(doc), "word", windowsize=5, minfreq=1)

# Extract DataFrame from results
df = assoc_score(PMI, ct)

Type Checking

# Check if type is an association metric
isa(PMI, Type{<:AssociationMetric})  # true

# Check if data format is valid
isa(ct, AssociationDataFormat)  # true

Performance Considerations

Memory Usage

Type	Typical Memory Usage	Notes
ContingencyTable	O(vocab_size)	Lazy loading reduces initial memory
Corpus	O(ndocs × avglength)	Use streaming for large corpora
CorpusContingencyTable	O(vocabsize × ndocs)	Aggregated lazily
CollocationNetwork	O(nodes + edges)	Scales with network size

Optimization Tips

Use lazy evaluation: Data is computed only when needed.
Reuse contingency tables: Avoid recreating for multiple metrics.
Stream large corpora: Use stream_corpus_analysis() for memory efficiency.
Cache results: LazyProcess automatically caches computations.