Core Types
This section documents all core types in TextAssociations.jl. These types form the foundation of the package's functionality.
Overview
The type system is organized into several categories:
TextAssociations Types ├── Data Structures │ ├── ContingencyTable # Word co-occurrence data │ ├── Corpus # Document collection │ └── CorpusContingencyTable # Aggregated corpus data ├── Analysis Results │ ├── MultiNodeAnalysis # Multiple word analysis │ ├── TemporalCorpusAnalysis # Time-based analysis │ ├── SubcorpusComparison # Comparative analysis │ ├── CollocationNetwork # Network representation │ └── Concordance # KWIC concordance ├── Abstract Types │ ├── AssociationMetric # Base for all metrics │ └── AssociationDataFormat # Base for data formats └── Utility Types ├── LazyProcess # Lazy evaluation wrapper └── LazyInput # Lazy input wrapper
Primary Data Structures
ContingencyTable
TextAssociations.ContingencyTable
— TypeContingencyTable <: AssociationDataFormat
Represents a contingency table for word co-occurrence analysis.
Fields
con_tbl
: Lazy-loaded contingency table datanode
: Target word (normalized)windowsize
: Context window sizeminfreq
: Minimum frequency thresholdinput_ref
: Reference to the processed input documentnorm_config
: Text normalization configuration
TextAssociations.TextNorm
— TypeTextNorm(; strip_case=true,
strip_accents=false,
unicode_form=:NFC,
strip_punctuation=true,
punctuation_to_space=true,
normalize_whitespace=true,
strip_whitespace=false,
use_prepare=false)
Configuration for text normalization used by prep_string
and corpus loaders.
Fields
strip_case::Bool
— Lowercase the text whentrue
.strip_accents::Bool
— Remove combining diacritics (e.g., Greek tonos, diaeresis).unicode_form::Symbol
— Unicode normalization form (:NFC
,:NFD
,:NFKC
,:NFKD
).strip_punctuation::Bool
— Iftrue
, remove punctuation; combined withpunctuation_to_space
to decide replace with a space vs delete.punctuation_to_space::Bool
— When stripping punctuation, replace it with a single space (iftrue
) or remove it (iffalse
).normalize_whitespace::Bool
— Collapse consecutive whitespace to a single space.strip_whitespace::Bool
— Trim leading/trailing whitespace.use_prepare::Bool
— Internal flag to route through a more aggressive prepare path.
Constructors
TextNorm()
— defaults above.TextNorm(d::Dict)
andTextNorm(nt::NamedTuple)
— convenience constructors; keys must match field names.
Notes
In prep_string
, normalization typically proceeds as: Unicode normalization → punctuation handling → whitespace normalization, then (if enabled) case folding and accent stripping.
Examples
```julia julia> cfg = TextNorm(stripaccents=true, stripwhitespace=true);
julia> doc = prep_string(" Καφέ, naïve résumé! ", cfg); julia> text(doc) "καφε naive resume"
The ContingencyTable
is the fundamental data structure for word co-occurrence analysis. It stores information about how often words co-occur within a specified window.
Constructor
function ContingencyTable(inputstring::AbstractString,
node::AbstractString;
windowsize::Int,
minfreq::Int=5,
norm_config::TextNorm=TextNorm())
Parameters
inputstring
: Text to analyze (can be raw text, file path, or directory)node
: Target word to analyzewindowsize
: Number of words to consider on each sideminfreq
: Minimum frequency threshold (default: 5)auto_prep
: Automatically preprocess text (default: true)strip_accents
: Remove diacritical marks (default: false)
Fields
con_tbl::LazyProcess{T,DataFrame}
: Lazy-loaded contingency datanode::AbstractString
: The target wordwindowsize::Int
: Context window sizeminfreq::Int64
: Minimum frequency thresholdinput_ref::LazyInput
: Reference to processed input
Example Usage
using TextAssociations
using DataFrames
text = """
The field of data science combines statistical analysis with machine learning.
Data scientists use various tools for data visualization and data mining.
Modern data science relies heavily on big data technologies.
"""
# Create contingency table for "data" (use positional args)
ct = ContingencyTable(text, "data"; windowsize=3, minfreq=1)
# The table is computed lazily when first accessed
results = assoc_score(PMI, ct)
println("Found $(nrow(results)) collocates for 'data'")
Found 24 collocates for 'data'
Contingency Table Structure
The internal contingency table contains the following values for each word pair:
Cell | Description | Formula |
---|---|---|
a | Co-occurrence frequency | f(node, collocate) |
b | Node without collocate | f(node) - a |
c | Collocate without node | f(collocate) - a |
d | Neither occurs | N - a - b - c |
N | Total observations | Total positions |
Corpus
TextAssociations.Corpus
— TypeCorpus <: AssociationDataFormat
Represents a collection of documents for corpus-level analysis.
Represents a collection of documents for corpus-level analysis.
Constructor
Corpus(documents::Vector{StringDocument};
build_dtm::Bool=false,
metadata::Dict{String,Any}=Dict())
Fields
documents::Vector{StringDocument}
: Collection of documentsmetadata::Dict{String,Any}
: Document metadatavocabulary::OrderedDict{String,Int}
: Word-to-index mappingdoc_term_matrix::Union{Nothing,SparseMatrixCSC}
: Optional document-term matrix
Example Usage
using TextAssociations
using TextAnalysis: StringDocument # avoid bringing TextAnalysis.Corpus into scope
# Create corpus from documents
docs = [
StringDocument("Artificial intelligence is transforming technology."),
StringDocument("Machine learning is a subset of artificial intelligence."),
StringDocument("Deep learning uses neural networks.")
]
corpus = TextAssociations.Corpus(docs, metadata=Dict{String,Any}("source" => "AI texts"))
println("Corpus Statistics:")
println(" Documents: ", length(corpus.documents))
println(" Vocabulary size: ", length(corpus.vocabulary))
println(" Metadata: ", collect(keys(corpus.metadata)))
Corpus Statistics:
Documents: 3
Vocabulary size: 16
Metadata: ["source"]
CorpusContingencyTable
TextAssociations.CorpusContingencyTable
— TypeCorpusContingencyTable
Aggregated contingency table across an entire corpus. Uses the corpus's normalization configuration.
Aggregates contingency tables across an entire corpus for comprehensive analysis.
Constructor
CorpusContingencyTable(corpus::Corpus,
node::AbstractString;
windowsize::Int,
minfreq::Int64=5)
Fields
tables::Vector{ContingencyTable}
: Individual document tablesaggregated_table::LazyProcess
: Lazily computed aggregatenode::AbstractString
: Target wordwindowsize::Int
: Context windowminfreq::Int64
: Minimum frequency thresholdcorpus_ref::Corpus
: Reference to source corpus
Analysis Result Types
MultiNodeAnalysis
TextAssociations.MultiNodeAnalysis
— TypeMultiNodeAnalysis
Analysis results for multiple node words across a corpus.
Stores results from analyzing multiple node words across a corpus.
Fields
nodes::Vector{String}
: Analyzed wordsresults::Dict{String,DataFrame}
: Results per nodecorpus_ref::Corpus
: Source corpusparameters::Dict{Symbol,Any}
: Analysis parameters
Example
# Example (illustrative)
# analysis = MultiNodeAnalysis(
# ["learning", "intelligence"],
# Dict("learning" => DataFrame(), "intelligence" => DataFrame()),
# corpus,
# Dict(:windowsize => 5, :metric => PMI)
# )
TemporalCorpusAnalysis
TextAssociations.TemporalCorpusAnalysis
— TypeTemporalCorpusAnalysis
Analysis of word associations over time periods.
Results from analyzing word associations over time periods.
Fields
time_periods::Vector{String}
: Period labelsresults_by_period::Dict{String,MultiNodeAnalysis}
: Period-specific resultstrend_analysis::DataFrame
: Trend statisticscorpus_ref::Corpus
: Source corpus
SubcorpusComparison
TextAssociations.SubcorpusComparison
— TypeSubcorpusComparison
Comparison of word associations between subcorpora.
Results from comparing word associations between different subcorpora.
Fields
subcorpora::Dict{String,Corpus}
: Subcorpus divisionsnode::String
: Analyzed wordresults::Dict{String,DataFrame}
: Results per subcorpusstatistical_tests::DataFrame
: Statistical comparisonseffect_sizes::DataFrame
: Effect size calculations
CollocationNetwork
TextAssociations.CollocationNetwork
— TypeCollocationNetwork
Network representation of word collocations.
Network representation of word collocations for visualization and analysis.
Fields
nodes::Vector{String}
: Network nodes (words)edges::DataFrame
: Edge data with columns [Source, Target, Weight, Metric]node_metrics::DataFrame
: Per-node metricsparameters::Dict{Symbol,Any}
: Network construction parameters
Example
# Example (illustrative)
# using DataFrames
# nodes = ["machine", "learning", "deep", "neural"]
# edges = DataFrame(
# Source = ["machine", "machine", "deep"],
# Target = ["learning", "deep", "neural"],
# Weight = [8.5, 6.2, 7.8],
# Metric = ["PMI", "PMI", "PMI"]
# )
# node_metrics = DataFrame(
# Node = nodes,
# Degree = [2, 1, 2, 1],
# AvgScore = [7.35, 8.5, 7.0, 7.8]
# )
# network = CollocationNetwork(
# nodes, edges, node_metrics,
# Dict(:metric => PMI, :depth => 2)
# )
Concordance
TextAssociations.Concordance
— TypeConcordance
KWIC (Key Word In Context) concordance lines.
KWIC (Key Word In Context) concordance representation.
Fields
node::String
: Target wordlines::DataFrame
: Concordance lines with columns [LeftContext, Node, RightContext, DocId, Position]statistics::Dict{Symbol,Any}
: Occurrence statistics
Abstract Types
AssociationMetric
TextAssociations.AssociationMetric
— TypeAbstract type for all association metrics.
Abstract supertype for all association metrics. All specific metrics (PMI, Dice, LLR, etc.) inherit from this type.
Type Hierarchy
abstract type AssociationMetric <: SemiMetric end
# Concrete subtypes (examples)
abstract type PMI <: AssociationMetric end
abstract type Dice <: AssociationMetric end
abstract type LLR <: AssociationMetric end
# ... more metrics
AssociationDataFormat
TextAssociations.AssociationDataFormat
— TypeAbstract type for data formats used in association computations.
Abstract supertype for data formats used in association computations.
Subtypes
ContingencyTable
: Single document analysisCorpusContingencyTable
: Corpus-level analysis
Utility Types
LazyProcess
TextAssociations.LazyProcess
— TypeLazyProcess{T,R}
Lazy evaluation wrapper for deferred computations. Stores a function that computes a result when first needed and caches it.
Enables lazy evaluation with caching for expensive computations.
Type Parameters
T
: Function typeR
: Result type
Fields
f::T
: Function to compute resultcached_result::Union{Nothing,R}
: Cached resultcached_process::Bool
: Whether result is cached
Example
using TextAssociations
using DataFrames
# Return a DataFrame so it matches LazyProcess{..., DataFrame}
expensive_df() = DataFrame(x = 1:3, y = [10, 20, 30])
lp = LazyProcess(expensive_df) # default R = DataFrame
# First call computes the result
result1 = cached_data(lp)
# Second call uses cache
result2 = cached_data(lp)
println("Results equal: ", result1 == result2)
Results equal: true
LazyInput
TextAssociations.LazyInput
— TypeLazyInput
Wrapper for lazily storing and accessing the processed input document. This is used by metrics like Lexical Gravity that need access to the original text beyond just the contingency table.
Wrapper for lazily storing and accessing processed input documents.
Fields
loader::LazyProcess{F,StringDocument}
: Lazy document loader
Type Traits and Extensions
Metric Traits
Some metrics require additional information beyond the contingency table:
# Trait to indicate token requirement
NeedsTokens(::Type{LexicalGravity}) = Val(true)
NeedsTokens(::Type{PMI}) = Val(false)
Custom Types
You can extend the type system with custom metrics:
# Define custom metric
abstract type MyCustomMetric <: AssociationMetric end
# Implement evaluation function
function eval_mycustommetric(data::AssociationDataFormat)
# Your implementation
end
Type Conversions and Utilities
Common Conversions
# Convert string to StringDocument
doc = StringDocument("your text")
# Convert to ContingencyTable
ct = ContingencyTable(text(doc), "word", windowsize=5, minfreq=1)
# Extract DataFrame from results
df = assoc_score(PMI, ct)
Type Checking
# Check if type is an association metric
isa(PMI, Type{<:AssociationMetric}) # true
# Check if data format is valid
isa(ct, AssociationDataFormat) # true
Performance Considerations
Memory Usage
Type | Typical Memory Usage | Notes |
---|---|---|
ContingencyTable | O(vocab_size) | Lazy loading reduces initial memory |
Corpus | O(ndocs × avglength) | Use streaming for large corpora |
CorpusContingencyTable | O(vocabsize × ndocs) | Aggregated lazily |
CollocationNetwork | O(nodes + edges) | Scales with network size |
Optimization Tips
- Use lazy evaluation: Data is computed only when needed.
- Reuse contingency tables: Avoid recreating for multiple metrics.
- Stream large corpora: Use
stream_corpus_analysis()
for memory efficiency. - Cache results:
LazyProcess
automatically caches computations.
See Also
- Main Functions — coming soon
- Corpus Functions — coming soon
- Metric Functions — coming soon
- Examples — coming soon