API Overview

🧭 How TextAssociations.jl Evaluates a Metric

TextAssociations.jl follows a transparent, two-layer model for computing word-association scores. No matter how you start—raw text, corpus, or contingency table—everything flows through a single unified pipeline.


1️⃣ Input levels

You can start from any of these representations:

InputExampleWhat happens internally
Raw text"It is a truth universally acknowledged..."A ContingencyTable is built around the target node with the specified windowsize, minfreq, and normalization settings.
Corpus objectcorpus = read_corpus("data_austen"; preprocess=true, norm_config=norm)For each node, a CorpusContingencyTable (CCT) is built — merging co-occurrence counts across all documents.
Prebuilt contingency datact = ContingencyTable(corpus, "might"; windowsize=5)Used directly — no text parsing. Ideal for reusing tables or parallel scoring.

All of these objects implement the AssociationDataFormat interface, which gives the scorer a consistent view of the data.


2️⃣ Delegation chain

Raw Text / Corpus
      ↓
  ContingencyTable (CT)
      ↓
CorpusContingencyTable (CCT)
      ↓
assoc_score(::Type{<:AssociationMetric}, x::AssociationDataFormat; kwargs...)
      ↓
Metric evaluator (eval_pmi, eval_llr, eval_bayesllr, eval_logdice, …)
      ↓
Result as DataFrame  (or Vector if scores_only=true)

Each overload of assoc_score simply delegates:

  • It constructs the appropriate CT or CCT.
  • Then it calls the core assoc_score(::Type{T}, x::AssociationDataFormat; ...).
  • That function resolves the correct evaluator function, for example eval_pmi or eval_llr.

3️⃣ Node analyzers: analyze_node and analyze_nodes

While assoc_score focuses on scoring collocates for one node and (optionally) multiple metrics, the analyzers are higher-level convenience functions that bundle common steps and summaries.

What they do (conceptually)

  • Build the appropriate contingency table (CT/CCT) for each node.
  • Run one or more association metrics (internally calling assoc_score).
  • Optionally apply sorting/selection (e.g., top-N).
  • Return a tidy, user-facing result (single DataFrame or a Dict of DataFrames), with the same metadata conventions ("status", "message", "node", "windowsize", "metrics").

When to use them

  • Use assoc_score when you want tight control over which metric(s) run and you’ll handle post-processing yourself.
  • Use analyze_node / analyze_nodes when you want the common “analyze this word / these words” workflow in one call, including sorting and selecting top collocates.

Typical signatures (illustrative)

# Single node, one or many metrics
analyze_node(::Type{<:AssociationMetric}}, x::AssociationDataFormat; kwargs...)      -> DataFrame
analyze_node(AbstractVector{<:Type{<:AssociationMetric}}, x::AssociationDataFormat; kwargs...) -> DataFrame

# Multiple nodes (returns a Dict)
analyze_nodes(::Type{<:AssociationMetric}}, x::AssociationDataFormat, nodes::Vector{String}; kwargs...) -> Dict{String,DataFrame}
analyze_nodes(AbstractVector{<:Type{<:AssociationMetric}}, x::AssociationDataFormat, nodes::Vector{String}; kwargs...) -> Dict{String,DataFrame}

4️⃣ Keyword layers

LayerKeywordsDescription
Table constructionwindowsize, minfreq, norm_configControl how co-occurrence contexts are collected.
Metric evaluationλ, base, direction, tokens, scores_onlyAffect how the metric is calculated.
Output controlscores_only, top_n, sort_byShape or filter the result, but not the math itself.

All keywords are safely forwarded to the correct layer. Metrics that need tokens (e.g. LexicalGravity) are handled automatically through the NeedsTokens trait.


5️⃣ Return values & metadata

Return typeTriggerDescription
DataFrame (default)scores_only=falseColumns: Node, Collocate, Frequency, <MetricName>
Vector{Float64}scores_only=trueRaw score values aligned with the collocate order
Dict{String,DataFrame}Multi-node callOne table per node, optionally trimmed to top_n

Each table carries embedded metadata:

Metadata keyMeaning
"status""ok", "empty", or "error"
"message"Context or diagnostic (e.g. “Node not found.”)
"node"The target word
"windowsize"Context window used
"metrics"Metrics evaluated (for multi-metric tables)

Inspect with:

using DataFrames
metadata(df)

6️⃣ Mental model summary

Everything becomes a Contingency Table.

Whether you begin with raw text, a corpus, or a precomputed table, the scorer always sees a uniform AssociationDataFormat object. From there, the metric evaluator handles the rest—robustly, transparently, reproducibly.


7️⃣ Quick examples

# From raw text
assoc_score(PMI, "It is a truth...", "truth"; windowsize=5, minfreq=2)

# From a corpus
assoc_score(LogDice, corpus, "love"; windowsize=4)

# Multiple metrics
assoc_score([PMI, LLR, LogDice], corpus, "world"; windowsize=5, minfreq=2)

# Token-requiring metric (Lexical Gravity)
assoc_score(LexicalGravity, corpus, "beautiful"; windowsize=5, tokens=mytokens)

All return standardized DataFrames with attached metadata.


8️⃣ Design philosophy

  • Transparent: identical scoring logic regardless of input type.
  • Composable: CT/CCT objects can be reused or serialized.
  • Safe: empty or failed evaluations never crash—always return a diagnostic table.
  • Extensible: new metrics only need an eval_* function and, if required, a NeedsTokens(::Type{YourMetric}) = Val(true) specialization.

Once you understand this flow, you understand the entire package. Every advanced feature—comparison, graphing, temporal analysis—builds on this backbone.