API Overview
🧭 How TextAssociations.jl Evaluates a Metric
TextAssociations.jl follows a transparent, two-layer model for computing word-association scores. No matter how you start—raw text, corpus, or contingency table—everything flows through a single unified pipeline.
1️⃣ Input levels
You can start from any of these representations:
| Input | Example | What happens internally |
|---|---|---|
| Raw text | "It is a truth universally acknowledged..." | A ContingencyTable is built around the target node with the specified windowsize, minfreq, and normalization settings. |
| Corpus object | corpus = read_corpus("data_austen"; preprocess=true, norm_config=norm) | For each node, a CorpusContingencyTable (CCT) is built — merging co-occurrence counts across all documents. |
| Prebuilt contingency data | ct = ContingencyTable(corpus, "might"; windowsize=5) | Used directly — no text parsing. Ideal for reusing tables or parallel scoring. |
All of these objects implement the AssociationDataFormat interface, which gives the scorer a consistent view of the data.
2️⃣ Delegation chain
Raw Text / Corpus
↓
ContingencyTable (CT)
↓
CorpusContingencyTable (CCT)
↓
assoc_score(::Type{<:AssociationMetric}, x::AssociationDataFormat; kwargs...)
↓
Metric evaluator (eval_pmi, eval_llr, eval_bayesllr, eval_logdice, …)
↓
Result as DataFrame (or Vector if scores_only=true)Each overload of assoc_score simply delegates:
- It constructs the appropriate CT or CCT.
- Then it calls the core
assoc_score(::Type{T}, x::AssociationDataFormat; ...). - That function resolves the correct evaluator function, for example
eval_pmioreval_llr.
3️⃣ Node analyzers: analyze_node and analyze_nodes
While assoc_score focuses on scoring collocates for one node and (optionally) multiple metrics, the analyzers are higher-level convenience functions that bundle common steps and summaries.
What they do (conceptually)
- Build the appropriate contingency table (CT/CCT) for each node.
- Run one or more association metrics (internally calling
assoc_score). - Optionally apply sorting/selection (e.g., top-N).
- Return a tidy, user-facing result (single DataFrame or a Dict of DataFrames), with the same metadata conventions (
"status","message","node","windowsize","metrics").
When to use them
- Use
assoc_scorewhen you want tight control over which metric(s) run and you’ll handle post-processing yourself. - Use
analyze_node/analyze_nodeswhen you want the common “analyze this word / these words” workflow in one call, including sorting and selecting top collocates.
Typical signatures (illustrative)
# Single node, one or many metrics
analyze_node(::Type{<:AssociationMetric}}, x::AssociationDataFormat; kwargs...) -> DataFrame
analyze_node(AbstractVector{<:Type{<:AssociationMetric}}, x::AssociationDataFormat; kwargs...) -> DataFrame
# Multiple nodes (returns a Dict)
analyze_nodes(::Type{<:AssociationMetric}}, x::AssociationDataFormat, nodes::Vector{String}; kwargs...) -> Dict{String,DataFrame}
analyze_nodes(AbstractVector{<:Type{<:AssociationMetric}}, x::AssociationDataFormat, nodes::Vector{String}; kwargs...) -> Dict{String,DataFrame}4️⃣ Keyword layers
| Layer | Keywords | Description |
|---|---|---|
| Table construction | windowsize, minfreq, norm_config | Control how co-occurrence contexts are collected. |
| Metric evaluation | λ, base, direction, tokens, scores_only | Affect how the metric is calculated. |
| Output control | scores_only, top_n, sort_by | Shape or filter the result, but not the math itself. |
All keywords are safely forwarded to the correct layer. Metrics that need tokens (e.g. LexicalGravity) are handled automatically through the NeedsTokens trait.
5️⃣ Return values & metadata
| Return type | Trigger | Description |
|---|---|---|
DataFrame (default) | scores_only=false | Columns: Node, Collocate, Frequency, <MetricName> |
Vector{Float64} | scores_only=true | Raw score values aligned with the collocate order |
Dict{String,DataFrame} | Multi-node call | One table per node, optionally trimmed to top_n |
Each table carries embedded metadata:
| Metadata key | Meaning |
|---|---|
"status" | "ok", "empty", or "error" |
"message" | Context or diagnostic (e.g. “Node not found.”) |
"node" | The target word |
"windowsize" | Context window used |
"metrics" | Metrics evaluated (for multi-metric tables) |
Inspect with:
using DataFrames
metadata(df)6️⃣ Mental model summary
Everything becomes a Contingency Table.
Whether you begin with raw text, a corpus, or a precomputed table, the scorer always sees a uniform AssociationDataFormat object. From there, the metric evaluator handles the rest—robustly, transparently, reproducibly.
7️⃣ Quick examples
# From raw text
assoc_score(PMI, "It is a truth...", "truth"; windowsize=5, minfreq=2)
# From a corpus
assoc_score(LogDice, corpus, "love"; windowsize=4)
# Multiple metrics
assoc_score([PMI, LLR, LogDice], corpus, "world"; windowsize=5, minfreq=2)
# Token-requiring metric (Lexical Gravity)
assoc_score(LexicalGravity, corpus, "beautiful"; windowsize=5, tokens=mytokens)All return standardized DataFrames with attached metadata.
8️⃣ Design philosophy
- Transparent: identical scoring logic regardless of input type.
- Composable: CT/CCT objects can be reused or serialized.
- Safe: empty or failed evaluations never crash—always return a diagnostic table.
- Extensible: new metrics only need an
eval_*function and, if required, aNeedsTokens(::Type{YourMetric}) = Val(true)specialization.
Once you understand this flow, you understand the entire package. Every advanced feature—comparison, graphing, temporal analysis—builds on this backbone.