Working with Corpora
This guide covers corpus-level analysis, from loading documents to advanced corpus operations.
Loading Corpora
From Various Sources
using TextAssociations
using DataFrames
# Create example files
temp_dir = mktempdir()
# Machine learning transforms data into insights.
# Create sample text files
texts = [
"Machine learning transforms data into insights. It parses corpora at scale, aligns term frequencies with semantic relationships, and highlights the associations that warrant closer study. Each iteration refines the embeddings, letting nuanced usage patterns surface from raw text streams. Within minutes, what began as unstructured prose becomes an interpretable map of concepts, trends, and contextual signals your analysts can act on. Continual learning cycles tie human feedback to automated scoring, ensuring each model update remains accountable across the network. In parallel, learning analytics expose which network nodes contribute novel context, letting curators rebalance data pipelines. As the knowledge graph expands, network observers watch federated network edges synchronize in real time, while downstream services tap network dashboards that keep learning teams aligned.",
"Deep learning uses neural networks extensively. Layers of nonlinear transformations enable the models to capture complex patterns, extracting latent representations that conventional features miss. When trained on rich corpora, the networks adapt to domain-specific nuances, delivering higher accuracy in tasks like entity recognition and sentiment analysis. With appropriate regularization and interpretability tools, practitioners can translate these dense embeddings into actionable insights while maintaining trust in the system’s predictions. Adaptive learning pipelines coordinate semantic parsers with a distributed network of annotators. Each network ingests curated corpora, while a secondary network synchronizes contextual metadata across regional clusters. During offline learning, analysts probe the recommendation network for bias signals, then trigger online learning updates that reweight features without stalling the monitoring network.",
"Data science combines statistics and programming. Analysts unify probabilistic models with code-driven automation to surface trends that raw tables conceal. This fusion accelerates experimentation, supports reproducible pipelines, and turns exploratory questions into measurable, actionable metrics across evolving datasets. Integrated learning cohorts audit each network channel to confirm data provenance. A governance network curates feature stores while a delivery network pushes dashboards to product teams. Scenario-based learning guides analysts as a simulation network harmonizes with a resilience network that shields critical pipelines. Hands-on learning labs document their findings for future audits."
]
for (i, text) in enumerate(texts)
write(joinpath(temp_dir, "doc$i.txt"), text)
end
# Load from directory
corpus = read_corpus(temp_dir;
norm_config=TextNorm(strip_case=true, strip_punctuation=true),
min_doc_length=5,
max_doc_length=1000
)
println("Loaded $(length(corpus.documents)) documents")
println("Vocabulary size: $(length(corpus.vocabulary))")
# Clean up
rm(temp_dir, recursive=true)
Loaded 3 documents
Loaded 3 documents
Vocabulary size: 231
From DataFrames
using TextAssociations, DataFrames
# Create a DataFrame with text and metadata
df = DataFrame(
text = [
"Artificial intelligence revolutionizes technology.",
"Machine learning enables pattern recognition.",
"Deep learning mimics human neural networks."
],
category = ["AI", "ML", "DL"],
year = [2023, 2023, 2024],
importance = ["high", "high", "medium"]
)
# Load corpus from DataFrame
corpus = read_corpus_df(df;
text_column=:text,
metadata_columns=[:category, :year, :importance],
norm_config=TextNorm()
)
println("Corpus from DataFrame:")
println(" Documents: $(length(corpus.documents))")
println(" Metadata fields: $(keys(corpus.metadata))")
Corpus from DataFrame:
Documents: 3
Metadata fields: ["doc_3", "doc_1", "doc_2"]
Corpus Statistics
Basic Statistics
using TextAssociations
using TextAnalysis: StringDocument
# Create a sample corpus
texts = [
"Natural language processing enables computers to understand human language.",
"Machine learning algorithms learn patterns from data automatically.",
"Deep neural networks consist of multiple hidden layers.",
"Artificial intelligence includes machine learning and deep learning."
]
docs = [StringDocument(t) for t in texts]
corpus = Corpus(docs)
# Get comprehensive statistics
stats = corpus_stats(corpus; include_token_distribution=true)
println("Corpus Statistics:")
println(" Documents: $(stats[:num_documents])")
println(" Total tokens: $(stats[:total_tokens])")
println(" Unique tokens: $(stats[:unique_tokens])")
println(" Type-token ratio: $(round(stats[:type_token_ratio], digits=4))")
println(" Hapax legomena: $(stats[:hapax_legomena])")
println("\nVocabulary coverage:")
println(" 50% coverage: $(stats[:words_for_50_percent_coverage]) words")
println(" 90% coverage: $(stats[:words_for_90_percent_coverage]) words")
# Display coverage summary
coverage_summary(stats)
Corpus Statistics:
Documents: 4
Total tokens: 37
Unique tokens: 31
Type-token ratio: 0.8378
Hapax legomena: 28
Vocabulary coverage:
50% coverage: 13 words
90% coverage: 28 words
=== Vocabulary Coverage Summary ===
Total tokens: 37
Vocabulary size: 31
Coverage Statistics:
──────────────────────────────────────────────────
25% of corpus: 4 words ( 12.90% of vocabulary)
50% of corpus: 13 words ( 41.94% of vocabulary)
75% of corpus: 22 words ( 70.97% of vocabulary)
90% of corpus: 28 words ( 90.32% of vocabulary)
95% of corpus: 30 words ( 96.77% of vocabulary)
99% of corpus: 31 words (100.00% of vocabulary)
100% of corpus: 31 words (100.00% of vocabulary)
Lexical Diversity:
──────────────────────────────────────────────────
Type-Token Ratio: 0.8378
Hapax Legomena: 28 (90.32% of vocabulary)
Token Distribution
using TextAssociations
# Analyze token distribution
dist = token_distribution(corpus)
println("\nTop 10 most frequent tokens:")
for row in eachrow(first(dist, 10))
println(" $(row.Token): $(row.Frequency) (TF-IDF: $(round(row.TFIDF, digits=2)))")
end
Top 10 most frequent tokens:
network: 15 (TF-IDF: 0.0)
learning: 11 (TF-IDF: 0.0)
the: 9 (TF-IDF: 3.65)
that: 6 (TF-IDF: 0.0)
and: 6 (TF-IDF: 0.0)
a: 6 (TF-IDF: 2.43)
to: 6 (TF-IDF: 0.0)
with: 5 (TF-IDF: 0.0)
pipelines: 4 (TF-IDF: 0.0)
while: 4 (TF-IDF: 0.0)
Corpus-Level Analysis
Single Node Analysis
using TextAssociations
# Analyze a single word across the corpus
results = analyze_corpus(
corpus,
"learning",
PMI;
windowsize=5,
minfreq=1
)
println("Top collocates of 'learning' across corpus:")
for row in eachrow(first(results, 5))
println(" $(row.Collocate): Score=$(round(row.Score, digits=2)), DocFreq=$(row.DocFrequency)")
end
Top collocates of 'learning' across corpus:
based: Score=-1.1, DocFreq=1
datasets: Score=-1.1, DocFreq=1
product: Score=-1.1, DocFreq=1
shields: Score=-1.1, DocFreq=1
document: Score=-1.1, DocFreq=1
Multiple Nodes Analysis
using TextAssociations
# Analyze multiple nodes
nodes = ["machine", "learning", "neural"]
metrics = [PMI, LogDice, LLR]
analysis = analyze_nodes(
corpus,
nodes,
metrics;
windowsize=5,
minfreq=1,
top_n=10
)
# Access results for each node
for node in analysis.nodes
node_results = analysis.results[node]
if !isempty(node_results)
println("\nTop collocates for '$node':")
for row in eachrow(first(node_results, 3))
println(" $(row.Collocate): PMI=$(round(row.PMI, digits=2))")
end
end
end
Analyzing nodes... 67%|██████████████████████ | ETA: 0:00:00
Analyzing nodes... 100%|█████████████████████████████████| Time: 0:00:00
Top collocates for 'neural':
of: PMI=0.0
uses: PMI=0.0
networks: PMI=0.0
Top collocates for 'machine':
transforms: PMI=0.0
into: PMI=0.0
learning: PMI=0.0
Top collocates for 'learning':
based: PMI=-1.1
datasets: PMI=-1.1
product: PMI=-1.1
Advanced Corpus Operations
Temporal Analysis
using TextAssociations, Dates, DataFrames
# Create DataFrame with temporal metadata
df = DataFrame(
text = [
"Early AI used rule-based systems.",
"Machine learning emerged as dominant approach.",
"Deep learning revolutionized the field.",
"Transformers changed natural language processing."
],
year = [1980, 1990, 2010, 2020]
)
# Use read_corpus_df to properly store metadata
temporal_corpus = read_corpus_df(df;
text_column=:text,
metadata_columns=[:year]
)
# Analyze temporal trends
temporal_analysis = analyze_temporal(
temporal_corpus,
["AI", "learning"],
:year,
PMI;
time_bins=2,
windowsize=5,
minfreq=1
)
println("Temporal Analysis Results:")
println("Time periods: ", temporal_analysis.time_periods)
if !isempty(temporal_analysis.trend_analysis)
println("\nTrend analysis:")
for row in eachrow(first(temporal_analysis.trend_analysis, 5))
println(" $(row.Node) + $(row.Collocate): correlation=$(round(row.Correlation, digits=2))")
end
end
Temporal Analysis Results:
Time periods: ["1980.0-2000.0", "2000.0-2020.0"]
Subcorpus Comparison
using TextAssociations, DataFrames
# Create corpus with categories
df = DataFrame(
text = [
"Scientific research requires rigorous methodology before the analysis is conducted.",
"Business analysis focuses on market trends.",
"Scientific experiments test hypotheses systematically and analyze the resulting data.",
"Business strategy and analysis drives organizational success."
],
field = ["Science", "Business", "Science", "Business"]
)
categorized_corpus = read_corpus_df(df;
text_column=:text,
metadata_columns=[:field]
)
# Compare subcorpora
comparison = compare_subcorpora(
categorized_corpus,
:field,
"analysis",
PMI;
windowsize=5,
minfreq=1
)
println("Subcorpus Comparison:")
for (subcorpus_name, results) in comparison.results
if !isempty(results)
println("\n$subcorpus_name subcorpus:")
for row in eachrow(first(results, 2))
println(" $(row.Collocate): Score=$(round(row.Score, digits=2))")
end
end
end
Subcorpus Comparison:
Business subcorpus:
market: Score=0.0
drives: Score=0.0
Science subcorpus:
before: Score=0.0
requires: Score=0.0
Keyword Extraction
using TextAssociations
# Extract keywords using TF-IDF
keywords = keyterms(
corpus;
method=:tfidf,
num_keywords=10,
min_doc_freq=1,
max_doc_freq_ratio=0.8
)
println("\nTop Keywords (TF-IDF):")
for row in eachrow(keywords)
println(" $(row.Keyword): TFIDF=$(round(row.TFIDF, digits=2)), DocFreq=$(row.DocFreq)")
end
Top Keywords (TF-IDF):
the: TFIDF=1.22, DocFreq=2
a: TFIDF=0.81, DocFreq=2
letting: TFIDF=0.73, DocFreq=1
networks: TFIDF=0.73, DocFreq=1
features: TFIDF=0.73, DocFreq=1
data: TFIDF=0.54, DocFreq=2
in: TFIDF=0.54, DocFreq=2
corpora: TFIDF=0.41, DocFreq=2
as: TFIDF=0.41, DocFreq=2
of: TFIDF=0.41, DocFreq=2
Building Collocation Networks
using TextAssociations
# Build collocation network
network = colloc_graph(
corpus,
["learning", "network"]; # Seed words
metric=PMI,
depth=1,
min_score=-10.0,
max_neighbors=5,
windowsize=5,
minfreq=1
)
println("\nCollocation Network:")
println(" Nodes: $(length(network.nodes))")
println(" Edges: $(nrow(network.edges))")
if !isempty(network.edges)
println("\nStrongest connections:")
for row in eachrow(first(sort(network.edges, :Weight, rev=true), 5))
println(" $(row.Source) → $(row.Target): $(round(row.Weight, digits=2))")
end
end
Collocation Network:
Nodes: 12
Edges: 10
Strongest connections:
learning → based: -1.1
learning → datasets: -1.1
learning → product: -1.1
learning → shields: -1.1
learning → document: -1.1
Memory-Efficient Processing
Batch Processing
using TextAssociations
function batch_analyze_corpus(corpus::Corpus, nodes::Vector{String}, batch_size::Int=10)
all_results = Dict{String, DataFrame}()
for batch_start in 1:batch_size:length(nodes)
batch_end = min(batch_start + batch_size - 1, length(nodes))
batch_nodes = nodes[batch_start:batch_end]
println("Processing batch: nodes $batch_start-$batch_end")
# Analyze batch
batch_analysis = analyze_nodes(
corpus, batch_nodes, [PMI];
windowsize=5, minfreq=1
)
# Store results
for (node, results) in batch_analysis.results
all_results[node] = results
end
# Force garbage collection between batches
GC.gc()
end
return all_results
end
# Example with many nodes
many_nodes = ["machine", "learning", "deep", "neural", "network",
"algorithm", "data", "pattern"]
batch_results = batch_analyze_corpus(corpus, many_nodes, 3)
println("\nBatch processing complete: $(length(batch_results)) nodes analyzed")
Processing batch: nodes 1-3
Processing batch: nodes 4-6
Processing batch: nodes 7-8
Batch processing complete: 8 nodes analyzed
Streaming Analysis
using TextAssociations
function stream_analyze(file_pattern::String, node::String)
aggregated_scores = Dict{String, Float64}()
doc_count = 0
# Process files one at a time
for file in glob(file_pattern)
# Read single file
text = read(file, String)
# Analyze
ct = ContingencyTable(text, node; windowsize=5, minfreq=1)
results = assoc_score(PMI, ct; scores_only=false)
# Aggregate results
for row in eachrow(results)
collocate = String(row.Collocate)
score = row.PMI
# Running average
current = get(aggregated_scores, collocate, 0.0)
aggregated_scores[collocate] = (current * doc_count + score) / (doc_count + 1)
end
doc_count += 1
end
return aggregated_scores, doc_count
end
println("Streaming analysis function defined for large corpora")
Streaming analysis function defined for large corpora
Corpus Filtering and Sampling
Document Filtering
using TextAssociations
using TextAnalysis
docs = [
StringDocument("Machine learning algorithms learn from data."),
StringDocument("Deep learning uses neural networks."),
StringDocument("AI includes machine learning.")
]
corpus = Corpus(docs)
function filter_corpus(corpus::Corpus, min_length::Int, max_length::Int)
filtered_docs = StringDocument{String}[] # Specify type
for doc in corpus.documents
doc_length = length(tokens(doc))
if min_length <= doc_length <= max_length
push!(filtered_docs, doc)
end
end
return Corpus(filtered_docs, norm_config=corpus.norm_config)
end
filtered = filter_corpus(corpus, 5, 15)
println("Filtered: $(length(filtered.documents)) documents")
WARNING: using TextAnalysis.Corpus in module Main conflicts with an existing identifier.
Filtered: 3 documents
Vocabulary Filtering
using TextAssociations
using TextAnalysis: tokens
using OrderedCollections
function filter_vocabulary(corpus::Corpus, min_freq::Int, max_freq_ratio::Float64)
# Count token frequencies
token_counts = Dict{String, Int}()
for doc in corpus.documents
for token in tokens(doc)
token_counts[token] = get(token_counts, token, 0) + 1
end
end
# Filter vocabulary
total_docs = length(corpus.documents)
max_freq = total_docs * max_freq_ratio
filtered_vocab = OrderedDict{String, Int}()
idx = 0
for (token, count) in token_counts
if min_freq <= count <= max_freq
idx += 1
filtered_vocab[token] = idx
end
end
println("Vocabulary filtered: $(length(corpus.vocabulary)) → $(length(filtered_vocab))")
return filtered_vocab
end
filtered_vocab = filter_vocabulary(corpus, 1, 0.8)
OrderedCollections.OrderedDict{String, Int64} with 12 entries:
"data" => 1
"algorithms" => 2
"learn" => 3
"networks" => 4
"includes" => 5
"neural" => 6
"AI" => 7
"machine" => 8
"uses" => 9
"Machine" => 10
"Deep" => 11
"from" => 12
Export and Persistence
Saving Results
using TextAssociations, CSV, Dates
# Analyze and save results
results = analyze_corpus(corpus, "learning", PMI, windowsize=3, minfreq=2)
# Save to CSV
temp_file = tempname() * ".csv"
CSV.write(temp_file, results)
println("Results saved to temporary file")
# Save with metadata
results_with_meta = copy(results)
metadata!(results_with_meta, "corpus_size", length(corpus.documents), style=:note)
metadata!(results_with_meta, "analysis_date", Dates.today(), style=:note)
# Clean up
rm(temp_file)
Results saved to temporary file
Multi-format Export
using TextAssociations
function export_analysis(analysis::MultiNodeAnalysis, base_path::String)
# Export as CSV
write_results(analysis, base_path * ".csv"; format=:csv)
# Export as JSON
write_results(analysis, base_path * ".json"; format=:json)
# Export summary
summary = DataFrame(
Node = analysis.nodes,
NumCollocates = [nrow(analysis.results[n]) for n in analysis.nodes],
WindowSize = analysis.parameters[:windowsize],
MinFreq = analysis.parameters[:minfreq]
)
CSV.write(base_path * "_summary.csv", summary)
println("Exported to multiple formats")
end
# Example (would create files)
# export_analysis(analysis, "corpus_analysis")
export_analysis (generic function with 1 method)
Performance Optimization
Corpus Size Guidelines
Corpus Size | Recommended Approach | Memory Usage | Processing Time |
---|---|---|---|
< 100 docs | Load all in memory | ~10MB | < 1s |
100-1000 docs | Standard processing | ~100MB | < 10s |
1000-10000 docs | Batch processing | ~500MB | < 1min |
> 10000 docs | Streaming | Constant | Linear |
Optimization Tips
# 1. Pre-filter vocabulary
const MIN_WORD_LENGTH = 2
const MAX_WORD_LENGTH = 20
# 2. Use appropriate data structures
const USE_SPARSE_MATRIX = true # For large vocabularies
# 3. Optimize window sizes
const OPTIMAL_WINDOW = Dict(
:syntactic => 2,
:semantic => 5,
:topical => 10
)
Troubleshooting
Common Issues
using TextAssociations
using TextAnalysis: tokens
using Statistics
function diagnose_corpus(corpus::Corpus)
println("Corpus Diagnostics:")
println("="^40)
# Check document distribution
doc_lengths = [length(tokens(doc)) for doc in corpus.documents]
println("Document lengths:")
println(" Min: $(minimum(doc_lengths))")
println(" Max: $(maximum(doc_lengths))")
println(" Mean: $(round(mean(doc_lengths), digits=1))")
# Check vocabulary
println("\nVocabulary:")
println(" Size: $(length(corpus.vocabulary))")
# Check for issues
if minimum(doc_lengths) < 5
println("\n⚠ Warning: Very short documents detected")
end
if maximum(doc_lengths) > 10000
println("\n⚠ Warning: Very long documents may slow processing")
end
if length(corpus.vocabulary) > 100000
println("\n⚠ Warning: Large vocabulary may require more memory")
end
end
diagnose_corpus(corpus)
Corpus Diagnostics:
========================================
Document lengths:
Min: 5
Max: 7
Mean: 6.0
Vocabulary:
Size: 14
Next Steps
- Explore Temporal Analysis for time-based patterns
- Learn about Network Analysis for visualization
- See Performance guide for large-scale processing