Bag-of-Words & TF-IDF
Once text has been preprocessed with NLPPipeline, the next step is converting it into numerical features that machine learning models can consume. The two foundational approaches are Bag-of-Words (BOW) and Term Frequency-Inverse Document Frequency (TF-IDF). Puffin provides build_bow, build_tfidf, and the more flexible DocumentTermMatrix class for this purpose.
Bag-of-Words
Bag-of-Words represents each document as a vector of term counts. It ignores word order and grammar, capturing only which words appear and how often.
from puffin.nlp import build_bow
documents = [
"Stock prices increased on strong earnings.",
"Earnings beat expectations, stock rallied.",
"Market volatility increased after announcement.",
]
# Build BOW matrix
matrix, features = build_bow(documents, max_features=100)
print(f"Matrix shape: {matrix.shape}")
# (3, 50) # 3 documents, 50 features
print(f"Features: {features[:10]}")
# ['after', 'announcement', 'beat', 'earnings', 'expectations', ...]
The
max_featuresparameter limits the vocabulary to the N most frequent terms. This prevents the feature matrix from becoming unwieldy when processing large corpora. Start with a reasonable limit (500-5000) and tune based on model performance.
When to Use BOW
BOW works well when:
- You need a simple, interpretable baseline
- Document length is relatively consistent
- You have a large training set relative to vocabulary size
- Word frequency is the primary signal (not word importance)
BOW struggles when:
- Rare but important terms get drowned out by common words
- You need to distinguish between documents that share many common terms but differ in key words
TF-IDF with N-grams
TF-IDF addresses the main weakness of BOW by weighting terms by their importance. Terms that appear frequently in a single document but rarely across the corpus receive higher weights, while terms that appear everywhere (like “the” or “company”) get down-weighted.
from puffin.nlp import build_tfidf
# Build TF-IDF with unigrams and bigrams
matrix, features = build_tfidf(
documents,
max_features=100,
ngram_range=(1, 2) # Include both single words and pairs
)
print(features[:15])
# ['after', 'announcement', 'beat', 'beat expectations',
# 'earnings', 'earnings beat', 'increased', 'market', ...]
N-gram Ranges
The ngram_range parameter controls which word combinations to include:
| Range | Description | Example Features |
|---|---|---|
(1, 1) |
Unigrams only | “earnings”, “beat”, “strong” |
(1, 2) |
Unigrams + bigrams | “earnings”, “earnings beat”, “strong earnings” |
(1, 3) |
Up to trigrams | “earnings”, “earnings beat”, “earnings beat expectations” |
Bigrams capture valuable two-word phrases like “earnings beat,” “price target,” and “revenue growth” that carry more specific meaning than individual words. However, higher n-grams dramatically increase the feature space. For financial text,
(1, 2)is usually the best trade-off.
DocumentTermMatrix Class
For more control over vectorization, use the DocumentTermMatrix class. It provides a scikit-learn-compatible interface with fit/transform semantics, making it easy to integrate into ML pipelines.
from puffin.nlp import DocumentTermMatrix
# Initialize with TF-IDF
dtm = DocumentTermMatrix(
method="tfidf",
max_features=1000,
ngram_range=(1, 2),
min_df=2, # Term must appear in at least 2 docs
max_df=0.8 # Term can't appear in more than 80% of docs
)
# Fit and transform
matrix = dtm.fit_transform(documents)
# Transform new documents
new_docs = ["Stock volatility remains elevated."]
new_matrix = dtm.transform(new_docs)
# Get top terms across all documents
top_terms = dtm.get_top_terms(n=10)
for term, weight in top_terms:
print(f"{term}: {weight:.3f}")
# Get top terms for a specific document
doc_terms = dtm.get_top_terms(n=5, doc_idx=0)
Key Parameters
| Parameter | Description | Recommended Value |
|---|---|---|
method |
Vectorization method ("bow" or "tfidf") |
"tfidf" for most tasks |
max_features |
Maximum vocabulary size | 1000-5000 |
ngram_range |
Tuple of (min_n, max_n) for n-grams | (1, 2) |
min_df |
Minimum document frequency (int or float) | 2-5 or 0.01 |
max_df |
Maximum document frequency (float) | 0.8-0.95 |
The
min_dfandmax_dfparameters are critical for filtering noise. Settingmin_df=2removes terms that only appear once (likely typos or irrelevant), whilemax_df=0.8removes terms that appear in 80%+ of documents (too common to be informative).
Fit vs Transform
The DocumentTermMatrix follows scikit-learn conventions:
fit_transform(docs): Learn the vocabulary from the training documents and transform them in one steptransform(docs): Transform new documents using the already-learned vocabularyget_top_terms(n, doc_idx): Inspect the most important terms overall or for a specific document
# Training phase
train_matrix = dtm.fit_transform(train_documents)
# Production phase -- use the same vocabulary
new_matrix = dtm.transform(incoming_articles)
Always call
fit_transformon your training set andtransformon new data. Callingfit_transformon test or production data would create a different vocabulary, making the features incompatible with your trained model.
Choosing Between BOW and TF-IDF
For most financial NLP tasks, TF-IDF is the better default:
| Criterion | BOW | TF-IDF |
|---|---|---|
| Simplicity | Simpler | Slightly more complex |
| Term weighting | Equal weight | Weighted by importance |
| Rare term handling | Under-weighted | Properly elevated |
| Common term handling | Over-weighted | Properly suppressed |
| Financial text | Baseline only | Recommended default |
In practice, TF-IDF with (1, 2) n-grams and sensible min_df/max_df filtering provides a strong baseline for financial text classification and sentiment scoring.
Integration with Downstream Models
The feature matrices produced by these vectorizers feed directly into classifiers like NewsClassifier or custom scikit-learn models:
from puffin.nlp import DocumentTermMatrix, NewsClassifier
from sklearn.linear_model import LogisticRegression
# Vectorize
dtm = DocumentTermMatrix(method="tfidf", max_features=2000, ngram_range=(1, 2))
X_train = dtm.fit_transform(train_texts)
X_test = dtm.transform(test_texts)
# Use with any scikit-learn classifier
model = LogisticRegression(max_iter=1000)
model.fit(X_train, train_labels)
predictions = model.predict(X_test)
Source Code
Browse the implementation: puffin/nlp/
Next Steps
Continue to Sentiment Classification to learn how to classify news as bullish, bearish, or neutral, and how to build lexicon-based sentiment scoring for financial text.