NLP Pipeline & Tokenization
Text preprocessing is the foundation of any NLP-for-trading system. Raw news articles, earnings transcripts, and SEC filings must be cleaned, tokenized, and annotated before they can produce useful trading signals. The NLPPipeline class handles all of this using spaCy, with automatic fallback if spaCy is unavailable.
Basic Usage
from puffin.nlp import NLPPipeline
# Initialize pipeline
pipeline = NLPPipeline()
# Process text
text = "Apple Inc. reported Q3 earnings of $19.4B, up 8% YoY."
doc = pipeline.process(text)
# Access linguistic features
print(doc.tokens)
# ['Apple', 'Inc.', 'reported', 'Q3', 'earnings', ...]
print(doc.lemmas)
# ['apple', 'inc.', 'report', 'q3', 'earnings', ...]
print(doc.entities)
# [('Apple Inc.', 'ORG'), ('Q3', 'DATE'), ('$19.4B', 'MONEY'), ('8%', 'PERCENT')]
print(doc.sentences)
# ['Apple Inc. reported Q3 earnings of $19.4B, up 8% YoY.']
The
process()method returns a document object withtokens,lemmas,entities, andsentencesattributes. Lemmatization reduces words to their base form (e.g., “reported” becomes “report”), which helps group related terms together for downstream analysis.
NLP Pipeline Flow
flowchart LR
A[Raw Text] --> B[Tokenization]
B --> C[Lemmatization]
C --> D[NER]
D --> E[Financial Terms]
D --> F["Entities<br/>(ORG, MONEY, %)"]
E --> G["Domain Keywords"]
C --> H["Lemmas"]
F --> I[Downstream Models]
G --> I
H --> I
classDef input fill:#1a3a5c,stroke:#0d2137,color:#e8e0d4
classDef process fill:#2d5016,stroke:#1a3a1a,color:#e8e0d4
classDef feature fill:#6b2d5b,stroke:#4a1e3f,color:#e8e0d4
classDef output fill:#8b4513,stroke:#5c2e0d,color:#e8e0d4
class A input
class B,C,D,E process
class F,G,H feature
class I output
Named Entity Recognition for Finance
Named Entity Recognition (NER) identifies structured information in unstructured text. For financial analysis, the most relevant entity types are organizations, monetary values, percentages, and dates.
# Focus on ORG, MONEY, PERCENT, DATE entities
entities = pipeline.extract_entities(text)
for entity_text, entity_label in entities:
print(f"{entity_label}: {entity_text}")
# ORG: Apple Inc.
# DATE: Q3
# MONEY: $19.4B
# PERCENT: 8%
Why NER Matters for Trading
Entity extraction enables several trading-relevant workflows:
- Company attribution: Link sentiment scores to the correct ticker symbol
- Monetary extraction: Pull out revenue figures, earnings per share, and price targets
- Date extraction: Identify forward-looking dates for guidance and earnings forecasts
- Relationship mapping: Build knowledge graphs of companies mentioned together
spaCy’s default NER model works well for general entities but may miss financial-specific patterns like ticker symbols (AAPL) or non-standard monetary formats (19.4B). For production use, consider fine-tuning the NER model on financial text.
Extract Financial Terms
The extract_financial_terms method identifies trading and finance keywords that carry signal value. This is distinct from NER – it captures domain-specific vocabulary rather than named entities.
text = "The stock rallied on strong earnings, with analysts raising price targets."
terms = pipeline.extract_financial_terms(text)
print(terms)
# ['stock', 'rallied', 'earnings', 'analysts']
Financial term extraction is useful for:
- Feature engineering: Count domain-relevant terms per document as ML features
- Filtering: Discard articles with no financial terms (off-topic noise)
- Weighting: Upweight documents with high financial term density
Batch Processing
When processing large collections of documents (e.g., a day’s worth of news articles), use batch_process for efficiency. It processes multiple documents in a single pass, leveraging spaCy’s internal batching.
texts = [
"Tesla shares gained 5% on delivery beat.",
"Microsoft cloud revenue grew 20% YoY.",
"Amazon acquired robotics startup.",
]
docs = pipeline.batch_process(texts)
for doc in docs:
print(f"Tokens: {len(doc.tokens)}, Entities: {len(doc.entities)}")
Processing at Scale
For large-scale processing (thousands of articles), consider chunking your input:
import itertools
def chunked(iterable, size):
"""Split iterable into chunks of given size."""
it = iter(iterable)
while True:
chunk = list(itertools.islice(it, size))
if not chunk:
break
yield chunk
# Process 10,000 articles in batches of 500
all_docs = []
for batch in chunked(all_articles, 500):
batch_docs = pipeline.batch_process(batch)
all_docs.extend(batch_docs)
Preprocessing Best Practices for Financial Text
Financial-Specific Stopwords
Standard NLP stopword lists remove common words, but financial text requires a custom approach:
# Use financial stopwords
financial_stopwords = {
'inc', 'corp', 'ltd', 'co', 'llc', # Company suffixes
'said', 'according', 'reported', # Common verbs
}
# Keep domain terms that general stopwords remove
keep_words = {
'up', 'down', 'above', 'below', # Direction words
'more', 'less', 'most', 'least', # Comparison
}
Do not blindly apply general-purpose stopword lists to financial text. Words like “up,” “down,” “above,” and “below” are critical directional signals that standard stopword lists would remove.
Entity-Specific Analysis
Analyze sentiment per company by combining NER with downstream scoring:
import numpy as np
def entity_sentiment(articles, pipeline, sentiment):
entity_scores = {}
for article in articles:
# Extract organizations
entities = pipeline.extract_entities(article)
orgs = [e[0] for e in entities if e[1] == 'ORG']
# Compute sentiment
score = sentiment.score(article)
# Attribute to each mentioned org
for org in orgs:
if org not in entity_scores:
entity_scores[org] = []
entity_scores[org].append(score)
# Average per entity
return {
org: np.mean(scores)
for org, scores in entity_scores.items()
}
Source Code
Browse the implementation: puffin/nlp/
Next Steps
Continue to Bag-of-Words & TF-IDF to learn how to convert preprocessed text into numerical feature vectors suitable for machine learning.