Word2Vec & GloVe
Static word embedding methods learn fixed vector representations for each word in a vocabulary. Word2Vec uses local context windows while GloVe leverages global co-occurrence statistics. Both produce dense, low-dimensional vectors that capture semantic relationships between words.
Word2Vec
Word2Vec learns word embeddings by training a shallow neural network to predict words from their context (CBOW) or context from words (Skip-gram).
Skip-Gram Model
The Skip-gram model predicts context words given a target word. It works well for rare words and larger datasets.
Training objective: Maximize probability of context words given target word:
\[P(w_{context} | w_{target}) = \frac{\exp(v_{context}^T v_{target})}{\sum_{w \in V} \exp(v_w^T v_{target})}\]Skip-gram tends to outperform CBOW on smaller datasets and with rare words, making it a good default choice for financial corpora where domain-specific terms may appear infrequently.
CBOW Model
Continuous Bag of Words (CBOW) predicts a target word from its context words. It is faster to train and works well for frequent words.
Training Word2Vec with Gensim
The Word2VecTrainer class wraps Gensim’s Word2Vec implementation with convenience methods for financial text analysis.
from puffin.nlp.embeddings import Word2VecTrainer
import pandas as pd
# Load and preprocess financial news
data = pd.read_csv('financial_news.csv')
# Tokenize documents
documents = [text.lower().split() for text in data['text']]
# Train Word2Vec
trainer = Word2VecTrainer()
model = trainer.train(
documents,
vector_size=100, # Embedding dimension
window=5, # Context window size
min_count=5, # Ignore rare words
sg=1, # 1=Skip-gram, 0=CBOW
workers=4, # Parallel processing
epochs=10
)
# Get word vector
market_vec = trainer.word_vector('market')
print(f"Vector shape: {market_vec.shape}")
# Find similar words
similar = trainer.similar_words('volatility', topn=10)
for word, similarity in similar:
print(f"{word}: {similarity:.3f}")
# Output:
# uncertainty: 0.852
# fluctuation: 0.831
# instability: 0.809
# ...
Document Embeddings
Average word vectors to create document-level embeddings:
# Get document vector
doc = ['market', 'volatility', 'increased', 'significantly']
doc_vec = trainer.document_vector(doc)
# Compare documents
doc1 = ['bull', 'market', 'rising', 'prices']
doc2 = ['bear', 'market', 'falling', 'prices']
vec1 = trainer.document_vector(doc1)
vec2 = trainer.document_vector(doc2)
from scipy.spatial.distance import cosine
similarity = 1 - cosine(vec1, vec2)
print(f"Document similarity: {similarity:.3f}")
Averaging word vectors to produce document embeddings is a simple baseline but can lose important information about word order and emphasis. For more robust document representations, see the Doc2Vec section.
Word Analogies
Solve analogies using vector arithmetic:
# king - man + woman = queen
# bull market - bull + bear = bear market
result = trainer.analogy(
positive=['bull', 'market'],
negative=['bull'],
topn=5
)
print(result)
# [('bear', 0.756), ('declining', 0.632), ...]
Word analogies demonstrate that embeddings capture meaningful linear relationships. In financial contexts, this enables discovery of related concepts (e.g., finding that “inflation” relates to “rates” similarly to how “growth” relates to “earnings”).
GloVe: Global Vectors
GloVe learns embeddings by factorizing the word co-occurrence matrix, capturing global corpus statistics rather than relying on local context windows alone.
Loading Pre-trained GloVe
The GloVeLoader class provides methods for loading pre-trained GloVe vectors and using them for financial text analysis.
from puffin.nlp.embeddings import GloVeLoader
# Download GloVe from: https://nlp.stanford.edu/projects/glove/
loader = GloVeLoader()
loader.load('glove.6B.100d.txt')
# Get word vector
vec = loader.word_vector('stock')
# Find similar words
similar = loader.similar_words('earnings', topn=5)
# Document embedding
doc = ['quarterly', 'earnings', 'exceeded', 'expectations']
doc_vec = loader.document_vector(doc)
Pre-trained GloVe vectors are available in several sizes: 50d, 100d, 200d, and 300d. For financial applications, 100d or 300d provides a good balance between quality and computational cost.
GloVe vs Word2Vec
Understanding when to choose each method is important for building effective NLP pipelines.
GloVe advantages:
- Captures global co-occurrence statistics
- More efficient on large corpora
- Better performance on word analogy tasks
Word2Vec advantages:
- Better at capturing local context
- Online learning (can update with new data)
- Better for rare words (Skip-gram)
| Criterion | Word2Vec | GloVe |
|---|---|---|
| Training data | Local context windows | Global co-occurrence matrix |
| Incremental updates | Yes (online learning) | No (requires retraining) |
| Rare words | Better (Skip-gram) | Weaker |
| Training speed | Fast | Fast (matrix factorization) |
| Memory usage | Lower | Higher (stores full matrix) |
For financial applications where the corpus changes frequently (e.g., daily news), Word2Vec’s online learning capability is valuable. For static corpora like historical filings, GloVe’s global statistics may produce higher quality embeddings.
Source Code
Browse the implementation: puffin/nlp/