Word Embeddings

Word embeddings are dense vector representations of words that capture semantic relationships and contextual meaning. Unlike traditional one-hot encodings, embeddings place semantically similar words close together in a continuous vector space, making them powerful tools for financial text analysis.

Introduction to Word Embeddings

Word embeddings map words from a discrete vocabulary to continuous vectors in a high-dimensional space. Words with similar meanings or contexts have similar vector representations, enabling mathematical operations on language.

Key Properties:

Semantic similarity: Related words have similar vectors
Compositionality: Vector arithmetic captures relationships (e.g., “king” - “man” + “woman” = “queen”)
Dimensionality reduction: From sparse one-hot vectors to dense representations
Transfer learning: Pre-trained embeddings can be used across tasks

Embedding Methods Pipeline

The following diagram illustrates the hierarchy of embedding methods covered in this chapter, from classical static embeddings through document-level representations to modern contextualized transformers.

flowchart TB
    A[Raw Financial Text] --> B[Preprocessing & Tokenization]
    B --> C{Embedding Method}

    C --> D[Word2Vec]
    C --> E[GloVe]
    C --> F[Doc2Vec]
    C --> G[Transformers]

    D --> D1[Skip-gram]
    D --> D2[CBOW]
    E --> E1[Co-occurrence Matrix]
    F --> F1[PV-DM]
    F --> F2[PV-DBOW]
    G --> G1[BERT / DistilBERT]
    G --> G2[FinBERT]

    D1 --> H[Word Vectors]
    D2 --> H
    E1 --> H
    F1 --> I[Document Vectors]
    F2 --> I
    G1 --> J[Contextualized Embeddings]
    G2 --> J

    H --> K[Downstream Tasks]
    I --> K
    J --> K

    classDef input fill:#1a3a5c,stroke:#0d2137,color:#e8e0d4
    classDef method fill:#2d5016,stroke:#1a3a1a,color:#e8e0d4
    classDef variant fill:#5c3d1a,stroke:#3d2810,color:#e8e0d4
    classDef output fill:#6b2d5b,stroke:#4a1e3f,color:#e8e0d4
    classDef task fill:#8b4513,stroke:#5c2e0d,color:#e8e0d4

    class A,B input
    class C,D,E,F,G method
    class D1,D2,E1,F1,F2,G1,G2 variant
    class H,I,J output
    class K task

Chapter Overview

This chapter covers three families of embedding techniques, progressing from classical to modern approaches:

Word2Vec & GloVe

Static word embedding methods that learn fixed vector representations for each word in the vocabulary. Word2Vec uses local context windows (Skip-gram and CBOW architectures), while GloVe leverages global co-occurrence statistics. These methods form the foundation for understanding how language can be represented numerically.

Word2Vec and GloVe produce a single vector per word regardless of context. The word “bank” gets the same vector whether it refers to a financial institution or a river bank.

Doc2Vec & SEC Filings

Document-level embeddings that extend Word2Vec to represent entire documents as vectors. Doc2Vec (Paragraph Vectors) learns direct document representations via PV-DM and PV-DBOW architectures. This section also covers SEC filing analysis using the SECFilingAnalyzer class, including filing comparison and language change detection over time.

Transformer Embeddings

Contextualized embeddings from transformer models like BERT and FinBERT. Unlike static embeddings, transformers produce different representations for the same word depending on its surrounding context. This section covers semantic search, financial document clustering, and practical applications including news similarity detection, topic-based portfolio selection, and earnings call sentiment tracking.

Summary

Word embeddings provide powerful representations for financial text analysis:

Word2Vec: Learn embeddings from local context (Skip-gram, CBOW)
GloVe: Global co-occurrence statistics
Doc2Vec: Direct document embeddings
BERT/FinBERT: Contextualized embeddings for state-of-the-art performance

Applications include:

SEC filing analysis and change detection
News similarity and deduplication
Theme-based portfolio selection
Sentiment tracking across documents

Notebook: Run the examples interactively in nlp_trading.ipynb

Part 13: NLP for Trading – NLP fundamentals cover the tokenization and preprocessing pipeline that feeds into embedding models
Part 14: Topic Modeling – Topic modeling provides an alternative unsupervised approach to capturing document-level meaning
Part 22: AI-Assisted Trading – AI-assisted trading uses transformer embeddings and LLMs for advanced sentiment and signal generation
Part 16: Deep Learning – Deep learning architectures underpin modern embedding methods like BERT and FinBERT

Source Code

Browse the implementation: puffin/nlp/

Word Embeddings

Introduction to Word Embeddings

Embedding Methods Pipeline

Chapter Overview

Word2Vec & GloVe

Doc2Vec & SEC Filings

Transformer Embeddings

Summary

Source Code

Further Reading

Table of contents

Word Embeddings

Introduction to Word Embeddings

Embedding Methods Pipeline

Chapter Overview

Word2Vec & GloVe

Doc2Vec & SEC Filings

Transformer Embeddings

Summary

Related Chapters

Source Code

Further Reading

Table of contents