All posts
NLP
Machine Learning
Search
Transformers

NER-Powered Semantic Search Engine

Building a semantic search system enhanced with Named Entity Recognition using Pinecone, BERT-NER, and sentence transformers on 50,000 Medium articles.

March 11, 2023 · 3 min read · By Kshitiz Regmi

Standard semantic search is powerful — but it struggles when users query for specific entities like people, organizations, or places. NER-powered semantic search solves this by combining dense vector retrieval with named entity filtering, ensuring results not only feel relevant but actually mention the queried entities.

This tutorial demonstrates how to build such a system using 50,000 Medium articles, Pinecone, and two Hugging Face models.

System Architecture

The system has three core components:

  1. Named Entity Recognitiondslim/bert-base-NER (identifies PER, ORG, LOC, MISC entities)
  2. Sentence Embeddingsmulti-qa-MiniLM-L6-cos-v1 (384-dim dense vectors optimized for semantic search)
  3. Pinecone — vector database storing embeddings + entity metadata for filtered retrieval

Step 1: Setup

pip install pinecone-client sentence-transformers transformers
import pinecone
from sentence_transformers import SentenceTransformer
from transformers import pipeline

# Load models
encoder = SentenceTransformer("multi-qa-MiniLM-L6-cos-v1")
ner_pipeline = pipeline("ner", model="dslim/bert-base-NER", aggregation_strategy="simple")

# Init Pinecone
pinecone.init(api_key="YOUR_API_KEY", environment="us-east1-gcp")
index = pinecone.Index("ner-semantic-search")

Step 2: Entity Extraction

dslim/bert-base-NER is fine-tuned BERT on CoNLL-2003, recognizing four entity types:

def extract_entities(text: str) -> dict:
    entities = ner_pipeline(text)
    return {
        "persons":       [e["word"] for e in entities if e["entity_group"] == "PER"],
        "organizations": [e["word"] for e in entities if e["entity_group"] == "ORG"],
        "locations":     [e["word"] for e in entities if e["entity_group"] == "LOC"],
        "misc":          [e["word"] for e in entities if e["entity_group"] == "MISC"],
    }

# Example
extract_entities("Elon Musk announced SpaceX's Starship launch from Texas.")
# {
#   "persons": ["Elon Musk"],
#   "organizations": ["SpaceX"],
#   "locations": ["Texas"],
#   "misc": ["Starship"]
# }

Step 3: Indexing Documents

For each article, we generate an embedding and store the extracted entities as Pinecone metadata:

def index_document(doc_id: str, text: str, metadata: dict):
    entities = extract_entities(text)
    embedding = encoder.encode(text).tolist()
    
    pinecone_metadata = {
        **metadata,          # title, url, publication_date, etc.
        **entities,          # persons, organizations, locations, misc
    }
    
    index.upsert([(doc_id, embedding, pinecone_metadata)])

# Index all 50k articles
for article in articles:
    index_document(
        doc_id=article["id"],
        text=article["content"],
        metadata={"title": article["title"], "url": article["url"]}
    )

Step 4: NER-Enhanced Query

At query time, extract entities from the user's query and use them as Pinecone metadata filters:

def search(query: str, top_k: int = 5) -> list:
    query_entities = extract_entities(query)
    query_embedding = encoder.encode(query).tolist()
    
    # Build metadata filter from query entities
    filter_dict = {}
    if query_entities["persons"]:
        filter_dict["persons"] = {"$in": query_entities["persons"]}
    if query_entities["organizations"]:
        filter_dict["organizations"] = {"$in": query_entities["organizations"]}
    if query_entities["locations"]:
        filter_dict["locations"] = {"$in": query_entities["locations"]}
    
    results = index.query(
        vector=query_embedding,
        top_k=top_k,
        filter=filter_dict if filter_dict else None,
        include_metadata=True
    )
    return results["matches"]

Why NER Improves Retrieval

Consider the query: "Google's approach to large language models"

Standard Semantic SearchNER-Enhanced Search
ResultsTech/AI articles (semantically similar)Articles explicitly mentioning Google
Query: "Elon Musk's companies"Entrepreneur/startup articlesArticles featuring Elon Musk by name
Query: "Events in Paris"Event articlesArticles about Paris specifically

Pure semantic search finds documents with similar meaning — but if you name a specific entity, you want documents that actually reference that entity, not just articles about the same topic.

Key Models

dslim/bert-base-NER

  • BERT-base fine-tuned on CoNLL-2003
  • 4 entity types: PER, ORG, LOC, MISC
  • Fast inference, production-ready

multi-qa-MiniLM-L6-cos-v1

  • Optimized for semantic search tasks
  • 384-dimensional embeddings
  • 6-layer, efficient — ~50ms per batch on CPU

Results

On the 50,000 Medium article corpus, NER-enhanced search significantly improved result relevance for entity-specific queries. Queries containing named entities now consistently return documents that actually mention those entities, eliminating the "semantically relevant but factually unrelated" problem common in pure vector search.

Extending This

  • Hybrid search (BM25 + dense): add keyword matching for exact terms alongside vector similarity
  • Entity linking: normalize "Elon" and "Musk" to the same entity
  • Re-ranking: apply a cross-encoder on top-k candidates for a final relevance score