NER-Powered Semantic Search Engine
Building a semantic search system enhanced with Named Entity Recognition using Pinecone, BERT-NER, and sentence transformers on 50,000 Medium articles.
March 11, 2023 · 3 min read · By Kshitiz Regmi
Standard semantic search is powerful — but it struggles when users query for specific entities like people, organizations, or places. NER-powered semantic search solves this by combining dense vector retrieval with named entity filtering, ensuring results not only feel relevant but actually mention the queried entities.
This tutorial demonstrates how to build such a system using 50,000 Medium articles, Pinecone, and two Hugging Face models.
System Architecture
The system has three core components:
- Named Entity Recognition —
dslim/bert-base-NER(identifies PER, ORG, LOC, MISC entities) - Sentence Embeddings —
multi-qa-MiniLM-L6-cos-v1(384-dim dense vectors optimized for semantic search) - Pinecone — vector database storing embeddings + entity metadata for filtered retrieval
Step 1: Setup
pip install pinecone-client sentence-transformers transformers
import pinecone
from sentence_transformers import SentenceTransformer
from transformers import pipeline
# Load models
encoder = SentenceTransformer("multi-qa-MiniLM-L6-cos-v1")
ner_pipeline = pipeline("ner", model="dslim/bert-base-NER", aggregation_strategy="simple")
# Init Pinecone
pinecone.init(api_key="YOUR_API_KEY", environment="us-east1-gcp")
index = pinecone.Index("ner-semantic-search")
Step 2: Entity Extraction
dslim/bert-base-NER is fine-tuned BERT on CoNLL-2003, recognizing four entity types:
def extract_entities(text: str) -> dict:
entities = ner_pipeline(text)
return {
"persons": [e["word"] for e in entities if e["entity_group"] == "PER"],
"organizations": [e["word"] for e in entities if e["entity_group"] == "ORG"],
"locations": [e["word"] for e in entities if e["entity_group"] == "LOC"],
"misc": [e["word"] for e in entities if e["entity_group"] == "MISC"],
}
# Example
extract_entities("Elon Musk announced SpaceX's Starship launch from Texas.")
# {
# "persons": ["Elon Musk"],
# "organizations": ["SpaceX"],
# "locations": ["Texas"],
# "misc": ["Starship"]
# }
Step 3: Indexing Documents
For each article, we generate an embedding and store the extracted entities as Pinecone metadata:
def index_document(doc_id: str, text: str, metadata: dict):
entities = extract_entities(text)
embedding = encoder.encode(text).tolist()
pinecone_metadata = {
**metadata, # title, url, publication_date, etc.
**entities, # persons, organizations, locations, misc
}
index.upsert([(doc_id, embedding, pinecone_metadata)])
# Index all 50k articles
for article in articles:
index_document(
doc_id=article["id"],
text=article["content"],
metadata={"title": article["title"], "url": article["url"]}
)
Step 4: NER-Enhanced Query
At query time, extract entities from the user's query and use them as Pinecone metadata filters:
def search(query: str, top_k: int = 5) -> list:
query_entities = extract_entities(query)
query_embedding = encoder.encode(query).tolist()
# Build metadata filter from query entities
filter_dict = {}
if query_entities["persons"]:
filter_dict["persons"] = {"$in": query_entities["persons"]}
if query_entities["organizations"]:
filter_dict["organizations"] = {"$in": query_entities["organizations"]}
if query_entities["locations"]:
filter_dict["locations"] = {"$in": query_entities["locations"]}
results = index.query(
vector=query_embedding,
top_k=top_k,
filter=filter_dict if filter_dict else None,
include_metadata=True
)
return results["matches"]
Why NER Improves Retrieval
Consider the query: "Google's approach to large language models"
| Standard Semantic Search | NER-Enhanced Search | |
|---|---|---|
| Results | Tech/AI articles (semantically similar) | Articles explicitly mentioning Google |
| Query: "Elon Musk's companies" | Entrepreneur/startup articles | Articles featuring Elon Musk by name |
| Query: "Events in Paris" | Event articles | Articles about Paris specifically |
Pure semantic search finds documents with similar meaning — but if you name a specific entity, you want documents that actually reference that entity, not just articles about the same topic.
Key Models
dslim/bert-base-NER
- BERT-base fine-tuned on CoNLL-2003
- 4 entity types: PER, ORG, LOC, MISC
- Fast inference, production-ready
multi-qa-MiniLM-L6-cos-v1
- Optimized for semantic search tasks
- 384-dimensional embeddings
- 6-layer, efficient — ~50ms per batch on CPU
Results
On the 50,000 Medium article corpus, NER-enhanced search significantly improved result relevance for entity-specific queries. Queries containing named entities now consistently return documents that actually mention those entities, eliminating the "semantically relevant but factually unrelated" problem common in pure vector search.
Extending This
- Hybrid search (BM25 + dense): add keyword matching for exact terms alongside vector similarity
- Entity linking: normalize "Elon" and "Musk" to the same entity
- Re-ranking: apply a cross-encoder on top-k candidates for a final relevance score