Guide

Named entity recognition (NER) explained

A compliance analyst reads a wire-transfer memo: "Please release funds to Acme Holdings Ltd per contract with Jane Doe signed in Singapore." Three facts matter — payee organization, signatory person, jurisdiction — buried inside free text. Named entity recognition (NER) is the NLP task that finds and classifies such spans: contiguous character or token ranges labeled as person, organization, location, date, product SKU, medical code, or any custom type your schema defines. Unlike text classification, which assigns one label to an entire document, NER is sequence labeling: every token (or subword) gets a tag, and adjacent tags are stitched into entities. NER powers search filters, CRM auto-fill, knowledge-graph population, regulatory redaction, and routing rules in support queues. This guide covers tagging schemes (BIO/BILOU), classical and neural models, spaCy and transformer pipelines, entity linking, evaluation on span F1, a Harbor Support ticket-triage worked example, an approach decision table, pitfalls, and a production checklist — alongside NLP fundamentals and knowledge graphs.

What NER extracts and how schemas are designed

Named entities are mentions of real-world objects referred to by name or identifier. Classic newswire schemas use PER (person), ORG, LOC, and sometimes MISC. Domain schemas add PRODUCT, LAW, GPE (geo-political entity), MONEY, DATE, ACCOUNT_ID, or clinical codes. Design choices drive everything downstream:

  • Granularity — Is "New York" a LOC or a GPE? Is "iPhone 15 Pro" PRODUCT or nested ORG + PRODUCT? Document the boundary rules.
  • Nesting and overlap — "University of California, Berkeley" may contain both org and loc. Flat NER cannot represent overlaps; use layered models, span-based parsers, or graph schemas.
  • Custom vs general — Off-the-shelf English models miss internal project codenames. Fine-tune or add gazetteer features for domain tokens.

Output is usually a list of tuples: (start, end, type, text) with character offsets so downstream systems can highlight, redact, or link without re-tokenizing.

Tagging schemes: BIO, BILOU, and span-based alternatives

Sequence labelers assign one tag per token. Multi-token entities need a scheme that encodes boundaries:

BIO (IOB) tagging

B-TYPE begins an entity, I-TYPE continues it, O is outside any entity. Example: [B-PER, I-PER, O, B-ORG, I-ORG, I-ORG] for "Jane Doe at Acme Corp." BIO is the de facto standard in CoNLL shared tasks and most training libraries.

BILOU (BIOES)

Adds L-TYPE (last token) and U-TYPE (unit/single-token entity). Single-token entities become unambiguous without heuristics; some CRF and neural models learn faster with explicit length signals.

Span-based and generative NER

Instead of per-token tags, models predict start/end indices or emit structured JSON. Transformer span classifiers score all candidate spans; LLMs can extract entities via constrained generation. These handle overlapping entities better than flat BIO, at higher compute cost.

Model families: from regex to fine-tuned transformers

Rules and gazetteers

Regular expressions plus curated dictionaries (company names, airport codes, drug lists) deliver high precision on known patterns with near-zero training data. Combine with a confidence gate: if the gazetteer misses, fall through to ML.

CRF and classical ML

Conditional Random Fields model tag transitions — an I-ORG after B-PER gets low probability without neural context. Features include word shape, prefixes/suffixes, part-of-speech tags, and dictionary membership. Still viable on CPU for stable domains with modest vocabulary drift.

spaCy and CNN/LSTM encoders

spaCy pipelines embed tokens with hash embeddings or transformers, apply a transition-based or bilinear parser for NER, and ship pretrained en_core_web_trf models. Fast to deploy, good baseline on clean newswire and web text.

BERT-style token classification

Fine-tune a pretrained encoder (BERT, RoBERTa, DeBERTa, domain models like BioClinicalBERT) with a linear head on each subword. Align subword pieces back to word boundaries with the standard "first subword gets the label" rule. This dominates benchmark leaderboards and handles context disambiguation — "Apple" as fruit vs company from surrounding tokens.

LLM extraction

Prompt a large model to return JSON entity lists. Strong zero-shot on novel types without annotation, but latency, cost, and format reliability lag dedicated NER models. Use for bootstrap labeling or low-volume analyst tools; cache and validate outputs against a schema.

Entity linking and knowledge graph integration

Recognizing "Paris" is insufficient when your database needs entity ID wd:Q90 (city) vs wd:Q457129 (Texas town). Entity linking (a.k.a. entity disambiguation) maps spans to canonical records in Wikidata, an internal CRM, or a product catalog. Pipelines: candidate generation (string match, embedding nearest neighbor) then a ranker using context embeddings. Linked entities feed knowledge graphs for search, fraud graph traversal, and RAG retrieval filters. Track linking accuracy separately from mention detection — a perfect span with wrong ID is a silent data bug.

Evaluation: span-level precision, recall, and F1

Token-level accuracy is misleading: predicting all O looks great while finding nothing useful. Standard practice is exact span match on (start, end, type) — both boundaries and type must match. Partial overlap metrics exist but reward sloppy boundaries. Report per-type F1 because models often excel on PER but fail on rare PRODUCT tags. Use stratified splits by document source, not random sentences, to detect domain shift. For imbalanced types, macro-F1 or weighted business cost (missing a SSN vs false alarm) beats headline micro-F1.

Worked example: Harbor Support entity-aware routing

Harbor Support receives 12,000 tickets weekly. Routing by keywords alone sent "Refund for order #8842 on Stripe" to the wrong queue when customers omitted the word "refund." They added a lightweight NER layer:

  1. Schema: ORDER_ID, PAYMENT_PROVIDER, PRODUCT, EMAIL, PER (agent or customer name when signed).
  2. Training data: 4,200 tickets annotated in Label Studio; regex pre-annotations for order IDs sped labeling 3x.
  3. Model: Fine-tuned distilbert-base-uncased token classifier on BIO tags; spaCy en_core_web_sm kept for PER/ORG fallback on low-confidence spans (< 0.75 max token prob).
  4. Routing rules: Any ORDER_ID + billing keyword routes to Commerce; PAYMENT_PROVIDER in {Stripe, PayPal} escalates to Payments L2; linked PRODUCT IDs enrich the CRM sidebar.

Span F1 on holdout: ORDER_ID 0.96, PRODUCT 0.88, PAYMENT_PROVIDER 0.91. End-to-end routing accuracy improved from 71% to 84% with 18 ms median latency on CPU (ONNX export). False positives on order-like numbers in log attachments dropped after they excluded code blocks in preprocessing.

Approach decision table

ApproachBest whenWatch out for
Regex + gazetteerFixed codes, IDs, regulated vocabularies; zero training budgetRecall collapses on paraphrases and typos
CRF + hand featuresStable domain, CPU-only, interpretable transitionsWeak on long-range context and social text
spaCy pretrainedQuick English baseline, PER/ORG/LOC on clean proseCustom types need fine-tune or replace NER component
Fine-tuned transformerCustom schema, ambiguous context, benchmark-leading F1Annotation cost; subword alignment bugs
LLM JSON extractionRapid schema experiments, few-shot new typesCost, latency, hallucinated entities
Span-based / overlap-awareNested entities (org inside loc)More complex training and inference

Common pitfalls

  • Train/serve tokenization mismatch — re-tokenizing breaks character offsets; persist offset maps through the pipeline.
  • Label inconsistency — annotators disagree on whether "Monday" is DATE; a 20-label guideline doc saves weeks.
  • Data leakage from templates — tickets share boilerplate; random splits inflate F1; split by account or time.
  • Ignoring subword fragmentation — mis-aligned BIO tags on WordPiece tokens create illegal I- starts.
  • Evaluating on token accuracy — use strict span F1 per type.
  • No confidence threshold — low-confidence spans should abstain or route to human review.
  • Skipping entity linking tests — "Amazon" linked to the rainforest reserve wrecks fulfillment APIs.
  • PII in logs — NER output is sensitive; redact before analytics stores.

Production checklist

  • Publish entity schema with boundary examples and negative cases.
  • Choose BIO vs BILOU vs span-based to match overlap requirements.
  • Baseline with regex/gazetteer for high-precision IDs before ML.
  • Annotate 500–2,000 diverse examples per rare type before fine-tuning.
  • Fine-tune transformer with early stopping; export ONNX if CPU latency matters.
  • Validate strict span F1 per type on held-out documents, not sentences.
  • Add confidence gating and human-review queue for regulated entity types.
  • Wire entity linking to canonical IDs with fallback when score < threshold.
  • Monitor type distribution drift and new OOV tokens weekly.
  • Version model, schema, and gazetteer together; replay golden tickets on deploy.

Key takeaways

  • NER labels spans, not whole documents — sequence tagging with BIO/BILOU is the standard encoding.
  • Schema design is half the project — granularity, overlap, and guidelines determine usable output.
  • Transformers win on ambiguity but rules and gazetteers still earn their place on IDs and codes.
  • Measure exact span F1 per type — token accuracy and micro-F1 hide failure modes.
  • Linking turns mentions into data — graphs, routing, and RAG need canonical entity IDs.

Related reading