Linguistic Features · spaCy Usage Documentation

spaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors and more.

Overzicht

Toegevoegd op

17 maart 2026

Vak & domein

computer-science-advanced · natural-language-processing-nlp

Schooljaar

Klas 1 (brugklas)–Klas 4

Paginatype

Article

Inleiding

spaCy Linguistic Features Overview

spaCy processes raw text into Doc objects containing rich linguistic annotations. To optimize memory, spaCy stores attributes as hash values; appending an underscore (e.g., .pos_) retrieves the readable string representation.

  • Part-of-Speech (POS) Tagging: Uses statistical models to predict tags based on context. Fine-grained tags (Token.tag) provide detailed morphological info, while coarse-grained tags (Token.pos) provide general categories.
  • Morphological Analysis:
    • Inflectional Morphology: Modifying a root (lemma) with prefixes/suffixes to change grammatical function without changing the POS.
    • Morphologizer: A statistical component that assigns morphological features (Token.morph) and POS tags.
    • Rule-based approach: Used for languages with simpler systems, mapping fine-grained tags to coarse-grained tags and features.
  • Lemmatization: The process of reducing words to their root form. spaCy offers three methods:
    • Lookup: Maps surface forms to lemmas via tables (requires spacy-lookups-data).
    • Rule-based: Uses language-specific rules and exception files (e.g., WordNet for English) based on POS/morphology.
    • Trainable (EditTreeLemmatizer): Learns form-to-lemma transformations from a corpus, often achieving higher accuracy than rule-based methods.
  • Dependency Parsing:
    • Analyzes syntactic relationships between words, represented as a tree of "heads" and "children."
    • Noun Chunks: Identifies "base noun phrases" (a noun plus its descriptors) via Doc.noun_chunks.
    • Navigation: Provides an API to traverse the dependency tree, check for annotations (doc.has_annotation("DEP")), and extract syntactic relations.
  • Data Management: Lemmatization tables and lookup data are distributed via the spacy-lookups-data package.

Community-recensies

Nog geen gepubliceerde recensies. Deel als eerste uw ervaring.