Linguistic Features · spaCy Usage Documentation

spaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors and more.

En bref

Ajouté le

17 mars 2026

Matière et domaine

computer-science-advanced · natural-language-processing-nlp

Niveaux scolaires

9e année (3e)–12e année (Terminale)

Type de page

Article

Introduction

spaCy Linguistic Features Overview

spaCy processes raw text into Doc objects containing rich linguistic annotations. To optimize memory, spaCy stores attributes as hash values; appending an underscore (e.g., .pos_) retrieves the readable string representation.

  • Part-of-Speech (POS) Tagging: Uses statistical models to predict tags based on context. Fine-grained tags (Token.tag) provide detailed morphological info, while coarse-grained tags (Token.pos) provide general categories.
  • Morphological Analysis:
    • Inflectional Morphology: Modifying a root (lemma) with prefixes/suffixes to change grammatical function without changing the POS.
    • Morphologizer: A statistical component that assigns morphological features (Token.morph) and POS tags.
    • Rule-based approach: Used for languages with simpler systems, mapping fine-grained tags to coarse-grained tags and features.
  • Lemmatization: The process of reducing words to their root form. spaCy offers three methods:
    • Lookup: Maps surface forms to lemmas via tables (requires spacy-lookups-data).
    • Rule-based: Uses language-specific rules and exception files (e.g., WordNet for English) based on POS/morphology.
    • Trainable (EditTreeLemmatizer): Learns form-to-lemma transformations from a corpus, often achieving higher accuracy than rule-based methods.
  • Dependency Parsing:
    • Analyzes syntactic relationships between words, represented as a tree of "heads" and "children."
    • Noun Chunks: Identifies "base noun phrases" (a noun plus its descriptors) via Doc.noun_chunks.
    • Navigation: Provides an API to traverse the dependency tree, check for annotations (doc.has_annotation("DEP")), and extract syntactic relations.
  • Data Management: Lemmatization tables and lookup data are distributed via the spacy-lookups-data package.

Avis de la communauté

Pas encore d’avis publiés. Soyez le premier à partager votre expérience.