Linguistic Features · spaCy Usage Documentation
spaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors and more.
En bref

Ajouté le
17 mars 2026
Matière et domaine
computer-science-advanced · natural-language-processing-nlp
Niveaux scolaires
9e année (3e)–12e année (Terminale)
Type de page
Article
Introduction
spaCy Linguistic Features Overview
spaCy processes raw text into Doc objects containing rich linguistic annotations. To optimize memory, spaCy stores attributes as hash values; appending an underscore (e.g., .pos_) retrieves the readable string representation.
- Part-of-Speech (POS) Tagging: Uses statistical models to predict tags based on context. Fine-grained tags (
Token.tag) provide detailed morphological info, while coarse-grained tags (Token.pos) provide general categories. - Morphological Analysis:
- Inflectional Morphology: Modifying a root (lemma) with prefixes/suffixes to change grammatical function without changing the POS.
- Morphologizer: A statistical component that assigns morphological features (
Token.morph) and POS tags. - Rule-based approach: Used for languages with simpler systems, mapping fine-grained tags to coarse-grained tags and features.
- Lemmatization: The process of reducing words to their root form. spaCy offers three methods:
- Lookup: Maps surface forms to lemmas via tables (requires
spacy-lookups-data). - Rule-based: Uses language-specific rules and exception files (e.g., WordNet for English) based on POS/morphology.
- Trainable (
EditTreeLemmatizer): Learns form-to-lemma transformations from a corpus, often achieving higher accuracy than rule-based methods.
- Lookup: Maps surface forms to lemmas via tables (requires
- Dependency Parsing:
- Analyzes syntactic relationships between words, represented as a tree of "heads" and "children."
- Noun Chunks: Identifies "base noun phrases" (a noun plus its descriptors) via
Doc.noun_chunks. - Navigation: Provides an API to traverse the dependency tree, check for annotations (
doc.has_annotation("DEP")), and extract syntactic relations.
- Data Management: Lemmatization tables and lookup data are distributed via the
spacy-lookups-datapackage.
Avis de la communauté
Pas encore d’avis publiés. Soyez le premier à partager votre expérience.