Linguistic Features · spaCy Usage Documentation

spaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors and more.

概览

收录于

2026年3月17日

学科与领域

computer-science-advanced · natural-language-processing-nlp

年级范围

九年级(高一)–十二年级(高四)

页面类型

Article

简介

spaCy Linguistic Features Overview

spaCy processes raw text into Doc objects containing rich linguistic annotations. To optimize memory, spaCy stores attributes as hash values; appending an underscore (e.g., .pos_) retrieves the readable string representation.

  • Part-of-Speech (POS) Tagging: Uses statistical models to predict tags based on context. Fine-grained tags (Token.tag) provide detailed morphological info, while coarse-grained tags (Token.pos) provide general categories.
  • Morphological Analysis:
    • Inflectional Morphology: Modifying a root (lemma) with prefixes/suffixes to change grammatical function without changing the POS.
    • Morphologizer: A statistical component that assigns morphological features (Token.morph) and POS tags.
    • Rule-based approach: Used for languages with simpler systems, mapping fine-grained tags to coarse-grained tags and features.
  • Lemmatization: The process of reducing words to their root form. spaCy offers three methods:
    • Lookup: Maps surface forms to lemmas via tables (requires spacy-lookups-data).
    • Rule-based: Uses language-specific rules and exception files (e.g., WordNet for English) based on POS/morphology.
    • Trainable (EditTreeLemmatizer): Learns form-to-lemma transformations from a corpus, often achieving higher accuracy than rule-based methods.
  • Dependency Parsing:
    • Analyzes syntactic relationships between words, represented as a tree of "heads" and "children."
    • Noun Chunks: Identifies "base noun phrases" (a noun plus its descriptors) via Doc.noun_chunks.
    • Navigation: Provides an API to traverse the dependency tree, check for annotations (doc.has_annotation("DEP")), and extract syntactic relations.
  • Data Management: Lemmatization tables and lookup data are distributed via the spacy-lookups-data package.

用户评价

暂无已发布的评价,欢迎率先分享您的使用体验。