Linguistic Features · spaCy Usage Documentation
spaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors and more.
概览

收录于
2026年3月17日
学科与领域
computer-science-advanced · natural-language-processing-nlp
年级范围
九年级(高一)–十二年级(高四)
页面类型
Article
简介
spaCy Linguistic Features Overview
spaCy processes raw text into Doc objects containing rich linguistic annotations. To optimize memory, spaCy stores attributes as hash values; appending an underscore (e.g., .pos_) retrieves the readable string representation.
- Part-of-Speech (POS) Tagging: Uses statistical models to predict tags based on context. Fine-grained tags (
Token.tag) provide detailed morphological info, while coarse-grained tags (Token.pos) provide general categories. - Morphological Analysis:
- Inflectional Morphology: Modifying a root (lemma) with prefixes/suffixes to change grammatical function without changing the POS.
- Morphologizer: A statistical component that assigns morphological features (
Token.morph) and POS tags. - Rule-based approach: Used for languages with simpler systems, mapping fine-grained tags to coarse-grained tags and features.
- Lemmatization: The process of reducing words to their root form. spaCy offers three methods:
- Lookup: Maps surface forms to lemmas via tables (requires
spacy-lookups-data). - Rule-based: Uses language-specific rules and exception files (e.g., WordNet for English) based on POS/morphology.
- Trainable (
EditTreeLemmatizer): Learns form-to-lemma transformations from a corpus, often achieving higher accuracy than rule-based methods.
- Lookup: Maps surface forms to lemmas via tables (requires
- Dependency Parsing:
- Analyzes syntactic relationships between words, represented as a tree of "heads" and "children."
- Noun Chunks: Identifies "base noun phrases" (a noun plus its descriptors) via
Doc.noun_chunks. - Navigation: Provides an API to traverse the dependency tree, check for annotations (
doc.has_annotation("DEP")), and extract syntactic relations.
- Data Management: Lemmatization tables and lookup data are distributed via the
spacy-lookups-datapackage.
用户评价
暂无已发布的评价,欢迎率先分享您的使用体验。