Want to learn more? Take the full course at https://learn.datacamp.com/courses/ad... at your own pace. More than a video, you'll learn hands-on coding & quickly apply skills to your daily work.
---
Hi, I'm Ines! I'm one of the core developers of spaCy, a popular library for advanced Natural Language Processing in Python.
In this video, we'll take a look at the most important concepts of spaCy and how to get started.
At the center of spaCy is the object containing the processing pipeline. We usually call this variable "nlp".
For example, to create an English nlp object, you can import the English language class from spacy dot lang dot en and instantiate it. You can use the nlp object like a function to analyze text.
It contains all the different components in the pipeline.
It also includes language-specific rules used for tokenizing the text into words and punctuation. spaCy supports a variety of languages that are available in spacy dot lang.
When you process a text with the nlp object, spaCy creates a Doc object – short for "document". The Doc lets you access information about the text in a structured way, and no information is lost.
The Doc behaves like a normal Python sequence by the way and lets you iterate over its tokens, or get a token by its index. But more on that later!
Token objects represent the tokens in a document – for example, a word or a punctuation character.
To get a token at a specific position, you can index into the Doc.
Token objects also provide various attributes that let you access more information about the tokens. For example, the dot text attribute returns the verbatim token text.
A Span object is a slice of the document consisting of one or more tokens. It's only a view of the Doc and doesn't contain any data itself.
To create a Span, you can use Python's slice notation. For example, 1 colon 3 will create a slice starting from the token at position 1, up to – but not including! – the token at position 3.
Here you can see some of the available token attributes:
"i" is the index of the token within the parent document.
"text" returns the token text.
"is alpha", "is punct" and "like num" return boolean values indicating whether the token consists of alphanumeric characters, whether it's punctuation or whether it resembles a number. For example, a token "10" – one, zero – or the word "ten" – T, E, N.
These attributes are also called lexical attributes: they refer to the entry in the vocabulary and don't depend on the token's context.
Let's see this in action and process your first text with spaCy.
#DataCamp #PythonTutorial #AdvancedNLPwithspaCy #spaCy #PythonNLP #IntroductiontospaCy
コメント