Apache Spark natural language processing library

Click for: original source

Excellent community blog and effort from the engineering team at John Snow Labs, explaining their contribution to an open-source Apache Spark Natural Language Processing (NLP) library. Apache Spark is a general-purpose cluster computing framework, with native support for distributed SQL, streaming, graph processing, and machine learning.

The NLP Library is written in Scala with no dependencies on other NLP or ML libraries. It natively extends the Spark ML Pipeline API. It comes out of the box with:

  • Tokenizer, Normalizer, Stemmer, Lemmatizer
  • Entity Extractor, Date Extractor, Part of Speech Tagger, Spell checker
  • Named Entity Recognition, Sentence boundary detection, Sentiment analysis

In addition, given the tight integration with Spark ML, there is a lot more you can use right away when building your NLP pipelines. Detailed technical post with accompanying charts and schemas. Great article!

[Read More]

Tags big-data data-science