This is a hands-on workshop for extracting and utilizing semantic topics from large collections of natural language texts.
By the end, participants will have built an application for efficiently processing, indexing and querying the entire English Wikipedia, using wondrous Python tools.
Workshop assumes knowledge of intermediate Python concepts (classes, generators, iterators).
- 20 min: motivation & dataset: the English Wikipedia
- 30 min: NLP: tokenization, lemmatization (textblob)
- 60 min: topic modeling (gensim)
- 60 min: document indexing, querying, parallelization (gensim)
- 10 min: cushion/extra: "Wikipedia similarity" web app (flask)
Install beforehand: IPython, NumPy, SciPy, TextBlob, Gensim + optionally Flask+Angular (all Python & open-source).
Linux, OSX and Windows are all fine.