Spring 2020 elective: CS 496 : Special topics (Introduction to data science)

Description: An introduction to data science from a computer science perspective. Includes an introduction to the programming language used in the course, numerical computation, essential probability and Bayesian statistics concepts, structured and unstructured text and data processing, data and information visualization, machine learning using examples such as Naive Bayes, regression, decision trees, clustering, mixture models and topic modeling.

Prerequisites: Completed at least two 300 level courses in computer science.

Textbook: Python Data Science Handbook: Essential tools for working with data. Jake VanderPlas. O'Reilly. ISBN-13: 978-1491912058.

Programming language: The Python programming language is a popular language in general and especially for data science since there are many libraries and support tools for Python, including interfaces to other non-Python libraries. Python 3.x will be used and some distinctive features of moving from Python 2.x to Python 3.x will be covered as support for Python 2.x is to be discontinued in early 2020.
Data science topics: The field of data science brings together many different areas such as computer science, mathematics, statistics, machine learning, data/information visualization, information technology and business. The general outline of the course topics is as follows.

Overview and introduction to data science
Python, IPython, Juypter as computational environments for data scientists
Numerical computation using NumPy and SciPy
Essential concepts of probability and Bayesian statistics (not the frequentist statistics taught in most statistics courses as frequentist statistics are not that useful in data science)
Data manipulation and processing of data using Pandas
Structured and unstructured text/data processing and feature extraction
Topic modeling using Gensim
Data and information visualization using MatPlotLib
Machine learning of various machine learning algorithms using Scikit-Learn

Naive Bayes classification, Linear and logistic regression
Support vector machines, Decision trees and random forests
K-means clustering, Gaussian mixture models

Professor: The professor, Dr. Snyder (PhD, computer science, applied programming language theory) has spent ten recent years working in industry for various companies in areas of complex structured and unstructured text, data and program analysis including work in Real Estate (various data feeds, AWS), intellectual property forensics (cluster computing, non-trivial file comparisons, etc.), patent application writing related to topic modeling and market prediction. His work in visualization includes the financial printing industry and visualizations using PostScript/GhostScript, SVG and Python using PIL and MatPlotLib. Other related projects included sentiment analysis (of German comments), use of large data sets (Google patent database, Enron email database, USGS satellite mapping image data, etc.) and automatically inferring and categorizing characteristics of tabular data such that common probability distributions (binomial, Poisson, etc.) could be used to generate similar data for testing.

For more information, contact Dr. Snyder, rsnyder9@ycp.edu, KEC 115, ycp.powersoftwo.org.