SameDiff is a a tool for beginner text mining. It is a web-based tool for comparing two texts. It generates lists of common words and distinct words. It also uses cosine similarity algorithm to rate the similarity of the documents.
Google Ngram Viewer is a beginner tool for text mining. It is a web-based tool where users can search for a word or phrase across the Google Books corpus. The tool returns a chart of occurrences of the word or phrase over time.
MALLET (Machine Learning for Language Toolkit) is a tool for intermediate to advanced text mining. It is a command line tool for natural language processing, document classification, clustering, topic modeling and information extraction.
The Georgetown University Library is a member of HathiTrust, a partnership of academic and research institutions, offering a collection of millions of titles digitized from libraries around the world. Georgetown users can log into HathiTrust's digital library to download millions of titles for free.
The University of Oxford Text Archive (OTA) is a repository of digital literary and linguistic resources for research and teaching. Users can download texts as XML files, plain text files, or view the text in Voyant with the direct link provided in each entry of the OTA catalog.