Skip to Main Content

Text and Data Mining

A guide for UWA staff and students on text and data mining

Open source resources

The following open sources are available on the web and permit text and data mining. 

Data Source Description Further information and access
ArXiv An open-access repository of electronic preprints, in the fields of mathematics, physics, astronomy, electrical engineering and computer science.  ArXiv provides bulk metadata and abstract access and the arXiv API.  Visit arXiv Bulk Data Access page for more information.
CORE Collection of open access research papers. Access to raw data for text mining.
Europe PMC An open science platform providing access to life science publications and preprints. Visit the Europe PMC Developer resources page to get access to the RESTful API and bulk downloads tools.
Google Books Contains an index of full-text books digitised by Google.   
Hathi Trust Digital Library Contains over 17 million digitised resources for scholarly research. Visit Data Availability and APIs for information on bulk download options.
PLOS Open access publisher in the fields of Science and Medicine. PLOS provides several options to access their data via their Text and Data Mining page. API Display Policy for terms and conditions.
Project Gutenberg Contains a library of free ebooks. See the Project Gutenberg License and Permissions pages for information on what you can do with the data.
Wikidata Contains free multilingual open data that can be read and edited by both humans and machines (Wikidata.org). Please see the Wikidata: Data access page for bulk access to Wikidata content using the API.

 

CONTENT LICENCE

 Except for logos, Canva designs, AI generated images or where otherwise indicated, content in this guide is licensed under a Creative Commons Attribution-ShareAlike 4.0 International Licence.