Philipp Meschenmoser, Norman Meuschke, Manuel Hotz, and Bela Gipp have published "Scraping Scientific Web Repositories: Challenges and Solutions for Automated Content Extraction" in D-Lib Magazine.
Here's an excerpt:
Many researchers are interested in accessing the underlying scientometric raw data to increase the transparency of these systems. In this paper, we discuss the challenges and present strategies to programmatically access such data in scientific Web repositories. We demonstrate the strategies as part of an open source tool (MIT license) that allows research performance comparisons based on Google Scholar data.
Digital Curation and Digital Preservation Works | Open Access Works | Digital Scholarship | Digital Scholarship Sitemap