Kevin M. Ford has self-archived his M.S. theses, "The Application of File Identification, Validation, and Characterization Tools in Digital Curation," in IDEALS.
Here's an excerpt:
File format identification, characterization, and validation are considered essential processes for digital preservation and, by extension, long-term data curation. These actions are performed on data objects by humans or computers, in an attempt to identify the type of a given file, derive characterizing information that is specific to the file, and validate that the given file conforms to its type specification. The present research reviews the literature surrounding these digital preservation activities, including their theoretical basis and the publications that accompanied the formal release of tools and services designed in response to their theoretical foundation. It also reports the results from extensive tests designed to evaluate the coverage of some of the software tools developed to perform file format identification, characterization, and validation actions. Tests of these tools demonstrate that more work is needed – particularly in terms of scalable solutions – to address the expanse of digital data to be preserved and curated. The breadth of file types these tools are anticipated to handle is so great as to call into question whether a scalable solution is feasible, and, more broadly, whether such efforts will offer a meaningful return on investment. Also, these tools, which serve to provide a type of baseline reading of a file in a repository, can be easily tricked. It is possible to generate files with nothing more than a proper file extension and correct magic number and have the tools "positively" identify the file. This is not the same as a file that conforms to its specification, and one that could be considered valid. The ability to manipulate the results returned by these tools raises issues of identity, trust, security and risk.