but no one tool can wrangle arbitrary data.


However search engines and associative Apache Foundation projects such as Nutch and Tika are purpose built to ingest thousands of formats for search consumption. The interesting engineering test would be to use Solr/Tika/Nutch/Akka land data in json or an a format that can be consumed by data science tools.