I work as a QA Engineer in a “stealth mode” startup building a network storage appliance. I am looking for “real world” datasets to load into our appliance to profile performance and scalability of the product given different schema models populated real world distribution of data. I envision looking for two significantly different datasets. One is the “flat file” schema like historical or logging data from Web Server Access and Error logs. The other would be a relational (preferably star schema) database like reservation database or inventory control database.
The data doesn’t need to be current. And it can be scrubbed to remove “real” data. The data won’t be used outside the QA lab. Again, this is to test “how does the product work when data that lives in the outside world is loaded.”
Ultimately, I am looking for 2 to 10 Terabytes of composite data at the end of the project.