Best Practices For New Datasets, Language-based

Planning to create a dataset of government documents, previously published in paper format (and from a published selection out of archives at that).

These would be things like proclamations, telegrams, receipts, etc.

Doing this is a practice and a first attempt, so some basic questions:

JSON or some other format preferred?

For any annotations, what would be the best practice? Have a “clean” dataset with no notes or have one “clean” and one with annotations?

The data would have uses for language and historical research purposes.

submitted by /u/Books_Of_Jeremiah
[link] [comments]

Leave a Reply

Your email address will not be published. Required fields are marked *