Thinking Of Open Sourcing A 250k Tables Dataset, Would This Be Valuable?

I’ve been working on a company for about 3 years with my co-founder. Our original goal was to build an intelligent document processing tool because we tried building a research co-pilot but found the document processing services available were bad. We got kind of carried away and built a data engine pipeline that reads in any latex, cleans it, and brings it to an intermediate representation where we can apply any augmentation (color, alignment, spacing). However, this has been a massive undertaking (~200k lines of python code), and to this point we have focused mostly on tables (the full document is written but it’s not refined or ready for production).

Due to our burnout and need to hit the real world, we decided to train an image -> Word, Excel, and latex converter using an architecture similar to nougat. It out-performed (except robustness) basically all table extraction models we’ve seen (and we’ve studied them all), but launching something that only extracts tables is not really a commercial product (it lacks focus). So hardly anyone used it.

We were looking into different use cases for the technology, but kept finding that it required the full document and meaningfully higher robustness to be commercially viable. Furthermore, we are good at focusing on one thing and doing it perfectly, and training a model + launching a website + marketing are a lot of things that split our focus. Not to mention that there is a lot of (well funded) competition in the space and we’re just a team of two.

Then we got to thinking: what if we sold our data. We have a pipeline that lets us create virtually any table (eventually document) with any kind of source data which can be augmented via an LLM. Then because we bring it into a form where we have control, we can apply programmatic augmentations to said tables of any kind and then go to any output ground truth format (Word, json, latex, html, …). That is to say, we have complete control and can generate any kind of data someone would need to improve their model.

So, we were thinking of dropping 250k tables + a benchmark based on our synthetic data (and real world validation) to demonstrate our capability and hopefully get companies that have custom requirements that can pay us to generate the data their model lacks. We can also probe the weaknesses of existing models similar to a security researcher and then offer our data as a solution.

What do you think? Is dropping 250k highly diverse and perfectly annotated tables (with multiple ground truth formats) a good idea? Would that be something that’s valuable to people and could gain traction?

We’re trying to be quick about it (next month or two) so publishing a paper or going to a conference probably isn’t the best move.

submitted by /u/Says_Watt
[link] [comments]

Leave a Reply

Your email address will not be published. Required fields are marked *