Hi,
I’m trying to create a dataset of exam questions from the A-level Edexcel Physics question papers.
Here’s a sample paper%2520QP.pdf) for example.
Ideally, I’d want to extract all the text, equations properly and the images (mainly graphs and diagrams) through just uploading the file but I assume this isn’t feasible as far as I know.
What I’m doing right now is just using PyPDF to extract the text alone and I’m ignoring possible errors in the format in which equations may be extracted in (which puts me in a difficult position, when there are more complex equations involved that just straight one line formulas). I’m then just manually cleaning it up, using regular expressions where I can to simplify the process. After that, I plan on just manually ‘snipping’ the images out and put all of this into a MySQL database.
The project I’m working on rn is a question suggestion system based on content and question difficulty and I’m using a very specific subset of questions, as I mentioned earlier, just because I’m not too committed atm to tediously creating a dataset. I’m not even sure if storing this in MySQL is a good idea and I’ve personally never worked on any ML projects that don’t involve .csv files or aren’t image datasets, so I am pretty lost on this.
Any advice would be super highly appreciated! Wish you a great day 🙂
submitted by /u/cakeandflowers2202
[link] [comments]