Working on a dataset for an LLM project and trying to extract structured info from a bunch of web sources. Got the scraping part mostly down, but maintaining the parsing is killing me. Every source has a slightly different layout, and things break constantly. How do you guys handle this when building training sets?
submitted by /u/Gwapong_Klapish
[link] [comments]