Hi everyone, I need help finding a dataset of images annotated with human actions [such as sitting+in-chair, working+on-laptop, etc.]. I found a model capable of generating such tags on Huggingface here, however I was unable to locate its source dataset.
Just for context, I am trying to create a fine-tuned ViT model, that incorporates as broad a set of visual tags as possible. My plan is to optimize this model for edge devices [using Quantization aware training + TFLite model conversion] and open-source the weights. Eventually, I am hoping this can be used for a broad range of visual search/tagging/QnA tasks. Currently, I am training the model on top 2500 Danbooru tags + MIT SUN indoor location tags.
An online demo of the model can be found here. If anyone has any suggestions regarding what other dataset/tags to add, or would like to help with the training efforts, please drop a line. I would really appreciate it.
[Disclosures: I am not affiliated in any way with any of the HuggingFace /Arxiv/Mit.edu links I posted here. The link to the online-demo is maintained by me, but there are no ads or anything else that procures me financial gain on it.]
submitted by /u/DisintegratingBo
[link] [comments]