Simeon Emanuilov PRO

s-emanuilov

AI & ML interests

Software Engineer & Ph.D. candidate | Specializing in ML/DL system development & applying AI to solve real-world business problems.

Recent Activity

posted an update about 7 hours ago

New paper from Salesforce AI Research. The authors found that joint training, continual pre-training (CPT), and instruction tuning with a 50/50 data split achieve better results than sequential training. Their 8B parameter model outperformed larger 70B models on financial tasks. Down-sampling CPT data to match IT data size improved performance on CFA Challenge exams from 34.44% to 55.56%, while maintaining strong general knowledge capabilities as shown by comparable or better performance on general knowledge benchmarks like AI2-ARC and MMLU. Technical implementation involved two-stage training: Group 1 utilized 3.84B tokens from web and basic texts, followed by Group 2, which used 1.66B tokens from domain-specific books. Their preference alignment method used generative reward models to identify and correct reasoning errors rather than just rating full solutions. Evaluation on 91,872 samples across 31 tasks showed their Llama-Fin model achieving 91.13% accuracy on sentiment analysis (FPB) and 95.32% on FiQA SA, exceeding GPT-4's performance of 82.16% and 68.51%, respectively, on these benchmarks. It could be useful for many financial companies looking to build AI pipelines. Interesting read, but neither the model nor GitHub repo is accessible yet. The key insight for AI builders is that with small models - it is fully possible to outperform much bigger models. https://arxiv.org/abs/2501.04961

liked a dataset about 12 hours ago

HumanLLMs/Human-Like-DPO-Dataset

liked a model about 12 hours ago

HumanLLMs/Human-Like-LLama3-8B-Instruct

View all activity

Organizations

Posts 2

Post

172

New paper from Salesforce AI Research. The authors found that joint training, continual pre-training (CPT), and instruction tuning with a 50/50 data split achieve better results than sequential training. Their 8B parameter model outperformed larger 70B models on financial tasks.

Down-sampling CPT data to match IT data size improved performance on CFA Challenge exams from 34.44% to 55.56%, while maintaining strong general knowledge capabilities as shown by comparable or better performance on general knowledge benchmarks like AI2-ARC and MMLU.

Technical implementation involved two-stage training: Group 1 utilized 3.84B tokens from web and basic texts, followed by Group 2, which used 1.66B tokens from domain-specific books. Their preference alignment method used generative reward models to identify and correct reasoning errors rather than just rating full solutions.

Evaluation on 91,872 samples across 31 tasks showed their Llama-Fin model achieving 91.13% accuracy on sentiment analysis (FPB) and 95.32% on FiQA SA, exceeding GPT-4's performance of 82.16% and 68.51%, respectively, on these benchmarks.

It could be useful for many financial companies looking to build AI pipelines.

Interesting read, but neither the model nor GitHub repo is accessible yet. The key insight for AI builders is that with small models - it is fully possible to outperform much bigger models.

https://arxiv.org/abs/2501.04961

Post

2539

Hey HF community! 👋

Excited to share Monkt - a tool I built to solve the eternal headache of processing documents for ML/AI pipelines.

What it does: Converts PDFs, Word, PowerPoint, Excel, Web pages or raw HTML into clean Markdown or structured JSON.

Great for:
✔ LLM training dataset preparation;
✔ Knowledge base construction;
✔ Research paper processing;
✔ Technical documentation management.

It has API access for integration into ML pipelines.

Check it out at https://monkt.com/ if you want to save time on document processing infrastructure.

Looking forward to your feedback!