Description
SeTABERTa is a new multilingual langue model pertained from scratch using various Open Access text repositories: EU legislation, research articles, EU public documents and US patents. 2/3 of training data is English. The other part of data covers EU24 languages. The model was trained on JRC Big Data Platform. The model can be fine-tuned for other tasks. The model is available on HuggingFace at https://huggingface.co/vidaud/SeTABERTa-mlm-v1 and can be loaded with FuggingFace transformers library.
Contact
Contributors
-
- European Commission, Joint Research Centre
- https://ec.europa.eu/info/departments/joint-research-centre
How to cite
European Commission, Joint Research Centre (JRC) (2024): Semantic Text Analyser BERT-like language model for formal language understanding. European Commission, Joint Research Centre (JRC) [Dataset] PID: http://data.europa.eu/89h/addd10f9-8325-4e49-8588-6cb681c162a5
Keywords
Data access
Additional information
- Published by
- European Commission, Joint Research Centre
- Created date
- 2024-02-01
- Modified date
- 2024-02-08
- Issued date
- 2024-02-01
- Data theme(s)
- Science and technology
- Update frequency
- unknown
- Identifier
- http://data.europa.eu/89h/addd10f9-8325-4e49-8588-6cb681c162a5
- Popularity
-