Semantic Text Analyser BERT-like language model for formal language understanding

Description

SeTABERTa is a new multilingual langue model pertained from scratch using various Open Access text repositories: EU legislation, research articles, EU public documents and US patents. 2/3 of training data is English. The other part of data covers EU24 languages. The model was trained on JRC Big Data Platform. The model can be fine-tuned for other tasks. The model is available on HuggingFace at https://huggingface.co/vidaud/SeTABERTa-mlm-v1 and can be loaded with FuggingFace transformers library.

Contact

Email: vidas.daudaravicius (at) ec.europa.eu

Contributors

European Commission, Joint Research Centre

https://ec.europa.eu/info/departments/joint-research-centre

How to cite

European Commission, Joint Research Centre (JRC) (2024): Semantic Text Analyser BERT-like language model for formal language understanding. European Commission, Joint Research Centre (JRC) [Dataset] PID: http://data.europa.eu/89h/addd10f9-8325-4e49-8588-6cb681c162a5

Keywords

Language model

Data access

Semantic Text Analyser BERT-like language model for formal language understanding

Use conditions: European Commission reuse notice
Access conditions: No limitations

URL

Additional information

Published by: European Commission, Joint Research Centre
Created date: 2024-02-01
Modified date: 2024-02-08
Issued date: 2024-02-01
Data theme(s): Science and technology
Update frequency: unknown
Identifier: http://data.europa.eu/89h/addd10f9-8325-4e49-8588-6cb681c162a5
Popularity