Corpora Round 1
Corpus | Size | Time span | Language | Documents | Download |
---|---|---|---|---|---|
Round 1 | |||||
EU Press Corner | 7.2 Mbyte (compressed) | June 2020 | English French German Greek Italian Spanish Swedish Ukranian |
335 276 266 120 123 115 122 0 |
europresscorner-202006-xml.zip |
EUR-Lex | 23.3 Mbyte (compressed) | June 2020 | English French German Greek Italian Spanish Swedish Ukranian |
352 345 345 344 342 342 343 0 |
eurlex-202006-xml.zip |
Global Voices | 13.6 Mbyte (compressed) | June 2020 | English French German Greek Italian Spanish Swedish Ukranian |
571 446 51 328 539 595 5 66 |
global-voices-20200611-xml.zip |
MEDISYS | 2,036.0 Mbyte (compressed) | December 2019 to April 2020 | English French German Greek Italian Spanish Swedish Ukranian |
1,450,251 325,178 272,645 146,763 661,514 832,369 37,615 15,395 |
medisys-201912-xml_ir.zip medisys-202001-xml_ir.zip medisys-202002-xml_ir.zip medisys-202003-p1-7-xml_ir.zip medisys-202004-xml_ir.zip |
Wikipedia | 13.7 Mbyte (compressed) | June 2020 | English French German Greek Italian Spanish Swedish Ukranian |
731 357 364 103 271 342 111 121 |
wikipedia-20200611-xml.zip |
Total documents by language | English French German Greek Italian Spanish Swedish Ukranian |
1,452,240 326,599 273,761 147,658 662,789 833,763 38,196 15,582 |
Corpora Round 2
Corpus | Size | Time span | Language | Documents | Download |
---|---|---|---|---|---|
Round 2 | |||||
MEDISYS | 46.9 Gbyte (compressed) | April 2020 to September 2020 | English French German Greek Italian Spanish Swedish Ukranian Arabic |
3,542,790 667,339 334,824 285,658 1,067,155 2,272,703 66,412 50,775 638,581 |
en_medisys_2020_round2.zip fr_medisys_2020_round2.zip de_medisys_2020_round2.zip el_medisys_2020_round2.zip it_medisys_2020_round2.zip es_medisys_2020_round2.zip sv_medisys_2020_round2.zip uk_medisys_2020_round2.zip ar_2020_round2.zip |