Dear Teklia team,
I successfully retrained a new model with Pylaia (in commande line) for our handwritten documents. I re-generate the training data with the atr-data-generator tool using, as you suggested, the option “polygon” for the parameter image.extraction_mode.
In the logs, I can see the CER goes from 0.22 (previous training) to 0.25. It is better but still too weak. Should I be attentive to any other relevant information in the training logs? There is a lot of information but not clear what should I look for to go forward.
As you suggested, I plan now to fine-tune a the Belfort project model with our data. I still using default parameter while training.
Do you have any specific recommendation on hyperparamer tuning at this stage ?
Thanks in advance,
Carmen (from EHESS)
Hi Carmen,
Thanks for getting in touch!
The CER measures the character error rate, so the lower the better. A 25% CER seems really high for handwriting recognition. You should aim for 5-10% on the validation set.
I wouldn’t necessarily recommend changing the hyperparameters for now. Our team generally uses PyLaia’s default parameters, and we manage to get satisfactory results on most datasets.
If PyLaia does not learn, you might want to check your dataset (number of lines, annotation quality, segmentation quality, etc). If your dataset is small (< 1000 training lines), fine-tuning the Belfort model is a good idea.
Finally, I recommend using Weights & Biases to log and display metrics during training. You won’t have to read the logs anymore, as everything will be summarized in the interface.
Best,
Solène
1 Like
Dear Solène, dear Teklia team,
I was to run the experiment to fine tune the belfort model from Eugène Wilhelm training data. The last part of log files indicates:
.... /459 [00:22<00:00, 20.17it/s, loss=126.8780, cer=91.7%, wer=100.0%] Monitored metric va_cer did not improve in the last 80 records. Best score: 0.896. Signaling Trainer to stop. Epoch 81, global step 37637: va_cer was not in top 3
This is odd, I obtained 89% CER which is a lot, right?
Also, it seems like the old version of Pylaia (Version: 1.1.1) I am using does not include the Weights & Biases option, I get parameter unknown error.
I could not install a more recent version compatible with my GPU settings.
Thanks in advance for your help,
best regards,
Carmen
Dear Carmen,
Yes, 89% CER is very high (only 10% of characters are correct!).
Fine-tuning the Belfort model should improve performance (e.g. reduce the CER) compared to the model trained from scratch. Especially since the Belfort dataset is pretty close to the Eugène Wilhelm dataset (French, 19-20th century, handwritten).
Another experiment you could try is to merge the Belfort & Eugène Wilhelm datasets, and train PyLaia (from scratch) on this larger dataset.
It would be helpful if you could share with us:
- your PyLaia configuration
- a sample of your data (train.txt, val.txt, and a few images)
- the full log file
Best,
Solène
Dear Solène,
Thanks a lot for taking a look at this and your suggestion, I will try that as next experiment.
I’d be very grateful if you could take a look indeed at my config file and sample annotated data, maybe I made a mistake.
As I can only attach images to this message, I send you the files by email.
thanks again,
Carmen
Dear Teklia team,
The formatting of the data sets from Eugène Wilhem and Belfort is quite different, see below. How to fuse them, is there a specific tool?
Maybe I would need to have access to the Pylaia formatted Belfort dataset via Arkindex? I suppose this format should be similar to the EW one?
Thanks in advance for your help,
best regards,
Files in the Belfort dataset via HuggingFace:
total 1,7G
201M test.parquet
1,4G train.parquet
166M validation.parquet
Files in the EW datasets:
ls /.../EW/env_atr_gen/data/
config.yaml images 'param-Training dataset-2024-12-03_11-02-44.json' syms.txt test_no_space.txt tokens.txt train_no_space.txt val_ids.txt val.txt
corpus_lm.txt lexicon.txt syms.count test_ids.txt test.txt train_ids.txt train.txt val_no_space.txt
Dear @cvbrandoe
You can use the new (unmerged) command to download a dataset from HuggingFace directly into the PyLaia format.
You can clone the repository and checkout to the huggingface-dataset
branch to use the pylaia-htr-dataset-hf-download
:
pylaia-htr-dataset-hf-download \
-d Teklia/Belfort-line \
-o /tmp/laia
to download Teklia/Belfort-line · Datasets at Hugging Face and save it in /tmp/laia
.
Then you will have to merge the two datasets. To do that you must:
- merge all txt files except
syms.txt
and corpus_lm.txt
,
- merge
syms.txt
by doing the union of both files (this file is the set of character present in the train and validation sets),
- add the path to both image directories when calling
pylaia-htr-train-ctc
.
Please keep us updated on your progress/issues.
Yoann Schneider
Thanks Yoann for this new module, this is great news.
I succefully installed the command to download the Belfort dataset in pylaia format, it worked.
As you mentionned, I followed the merge instructions of the two input folders (EW and Belfort).
However, only a single corpus_lm.txt file exists in the EW training data. I cannot find one in the newly created belfort folder, so I used the EW one in the resulting folder.
When I execute the pylaia-htr-train-ctc command again, I got the error below. I’m using absolute paths in my configuration file (see below). Should I do things differently? I found this command parameter “–common.model_filename” but I’m not sure what I should provide as argument value.
pylaia-htr-train-ctc --config config/config_train_model_4e_exp.yaml
Global seed set to 74565
[2025-02-24 16:31:23,030 CRITICAL laia] Uncaught exception:
Traceback (most recent call last):
File "/home/geomatique/pylaia-env/bin/pylaia-htr-train-ctc", line 8, in <module>
sys.exit(main())
File "/home/geomatique/pylaia-env/lib/python3.10/site-packages/laia/scripts/htr/train_ctc.py", line 253, in main
run(**args)
File "/home/geomatique/pylaia-env/lib/python3.10/site-packages/laia/scripts/htr/train_ctc.py", line 73, in run
model is not None
AssertionError: Could not find the model. Have you run pylaia-htr-create-model?
Content of config YAML file:
syms: /home/geomatique/EW/Training_datasets/2_belfort_training_data_pylaia_teklia/fusion_ew_belfort/syms.txt
img_dirs:
- /home/geomatique/EW/Training_datasets/2_belfort_training_data_pylaia_teklia/EW_Training_data_polygon_mode_latest/images/
- /home/geomatique/EW/Training_datasets/2_belfort_training_data_pylaia_teklia/belfort_pylaia_format_new/images/
tr_txt_table: /home/geomatique/EW/Training_datasets/2_belfort_training_data_pylaia_teklia/fusion_ew_belfort/train.txt
va_txt_table: /home/geomatique/EW/Training_datasets/2_belfort_training_data_pylaia_teklia/fusion_ew_belfort/val.txt
common:
experiment_dirname: /home/geomatique/EW/Training_experiments/experiment_belfort_EW_4exp
logging:
filepath: pylaia_training_24022025.log
scheduler:
active: true
train:
augment_training: true
early_stopping_patience: 80
trainer:
auto_select_gpus: true
gpus: 1
max_epochs: 600
thanks a lot for your help.
Hi @cvbrandoe
You don’t really need the corpus_lm.txt
file anyway for now so we’ll skip that part.
The error you get can be explained by the content of the path at common.experiment_dirname
. There should be a model
folder inside there.
Can you show us your current setup?
Yoann Schneider
Hi @Yoann.Schneider
Thanks for this!
I supposed corpus_lm.txt was not necessary.
Nevertheless, I don’t have a model folder in any of my folders.
My common.experiment_dirname
is empty before I execute the pylaia-htr-train-ctc command, as I use to with previous experiments.
I understand this is the folder where the result of the trained model is put by the pylaia-htr-train-ctc command ?
I also tried to run the pylaia-htr-create-model but I don’t have the arguments right and the documentation seems to be private in your gitlab.
This is the content of the two folder (EW and Belfort):
(pylaia-env) (base) geomatique@geomatique-Precision-3660:~/EW$ ls Training_datasets/2_belfort_training_data_pylaia_teklia/belfort_pylaia_format_new/
dev_ids.txt dev.txt syms_stats.txt test_ids.txt test.txt train_text.txt validation_ids.txt validation.txt val_text.txt
dev_text.txt images syms.txt test_text.txt train_ids.txt train.txt validation_text.txt val_ids.txt val.txt
(pylaia-env) (base) geomatique@geomatique-Precision-3660:~/EW$ ls Training_datasets/2_belfort_training_data_pylaia_teklia/EW_Training_data_polygon_mode_latest/
config.yaml images 'param-Training dataset-2024-12-03_11-02-44.json' syms.txt test_no_space.txt tokens.txt train_no_space.txt val_ids.txt val.txt
corpus_lm.txt lexicon.txt syms.count test_ids.txt test.txt train_ids.txt train.txt val_no_space.txt
This the content of the folder after their fusion:
pylaia-env) (base) geomatique@geomatique-Precision-3660:~/EW$ ls Training_datasets/2_belfort_training_data_pylaia_teklia/fusion_ew_belfort/
corpus_lm.txt dev_text.txt lexicon.txt syms.txt test_no_space.txt test.txt train_ids.txt train_text.txt validation_ids.txt validation.txt val_no_space.txt val.txt
dev_ids.txt dev.txt syms_stats.txt test_ids.txt test_text.txt tokens.txt train_no_space.txt train.txt validation_text.txt val_ids.txt val_text.txt
Maybe it would be easier if we could make a quick call to check parameters, next Monday if you have some time?
Thanks in advance,
best regards,
Carmen
Hi Carmen,
The documentation is public and available at : https://atr.pages.teklia.com/pylaia/pylaia/usage/initialization/
Let’s schedule a call on Monday, I’ll email you to find a time that works for both of us!
Best,
Solène
Hi Solène,
Thanks again for your help last Monday.
Training finished succesfully (took less than 2 days), CER is 0.1497.
It is better than the first experiment (0.25), here (Teklia/pylaia-belfort · Hugging Face) you mentioned the belfort model achieved around 0.10…
This seems good news for EW.
Do you think we can further improved the model, what would you advice us to do from here?
I will send you the log by email.
thanks again!
best,
Carmen
Hi Carmen,
Great news! However, your current score is computed on both Belfort and EW. You should compute the CER/WER on each test set (Belfort vs EW) to see how the model performs on both datasets.
Once this is done, you can try to improve performance on EW by adding a language model trained on the EW training set (see the documentation). You can then run another evaluation and compare the CER/WER with your first model.
Hope this helps!
Best,
Solène
Hi Solène,
Yes, you are right…
Thanks for the hint about using a language model.
I’ll try to find some time to do this in the next months, by any chance, do you have any pointers on how to do it? should I use the pylaia tool set, or what would you recommend? any suggestions are welcome.
thanks again,
best,
Carmen