Improving model for Eugène Wilhem transcripts

cvbrandoe · December 4, 2024, 10:50am

Dear Teklia team,

I successfully retrained a new model with Pylaia (in commande line) for our handwritten documents. I re-generate the training data with the atr-data-generator tool using, as you suggested, the option “polygon” for the parameter image.extraction_mode.
In the logs, I can see the CER goes from 0.22 (previous training) to 0.25. It is better but still too weak. Should I be attentive to any other relevant information in the training logs? There is a lot of information but not clear what should I look for to go forward.

As you suggested, I plan now to fine-tune a the Belfort project model with our data. I still using default parameter while training.
Do you have any specific recommendation on hyperparamer tuning at this stage ?

Thanks in advance,
Carmen (from EHESS)

starride · December 16, 2024, 3:53pm

Hi Carmen,

Thanks for getting in touch!

The CER measures the character error rate, so the lower the better. A 25% CER seems really high for handwriting recognition. You should aim for 5-10% on the validation set.

I wouldn’t necessarily recommend changing the hyperparameters for now. Our team generally uses PyLaia’s default parameters, and we manage to get satisfactory results on most datasets.

If PyLaia does not learn, you might want to check your dataset (number of lines, annotation quality, segmentation quality, etc). If your dataset is small (< 1000 training lines), fine-tuning the Belfort model is a good idea.

Finally, I recommend using Weights & Biases to log and display metrics during training. You won’t have to read the logs anymore, as everything will be summarized in the interface.

Best,
Solène

cvbrandoe · January 20, 2025, 8:49am

Dear Solène, dear Teklia team,

I was to run the experiment to fine tune the belfort model from Eugène Wilhelm training data. The last part of log files indicates:

.... /459 [00:22<00:00, 20.17it/s, loss=126.8780, cer=91.7%, wer=100.0%] Monitored metric va_cer did not improve in the last 80 records. Best score: 0.896. Signaling Trainer to stop. Epoch 81, global step 37637: va_cer was not in top 3

This is odd, I obtained 89% CER which is a lot, right?

Also, it seems like the old version of Pylaia (Version: 1.1.1) I am using does not include the Weights & Biases option, I get parameter unknown error.
I could not install a more recent version compatible with my GPU settings.

Thanks in advance for your help,
best regards,
Carmen

starride · January 23, 2025, 11:03am

Dear Carmen,

Yes, 89% CER is very high (only 10% of characters are correct!).

Fine-tuning the Belfort model should improve performance (e.g. reduce the CER) compared to the model trained from scratch. Especially since the Belfort dataset is pretty close to the Eugène Wilhelm dataset (French, 19-20th century, handwritten).

Another experiment you could try is to merge the Belfort & Eugène Wilhelm datasets, and train PyLaia (from scratch) on this larger dataset.

It would be helpful if you could share with us:

your PyLaia configuration
a sample of your data (train.txt, val.txt, and a few images)
the full log file

Best,
Solène

cvbrandoe · January 23, 2025, 1:48pm

Dear Solène,
Thanks a lot for taking a look at this and your suggestion, I will try that as next experiment.
I’d be very grateful if you could take a look indeed at my config file and sample annotated data, maybe I made a mistake.
As I can only attach images to this message, I send you the files by email.
thanks again,
Carmen

cvbrandoe · February 4, 2025, 4:46pm

Dear Teklia team,
The formatting of the data sets from Eugène Wilhem and Belfort is quite different, see below. How to fuse them, is there a specific tool?

Maybe I would need to have access to the Pylaia formatted Belfort dataset via Arkindex? I suppose this format should be similar to the EW one?

Thanks in advance for your help,
best regards,

Files in the Belfort dataset via HuggingFace:

total 1,7G
201M test.parquet
1,4G  train.parquet
166M validation.parquet

Files in the EW datasets:

ls /.../EW/env_atr_gen/data/
 config.yaml     images       'param-Training dataset-2024-12-03_11-02-44.json'   syms.txt       test_no_space.txt   tokens.txt      train_no_space.txt   val_ids.txt        val.txt
 corpus_lm.txt   lexicon.txt   syms.count                                         test_ids.txt   test.txt            train_ids.txt   train.txt            val_no_space.txt

Yoann.Schneider · February 18, 2025, 4:09pm

Dear @cvbrandoe

You can use the new (unmerged) command to download a dataset from HuggingFace directly into the PyLaia format.
You can clone the repository and checkout to the huggingface-dataset branch to use the pylaia-htr-dataset-hf-download:

pylaia-htr-dataset-hf-download \
   -d Teklia/Belfort-line \
   -o /tmp/laia

to download Teklia/Belfort-line · Datasets at Hugging Face and save it in /tmp/laia.

Then you will have to merge the two datasets. To do that you must:

merge all txt files except syms.txt and corpus_lm.txt,
merge syms.txt by doing the union of both files (this file is the set of character present in the train and validation sets),
add the path to both image directories when calling pylaia-htr-train-ctc.

Please keep us updated on your progress/issues.

Yoann Schneider

cvbrandoe · February 24, 2025, 2:51pm

Thanks Yoann for this new module, this is great news.

I succefully installed the command to download the Belfort dataset in pylaia format, it worked.

As you mentionned, I followed the merge instructions of the two input folders (EW and Belfort).
However, only a single corpus_lm.txt file exists in the EW training data. I cannot find one in the newly created belfort folder, so I used the EW one in the resulting folder.
When I execute the pylaia-htr-train-ctc command again, I got the error below. I’m using absolute paths in my configuration file (see below). Should I do things differently? I found this command parameter “–common.model_filename” but I’m not sure what I should provide as argument value.

pylaia-htr-train-ctc --config config/config_train_model_4e_exp.yaml
Global seed set to 74565
[2025-02-24 16:31:23,030 CRITICAL laia] Uncaught exception:
Traceback (most recent call last):
  File "/home/geomatique/pylaia-env/bin/pylaia-htr-train-ctc", line 8, in <module>
    sys.exit(main())
  File "/home/geomatique/pylaia-env/lib/python3.10/site-packages/laia/scripts/htr/train_ctc.py", line 253, in main
    run(**args)
  File "/home/geomatique/pylaia-env/lib/python3.10/site-packages/laia/scripts/htr/train_ctc.py", line 73, in run
    model is not None
AssertionError: Could not find the model. Have you run pylaia-htr-create-model?

Content of config YAML file:

syms: /home/geomatique/EW/Training_datasets/2_belfort_training_data_pylaia_teklia/fusion_ew_belfort/syms.txt
img_dirs:
  - /home/geomatique/EW/Training_datasets/2_belfort_training_data_pylaia_teklia/EW_Training_data_polygon_mode_latest/images/
  - /home/geomatique/EW/Training_datasets/2_belfort_training_data_pylaia_teklia/belfort_pylaia_format_new/images/
tr_txt_table: /home/geomatique/EW/Training_datasets/2_belfort_training_data_pylaia_teklia/fusion_ew_belfort/train.txt
va_txt_table: /home/geomatique/EW/Training_datasets/2_belfort_training_data_pylaia_teklia/fusion_ew_belfort/val.txt
common:
  experiment_dirname: /home/geomatique/EW/Training_experiments/experiment_belfort_EW_4exp
logging:
  filepath: pylaia_training_24022025.log
scheduler:
  active: true
train:
  augment_training: true
  early_stopping_patience: 80
trainer:
  auto_select_gpus: true
  gpus: 1
  max_epochs: 600

thanks a lot for your help.

Yoann.Schneider · February 26, 2025, 2:52pm

Hi @cvbrandoe

You don’t really need the corpus_lm.txt file anyway for now so we’ll skip that part.

The error you get can be explained by the content of the path at common.experiment_dirname. There should be a model folder inside there.
Can you show us your current setup?

Yoann Schneider

cvbrandoe · March 3, 2025, 3:22pm

Hi @Yoann.Schneider
Thanks for this!
I supposed corpus_lm.txt was not necessary.
Nevertheless, I don’t have a model folder in any of my folders.
My common.experiment_dirname is empty before I execute the pylaia-htr-train-ctc command, as I use to with previous experiments.
I understand this is the folder where the result of the trained model is put by the pylaia-htr-train-ctc command ?
I also tried to run the pylaia-htr-create-model but I don’t have the arguments right and the documentation seems to be private in your gitlab.

This is the content of the two folder (EW and Belfort):

(pylaia-env) (base) geomatique@geomatique-Precision-3660:~/EW$ ls Training_datasets/2_belfort_training_data_pylaia_teklia/belfort_pylaia_format_new/
dev_ids.txt   dev.txt  syms_stats.txt  test_ids.txt   test.txt       train_text.txt  validation_ids.txt   validation.txt  val_text.txt
dev_text.txt  images   syms.txt        test_text.txt  train_ids.txt  train.txt       validation_text.txt  val_ids.txt     val.txt

(pylaia-env) (base) geomatique@geomatique-Precision-3660:~/EW$ ls Training_datasets/2_belfort_training_data_pylaia_teklia/EW_Training_data_polygon_mode_latest/
 config.yaml     images       'param-Training dataset-2024-12-03_11-02-44.json'   syms.txt       test_no_space.txt   tokens.txt      train_no_space.txt   val_ids.txt        val.txt
 corpus_lm.txt   lexicon.txt   syms.count                                         test_ids.txt   test.txt            train_ids.txt   train.txt            val_no_space.txt

This the content of the folder after their fusion:

pylaia-env) (base) geomatique@geomatique-Precision-3660:~/EW$ ls Training_datasets/2_belfort_training_data_pylaia_teklia/fusion_ew_belfort/
corpus_lm.txt  dev_text.txt  lexicon.txt     syms.txt      test_no_space.txt  test.txt    train_ids.txt       train_text.txt  validation_ids.txt   validation.txt  val_no_space.txt  val.txt
dev_ids.txt    dev.txt       syms_stats.txt  test_ids.txt  test_text.txt      tokens.txt  train_no_space.txt  train.txt       validation_text.txt  val_ids.txt     val_text.txt

Maybe it would be easier if we could make a quick call to check parameters, next Monday if you have some time?

Thanks in advance,
best regards,
Carmen

starride · March 6, 2025, 2:50pm

Hi Carmen,

The documentation is public and available at : https://atr.pages.teklia.com/pylaia/pylaia/usage/initialization/
Let’s schedule a call on Monday, I’ll email you to find a time that works for both of us!

Best,
Solène

cvbrandoe · March 17, 2025, 8:42am

Hi Solène,
Thanks again for your help last Monday.
Training finished succesfully (took less than 2 days), CER is 0.1497.
It is better than the first experiment (0.25), here (Teklia/pylaia-belfort · Hugging Face) you mentioned the belfort model achieved around 0.10…
This seems good news for EW.
Do you think we can further improved the model, what would you advice us to do from here?
I will send you the log by email.
thanks again!
best,
Carmen

starride · March 17, 2025, 11:49am

Hi Carmen,

Great news! However, your current score is computed on both Belfort and EW. You should compute the CER/WER on each test set (Belfort vs EW) to see how the model performs on both datasets.

Once this is done, you can try to improve performance on EW by adding a language model trained on the EW training set (see the documentation). You can then run another evaluation and compare the CER/WER with your first model.

Hope this helps!

Best,
Solène

cvbrandoe · March 18, 2025, 9:47am

Hi Solène,
Yes, you are right…
Thanks for the hint about using a language model.
I’ll try to find some time to do this in the next months, by any chance, do you have any pointers on how to do it? should I use the pylaia tool set, or what would you recommend? any suggestions are welcome.
thanks again,
best,
Carmen