OCR Quality #21

Open
opened 2025-05-19 08:36:00 +00:00 by aonrud · 0 comments
Owner

See what the best output that can reasonably be obtained is. Consider post-processing with NLP library.

OCR:

  • Tesseract has a tendency to output an initial line of garbage. See if this can be mitigated.
  • Confirm language selection should be "eng+gle" (assuming more selections means higher error rate as it's selecting from more languages?)
  • Can column detection be improved? Some issues evident

Post processing (if not configurable in processor):

  • Whitespace and hyphenation - at least normalise the line breaks to paragraph flow.
  • Quality assessment - should there be a signal to noise threshold (e.g. high garbage rate means drop the data?)
See what the best output that can reasonably be obtained is. Consider post-processing with NLP library. OCR: * Tesseract has a tendency to output an initial line of garbage. See if this can be mitigated. * Confirm language selection should be "eng+gle" (assuming more selections means higher error rate as it's selecting from more languages?) * Can column detection be improved? Some issues evident Post processing (if not configurable in processor): * Whitespace and hyphenation - at least normalise the line breaks to paragraph flow. * Quality assessment - should there be a signal to noise threshold (e.g. high garbage rate means drop the data?)
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Irish-Left-Archive/ILAv2#21
No description provided.