OCR Quality #21

New issue

Open

opened 2025-05-19 08:36:00 +00:00 by aonrud · 0 comments

aonrud commented

2025-05-19 08:36:00 +00:00

Owner

See what the best output that can reasonably be obtained is. Consider post-processing with NLP library.

OCR:

Tesseract has a tendency to output an initial line of garbage. See if this can be mitigated.
Confirm language selection should be "eng+gle" (assuming more selections means higher error rate as it's selecting from more languages?)
Can column detection be improved? Some issues evident

Post processing (if not configurable in processor):

Whitespace and hyphenation - at least normalise the line breaks to paragraph flow.
Quality assessment - should there be a signal to noise threshold (e.g. high garbage rate means drop the data?)

See what the best output that can reasonably be obtained is. Consider post-processing with NLP library. OCR: * Tesseract has a tendency to output an initial line of garbage. See if this can be mitigated. * Confirm language selection should be "eng+gle" (assuming more selections means higher error rate as it's selecting from more languages?) * Can column detection be improved? Some issues evident Post processing (if not configurable in processor): * Whitespace and hyphenation - at least normalise the line breaks to paragraph flow. * Quality assessment - should there be a signal to noise threshold (e.g. high garbage rate means drop the data?)

aonrud added a new dependency

2025-07-19 08:21:57 +00:00

#18 Text content and OCR

aonrud referenced this issue

2025-07-19 08:23:03 +00:00

Text content and OCR #18

aonrud removed a dependency

2025-07-19 08:23:12 +00:00

#18 Text content and OCR