We are pleased to announce that, after almost two years of work, the digitisation of the Budapest Commodity and Stock Exchange Price Quotation Journals, published between 1894 and 1913, was completed in October 2018.
The beginnings
The idea to process old stock market data first occurred to Márton Radnai, CEO of Ramasoft Data Services and Information Technology Ltd., almost 10 years ago when he learned that past US stock market data is now available on a daily basis for almost 200 years. It was then that the idea of processing this data for Hungary was conceived. But at the time, neither market nor public funding for the project could be found.
However, over the past 10 years, the cost of digitising old journals has fallen significantly as Arcanum Database Ltd. digitised one of the stock market price sources, the Budapest Public Notices and made available at its own expense the Arcanum Digital Library. So two years ago, Márton Radnai decided to start the project with his own funding.
Find resources
The project started with a search for available data sources. Until the First World War, the prices of the old stock exchange were published in three journals: the Budapest Price and Stock Exchange Price Quotation Journal, published almost uninterruptedly between 1864 and 1948, the Budapest Gazette (the official state gazette of the time) and Pester Lloyd. Of the three publications, the Budapest Gazette provided only partial data (no prices for deals, only closing prices) and Pester Lloyd was in German. After the First World War, the data were published for a while only in the Quotation Journal, and then again in the Budapest Gazette, but only in a highly abstracted form. It became clear that the optimal solution would be to digitise the Quotation Journal.
The number of copies of the Quotation Journals was not very high, so very few of them have survived, only a few Hungarian libraries had them, and some volumes were available in the Austrian national library. Of these Szabó Ervin Library of Budapest The Budapest Collection (where the years 1873-1913 were available) has agreed to make its collection available for digitisation. The digitisation was carried out by Arcanum Database Ltd. Negotiations between the three partners were finalised by April 2017, and digitisation began.
Photography
Since the printing technology for the 1873 to 1893 years had not yet have the quality that would have allowed for later computer processing of the data, and since the data content of the price lists for this period was essentially identical to the already digitised Budapest Gazette, the project sponsor Ramasoft decided to digitise the 1894 to 1913 period in order to be cost-effective.
This was done using two techniques: the 1894-1904 cohorts were photographed using a photographic machine. However, for the later years (as the price sheet doubled in size), the images were captured using a so-called map scanner.
Pre-processing of images
After the exposure, the images were handed over to Ramasoft by Arcanum. In order to improve the quality of the subsequent optical character recognition, Ramasoft developed a special software that performed two functions: firstly, it automatically straightened and trapezoidally corrected the price tables, so that the rows of the table were only horizontal, which is a prerequisite for efficient character recognition. On the other hand, he transformed the tables to the same pixel position every day in order to set a template in the optical recognition software for processing them.
However, the character recognition result was not yet satisfactory, as the recognition software could not separate the rows of the tables. Therefore, in a second step, the dividing lines of the tables were drawn. The tables could then be recognised with sufficient quality.
Optical character recognition
For optical character recognition, Ramasoft used the Abbyy Finereader application. This was done in two phases: in the first phase, the straightened and trapezoid-corrected images were converted into a so-called two-layer PDF, which allows them to be read later.
In the second phase, the version including the drawn table lines was character-recognised and the data exported to Microsoft Excel.
Data verification and correction
The raw excel output was then sorted into a database. To do this, we built a securities data list and a dictionary that matched the names of the securities in the newspaper with the names in the data list. A major difficulty was that in many cases a catcount was used to name securities, which made this matching difficult. In addition, the exchange rate data had to be cleaned and formatted (for example, the quotations of the time did not have decimal separators).
From the data thus paired and cleaned, another excel file was created, in which on the one hand automatic control rules were checked, and on the other hand this file was given to the proofreaders to compare with the original documents and to correct remaining errors manually.
Exporting data to a database
The last step was to export the corrected data to an SQL database.