Method for digitizing documents as seen in the videos above
The first step in the process of digitizing the documents was to convert all photos into high-contrast black and white scans which would increase the machine-readability of the contained text. To represent the process visually, the Android application CamScanner was used. However, since mobile resolutions do not translate well to the big screen and since the screen-capture software ecosystem is more advanced on PC operating systems, the process was done on a Windows computer running an Android emulator. So in essence, an Android application was used on the computer and the process was screen-captured in high quality. The process involved manually detecting the edges of the document and having the software separate it from its background and color-correct accordingly.
After this, a NodeJS script was put together to locally extract the text of these high-contrast scans. The point to consider here was to eliminate the need for cloud-processing so that the documents stay private during analysis. To perform the OCR (optical character recognition), the popular OCR framework Tesseract was used. The script first looked at the contents of the folder entitled “data” in the same folder as the code. Then, it listed all the images and randomized their order to add some unpredictable spontaneity to the process - after which it chose images one by one, and then detected lines, words, and individual letters. Since the algorithm only detects characters and has no knowledge of correct spelling of words, some phrases may be rendered as nonsense. To mitigate these small errors in the detection of individual letters, the detected words were run through a spell-checker that corrects for misinterpreted characters. The spell-checked words were then joined together to construct the paragraphs - as they appear in the original document. The final image and text results were then saved in a text file of the same name as the corresponding original document.
Experiment result
After hours of algorithm analysis, the script produced the intended results with some exceptions. Since the documents are typewritten (analogs) they do have an inconsistent word and character spacing, the script split words or interpreted mid-word line-breaks as two distinct words. The inaccuracies are simple to detect and fix using a word processor and we experimented with this; in the end, delivering the imperfect 'testimony' of the algorithm image processing was the most interesting option as this became, somewhat, a nonhuman testimony.
To represent the process of character recognition visually, the script announces its progress continuously and also displays the resulting text before saving it into a text file and moving on to the next document.
The scenes in Venezuela from Testimony X-2 (2018) recount a 1960s narrative of geopolitical violence, while the camera work, preserved in the editing without corrections, discloses a constantly shifting sensuous environment of textures and lives. The work is in attunement to what is nascent, and hopeful as unfoldings that emancipate the story of death and disappearance. As we hear the voice of a man recount Venezuelan and USA governmental implications of state violence from recently declassified documents, the images attune in a haptic description to lively processes that provide a counterpoint and subvert narratives of victimization. The moving image intends to bring attention to something that has remained dynamic and vibrant that seems to evoke a poetic of testimonies from human and nonhuman witnesses on a made-to-disappear history.
The recording captures images where different things meet suggesting that “new forms of life” (Thrift 2007) could build together further meaning: a sinuous path with rocks and tall grass, the background mountains, a chicken passing through a between a courtyard and a door, the moving waves of the river, a shovel between the ground and the water. The sense of moving through the maze of the investigation is mediated by the relations between me, the camera, the environments, and the human and nonhuman witnesses.
Experiment objectives
Because the organization of critical information has been a difficult task, I wanted to explore how a nonhuman witness, in this case a computer algorithm, might interpret and reclassify the information. Therefore, in order to better explore the content of declassified documents obtained at the Washington DC archives, the photographs taken of them first needed to be machine readable. The primary objective of this process was to turn the text of the documents into a searchable format, essentially making the document more accessible when it comes to fact-finding. The secondary objective however was to be able to manipulate the text: to extract pieces of it, to synthesize new text from it, or to creatively manipulate the content of the documents in other ways.