Inception

The problem

Hoarding is a disorder characterized by difficulty in parting with possessions. It was once considered a symptom of obsessive-compulsive disorder, a mental and behavioral disorder that affects between 2% and 4% of the general population. In 2013, hoarding disorder was classified as a separate condition in the Diagnostic and Statistical Manual of Mental Disorders. Digital hoarding is just “a new version of an old psychological challenge,” Dr. Maidenberg says. With digital hoarding, however, the act of saving the file becomes an uncontrollable urge. [source]

Hoarding of digital documents leads to forgetting what is in the hoard.

Hoarding in shared directories causes security and privacy issues.

Digital theft often occurs with access to digital hoards within organizations.

Use after theft can only be stopped through mass encryption of shared directories.

Big Data Hoarding

Most web sites, applications (IoT, mobile, streaming and desktop), plus many operating systems, remotely hoard data.

Governments hoard through backbones and backdoors.

The rate of big data hoarding has been accelerating for years.

It cannot be stopped because of the wealth and control associated with big data and AI.

Most big data cannot be processed due to contraints on data mining which makes the data close to useless as it ages.

Our solution

Inception rapidly transforms documents and raw data into other forms.

Inception is an automated data pre-processor that can greatly speed up data conversion.

A child could create the processing scripts used to transform data into a desired form.

Inception automates transforming Microsoft Office documents into words that are quickly recorded in a memory-based database.

After completion of processing, results are inserted into a relational database to allow searching of words and phrases to locate documents.

Evolution

Inception was created and enhanced over many years.

The process of transforming digital data into new forms is as old as the computer industry.

  1. Magazine editorial submissions into formats used to review, edit and typeset.
  2. Application used by Chrysler and Dylex to select and view word processing, spread sheet and presentation documents.
    • Decoded file headers to determine file type, then decoded each file to present in a form that appealed to users.
  3. Large amounts of mainframe data from Magnetic Optical drive to automate word and phrase indexing and presentation in English and French.
  4. Legacy DB2 and Informix RDMS data before import into an Oracle RDMS for the Siebel CRM.
  5. Thousands of Word and PDF documents into HTML and XML formats for technical and developer support sites.
  6. Messy XML to clean and DTD-validated XML for linguistic translation.
  7. Automated transformation of email real estate leads using 61 processing scripts to create SQL scripts for automated import into an MySQL database.
    • Signature analysis for 13 lead sources with many sub-types to select a processing script .
  8. Automated transformation of Microsoft Office Word docx, Excel xlsx and PowerPoint pptx documents to text.
    • Multi-threaded automation provided processing of complex directory chains along with individual documents.
    • Resulting UTF-8 text is broken into words with stop-words removed.
    • Pertinent information that includes document size, timestamps, word sequence and groups, is later written to a database.

Refactored

Inception was refined to achieve the highest processing speed.

    Inception was refactored and largely rewritten in a selective mixture of C and C++.

    We could have quickly rewritten Inception in higher-level languages but chose to take the difficult route of coding in C and C++.

    We have a great deal of experience developing software in many higher-level languages.

    Refactoring Inception in any of the higher-level languages would have saved us a great deal of time.

    We developed in C to directly load and process files in memory because it is fastest and provides greatest control.

    We did not use C++ memory handling because it makes indirect use of memory which is slower than direct access.

    We did not use C++ vectors and other STL functions because they internally control memory allocation, expansion and deallocation which reduces processing speed.

    The speed difference between direct and indirect memory access becomes highly noticable when processing millions of documents.

    We utilized the C++ object-oriented framework to make it easier to maintain Inception.

    We created Inception for speed and control rather than cobble together a series of open source applications.

    While making use of open source applications accelerates creation of products and services, it often causes more problems than it solves.

    The languages used to create open source applications are often poorly chosen.

    Reviewing code bases of many open source products usually show inefficiencies, flawed architecture, design and implementation, and dismal commenting.

    The Inception code base is well constructed, well commented and fairly easy to understand.

    We can enhance Inception to automate processing of other document types as needed in future.

The scripting command set has been refreshed and expanded.

A Linux API is provided through use of shared and static libraries to enable custom software to make use of Inception.

Async process threading and automated directory processing has been added.

A memory-based NoSQL engine was created and used to retain processed documents until all data can be retained in a relational database.

We make use of the memory-based NoSQL engine to accelerate the speed of document processing.

Automation

Inception automates document transformation.

Files and directory chains that are transferred to an Inception input queue are automatically processed in FIFO sequence, then written to an output queue.

Automatic detection of file type, then document type, results in correct processing.

Processing speed reduces total threads required for processing.

Shared features of Kryptera

Files and directory chains transferred into an input queue are fully written before they are moved to thread processing space.

This feature to detect open file handles associated to a file or directory chain is critical for automation of document processing.

Without validation, file and directories could be moved while still being written.

This problem was discovered while developing the Kryptera HSM.

Tracing down a solution required R&D that was specific to Linux.

A variable delay is placed after a file or directory chain is completely written.

The delay is only needed if cached writes occur after file writes are complete.


Operating Platform

Inception builds and runs on Debian and related trees such as Ubuntu and Devuan, plus RedHat, Centos and Fedora.

Inception will run as an initd or systemd service, a standalone service-like application, or a standalone application to create processing scripts.

Inception shared and static libraries can be used by custom software to automate document and data transformation.

We chose to develop Inception to operate under Linux rather than Windows to ensure the highest processing speed and stability.

Configuration

Inception service startup flags.

./InceptionApplication -folder_input input/ -folder_output output/ -folder_process process/ -folder_script script/ -folder_error error/ -folder_log log/

This starts Inception as a standalone application which can also be used as a systemd service. -start_daemon can be passed to start Inception as a background initd daemon (service). The design allows shared directories to be used for input and output queues, and allows use of higher-speed storage for thread-processing space.

Script Creation

Inception Script Creation Flags:

./InceptionApplication -test_source path/source_file -test_target path/target_file
-test_script path/script_file

The flags are used to test creation or refinement of a processing script.

Load the source, script and target files into a text editor.

Change the script then start command line processing to update the target for review.

Scripting

The key feature of Inception lies with the user-defined processing scripts used to automate transformation of documents into desired formats.

The Inception scripting language contains many commands separated into groups:
  1. Beautify
    • BeautifyXML [1 of 2]
      • Format code, indented with tabs.
      • Syntax: BeautifyXML
      • Example: BeautifyXML
    • BeautifyXML [2 of 2]
      • Format code, indented with tabs and then remove the space between <End> and </End> and <A*> and </A> tags.
      • Syntax: BeautifyXML|FIX_END|
      • Example: BeautifyXML|FIX_END|
  2. Change
    • ChangeTag
      • Replace chosen field (1 is tag1 else tag2) with replacement after locating tag1 then tag2 in sequence.
      • Syntax: ChangeTag|1 for tag1|tag1|tag2|replacement
      • Example: ChangeTag|1|<div><a|<div class="foo">
    • ChangeWrappedString
      • Find signature, scan for opening and closing double quotes replace text between quotes with replacement.
      • Syntax: ChangeWrappedString|signature|replacement
      • Example: ChangeWrappedString|Good|Evil
  3. Clean
    • CleanXML
      • Remove tabs, carriage returns, line feeds and hidden chars when writing final output. Note that StripXML has a higher priority than FormatXML in that StripXML will be used even if FormatXML and StripXML are declared in this file.
      • Syntax: CleanXML
      • Example: CleanXML
  4. Conceal
    • ConcealBlankTags
      • Hide passed tag sequence if there is nothing between the tags.
      • Syntax: ConcealBlankTags|tag_open|tag_close|
      • Example: ConcealBlankTags|<Note_Heading>|</Note_Heading>|
    • ConcealSpecialTags
      • Hide tags and the data between in a way that finds tags that contain carriage returns/line feeds between elements.
      • Syntax: ConcealSpecialTags|tag_open|tag_contains|tag_close|insert_to_start_of_buffer
      • Example: ConcealSpecialTags|<TITLE>||</TITLE>| Example: HideTagsSpecial|<A |ID=|</A>|
  5. Confirm
    • ConfirmField
      • Validates a field to ensure it exists, is set as a name => value pair in PHP array format, with leading and trailing characters verified to ensure it is correct format to be loaded and processed within another PHP script.
      • Syntax: ConfirmField|String|
      • Example: ConfirmField|"message"|
  6. Correct
    • CorrectQP
      • Repair quoteable-printable 7-bit email encoding.
      • Syntax: CorrectQP
      • Example: CorrectQP
  7. Decode
    • DecodeBase64
      • Decodes the first base64 attachment found in the email file buffer and places it at the end of the buffer where it leads with "^{_BASE64_START_}^" ending at "^{_BODY_END_}^"
      • Syntax: DecodeBase64
      • Example: DecodeBase64
  8. Eliminate
    • EliminateBinary
      • Force deletion of the passed binary data where the second string of three is one, two or three 8-bit binary values represented in hexadecimal format. Each hex value must take the form of ‘0xFF’ where ‘0x’ is the hex prefix and ‘FF’ can be any hex value from ‘00’ to ‘FF’. Up to three hex values can be passed in the format of ‘0xFF0xFF0xFF’.
      • Syntax: EliminateBinary|tag_open|hex_binary|tag_close|
      • Example: EliminateBinary|<data>|0xA90xB80xC7|</data>|
    • EliminateBytes
      • Delete all occurrences of 8-bit binary values represented in hexadecimal format. Each hex value must take the form of ‘XX’ where ‘XX’ is any hex value from ‘00’ to ‘FF’.
      • Syntax: EliminateBytes|pairs_of_hex_values|
      • Example: EliminateBytes|0D0A|
    • EliminateContent
      • Delete data that starts with from and ends with to, where to is retained if 0 is passed
      • Syntax: EliminateContent|from|to|0|
      • Example: EliminateContent|</td>|^{_BODY_END_}^|0|
    • EliminateContentAll
      • Delete all data that starts with from and ends with to, where to is retained if 0 is passed.
      • Syntax: EliminateContentAll|from|to|0
      • Example: EliminateContentAll|</td>|^{_BODY_END_}^|0|
    • EliminateLFs
      • Removes line terminators.
      • Syntax: EliminateLFs
      • Example: EliminateLFs
    • EliminateString
      • Delete all occurrences of passed string from buffer.
      • Syntax: EliminateString|string_to_delete|
      • Example: EliminateString|Yadda|
    • EliminateTag
      • Delete data that starts with tag_open and, if not passed, ends with '>'.
      • Syntax: EliminateTag|tag_open|
      • Example: EliminateTag|<Dash1List>
    • EliminateTag2
      • Locate exact match to tag_open, scan for exact match to tag_next_to_delete, and then delete tag_next_to_delete. Note that this is potentially dangerous in that tag_open and tag_next_to_delete could be separated in context and result in invalid data deletion. It also has limited use in that it could leave a mess behind of deleted opening tags with left over closing tags.
      • Syntax: EliminateTag2|tag_open|tag_next_to_delete|
      • Example: EliminateTag2|<Body>|<Bold>|
  9. Preserve
    • PreserveMemory
      • Save file memory buffer to a file. This command is useful when creating a new processing script because it can save the file buffer at any stage of processing.
      • Syntax: PreserveMemory
      • Example: PreserveMemory
  10. Provisional
    • ProvisionalUpdate
      • If tag_find not found then insert it before tag_after.
      • Syntax: ProvisionalUpdate|tag_find|tag_after|
      • Example: ProvisionalUpdate|</Module_Divider>|<Section>|
  11. Put
    • PutBetweenTags
      • Locate exact match to tag_open scan for exact match to tag_next, and then insert tag_to_insert_between_open_and_next between tag_open and tag_next.
      • Syntax: PutBetweenTags|tag_open|tag_next|tag_to_insert_between_open_and_next|
      • Example: PutBetweenTags|<Table>|<ROW>|<Table_Body>|
    • PutBinaryPostfix
      • Append passed string with Adobe InDesign-specific binary line feed data. See the notes in InsertBinaryPrefix.
      • Syntax: PutBinaryPostfix|tag|
      • Example: PutBinaryPostfix|</Table>|
    • PutBinaryPrefix
      • Prepend passed string with Adobe InDesign-specific binary line feed data. Note that the binary data is embedded to force InDesign to drop lines feeds after tag closure. This is only needed when the InDesign tag formatting does not specifically call for a line feed to be dropped after a tag closes. It would be best to avoid using InsertBinaryPrefix and InsertBinaryPostfix by handling all line feeds through tag formatting within InDesign.
      • Syntax: PutBinaryPrefix|tag|
      • Example: PutBinaryPrefix|</Title>|
    • PutPostfix
      • Insert a string at the end of the file memory buffer.
      • Syntax: PutPostfix|String|
      • Example: PutPostfix|EOF|
    • PutPrefix
      • Insert a string at the start of the file memory buffer.
      • Syntax: PutPrefix|String|
      • Example: PutPrefix|<php $array = [||
  12. Remove
    • RemoveBetween
      • Remove data between START and END.
      • Syntax: RemoveBetween|Start|End|
      • Example: RemoveBetween|<t|>|
    • RemoveWithout
      • If Find is not found in the memory buffer then replace all memory buffer content with Replace.
      • Syntax: RemoveWithout|Find|Replace|
      • Example: RemoveWithout|<a:t>|>|
    • RemoveWrapper
      • Purge <tag_open> and </tag_open> if located on tag_level and followed by tag_after at tag_level+1.
      • Syntax: RemoveWrapper|tag_level|tag_open|tag_after|
      • Example: RemoveWrapper|3|<Dash1>|<Dash1>|
  13. Repair
    • RepairDoubleSection
      • Purge opening and closing <Section> blocks from blocks that start with <Section></Section>.
      • Syntax: RepairDoubleSection
      • Example: RepairDoubleSection
    • RepairSymbols
      • Translate Microsoft Word and Windows symbol chars to escaped values.
      • Syntax: RepairSymbols
      • Example: RepairSymbols
  14. Set
    • SetClosingTag
      • Locates tag_open and tag_close when they are both positioned at the same tag level, and then replaces tag_close with new_tag_close.
      • Syntax: SetClosingTag|tag_open|tag_close|new_tag_close|
      • Example: SetClosingTag|<Cell_Number1_First>|</Cell_Number1_Next>|</Cell_Number1_First>|
    • SetFieldDelimiter
      • Set field delimiter within curly brackets
      • Syntax: SetFieldDelimiter{delchar}
      • Example: SetFieldDelimiter{|}
  15. Swap
    • SwapAtNestedLevel
      • Substitute tag_open located at tag_level with new_tag_open and then replaces matching closing TAG with new_tag_close. Note that if before is passed then it has to exist before tag_open for the changes to be made.
      • Syntax: SwapAtNestedLevel|tag_level|before|tag_open|new_tag_open|new_tag_close|
      • Example: SwapAtNestedLevel|2||<ExerciseNumber>|<Number1_Next>|</Number1_Next>|
      • Example: SwapAtNestedLevel|9|<Table_Cell>|<Body>|<Cell_Body>|</Cell_Body>|
      • Example: SwapAtNestedLevel|1||<A ID=|||
    • SwapNested
      • Complex search and substitution for data nested from two to three tag levels.
      • Syntax: SwapNested|sig_tag_root|sig_nested_1_tag|sig_nested_2_tag|sig_tag_close|replace_open|replace_close|
      • Example: SwapNested|<c props=|font-family:Arial; font-weight:bold||</c>|<Bold>|</Bold>|
      • Example: SwapNested|<c props=|font-size:12pt; font-family:Times New Roman||</c>|||
    • SwapNext
      • Change first occurrences of from to to from start of file buffer
      • Syntax: SwapNext|from|to|
      • Example: SwapNext|center;">|"rental_type" => "|
    • SwapOutward
      • Search for primary opening and closing tags. If found, search backward and forward for secondary tags. If found, perform substitution. Why? Because some tags are so generic that SwapNested fails.
      • Syntax: SwapOutward|tag_open|tag_close|previous_tag_open|previous_tag_close|replace_open|replace_close|0=Do not extract text, 1=Extract Test|
      • Example: SwapOutward|<image |/>|<p style="Normal"|</p>|<Image href="Images/IMAGE.gif">|</Image>|0|
    • SwapStrings
      • Change all occurrences of from to to.
      • Syntax: SwapStrings|from|to|
      • Example: SwapStrings|=F0=9F=99=8F| |
    • SwapTags
      • Swap sig_tag_open and sig_tag_close with replace_open and replace_close, keeping the data between.
      • Syntax: SwapTags|sig_tag_open|sig_tag_close|replace_open|replace_close|
      • Example: SwapTags|<c props="lang:en-US; font-size:24pt; font-family:Arial">|</c>||
      • Example: SwapTags|<p style="Normal"|</p>|<BodyText>|</BodyText>|
  16. Transfer
    • TransferBlock
      • Locate <tag_open> and </tag_open> located at tag_level_from.
        • If before is populated then determine if it precedes <tag_open> one level before.
        • Do not make any changes if it does not.
        • Extract the data between <tag_open> and </tag_open>,
          hide <tag_open>data</tag_open>,
          move down till Tag Level == tag_level_to
          then start a new block using the passed parameters
          making sure to include the extracted data.
        • Note : tag_open does not have to have a leading '<'.
      • Syntax: TransferBlock|tag_level_from|tag_level_to|before|tag_open|replace_open|replace_close|
      • Example: TransferBlock|6|1|<Letter1>|<Table>|<Section><Table>|</Table></Section>|
  17. Transform
    • TransformLFs
      • Change Linefeed (0x0A) and Carriage Return (0x0D) ASCII values to "^LF^" and "^CR^"
      • Syntax: TransformLFs
      • Example: TransformLFs

Microsoft Office Scripting

A five line script using three commands transforms Word XML documents to text.
EliminateContent|<?xml|<w:t>|1|
SwapStrings|</w:t>| _TE_|
SwapStrings|<w:t>|_TB_|
EliminateContentAll|_TE_|_TB_|1|
EliminateContent| _TE_|</w:document>|1|

A six line script using four commands transforms Excel XML documents to text.
RemoveBetween|<t|>|
EliminateContent|<?xml|<t>|1|
SwapStrings|</t>| _TE_|
SwapStrings|<t>|_TB_|
EliminateContentAll|_TE_|_TB_|1|
EliminateContent| _TE_|</sst>|1|

A six line script using four commands transforms each PowerPoint XML slide to text.
RemoveWithout|<a:t>|<?xml><a:t></a:t></p:sld>|
EliminateContent|<?xml|<a:t>|1|
SwapStrings|</a:t>| _TE_|
SwapStrings|<a:t>|_TB_|
EliminateContentAll|_TE_|_TB_|1|
EliminateContent| _TE_|</p:sld>|1|

Application Programming Interface (API)

Five API functions are provided that enable custom software to automate document transformation.

The functions differ on data passed to and from the API.

API naming is based on the International Civil Aviation Organization (ICAO) alphabet.

Virtualized

Inception can be deployed on a virtualized computer such as Docker.

Docker containers are easy to create, configure and use.

Docker supports Volumes which enable use of Inception input and output queues.

Kubernetes can be used to control 1-n Docker containers.

Processing Speed

Intel Core i3-4150 4 CPU cores @ 3.50 GHz / 800 MHz,
1 TB 7200 RPM SATA HDD, 8 async threads limit.

Directory Processing Steps

  1. Detect directory in the input queue.
  2. Wait until the directory chain has been fully written.
  3. Move directory chain to thread processing storage.
  4. Walk through the directory chain to create and write lists of files and sub-directories.
  5. Cycle through a list of directories:
    • Create a mirror of the directory chain in processing storage.
  6. Cycle through a list of files in the directory chain.
    • Check the file header and ignore the file if it is not supported.
    • Get next available thread identifier.
    • Decompress file to thread processing space.
    • Get document type.
      • If not supported then log, clean up, release thread id, and ignore.
    • Start a process specific to the document type:
      • Start an async thread:
        • Load document into memory, process into text, write to mirror chain.
  7. On completion, move directory mirror to the output queue, clean up, then exit.

Our Competition

OpenRefine

Trifacta Wrangler

Drake

TIBCO Clarity

Winpure

Data Ladder

Quadient Data Cleaner

Cloudingo

Reifier

IBM Infosphere Quality Stage

Inception is designed to fully automate conversion of documents into desired formats.

This includes automated processing of directory chains.

Inception will perform in a straightforward manner 24/7.

None of the competitive products provide similar features.

All products are complex in construct and use.

Summary

Inception solves data hoarding issues through rapid automation of document transformation.

Transformed Office documents are parsed, with words separated then later imported into a database for indexing.

Automated processing of shared directory chains ensures a faster return on investment for clients.


Contact us today.