Inception (noun) an event that is a beginning; a first part or stage of subsequent events

Two problems addressed by Inception

Data harvesting will keep increasing for years to come.

  Harvested data is mostly stored in non-relational databases in cloud space.
  Most harvested data is not used for lack of data mining software, and limited execution speed of mining software.
  Private harvested data is increasing stolen, used and sold without permission.

Backing up files can result in a digital hoarding problem.

  Information within hoarded documents is often forgotten.
  Shared directories containing hoarded documents causes a security and privacy nightmare.
  Theft occurs when access is provided to digital hoards within organizations.
  Theft is for personal, corporate and government gain, where some theft is to increase personal digital hoards.

About Big Data Hoarding

Our solution

About automated processing of Microsoft Office documents

Inception and database services execute in the background

  1. The Inception service.
    • Automatically detects and transforms Word, Excel and PowerPoint documents into full text, then words by sequence and group.
    • Documents and associated words are recorded in a memory-based database.
    • When directory processing completes then the Inception memory-based database is committed to local storage.
  2. The Database microservice.
    • Starts by passing an encrypted file containing database-related information plus preferences.
    • Automatically loads the Inception database into memory.
    • Each memory-based database is written to a PostgreSQL relational database along with full document text.
      • Options can be set to enforce database security and privacy within encrypted file passed to start the service.
        1. Encrypt full text before it is committed to the database.
          • Legitmate use permitted while use of text on database breach is halted.
        2. Words plus user-defined salt are hashed before being committed to the database.
          • Legitmate use permitted while use of words on database breach is halted.
  3. Private Intranet software is used to perform searches of words and phrases to locate documents for review.

Evolution

The process of transforming digital data into new forms is as old as the computer industry.

Inception was created, recreated, rewritten and perpetually enhanced over many years.

  1. Earliest use was to transform magazine editorial submissions into formats used to review, edit and typeset.
  2. We created a commercial application that allowed selection and viewing of word processing, spread sheet and presentation documents.
    • The application decoded file headers to determine file type, then transformed each file to present to users.
  3. Large amounts of mainframe data stored on Magnetic Optical media was directly read using low-level code under Windows.
    • The English and French data was automatically retained as full text plus transformed into words and phrases.
    • We created an advanced indexing system used to quickly locate the words and phrases from slow media.
    • This allowed users to rapidly search, view, export and print English and French text used during document translation.
  4. Legacy DB2 and Informix RDMS database exports were transformed to import into an Oracle RDMS for use through the Siebel CRM.
  5. Thousands of Word and PDF documents were automatically transformed into HTML and XML formats for technical and developer support sites.
  6. Automated transformation of bloated, unconventional XML to clean, DTD-validated XML for linguistic translation.
  7. Automated transformation of email real estate leads using many processing scripts to create SQL scripts for automated import into an MySQL database.
    • Signature analysis was used for several lead sources with many sub-types to select a processing script .
  8. Automated transformation of Microsoft Office Word docx, Excel xlsx and PowerPoint pptx documents.
    • Multi-threaded automation provides processing of complex directory chains along with individual documents.
    • Resulting UTF-8 text is broken into words with stop-words removed.
    • Pertinent information that includes document size, timestamps, word sequence and groups, is retained in a memory-based database.
    • The memory-based database is written to local storage after processing completes.
    • A database service handles writing the results to a relational database.

Refactored

Inception was refined to achieve the highest processing speed.

    Inception was refactored and largely rewritten in a selective mixture of C and C++.

    We have a great deal of experience developing software in several low to high-level languages.

    We could have quickly rewritten Inception in higher-level languages.

    We chose to take the difficult route of coding in C and C++.

    Refactoring Inception in any of the higher-level languages would have saved us a great deal of time and cost.

    We could have made use of open-source applications to save us even more time and cost.

    Use of open-source applications that are usually written in higher-level languages, could:

    • Reduce processing speed.
    • Eliminate key-features such as directory processing.
    • Incur stability, security and privacy risks.
    • Cause maintenance and liability problems.

    We chose to develop key-portions of Inception in C to directly load and process files in memory because it is fastest and provides greatest control.

    We did not use C++ memory handling because it makes indirect use of memory, which is slower than direct access.

    Direct memory access is faster in situations where CPU caching is ineffective due to multi-threaded processing.

    • Each processing thread repeatedly changes their own memory buffers in different sections of the buffers.
    • Thread memory is discarded once processing ends.

    We did not use C++ vectors and other associated functions.

    • Each function controls memory allocation, expansion and deallocation.
    • This is an unpredicatable and uncontrollable behaviour that will reduce processing speed.

    The speed difference between direct and indirect memory access becomes noticable when processing thousands to millions of documents.

    We created Inception for speed, stability and user-control.

    We utilized the C++ object-oriented framework to make it easier to isolate functionality and maintain Inception.

    The Inception code base is well constructed, well commented and documented, and fairly easy to understand.

    We can enhance Inception to automate processing of other document types as needed in future.


Automation

Inception automates document transformation.

It does so from within custom software that utilizes the Inception API.

It does so when used as a Linux service that automates document processing.

Shared features of Kryptera

Files and directory chains transferred into an input queue must be fully written before use.

This feature is critical for document processing automation.

Without validation, file and directories could be moved or loaded into memory while still being written.

This problem was discovered while developing the Kryptera HSM.

Tracing down a solution required low-level R&D that was specific to Linux.

A variable delay is placed after a file or directory chain is completely written.

The delay is only needed if cached writes occur after file writes are complete.


Operating Platform

Inception builds and runs on Debian and related trees such as Ubuntu and Devuan, plus RedHat, Centos and Fedora.

It would be fairly simple to refine Inception to build and execute under Unix.

Inception will run as an initd or systemd service, a standalone service-like application, or a standalone application to create processing scripts.

Inception shared and static libraries can be used by custom software to automate document and data transformation.

We chose to develop Inception to operate under Linux rather than Windows to ensure the highest processing speed and stability.

Windows, applications and services utilize memory, CPU cores, file storage and network bandwidth which would reduce Inception processing speed.

Refining Inception to build and execute under Windows will occur in the near future.

Configuration

Inception service startup flags.
./ServiceInception -folder_input input/ -folder_output output/ -folder_process process/ -folder_script script/ -folder_error error/ -folder_log log/

This starts Inception as a standalone application which can also be used as a systemd service.
-start_daemon is passed to start Inception as a background initd daemon (service). The design allows shared directories to be used for input and output queues, and allows use of higher-speed storage for thread-processing space.


PostgreSQL database microservice startup flags.
./ServicePostgreSQL -folder_input input/ -folder_log log/ -encrypted_auth PATH_FILENAME

Pass -start_daemon to start the service as a background initd daemon. Set -folder_input to what -folder_output is set to for Inception.

A utility is provided to encrypt database-related information and preferences.

Script Creation

Inception Script Creation Flags.

./ServiceInception -test_source path/source_file
    -test_target path/target_file
    -test_script path/script_file

The flags are used to test creation or refinement of a processing script.

Load the source, script and target files into a text editor.

Change the script then start command line processing to update the target for review.

Scripting

A key Inception feature is user-defined processing scripts that are used to automate document transformation into desired formats.

The Inception scripting language contains many commands that can be separated into groups:
  1. Beautify
    • BeautifyXML [1 of 2]
      • Format code, indented with tabs.
      • Syntax: BeautifyXML
      • Example: BeautifyXML
    • BeautifyXML [2 of 2]
      • Format code, indented with tabs and then remove the space between <End> and </End> and <A*> and </A> tags.
      • Syntax: BeautifyXML|FIX_END|
      • Example: BeautifyXML|FIX_END|
  2. Change
    • ChangeTag
      • Replace chosen field (1 is tag1 else tag2) with replacement after locating tag1 then tag2 in sequence.
      • Syntax: ChangeTag|1 for tag1|tag1|tag2|replacement
      • Example: ChangeTag|1|<div><a|<div class="foo">
    • ChangeWrappedString
      • Find signature, scan for opening and closing double quotes replace text between quotes with replacement.
      • Syntax: ChangeWrappedString|signature|replacement
      • Example: ChangeWrappedString|Good|Evil
  3. Clean
    • CleanXML
      • Remove tabs, carriage returns, line feeds and hidden chars when writing final output. Note that StripXML has a higher priority than FormatXML in that StripXML will be used even if FormatXML and StripXML are declared in this file.
      • Syntax: CleanXML
      • Example: CleanXML
  4. Conceal
    • ConcealBlankTags
      • Hide passed tag sequence if there is nothing between the tags.
      • Syntax: ConcealBlankTags|tag_open|tag_close|
      • Example: ConcealBlankTags|<Note_Heading>|</Note_Heading>|
    • ConcealSpecialTags
      • Hide tags and the data between in a way that finds tags that contain carriage returns/line feeds between elements.
      • Syntax: ConcealSpecialTags|tag_open|tag_contains|tag_close|insert_to_start_of_buffer
      • Example: ConcealSpecialTags|<TITLE>||</TITLE>| Example: HideTagsSpecial|<A |ID=|</A>|
  5. Confirm
    • ConfirmField
      • Validates a field to ensure it exists, is set as a name => value pair in PHP array format, with leading and trailing characters verified to ensure it is correct format to be loaded and processed within another PHP script.
      • Syntax: ConfirmField|String|
      • Example: ConfirmField|"message"|
  6. Correct
    • CorrectQP
      • Repair quoteable-printable 7-bit email encoding.
      • Syntax: CorrectQP
      • Example: CorrectQP
  7. Decode
    • DecodeBase64
      • Decodes the first base64 attachment found in the email file buffer and places it at the end of the buffer where it leads with "^{_BASE64_START_}^" ending at "^{_BODY_END_}^"
      • Syntax: DecodeBase64
      • Example: DecodeBase64
  8. Eliminate
    • EliminateBinary
      • Force deletion of the passed binary data where the second string of three is one, two or three 8-bit binary values represented in hexadecimal format. Each hex value must take the form of ‘0xFF’ where ‘0x’ is the hex prefix and ‘FF’ can be any hex value from ‘00’ to ‘FF’. Up to three hex values can be passed in the format of ‘0xFF0xFF0xFF’.
      • Syntax: EliminateBinary|tag_open|hex_binary|tag_close|
      • Example: EliminateBinary|<data>|0xA90xB80xC7|</data>|
    • EliminateBytes
      • Delete all occurrences of 8-bit binary values represented in hexadecimal format. Each hex value must take the form of ‘XX’ where ‘XX’ is any hex value from ‘00’ to ‘FF’.
      • Syntax: EliminateBytes|pairs_of_hex_values|
      • Example: EliminateBytes|0D0A|
    • EliminateContent
      • Delete data that starts with from and ends with to, where to is retained if 0 is passed
      • Syntax: EliminateContent|from|to|0|
      • Example: EliminateContent|</td>|^{_BODY_END_}^|0|
    • EliminateContentAll
      • Delete all data that starts with from and ends with to, where to is retained if 0 is passed.
      • Syntax: EliminateContentAll|from|to|0
      • Example: EliminateContentAll|</td>|^{_BODY_END_}^|0|
    • EliminateLFs
      • Removes line terminators.
      • Syntax: EliminateLFs
      • Example: EliminateLFs
    • EliminateString
      • Delete all occurrences of passed string from buffer.
      • Syntax: EliminateString|string_to_delete|
      • Example: EliminateString|Yadda|
    • EliminateTag
      • Delete data that starts with tag_open and, if not passed, ends with '>'.
      • Syntax: EliminateTag|tag_open|
      • Example: EliminateTag|<Dash1List>
    • EliminateTag2
      • Locate exact match to tag_open, scan for exact match to tag_next_to_delete, and then delete tag_next_to_delete. Note that this is potentially dangerous in that tag_open and tag_next_to_delete could be separated in context and result in invalid data deletion. It also has limited use in that it could leave a mess behind of deleted opening tags with left over closing tags.
      • Syntax: EliminateTag2|tag_open|tag_next_to_delete|
      • Example: EliminateTag2|<Body>|<Bold>|
  9. Preserve
    • PreserveMemory
      • Save file memory buffer to a file. This command is useful when creating a new processing script because it can save the file buffer at any stage of processing.
      • Syntax: PreserveMemory
      • Example: PreserveMemory
  10. Provisional
    • ProvisionalUpdate
      • If tag_find not found then insert it before tag_after.
      • Syntax: ProvisionalUpdate|tag_find|tag_after|
      • Example: ProvisionalUpdate|</Module_Divider>|<Section>|
  11. Put
    • PutBetweenTags
      • Locate exact match to tag_open scan for exact match to tag_next, and then insert tag_to_insert_between_open_and_next between tag_open and tag_next.
      • Syntax: PutBetweenTags|tag_open|tag_next|tag_to_insert_between_open_and_next|
      • Example: PutBetweenTags|<Table>|<ROW>|<Table_Body>|
    • PutBinaryPostfix
      • Append passed string with Adobe InDesign-specific binary line feed data. See the notes in InsertBinaryPrefix.
      • Syntax: PutBinaryPostfix|tag|
      • Example: PutBinaryPostfix|</Table>|
    • PutBinaryPrefix
      • Prepend passed string with Adobe InDesign-specific binary line feed data. Note that the binary data is embedded to force InDesign to drop lines feeds after tag closure. This is only needed when the InDesign tag formatting does not specifically call for a line feed to be dropped after a tag closes. It would be best to avoid using InsertBinaryPrefix and InsertBinaryPostfix by handling all line feeds through tag formatting within InDesign.
      • Syntax: PutBinaryPrefix|tag|
      • Example: PutBinaryPrefix|</Title>|
    • PutPostfix
      • Insert a string at the end of the file memory buffer.
      • Syntax: PutPostfix|String|
      • Example: PutPostfix|EOF|
    • PutPrefix
      • Insert a string at the start of the file memory buffer.
      • Syntax: PutPrefix|String|
      • Example: PutPrefix|<php $array = [||
  12. Remove
    • RemoveBetween
      • Remove data between START and END.
      • Syntax: RemoveBetween|Start|End|
      • Example: RemoveBetween|<t|>|
    • RemoveWithout
      • If Find is not found in the memory buffer then replace all memory buffer content with Replace.
      • Syntax: RemoveWithout|Find|Replace|
      • Example: RemoveWithout|<a:t>|>|
    • RemoveWrapper
      • Purge <tag_open> and </tag_open> if located on tag_level and followed by tag_after at tag_level+1.
      • Syntax: RemoveWrapper|tag_level|tag_open|tag_after|
      • Example: RemoveWrapper|3|<Dash1>|<Dash1>|
  13. Repair
    • RepairDoubleSection
      • Purge opening and closing <Section> blocks from blocks that start with <Section></Section>.
      • Syntax: RepairDoubleSection
      • Example: RepairDoubleSection
    • RepairSymbols
      • Translate Microsoft Word and Windows symbol chars to escaped values.
      • Syntax: RepairSymbols
      • Example: RepairSymbols
  14. Set
    • SetClosingTag
      • Locates tag_open and tag_close when they are both positioned at the same tag level, and then replaces tag_close with new_tag_close.
      • Syntax: SetClosingTag|tag_open|tag_close|new_tag_close|
      • Example: SetClosingTag|<Cell_Number1_First>|</Cell_Number1_Next>|</Cell_Number1_First>|
    • SetFieldDelimiter
      • Set field delimiter within curly brackets
      • Syntax: SetFieldDelimiter{delchar}
      • Example: SetFieldDelimiter{|}
  15. Swap
    • SwapAtNestedLevel
      • Substitute tag_open located at tag_level with new_tag_open and then replaces matching closing TAG with new_tag_close. Note that if before is passed then it has to exist before tag_open for the changes to be made.
      • Syntax: SwapAtNestedLevel|tag_level|before|tag_open|new_tag_open|new_tag_close|
      • Example: SwapAtNestedLevel|2||<ExerciseNumber>|<Number1_Next>|</Number1_Next>|
      • Example: SwapAtNestedLevel|9|<Table_Cell>|<Body>|<Cell_Body>|</Cell_Body>|
      • Example: SwapAtNestedLevel|1||<A ID=|||
    • SwapNested
      • Complex search and substitution for data nested from two to three tag levels.
      • Syntax: SwapNested|sig_tag_root|sig_nested_1_tag|sig_nested_2_tag|sig_tag_close|replace_open|replace_close|
      • Example: SwapNested|<c props=|font-family:Arial; font-weight:bold||</c>|<Bold>|</Bold>|
      • Example: SwapNested|<c props=|font-size:12pt; font-family:Times New Roman||</c>|||
    • SwapNext
      • Change first occurrences of from to to from start of file buffer
      • Syntax: SwapNext|from|to|
      • Example: SwapNext|center;">|"rental_type" => "|
    • SwapOutward
      • Search for primary opening and closing tags. If found, search backward and forward for secondary tags. If found, perform substitution. Why? Because some tags are so generic that SwapNested fails.
      • Syntax: SwapOutward|tag_open|tag_close|previous_tag_open|previous_tag_close|replace_open|replace_close|0=Do not extract text, 1=Extract Test|
      • Example: SwapOutward|<image |/>|<p style="Normal"|</p>|<Image href="Images/IMAGE.gif">|</Image>|0|
    • SwapStrings
      • Change all occurrences of from to to.
      • Syntax: SwapStrings|from|to|
      • Example: SwapStrings|=F0=9F=99=8F| |
    • SwapTags
      • Swap sig_tag_open and sig_tag_close with replace_open and replace_close, keeping the data between.
      • Syntax: SwapTags|sig_tag_open|sig_tag_close|replace_open|replace_close|
      • Example: SwapTags|<c props="lang:en-US; font-size:24pt; font-family:Arial">|</c>||
      • Example: SwapTags|<p style="Normal"|</p>|<BodyText>|</BodyText>|
  16. Transfer
    • TransferBlock
      • Locate <tag_open> and </tag_open> located at tag_level_from.
        • If before is populated then determine if it precedes <tag_open> one level before.
        • Do not make any changes if it does not.
        • Extract the data between <tag_open> and </tag_open>,
          hide <tag_open>data</tag_open>,
          move down till Tag Level == tag_level_to
          then start a new block using the passed parameters
          making sure to include the extracted data.
        • Note : tag_open does not have to have a leading '<'.
      • Syntax: TransferBlock|tag_level_from|tag_level_to|before|tag_open|replace_open|replace_close|
      • Example: TransferBlock|6|1|<Letter1>|<Table>|<Section><Table>|</Table></Section>|
  17. Transform
    • TransformLFs
      • Change Linefeed (0x0A) and Carriage Return (0x0D) ASCII values to "^LF^" and "^CR^"
      • Syntax: TransformLFs
      • Example: TransformLFs

Our scripting language can be extended in future to better process any form of data plus other document formats.

Microsoft Office Scripting

We added functionality to automate transformation of Microsoft Office documents into words that are inserted into a relational database.

Application Programming Interface (API)

Five API functions are provided that enable custom software to automate document transformation.

The functions differ on the data that is passed to and from the API.

API naming is based on the International Civil Aviation Organization (ICAO) alphabet.

Virtualized

Inception can be deployed on a virtualized server such as Docker.

Docker containers are easy to create, configure and use.

Docker supports Volumes which enable use of Inception input and output queues.

Kubernetes can be used to control 1-n Docker containers.

Processing Speed

Intel Core i3-4150 4 CPU cores @ 3.50 GHz / 800 MHz,
1 TB 7200 RPM SATA HDD, 8 async threads limit.

Documents transformed into text.

Documents transformed into text, split into words, retained in memory, with memory tables saved as binary files.

PostgreSQL database microservice loading saved tables into memory, then writing results to a relational database.

Current and Future Enhancement.

Some of our Competition

Alteryx
Document Processing Data Mining
ShinyDocs
Document Processing
OpenRefine
Data Mining
Trifacta Wrangler
Data Mining
Drake
Data Mining
TIBCO Clarity
Data Mining
Winpure
Data Mining
Data Ladder
Data Mining
Quadient Data Cleaner
Data Mining
Cloudingo
Data Mining
Reifier
Data Mining
IBM Infosphere Quality Stage
Data Mining

Inception is designed to fully automate conversion of documents into desired formats.

Summary

Inception solves data hoarding issues through rapid automation of document transformation.

Transformed Office documents are parsed, with words separated then separately imported into a database for indexing.

Automated processing of shared directory chains ensures a faster return on investment for clients.


Contact us today.