Inception (noun) an event that is a beginning; a first part or stage of subsequent events
Two problems addressed by Inception
Data harvesting will keep increasing for years to come.
Harvested data is mostly stored in non-relational databases in cloud space.
Most harvested data is not used for lack of data mining software, and limited execution speed of mining software.
Private harvested data is increasing stolen, used and sold without permission.
Backing up files can result in a digital hoarding problem.
Information within hoarded documents is often forgotten.
Shared directories containing hoarded documents causes a security and privacy nightmare.
Theft occurs when access is provided to digital hoards within organizations.
Theft is for personal, corporate and government gain, where some theft is to increase personal digital hoards.
About Big Data Hoarding
- Most web sites, applications and operating systems hoard private data without asking, and transport data to remote networks for storage and use.
- Governments hoard through backbones, and backdoors into operating systems, applications, and ISPs.
- Many countries pass backbone data to other aligned countries to hoard before it is passed back into the country of origin.
- This vast movement of data bypasses surveillance laws of people and organizations within each country.
- The rate of big data hoarding has been accelerating for years.
- This form of hoarding cannot be stopped because of the wealth and control associated with big data and AI.
- Most big data has not, and will not, be processed due to bottlenecks associated with data mining.
- This makes the data close to useless as it ages without use.
- This risky spread of unlimited use, and insecure storage, of private data is completely unacceptable.
Our solution
- The Inception service and API rapidly transforms raw data and documents into other desired forms.
- Inception is an automated data pre-processor that can greatly speed up data transformation.
- A student could create processing scripts used to transform data into desired forms.
- There is no need to craft extensive data-mining software that is written in interpreted languages.
- Local private data can remain local, private and secure after transformation.
- The Inception service has been customized to automate transformation of Microsoft Office documents.
About automated processing of Microsoft Office documents
Inception and database services execute in the background
- The Inception service.
- Automatically detects and transforms Word, Excel and PowerPoint documents into full text, then words by sequence and group.
- Documents and associated words are recorded in a memory-based database.
- When directory processing completes then the Inception memory-based database is committed to local storage.
- The Database microservice.
- Starts by passing an encrypted file containing database-related information plus preferences.
- Automatically loads the Inception database into memory.
- Each memory-based database is written to a PostgreSQL relational database along with full document text.
- Options can be set to enforce database security and privacy within encrypted file passed to start the service.
- Encrypt full text before it is committed to the database.
- Legitmate use permitted while use of text on database breach is halted.
- Words plus user-defined salt are hashed before being committed to the database.
- Legitmate use permitted while use of words on database breach is halted.
- Private Intranet software is used to perform searches of words and phrases to locate documents for review.
Evolution
The process of transforming digital data into new forms is as old as the computer industry.
Inception was created, recreated, rewritten and perpetually enhanced over many years.
-
Earliest use was to transform magazine editorial submissions into formats used to review, edit and typeset.
-
We created a commercial application that allowed selection and viewing of word processing, spread sheet and presentation documents.
-
The application decoded file headers to determine file type, then transformed each file to present to users.
-
Large amounts of mainframe data stored on Magnetic Optical media was directly read using low-level code under Windows.
-
The English and French data was automatically retained as full text plus transformed into words and phrases.
-
We created an advanced indexing system used to quickly locate the words and phrases from slow media.
-
This allowed users to rapidly search, view, export and print English and French text used during document translation.
-
Legacy DB2 and Informix RDMS database exports were transformed to import into an Oracle RDMS for use through the Siebel CRM.
-
Thousands of Word and PDF documents were automatically transformed into HTML and XML formats for technical and developer support sites.
-
Automated transformation of bloated, unconventional XML to clean, DTD-validated XML for linguistic translation.
-
Automated transformation of email real estate leads using many processing scripts to create SQL scripts for automated import into an MySQL database.
-
Signature analysis was used for several lead sources with many sub-types to select a processing script .
-
Automated transformation of Microsoft Office Word docx, Excel xlsx and PowerPoint pptx documents.
-
Multi-threaded automation provides processing of complex directory chains along with individual documents.
-
Resulting UTF-8 text is broken into words with stop-words removed.
-
Pertinent information that includes document size, timestamps, word sequence and groups, is retained in a memory-based database.
-
The memory-based database is written to local storage after processing completes.
-
A database service handles writing the results to a relational database.
Refactored
Inception was refined to achieve the highest processing speed.
Inception was refactored and largely rewritten in a selective mixture of C and C++.
We have a great deal of experience developing software in several low to high-level languages.
We could have quickly rewritten Inception in higher-level languages.
We chose to take the difficult route of coding in C and C++.
Refactoring Inception in any of the higher-level languages would have saved us a great deal of time and cost.
We could have made use of open-source applications to save us even more time and cost.
Use of open-source applications that are usually written in higher-level languages, could:
- Reduce processing speed.
- Eliminate key-features such as directory processing.
- Incur stability, security and privacy risks.
- Cause maintenance and liability problems.
We chose to develop key-portions of Inception in C to directly load and process files in memory because it is fastest and provides greatest control.
We did not use C++ memory handling because it makes indirect use of memory, which is slower than direct access.
Direct memory access is faster in situations where CPU caching is ineffective due to multi-threaded processing.
- Each processing thread repeatedly changes their own memory buffers in different sections of the buffers.
- Thread memory is discarded once processing ends.
We did not use C++ vectors and other associated functions.
- Each function controls memory allocation, expansion and deallocation.
- This is an unpredicatable and uncontrollable behaviour that will reduce processing speed.
The speed difference between direct and indirect memory access becomes noticable when processing thousands to millions of documents.
We created Inception for speed, stability and user-control.
We utilized the C++ object-oriented framework to make it easier to isolate functionality and maintain Inception.
The Inception code base is well constructed, well commented and documented, and fairly easy to understand.
We can enhance Inception to automate processing of other document types as needed in future.
- The scripting command set has been expanded.
- An Application Programming Interface (API) is provided through use of Linux shared and static libraries.
- Use of the API enables custom software to transform data into new forms by making use of Inception.
- API functionality is being added to encypt, and subsequently decrypt, data returned by the API.
- Adjustable async process threading has been added.
- Automated directory processing has been added.
- A memory-based NoSQL engine was created and used to retain processed documents and associate words.
- Use of the memory-based database accelerates document processing speed.
- The memory-based database is written to local storage after directory processing has been completed.
- A database service reloads the memory-based database and securely writes it to a relational database.
Automation
Inception automates document transformation.
It does so from within custom software that utilizes the Inception API.
It does so when used as a Linux service that automates document processing.
- Files and directory chains are transferred to an Inception input queue.
- Each are automatically processed in FIFO sequence.
- Automatic detection of file type, then document type, results in correct processing.
- The processing speed reduces the total threads that are required for processing.
Shared features of Kryptera
Files and directory chains transferred into an input queue must be fully written before use.
This feature is critical for document processing automation.
Without validation, file and directories could be moved or loaded into memory while still being written.
This problem was discovered while developing the Kryptera HSM.
Tracing down a solution required low-level R&D that was specific to Linux.
A variable delay is placed after a file or directory chain is completely written.
The delay is only needed if cached writes occur after file writes are complete.
Operating Platform
Inception builds and runs on Debian and related trees such as Ubuntu and Devuan, plus RedHat, Centos and Fedora.
It would be fairly simple to refine Inception to build and execute under Unix.
Inception will run as an initd or systemd service, a standalone service-like application, or a standalone application to create processing scripts.
Inception shared and static libraries can be used by custom software to automate document and data transformation.
We chose to develop Inception to operate under Linux rather than Windows to ensure the highest processing speed and stability.
Windows, applications and services utilize memory, CPU cores, file storage and network bandwidth which would reduce Inception processing speed.
Refining Inception to build and execute under Windows will occur in the near future.
Configuration
Inception service startup flags.
./ServiceInception -folder_input input/ -folder_output output/ -folder_process process/ -folder_script script/ -folder_error error/ -folder_log log/
This starts Inception as a standalone application which can also be used as a systemd service.
-start_daemon is passed to start Inception as a background initd daemon (service). The design allows shared directories to be used for input and output queues, and allows use of higher-speed storage for thread-processing space.
PostgreSQL database microservice startup flags.
./ServicePostgreSQL -folder_input input/ -folder_log log/ -encrypted_auth PATH_FILENAME
Pass -start_daemon to start the service as a background initd daemon. Set -folder_input to what -folder_output is set to for Inception.
A utility is provided to encrypt database-related information and preferences.
Script Creation
Inception Script Creation Flags.
./ServiceInception -test_source path/source_file
-test_target path/target_file
-test_script path/script_file
The flags are used to test creation or refinement of a processing script.
Load the source, script and target files into a text editor.
Change the script then start command line processing to update the target for review.
Scripting
A key Inception feature is user-defined processing scripts that are used to automate document transformation into desired formats.
The Inception scripting language contains many commands that can be separated into groups:
-
Beautify
-
BeautifyXML [1 of 2]
-
Format code, indented with tabs.
-
Syntax: BeautifyXML
-
Example: BeautifyXML
-
BeautifyXML [2 of 2]
-
Format code, indented with tabs and then remove the space between <End> and </End> and <A*> and </A> tags.
-
Syntax: BeautifyXML|FIX_END|
-
Example: BeautifyXML|FIX_END|
-
Change
-
ChangeTag
-
Replace chosen field (1 is tag1 else tag2) with replacement after locating tag1 then tag2 in sequence.
-
Syntax: ChangeTag|1 for tag1|tag1|tag2|replacement
-
Example: ChangeTag|1|<div><a|<div class="foo">
-
ChangeWrappedString
-
Find signature, scan for opening and closing double quotes replace text between quotes with replacement.
-
Syntax: ChangeWrappedString|signature|replacement
-
Example: ChangeWrappedString|Good|Evil
-
Clean
-
CleanXML
-
Remove tabs, carriage returns, line feeds and hidden chars when writing final output. Note that StripXML has a higher priority than FormatXML in that StripXML will be used even if FormatXML and StripXML are declared in this file.
-
Syntax: CleanXML
-
Example: CleanXML
-
Conceal
-
ConcealBlankTags
-
Hide passed tag sequence if there is nothing between the tags.
-
Syntax: ConcealBlankTags|tag_open|tag_close|
-
Example: ConcealBlankTags|<Note_Heading>|</Note_Heading>|
-
ConcealSpecialTags
-
Hide tags and the data between in a way that finds tags that contain carriage returns/line feeds between elements.
-
Syntax: ConcealSpecialTags|tag_open|tag_contains|tag_close|insert_to_start_of_buffer
-
Example: ConcealSpecialTags|<TITLE>||</TITLE>|
Example: HideTagsSpecial|<A |ID=|</A>|
-
Confirm
-
ConfirmField
-
Validates a field to ensure it exists, is set as a name => value pair in PHP array format, with leading and trailing characters verified to ensure it is correct format to be loaded and processed within another PHP script.
-
Syntax: ConfirmField|String|
-
Example: ConfirmField|"message"|
-
Correct
-
CorrectQP
-
Repair quoteable-printable 7-bit email encoding.
-
Syntax: CorrectQP
-
Example: CorrectQP
-
Decode
-
DecodeBase64
-
Decodes the first base64 attachment found in the email file buffer and places it at the end of the buffer where it leads with "^{_BASE64_START_}^" ending at "^{_BODY_END_}^"
-
Syntax: DecodeBase64
-
Example: DecodeBase64
-
Eliminate
-
EliminateBinary
-
Force deletion of the passed binary data where the second string of three is one, two or three 8-bit binary values represented in hexadecimal format. Each hex value must take the form of ‘0xFF’ where ‘0x’ is the hex prefix and ‘FF’ can be any hex value from ‘00’ to ‘FF’. Up to three hex values can be passed in the format of ‘0xFF0xFF0xFF’.
-
Syntax: EliminateBinary|tag_open|hex_binary|tag_close|
-
Example: EliminateBinary|<data>|0xA90xB80xC7|</data>|
-
EliminateBytes
-
Delete all occurrences of 8-bit binary values represented in hexadecimal format. Each hex value must take the form of ‘XX’ where ‘XX’ is any hex value from ‘00’ to ‘FF’.
-
Syntax: EliminateBytes|pairs_of_hex_values|
-
Example: EliminateBytes|0D0A|
-
EliminateContent
-
Delete data that starts with from and ends with to, where to is retained if 0 is passed
-
Syntax: EliminateContent|from|to|0|
-
Example: EliminateContent|</td>|^{_BODY_END_}^|0|
-
EliminateContentAll
-
Delete all data that starts with from and ends with to, where to is retained if 0 is passed.
-
Syntax: EliminateContentAll|from|to|0
-
Example: EliminateContentAll|</td>|^{_BODY_END_}^|0|
-
EliminateLFs
-
Removes line terminators.
-
Syntax: EliminateLFs
-
Example: EliminateLFs
-
EliminateString
-
Delete all occurrences of passed string from buffer.
-
Syntax: EliminateString|string_to_delete|
-
Example: EliminateString|Yadda|
-
EliminateTag
-
Delete data that starts with tag_open and, if not passed, ends with '>'.
-
Syntax: EliminateTag|tag_open|
-
Example: EliminateTag|<Dash1List>
-
EliminateTag2
-
Locate exact match to tag_open, scan for exact match to tag_next_to_delete, and then delete tag_next_to_delete. Note that this is potentially dangerous in that tag_open and tag_next_to_delete could be separated in context and result in invalid data deletion. It also has limited use in that it could leave a mess behind of deleted opening tags with left over closing tags.
-
Syntax: EliminateTag2|tag_open|tag_next_to_delete|
-
Example: EliminateTag2|<Body>|<Bold>|
-
Preserve
-
PreserveMemory
-
Save file memory buffer to a file. This command is useful when creating a new processing script because it can save the file buffer at any stage of processing.
-
Syntax: PreserveMemory
-
Example: PreserveMemory
-
Provisional
-
ProvisionalUpdate
-
If tag_find not found then insert it before tag_after.
-
Syntax: ProvisionalUpdate|tag_find|tag_after|
-
Example: ProvisionalUpdate|</Module_Divider>|<Section>|
-
Put
-
PutBetweenTags
-
Locate exact match to tag_open scan for exact match to tag_next, and then insert tag_to_insert_between_open_and_next between tag_open and tag_next.
-
Syntax: PutBetweenTags|tag_open|tag_next|tag_to_insert_between_open_and_next|
-
Example: PutBetweenTags|<Table>|<ROW>|<Table_Body>|
-
PutBinaryPostfix
-
Append passed string with Adobe InDesign-specific binary line feed data. See the notes in InsertBinaryPrefix.
-
Syntax: PutBinaryPostfix|tag|
-
Example: PutBinaryPostfix|</Table>|
-
PutBinaryPrefix
-
Prepend passed string with Adobe InDesign-specific binary line feed data. Note that the binary data is embedded to force InDesign to drop lines feeds after tag closure. This is only needed when the InDesign tag formatting does not specifically call for a line feed to be dropped after a tag closes. It would be best to avoid using InsertBinaryPrefix and InsertBinaryPostfix by handling all line feeds through tag formatting within InDesign.
-
Syntax: PutBinaryPrefix|tag|
-
Example: PutBinaryPrefix|</Title>|
-
PutPostfix
-
Insert a string at the end of the file memory buffer.
-
Syntax: PutPostfix|String|
-
Example: PutPostfix|EOF|
-
PutPrefix
-
Insert a string at the start of the file memory buffer.
-
Syntax: PutPrefix|String|
-
Example: PutPrefix|<php $array = [||
-
Remove
-
RemoveBetween
-
Remove data between START and END.
-
Syntax: RemoveBetween|Start|End|
-
Example: RemoveBetween|<t|>|
-
RemoveWithout
-
If Find is not found in the memory buffer then replace all memory buffer content with Replace.
-
Syntax: RemoveWithout|Find|Replace|
-
Example: RemoveWithout|<a:t>|>|
-
RemoveWrapper
-
Purge <tag_open> and </tag_open> if located on tag_level and followed by tag_after at tag_level+1.
-
Syntax: RemoveWrapper|tag_level|tag_open|tag_after|
-
Example: RemoveWrapper|3|<Dash1>|<Dash1>|
-
Repair
-
RepairDoubleSection
-
Purge opening and closing <Section> blocks from blocks that start with <Section></Section>.
-
Syntax: RepairDoubleSection
-
Example: RepairDoubleSection
-
RepairSymbols
-
Translate Microsoft Word and Windows symbol chars to escaped values.
-
Syntax: RepairSymbols
-
Example: RepairSymbols
-
Set
-
SetClosingTag
-
Locates tag_open and tag_close when they are both positioned at the same tag level, and then replaces tag_close with new_tag_close.
-
Syntax: SetClosingTag|tag_open|tag_close|new_tag_close|
-
Example: SetClosingTag|<Cell_Number1_First>|</Cell_Number1_Next>|</Cell_Number1_First>|
-
SetFieldDelimiter
-
Set field delimiter within curly brackets
-
Syntax: SetFieldDelimiter{delchar}
-
Example: SetFieldDelimiter{|}
-
Swap
-
SwapAtNestedLevel
-
Substitute tag_open located at tag_level with new_tag_open and then replaces matching closing TAG with new_tag_close. Note that if before is passed then it has to exist before tag_open for the changes to be made.
-
Syntax: SwapAtNestedLevel|tag_level|before|tag_open|new_tag_open|new_tag_close|
-
Example: SwapAtNestedLevel|2||<ExerciseNumber>|<Number1_Next>|</Number1_Next>|
-
Example: SwapAtNestedLevel|9|<Table_Cell>|<Body>|<Cell_Body>|</Cell_Body>|
-
Example: SwapAtNestedLevel|1||<A ID=|||
-
SwapNested
-
Complex search and substitution for data nested from two to three tag levels.
-
Syntax: SwapNested|sig_tag_root|sig_nested_1_tag|sig_nested_2_tag|sig_tag_close|replace_open|replace_close|
-
Example: SwapNested|<c props=|font-family:Arial; font-weight:bold||</c>|<Bold>|</Bold>|
-
Example: SwapNested|<c props=|font-size:12pt; font-family:Times New Roman||</c>|||
-
SwapNext
-
Change first occurrences of from to to from start of file buffer
-
Syntax: SwapNext|from|to|
-
Example: SwapNext|center;">|"rental_type" => "|
-
SwapOutward
-
Search for primary opening and closing tags. If found, search backward and forward for secondary tags. If found, perform substitution. Why? Because some tags are so generic that SwapNested fails.
-
Syntax: SwapOutward|tag_open|tag_close|previous_tag_open|previous_tag_close|replace_open|replace_close|0=Do not extract text, 1=Extract Test|
-
Example: SwapOutward|<image |/>|<p style="Normal"|</p>|<Image href="Images/IMAGE.gif">|</Image>|0|
-
SwapStrings
-
Change all occurrences of from to to.
-
Syntax: SwapStrings|from|to|
-
Example: SwapStrings|=F0=9F=99=8F| |
-
SwapTags
-
Swap sig_tag_open and sig_tag_close with replace_open and replace_close, keeping the data between.
-
Syntax: SwapTags|sig_tag_open|sig_tag_close|replace_open|replace_close|
-
Example: SwapTags|<c props="lang:en-US; font-size:24pt; font-family:Arial">|</c>||
-
Example: SwapTags|<p style="Normal"|</p>|<BodyText>|</BodyText>|
-
Transfer
-
TransferBlock
-
Locate <tag_open> and </tag_open> located at tag_level_from.
-
If before is populated then determine if it precedes <tag_open> one level before.
-
Do not make any changes if it does not.
-
Extract the data between <tag_open> and </tag_open>,
hide <tag_open>data</tag_open>,
move down till Tag Level == tag_level_to
then start a new block using the passed parameters
making sure to include the extracted data.
-
Note : tag_open does not have to have a leading '<'.
-
Syntax: TransferBlock|tag_level_from|tag_level_to|before|tag_open|replace_open|replace_close|
-
Example: TransferBlock|6|1|<Letter1>|<Table>|<Section><Table>|</Table></Section>|
-
Transform
-
TransformLFs
-
Change Linefeed (0x0A) and Carriage Return (0x0D) ASCII values to "^LF^" and "^CR^"
-
Syntax: TransformLFs
-
Example: TransformLFs
Our scripting language can be extended in future to better process any form of data plus other document formats.
Microsoft Office Scripting
We added functionality to automate transformation of Microsoft Office documents into words that are inserted into a relational database.
-
A five line script that only uses three commands is used to transform Word XML documents into text.
-
A six line script that only uses four commands is used to transform Excel XML documents into text.
-
A six line script that only uses four commands is used to transform PowerPoint XML slides into text.
Application Programming Interface (API)
Five API functions are provided that enable custom software to automate document transformation.
The functions differ on the data that is passed to and from the API.
API naming is based on the International Civil Aviation Organization (ICAO) alphabet.
Alpha
/**
* \brief Load source and processing script, write target, deallocate source and script memory, caller has target file.
*
* Buffers pSourcePath & pProcessingScript, calls Delta, writes pTargetPath,
* deallocates pSourceBuffer & pProcessingScriptBuffer, does not return pSourceBuffer.
*
* \param pSourcePath
* \param pTargetPath
* \param pProcessingScript
* \param pSizeSourceBuffer pointer to the size of the pSourceBuffer
* \return FAILURE_OKAY on success else various FAILURE_* on error
*/
int32_t Alpha( char *pSourcePath, char *pTargetPath, char *pProcessingScript, uint32_t *pSizeSourceBuffer );
Bravo
/**
* \brief Load processing script, write target, deallocate script memory, caller has target file and deallocates pSourceBuffer.
*
* Buffers pProcessingScript, calls Delta, writes pTargetPath,
* deallocates pProcessingScriptBuffer, does not return pSourceBuffer.
*
* \param pSourceBuffer
* \param pTargetPath
* \param pProcessingScript
* \param pSizeSourceBuffer pointer to the size of the pSourceBuffer
* \return FAILURE_OKAY on success else various FAILURE_* on error
*/
int32_t Bravo( uint8_t *pSourceBuffer, char *pTargetPath, char *pProcessingScript, uint32_t *pSizeSourceBuffer );
Charlie
/**
* \brief Load processing script, deallocate script, returns target (pSourceBuffer), caller writes target
* and deallocates pSourceBuffer.
*
* Buffers pProcessingScript, calls Delta, deallocates pProcessingScriptBuffer, returns altered pSourceBuffer.
*
* \param pSourceBuffer
* \param pProcessingScript
* \param pSizeSourceBuffer pointer to the size of the pSourceBuffer
* \return FAILURE_OKAY on success else various FAILURE_* on error
*/
int32_t Charlie( uint8_t *pSourceBuffer, char *pProcessingScript, uint32_t *pSizeSourceBuffer );
Delta
/**
* \brief Returns target (pSourceBuffer), caller has target (pSourceBuffer) and deallocates pSourceBuffer
* and pProcessingScriptBuffer.
*
* Delta is the class called by all others.
*
* \param pSourceBuffer
* \param pProcessingScriptBuffer
* \param pSizeSourceBuffer pointer to the size of the pSourceBuffer
* \param pMaxSourceBuffer: size of pSourceBuffer when malloc'd
* \return FAILURE_OKAY on success else various FAILURE_* on error
*/
int32_t Delta( uint8_t *pSourceBuffer, uint8_t *pProcessingScriptBuffer, uint32_t *pSizeSourceBuffer, uint64_t pMaxSourceBuffer );
Echo
/**
* \brief processing script is populated, load source, write target, deallocate source memory, caller has target file.
*
* Buffers pSourcePath & pProcessingScript, calls Delta, writes pTargetPath,
* deallocates pSourceBuffer & pProcessingScriptBuffer, does not return pSourceBuffer.
*
* \param pSourcePath
* \param pTargetPath
* \param pProcessingScriptBuffer
* \param pSizeSourceBuffer pointer to the size of the pSourceBuffer
* \return FAILURE_OKAY on success else various FAILURE_* on error
**/
int32_t Echo( char *pSourcePath, char *pTargetPath, uint8_t *pProcessingScriptBuffer, uint32_t *pSizeSourceBuffer );
Virtualized
Inception can be deployed on a virtualized server such as Docker.
Docker containers are easy to create, configure and use.
Docker supports Volumes which enable use of Inception input and output queues.
Kubernetes can be used to control 1-n Docker containers.
Processing Speed
Intel Core i3-4150 4 CPU cores @ 3.50 GHz / 800 MHz,
1 TB 7200 RPM SATA HDD, 8 async threads limit.
Documents transformed into text.
-
408 Word .docx files, 113.8 MB, 9.754233 seconds.
-
484 Excel .xlsx files, 108.2 MB, 99.723453 seconds.
-
132 PowerPoint .pptx files, 145.34 MB, 8.116166 seconds.
Documents transformed into text, split into words, retained in memory, with memory tables saved as binary files.
-
408 Word .docx files, 113.8 MB, 11.513733 seconds, 147,769 words.
-
484 Excel .xlsx files, 108.2 MB, 102.881834 seconds, 1,235,563 words.
-
132 PowerPoint .pptx files, 145.34 MB, 8.150428 seconds, 6,267 words.
PostgreSQL database microservice loading saved tables into memory, then writing results to a relational database.
-
408 Word .docx files, 113.8 MB, 65.207372 seconds, 147,769 words.
-
484 Excel .xlsx files, 108.2 MB, 524.543025 seconds, 1,235,563 words.
-
132 PowerPoint .pptx files, 145.34 MB, 15.441017 seconds, 6,267 words.
Current and Future Enhancement.
-
Inception achieves the highest speed of transforming documents into words.
-
Writing documents information and associated words to a relational database greatly increases directory processing time.
-
Database activity has been separated into a unique microservice.
-
Inception achieves the highest processing speed where writing results to the database occurs at a slower pace.
-
The database service will be enhanced in future to support additional database systems without impact on the Inception service.
-
Inception will be enhanced in future to process additional document types that include PDF and LibreOffice documents.
Some of our Competition
Document Processing
- Unlock Data from PDFs and Images.
- Extract data encoded in system-generated PDFs with PDF to Text and leverage Google Tesseract’s powerful OCR (Optical Character
- Recognition) capabilities to extract image content from documents (e.g. invoices, business cards, forms, etc.) with Image to Text
- Develop image classification models and apply those models to new images with Image Recognition
- Automatically extract entity names from text using Named Entity Recognition (NER)
Data Mining
- Because Data-Driven Decisions Matter Now More Than Ever
- Empower everyone to generate the insights you need, when you need them.
- Trusted by the World’s Most Influential Brands
- Our customers understand the power of data and analytics.
- That’s why they chose Alteryx to help empower their employees to make transformational outcomes.
Document Processing
- Their product processes documents within shared directories into text, then words, stored in a database.
- Support is provided for Microsoft Office Word, Excel annd PowerPoint documents, and PDF documents.
Data Mining
- Clean and transform data from one format to another.
- Free, open source, written in Java, works using a browser.
- Poorly constructed user interface, complex to use.
- Book written to use “Using OpenRefine” plus complex documentation online.
Data Mining
- Venture started by makers of Data Wrangler
- Interactive tool for data cleaning and transformation.
- Less formatting time and larger focus on analysing data.
- Machine learning algorithms suggest common transformations and aggregations.
- Free.
Data Mining
- Simple to use, extensible, text based data workflow management.
- Has data processing steps defined along with their inputs and output.
- Can automatically resolve dependencies, calculate the command to execute in order.
- Organizes command execution around data and its dependencies.
Data Mining
- Data cleaning software-as-a-service.
- Can validate data, perform deduplication, cleanse addresses.
- Helps quickly identify trends and make smarter decisions.
- Standardises raw data from disparate sources.
- Provides good quality data for accurate analysis.
Data Mining
- Clean large amount of data, fuzzy matching, remove duplicates, correct and standardize.
- Support of databases, spreadsheets, CRMs and text files.
Data Mining
- DataMatch and DataMatch Enterprise
- Advanced fuzzy matching algorithms for up to 100 million records.
Data Mining
- Data profiling engine for analyzing data quality.
- Finds missing values, patterns, character sets and other characteristics in a data set.
- Detects duplicates using fuzzy logic.
- Build custom cleansing rules and compose into scenarios to target databases.
Data Mining
- Salesforce data cleansing tool
- Eliminates duplicates, cleans records, and maintains data quality.
- Data is updated in bulk, and imported files are cleansed before accessing Salesforce.
- Automation capabilities regularly scan data for errors.
- Features deleting unnecessary and stale records, update records in bulk, automate on a schedule.
Data Mining
- Utilises Spark for distributed entity resolution, deduplication and record linkage.
- It uses machine learning algorithms to provide entity resolution and fuzzy data matching.
Data Mining
- Database cleansing and management.
- Builds consistent views of customers, vendors, products, locations, etc.
- For big data, business intelligence, data warehousing, master data management, etc.
Inception is designed to fully automate conversion of documents into desired formats.
- This includes automated processing of complex directory chains.
- Inception processes documents at the highest possible speed.
- Inception will perform in a straightforward manner 24/7.
- None of the competitive products provide similar features.
- All products are complex in construct and use.
- No competitive product makes use of a processing command set that allows users to custom scripts to transform their data.
- Some of the competitive products utilize open source products to process documents into text.
- Use of open-source applications that are usually written in higher-level languages, could:
- Reduce processing speed.
- Eliminate key-features such as directory processing.
- Incur stability, security and privacy risks.
- Cause maintenance and liability problems.
Summary
Inception solves data hoarding issues through rapid automation of document transformation.
Transformed Office documents are parsed, with words separated then separately imported into a database for indexing.
Automated processing of shared directory chains ensures a faster return on investment for clients.
Contact us today.