Data harvesting will keep increasing for years to come.
Harvested data is mostly stored in non-relational databases in cloud space.
Most harvested data is not used for lack of data mining software, and limited execution speed of mining software.
Backing up files can result in a digital hoarding problem.
Information within hoarded documents is often forgotten.
Shared directories containing hoarded documents causes a security and privacy nightmare.
Theft occurs when access is provided to digital hoards within organizations.
Theft for personal gain, and theft to increase personal digital hoards.
Most web sites, applications and operating systems hoard local data on remote networks.
Governments hoard through backbones, and backdoors into operating systems, applications, and ISPs.
Many countries pass backbone data to other aligned countries to hoard before it is passed back into the country of origin.
This vast movement of data bypasses surveillance laws of people and organizations within each country.
The rate of big data hoarding has been accelerating for years.
This form of hoarding cannot be stopped because of the wealth and control associated with big data and AI.
Most big data has not, and will not, be processed due to bottlenecks associated with data mining.
This makes the data close to useless as it ages without use.
Inception rapidly transforms raw data and documents into other desired forms.
Inception is an automated data pre-processor that can greatly speed up data transformation.
A student could create processing scripts used to transform data into desired forms.
There is no need to craft extensive data-mining software that is written in interpreted languages.
Inception was also customized to automate transformation of Microsoft Office documents.
Inception and database services execute in the background
After the information is written to the database, it is easy to perform searches of words and phrases to locate documents for review.
The process of transforming digital data into new forms is as old as the computer industry.
Inception was created, recreated, rewritten and perpetually enhanced over many years.
Inception was refactored and largely rewritten in a selective mixture of C and C++.
We have a great deal of experience developing software in several low to high-level languages.
We could have quickly rewritten Inception in higher-level languages.
We chose to take the difficult route of coding in C and C++.
Refactoring Inception in any of the higher-level languages would have saved us a great deal of time and cost.
We could have made use of open-source applications to save us even more time and cost.
Use of open-source applications that are usually written in higher-level languages, could:
Inception was refined to achieve the highest processing speed.
We chose to develop key-portions of Inception in C to directly load and process files in memory because it is fastest and provides greatest control.
We did not use C++ memory handling because it makes indirect use of memory, which is slower than direct access.
Direct memory access is faster in situations where CPU caching is ineffective due to multi-threaded processing.
We did not use C++ vectors and other associated functions.
The speed difference between direct and indirect memory access becomes noticable when processing thousands to millions of documents.
We created Inception for speed, stability and user-control.
We utilized the C++ object-oriented framework to make it easier to isolate functionality and maintain Inception.
The Inception code base is well constructed, well commented and documented, and fairly easy to understand.
We can enhance Inception to automate processing of other document types as needed in future.
Inception automates document transformation.
It does so from within custom software that utilizes the Inception API.
It does so when used as a Linux service that automates document processing.
Files and directory chains transferred into an input queue must be fully written before use.
This feature is critical for document processing automation.
Without validation, file and directories could be moved or loaded into memory while still being written.
This problem was discovered while developing the Kryptera HSM.
Tracing down a solution required low-level R&D that was specific to Linux.
A variable delay is placed after a file or directory chain is completely written.
The delay is only needed if cached writes occur after file writes are complete.
Inception builds and runs on Debian and related trees such as Ubuntu and Devuan, plus RedHat, Centos and Fedora.
It would be fairly simple to refine Inception to build and execute under Unix.
Inception will run as an initd or systemd service, a standalone service-like application, or a standalone application to create processing scripts.
Inception shared and static libraries can be used by custom software to automate document and data transformation.
We chose to develop Inception to operate under Linux rather than Windows to ensure the highest processing speed and stability.
Windows, applications and services utilize memory, CPU cores, file storage and network bandwidth which would reduce Inception processing speed.
Refining Inception to build and execute under Windows will occur in the near future.
Inception service startup flags.
./ServiceInception -folder_input input/ -folder_output output/ -folder_process process/ -folder_script script/ -folder_error error/ -folder_log log/
This starts Inception as a standalone application which can also be used as a systemd service.
-start_daemon is passed to start Inception as a background initd daemon (service). The design allows shared directories to be used for input and output queues, and allows use of higher-speed storage for thread-processing space.
PostgreSQL database microservice startup flags.
./ServicePostgreSQL -folder_input input/ -folder_log log/
Pass -start_daemon to start the service as a background initd daemon. Set -folder_input to what -folder_output is set to for Inception.
Inception Script Creation Flags.
./ServiceInception -test_source path/source_file
-test_target path/target_file
-test_script path/script_file
The flags are used to test creation or refinement of a processing script.
Load the source, script and target files into a text editor.
Change the script then start command line processing to update the target for review.
A key Inception feature is user-defined processing scripts used to automate document transformation into desired formats.
The Inception scripting language contains many commands that can be separated into groups:
hide <tag_open>data</tag_open>,
move down till Tag Level == tag_level_to
then start a new block using the passed parameters
making sure to include the extracted data.
Our scripting language can be extended in future to better process big data and other document formats.
We added functionality to automate transformation of Microsoft Office documents into words that are directed added to a relational database.
A five line script that only uses three commands to transform Word XML documents into text.
EliminateContent|<?xml|<w:t>|1|
SwapStrings|</w:t>| _TE_|
SwapStrings|<w:t>|_TB_|
EliminateContentAll|_TE_|_TB_|1|
EliminateContent| _TE_|</w:document>|1|
A six line script that only uses four commands to transform Excel XML documents into text.
RemoveBetween|<t|>|
EliminateContent|<?xml|<t>|1|
SwapStrings|</t>| _TE_|
SwapStrings|<t>|_TB_|
EliminateContentAll|_TE_|_TB_|1|
EliminateContent| _TE_|</sst>|1|
A six line script that only uses four commands to transform PowerPoint XML slides into text.
RemoveWithout|<a:t>|<?xml><a:t></a:t></p:sld>|
EliminateContent|<?xml|<a:t>|1|
SwapStrings|</a:t>| _TE_|
SwapStrings|<a:t>|_TB_|
EliminateContentAll|_TE_|_TB_|1|
EliminateContent| _TE_|</p:sld>|1|
Five API functions are provided that enable custom software to automate document transformation.
The functions differ on the data that is passed to and from the API.
API naming is based on the International Civil Aviation Organization (ICAO) alphabet.
/** * \brief Load source and processing script, write target, deallocate source and script memory, caller has target file. * * Buffers pSourcePath & pProcessingScript, calls Delta, writes pTargetPath, * deallocates pSourceBuffer & pProcessingScriptBuffer, does not return pSourceBuffer. * * \param pSourcePath * \param pTargetPath * \param pProcessingScript * \param pSizeSourceBuffer pointer to the size of the pSourceBuffer * \return FAILURE_OKAY on success else various FAILURE_* on error */ int32_t Alpha( char *pSourcePath, char *pTargetPath, char *pProcessingScript, uint32_t *pSizeSourceBuffer );
/** * \brief Load processing script, write target, deallocate script memory, caller has target file and deallocates pSourceBuffer. * * Buffers pProcessingScript, calls Delta, writes pTargetPath, * deallocates pProcessingScriptBuffer, does not return pSourceBuffer. * * \param pSourceBuffer * \param pTargetPath * \param pProcessingScript * \param pSizeSourceBuffer pointer to the size of the pSourceBuffer * \return FAILURE_OKAY on success else various FAILURE_* on error */ int32_t Bravo( uint8_t *pSourceBuffer, char *pTargetPath, char *pProcessingScript, uint32_t *pSizeSourceBuffer );
/** * \brief Load processing script, deallocate script, returns target (pSourceBuffer), caller writes target * and deallocates pSourceBuffer. * * Buffers pProcessingScript, calls Delta, deallocates pProcessingScriptBuffer, returns altered pSourceBuffer. * * \param pSourceBuffer * \param pProcessingScript * \param pSizeSourceBuffer pointer to the size of the pSourceBuffer * \return FAILURE_OKAY on success else various FAILURE_* on error */ int32_t Charlie( uint8_t *pSourceBuffer, char *pProcessingScript, uint32_t *pSizeSourceBuffer );
/** * \brief Returns target (pSourceBuffer), caller has target (pSourceBuffer) and deallocates pSourceBuffer * and pProcessingScriptBuffer. * * Delta is the class called by all others. * * \param pSourceBuffer * \param pProcessingScriptBuffer * \param pSizeSourceBuffer pointer to the size of the pSourceBuffer * \param pMaxSourceBuffer: size of pSourceBuffer when malloc'd * \return FAILURE_OKAY on success else various FAILURE_* on error */ int32_t Delta( uint8_t *pSourceBuffer, uint8_t *pProcessingScriptBuffer, uint32_t *pSizeSourceBuffer, uint64_t pMaxSourceBuffer );
/** * \brief processing script is populated, load source, write target, deallocate source memory, caller has target file. * * Buffers pSourcePath & pProcessingScript, calls Delta, writes pTargetPath, * deallocates pSourceBuffer & pProcessingScriptBuffer, does not return pSourceBuffer. * * \param pSourcePath * \param pTargetPath * \param pProcessingScriptBuffer * \param pSizeSourceBuffer pointer to the size of the pSourceBuffer * \return FAILURE_OKAY on success else various FAILURE_* on error **/ int32_t Echo( char *pSourcePath, char *pTargetPath, uint8_t *pProcessingScriptBuffer, uint32_t *pSizeSourceBuffer );
Inception can be deployed on a virtualized computer such as Docker.
Docker containers are easy to create, configure and use.
Docker supports Volumes which enable use of Inception input and output queues.
Kubernetes can be used to control 1-n Docker containers.
Intel Core i3-4150 4 CPU cores @ 3.50 GHz / 800 MHz,
1 TB 7200 RPM SATA HDD, 8 async threads limit.
Documents transformed into text.
Documents transformed into text, split into words, retained in memory, with memory tables saved as binary files.
PostgreSQL database microservice loading saved tables into memory, then writing results to a relational database.
Current and Future Enhancement.
Inception is designed to fully automate conversion of documents into desired formats.
This includes automated processing of directory chains.
Inception will perform in a straightforward manner 24/7.
None of the competitive products provide similar features.
All products are complex in construct and use.
Inception solves data hoarding issues through rapid automation of document transformation.
Transformed Office documents are parsed, with words separated then separately imported into a database for indexing.
Automated processing of shared directory chains ensures a faster return on investment for clients.
Copyright © Kryptera