In this article, we will consider the use of AWS Data Pipeline and its main characteristics.
The principle of pipelined data processing
The data processing pipeline itself is a set of transformations that must be performed on the input data. It is complicated, for example, because information always arrives at the input of the conveyor in an unverified and unstructured form. And consumers want to see it in an easy-to-understand form.
This processing improves resource utilization for a given set of processes, each of which uses those resources in a predetermined manner. A good example of a pipeline organization is an assembly conveyor in production, when a product goes through all stages in sequence, right up to the finished product. The advantage of this method is that all products along their path use the same set of resources, and as soon as some resource is released by a given product, it can immediately be used by the next product, without waiting for the previous product to reach the end of the assembly line. If the conveyor carries similar but not identical products, then it is a sequential conveyor; if all products are the same, then it is a vector conveyor.
The device for processing commands of the simplest processor includes four stages:
- fetching commands from memory;
- decoding;
- determination of the address and selection of the operand;
- execution.
The purpose of AWS Data Pipeline
Today, Amazon Web Services has released a new service that works with other services, namely, it can help transfer data between:
- S3
- MySQL RDS / External MySQL servers
- DynamoDB
AWS Data Pipeline allows the user to copy, migrate data from SQL and DynamoDB tables to S3 and vice versa.
The staggered nature of image analysis tasks initially dictates the choice of a data pipeline architecture for the underlying data processing infrastructure. The data pipeline allows you to:
- Combining filters: Image filters from a set can be combined with one another, forming a processing chain, within which a sequence of operations is performed on the original images.
- Exploring the Effect of Changing Parameters: Once the filters are chained, it is easy to change the parameters of any filter in the chained and explore the effect of changing those parameters on the resulting image.
- Memory streaming: Large images can be processed by modifying only the image blocks at any given time. Thus, it becomes possible to process large images that cannot be accommodated in RAM in any way.
By default, Data Pipeline provides several templates:
- Export from DynamoDB to S3
- Export from S3 to DynamoDB
- Copy from S3 to RDS
- RDS to S3 Rip
- Analyzing files in S3
- Migrating from non-RDS MySQL to S3
The process is easily customizable using a graphical interface. You drag and drop elements, set parameters, and so on.