Pyspark join dataframes

Pyspark join dataframes DEFAULT

moon-behind-clouds

PySpark provides multiple ways to combine dataframes i.e. join, merge, union, SQL interface, etc. In this article, we will take a look at how the PySpark join function is similar to SQL join, where two or more tables or dataframes can be combined based on conditions. 

Let's take a look at some of the join operations supported by PySpark with examples. First, create two dataframes from Python Dictionary, we will be using these two dataframes in this article.  

The following kinds of joins are explained in this article. 

  • Inner Join.

  • Outer Join.

  • Left Join.

  • Right Join.

  • Left Semi Join.

  • Left Anti Join.

  • Inner Join with advance conditions.

You may also like: PySpark Tutorial: Learn Apache Spark Using Python.

Let's take detailed look in each of them. 

The inner join selects matching records from both of the dataframes. Match is performed on column(s) specified in the  parameter. In this example, both dataframes are joined when the column named   has same value, i.e. 'abc.'

Outer Join

Outer join combines data from both dataframes, irrespective of 'on' column matches or not. If there is a match combined, one row is created if there is no match missing columns for that row are filled with .  

Left Join

Left join will choose all the data from the left dataframe (i.e. df1 in this example) and perform matches on column name . If a match is found, values are filled from the matching row, and if not found, unavailable values are filled with . 

Right Join

This is the same as the left join operation performed on right side dataframe, i.e df2 in this example.  

Left Semi Join

This is like inner join, with only the left dataframe columns and values are selected.

Left Anti Join

This join is like df1-df2, as it selects all rows from df1 that are not present in df2.   

Inner Join With Advanced Conditions

In addition, PySpark provides conditions that can be specified instead of the 'on' parameter. For example, if you want to join based on range in Geo Location-based data, you may want to choose latitude longitude ranges.

I hope this article helps you understand some functionalities that PySpark joins provide.  

Further Reading

Sours: https://dzone.com/articles/pyspark-join-explained-with-examples

Join and Aggregate PySpark DataFrames

We've had quite a journey exploring the magical world of PySpark together. After covering DataFrame transformations, structured streams, and RDDs, there are only so many things left to cross off the list before we've gone too deep.

To round things up for this series, we're a to take a look back at some powerful DataFrame operations we missed. In particular we'll be focusing on operations which modify DataFrames as a whole, such as  

Joining DataFrames in PySpark

I'm going to assume you're already familiar with the concept of SQL-like joins. To demonstrate these in PySpark, I'll create two simple DataFrames: a customers DataFrame and an orders DataFrame:

Here's how they look:

The DataFrames we just created.

Now we have two cliche tables to work with.

Before we join these two tables it's important to realize that table joins in Spark are relatively "expensive" operations, which is to say that they utilize a fair amount of time and system resources.

Inner Joins

Without specifying the type of join we'd like to execute, PySpark will default to an inner join. Joins are possible by calling the method on a DataFrame:

The first argument accepts is the "right" DataFrame that we'll be joining on to the DataFrame we're calling the function on.

Next, we specify the "on" of our join. In our example, we're telling our join to compare the "name" column of customersDF to the "customer" column of ordersDF. Here's how it turned out:

idnamecredit_card_numberidproduct_namecustomerprice
2rob90875675654393fake vegan meatrob13.99
3tim54365869994674cheesey poofstim3.99
3tim54365869994675ice creamtim4.95
4tom83497568532506protein powdertom49.95
1bob34625436586861ketchupbob1.2
1bob34625436586862rutabagabob3.35

Right, Left, and Outer Joins

We can pass the keyword argument "how" into , which specifies the type of join we'd like to execute. how accepts , , , and , as you might imagine. how also accepts a few redundant types like (same as ).

Cross Joins

The last type of join we can execute is a cross join, also known as a cartesian join. Cross joins are a bit different from the other types of joins, thus cross joins get their very own DataFrame method:

Cross joins create a new row in DataFrame #1 per record in DataFrame #2:

Anatomy of a cross join.

Aggregating Data

Spark allows us to perform powerful aggregate functions on our data, similar to what you're probably already used to in either SQL or Pandas. The data I'll be aggregating is a dataset of NYC motor vehicle collisions because I'm a sad and twisted human being:

boroughcontributing_factor_vehicle_1contributing_factor_vehicle_2cross_street_nametimestamplatitudelongitudelocationnumber_of_cyclist_injurednumber_of_cyclist_killednumber_of_motorist_injurednumber_of_motorist_killednumber_of_pedestrians_injurednumber_of_pedestrians_killednumber_of_persons_injurednumber_of_persons_killedoff_street_nameon_street_nameunique_keyvehicle_type_code1vehicle_type_code2zip_code
MANHATTANFollowing Too CloselyUnspecifiedLEXINGTON AVENUE2019-06-18T12:15:0040.772373-73.96079(40.772373, -73.96079)00000000EAST 75 STREET4154304Station Wagon/Sport Utility VehicleBox Truck10021
MANHATTANDriver Inattention/DistractionUnspecifiedPARK AVENUE2019-06-14T13:43:0040.8076-73.93719(40.8076, -73.93719)10000010EAST 129 STREET4152035VanBike10035
MANHATTANBacking UnsafelyFollowing Too CloselyWEST 158 STREET2019-06-11T12:40:0040.83468-73.944435(40.83468, -73.944435)00000000BROADWAY 4150129Station Wagon/Sport Utility VehicleTractor Truck Diesel10032
BROOKLYNDriver Inattention/DistractionUnspecified2019-06-11T17:00:0040.72065-73.96079(40.72065, -73.96079)0000000058 NORTH 8 STREET4150450SedanPick-up Truck11249
MANHATTANUnspecifiedUnspecifiedPARK AVENUE2019-06-07T15:30:0040.805058-73.93904(40.805058, -73.93904)00000000EAST 125 STREET4147239BusPick-up Truck10035
QUEENSTraffic Control DisregardedFailure to Yield Right-of-Way58 AVENUE2019-06-06T18:40:0040.74554-73.7768(40.74554, -73.7768)00000000FRANCIS LEWIS BOULEVARD 4146404SedanPick-up Truck11364
QUEENSDriver Inattention/DistractionUnspecified2019-06-04T15:00:0040.76257-73.88856(40.76257, -73.88856)0000000025-38 80 STREET4144994Station Wagon/Sport Utility VehicleBox Truck11370
BROOKLYNFailure to Yield Right-of-WayUnspecifiedWEST 20 STREET2019-05-31T15:30:0040.5787-73.98734(40.5787, -73.98734)00000000NEPTUNE AVENUE 4143848SedanFlat Bed11224
MANHATTANDriver Inattention/DistractionUnspecified2019-05-30T15:00:0040.793224-73.97096(40.793224, -73.97096)00000000715 AMSTERDAM AVENUE4142113BusBus10025

We're going to become familiar with two functions here: and . These are typically used in tandem, but can be used on a dataset without :

Aggregating without performing typically isn't entirely useful:

Let's derive some deeper meaning from our data by combining with .

Using groupBy()

Let's see which boroughs lead the way in terms of the number of accidents:

The results:

Queens leads the way with 241 accidents from our sample size! Get your shit together, Queens. Let's see which borough is the deadliest:

Here we go:

Well... alright then.

Grouping By Multiple Columns

Often times we'll want to group by multiple columns to see more complex breakdowns. Here we group by both borough and "main contributing factor":

This will show us the most common type of accidents per borough:

boroughcontributing_factor_vehicle_1injuries
STATEN ISLANDDriver Inattention/Distraction4
STATEN ISLANDUnspecified4
STATEN ISLANDFailure to Yield Right-of-Way3
QUEENSDriver Inattention/Distraction21
QUEENSFailure to Yield Right-of-Way10
QUEENSUnspecified8
QUEENSTraffic Control Disregarded5
MANHATTANDriver Inattention/Distraction19
MANHATTANUnspecified17
MANHATTANPedestrian/Bicyclist/Other Pedestrian Error/Confusion5
MANHATTANFailure to Yield Right-of-Way4
MANHATTANPassing or Lane Usage Improper3
BROOKLYNDriver Inattention/Distraction15
BROOKLYNFailure to Yield Right-of-Way6
BROOKLYNUnspecified5
BROOKLYNOversized Vehicle4
BROOKLYNDriver Inexperience3
BRONXUnspecified16
BRONXDriver Inattention/Distraction6
BRONXDriver Inexperience6
BRONXUnsafe Lane Changing3
BRONXPedestrian/Bicyclist/Other Pedestrian Error/Confusion3
BRONXPassing or Lane Usage Improper3
BRONXTraffic Control Disregarded3
BRONXFailure to Yield Right-of-Way3

So far we've aggregated by using the and functions. As you might imagine, we could also aggregate by using the , , and functions. There's one additional function worth special mention as well called .

Determining Column Correlation

If you're the scientific type, you're going to love aggregating using . determines whether two columns have any correlation between them, and outputs and integer which represent the correlation:

Example output:

Databricks Visualizations on Aggregations

If you're following along in a Databricks notebook, there are a ton of cool visualizations that come standard with the command to compliment any aggregations we perform. These are especially useful when trying to understand the distribution of aggregate functions we perform.

I went ahead and pieced together a breakdown of people injured in accidents below. We're splitting our results by borough, and then seeing the distribution of people injured between cyclists and motorists:

Creating a visualization in Databricks.

While customizing a bar plot, "keys" determines the values across the x-axis.I'm measuring by a number of "values" here, which is to say that multiple measurements across the y-axis will be shown.

This particular chart lends itself well to a stacked bar chart, which we create by specifying bar chart as our display type, and then specifying stacked inthe additional options. Databricks allows for all sorts of additional cool visualizations like geographical charts, scatter plots, and way more.

Happy Trails

We've been through a lot on this PySpark journey together. As much as I'd love to keep you here forever, every good parent knows when its time for their children to leave the nest and fly on their own. I'll leave you with some advice my parents gave me: go get a job and get out of my god-damn house.

Sours: https://hackersandslackers.com/join-aggregate-pyspark-dataframes/
  1. Cheap hotels vietnam
  2. Corkys restaurant menu
  3. Dell digital delivery
  4. Jeremy rutherford twitter

Introduction to Pyspark join types

This article is written in order to visualize different join types, a cheat sheet so that all types of joins are listed in one place with examples and without stupid circles. Aaah, circles!

I am tired of these explanations of joins with intersections of sets and circles. It seems both clear and understandable, but in fact, it is at least inaccurate and generally wrong. Let's see why and mention a couple of nuances of the joints.

Join meme

Firstly, dataframe/table is not a set. In a set, all elements are unique, there should be no duplicates. In tables in general, this is not really the case. Secondly, the term "intersection" is confusing to people.


Let's describe the data we will be working with:

Two dateframes of superheroes and their race. They're connected through an column.

Cross join

As the saying goes, the cross product of big data and big data is an out-of-memory exception. [Holden’s "High-Performance Spark"]

Let's start with the cross join.

This join simply combines each row of the first table with each row of the second table. For example, we have rows in one table and rows in another, this gives us rows in the resulting table. So, imagine that a small table of 1000 customers combined with a product table with 1000 records will produce 1,000,000 records! Try to avoid this with large tables in production.

Also, to get around when running query with cross join we have to set to true in our Spark session builder, but please do not do that. For this type of connection, it is better to use a special Spark API method . The reason here is that you explicitly tell Spark to use this dangerous method in a particular situation, knowing what you are doing. But you'll be safe in the next case where you accidentally forget to add a argument (for Spark, which means cross join) - Spark will warn you about this with the exception.

Inner join

Inner join

This is the default join type in Spark. The inner join essentially removes anything that is not common in both tables. It returns all data that has a match under the join condition (predicate in the `on' argument) from both sides of the table. This means that if one of the tables is empty, the result will also be empty.

So in the example above, only Iron Man and Deadpool have entries in both tables, so the inner join only returns these rows.

But if there are duplicates in the source data, those duplicates will be the result as well:

Yes, the circles don't show that.

Left join / Left outer join

Left join

We looked at the inner join, which returns only those combinations of left/right table rows, for which the predicate value in the expression is true. The outer join allows us to include in the result rows of one table for which there are no matching rows found in another table.

In a left join, all rows of the left table remain unchanged, regardless of whether there is a match in the right table or not. When a match is found in the right table, it will be returned or null otherwise.

You can use or and the results are exactly the same. It is just an alias in Spark. It seems like this is a convenience for people coming from different SQL flavor backgrounds.

Right join / Right outer join

Right join

The right outer join performs the same task as the left outer join, but for the right table. Thus, it returns all the rows of the right table as a result.

Here the right side of the table is the race, therefore all data from the race table are returned. Yeah, we don't see any Kryptonians in the data.

Also, the right join and right outer join yield produce the same result.

Full outer join

Full outer join

We use full external join to store records from both tables together with the corresponding zero values in the corresponding left/right tables. This is a fairly rare case but is usually used in situations where you do not want to lose data from either table.

Full outer join can be considered as a combination of inner join + left join + right join.

Left semi-join

Left semi-join

Only the data on the left side that has a match on the right side will be returned based on the condition in . Unlike the left join, in which all rows of the right-hand table are also present in the result, here right-hand table data is omitted from the output.

Left anti join

Left anti join

As the name suggests, it does the exact opposite of left semi-join. It simply returns data that does not match in the right table. The result will only include the columns from the left table.

Some notes

  1. If you have not applied an alias to a DataFrame, you will get an error after creating a joined DataFrame. When two columns are named the same, accessing one of the duplicates named columns returns an error, which basically means that it doesn't know which column you chose.

  2. In the second parameter, you use the (ampersand) symbol for AND the (pipe) symbol for OR between columns.

Recommended books


Sours: https://luminousmen.com/post/introduction-to-pyspark-join-types

PySpark Join is used to combine two DataFrames and by chaining these you can join multiple DataFrames; it supports all basic join type operations available in traditional SQL like , , , , , ,  JOIN. PySpark Joins are wider transformations that involve data shuffling across the network.

PySpark SQL Joins comes with more optimization by default (thanks to DataFrames) however still there would be some performance issues to consider while using.

In this PySpark SQL Join tutorial, you will learn different Join syntaxes and using different Join types on two or more DataFrames and Datasets using examples.

1. PySpark Join Syntax

PySpark SQL join has a below syntax and it can be accessed directly from DataFrame.

operation takes parameters as below and returns DataFrame.

  • param other: Right side of the join
  • param on: a string for the join column name
  • param how: default . Must be one of , , ,, , , , , ,, and .

You can also write Join expression by adding where() and filter() methods on DataFrame and can have Join on multiple columns.

2. PySpark Join Types

Below are the different Join Types PySpark supports.

Join String Equivalent SQL Join
innerINNER JOIN
outer, full, fullouter, full_outerFULL OUTER JOIN
left, leftouter, left_outerLEFT JOIN
right, rightouter, right_outerRIGHT JOIN
cross
anti, leftanti, left_anti
semi, leftsemi, left_semi

Before we jump into PySpark SQL Join examples, first, let’s create an and DataFrames. here, column is unique on emp and is unique on the dept dataset’s and emp_dept_id from emp has a reference to dept_id on dept dataset.

This prints “emp” and “dept” DataFrame to the console. Refer complete example below on how to create object.

3. PySpark Inner Join DataFrame

join is the default join in PySpark and it’s mostly used. This joins two datasets on key columns, where keys don’t match the rows get dropped from both datasets ( & ).

When we apply Inner join on our datasets, It drops “” 50 from “” and “” 30 from “” datasets. Below is the result of the above Join expression.

4. PySpark Full Outer Join

a.k.a , join returns all rows from both datasets, where join expression doesn’t match it returns null on respective record columns.

From our “” dataset’s “” with value 50 doesn’t have a record on “” hence dept columns have null and “” 30 doesn’t have a record in “” hence you see null’s on emp columns. Below is the result of the above Join expression.

5. PySpark Left Outer Join

a.k.a join returns all rows from the left dataset regardless of match found on the right dataset when join expression doesn’t match, it assigns null for that record and drops records from right where match not found.

From our dataset, “” 5o doesn’t have a record on “” dataset hence, this record contains null on “” columns (dept_name & dept_id). and “” 30 from “” dataset dropped from the results. Below is the result of the above Join expression.

6. Right Outer Join

a.k.a join is opposite of join, here it returns all rows from the right dataset regardless of math found on the left dataset, when join expression doesn’t match, it assigns null for that record and drops records from left where match not found.

From our example, the right dataset “” 30 doesn’t have it on the left dataset “” hence, this record contains null on “” columns. and “” 50 dropped as a match not found on left. Below is the result of the above Join expression.

7. Left Semi Join

join is similar to join difference being join returns all columns from the left dataset and ignores all columns from the right dataset. In other words, this join returns columns from the only left dataset for the records match in the right dataset on join expression, records not matched on join expression are ignored from both left and right datasets.

The same result can be achieved using select on the result of the inner join however, using this join would be efficient.

Below is the result of the above join expression.

8. Left Anti Join

join does the exact opposite of the , join returns only columns from the left dataset for non-matched records.

Yields below output

9. PySpark Self Join

Joins are not complete without a self join, Though there is no self-join type available, we can use any of the above-explained join types to join DataFrame to itself. below example use self join.

Here, we are joining dataset with itself to find out superior and for all employees.

4. Using SQL Expression

Since PySpark SQL support native SQL syntax, we can also write join operations after creating temporary tables on DataFrames and use these tables on .

5. PySpark SQL Join on multiple DataFrames

When you need to join more than two tables, you either use SQL expression after creating a temporary view on the DataFrame or use the result of join operation to join with another DataFrame like chaining them. for example

6. PySpark SQL Join Complete Example

Examples explained here are available at the GitHub project for reference.

Conclusion

In this PySpark SQL tutorial, you have learned two or more DataFrames can be joined using the function of the DataFrame, Join types syntax, usage, and examples with PySpark (Spark with Python), I would also recommend reading through Optimizing SQL Joins to know performance impact on joins.

Happy Learning !!

Tags: Cross Join,DataFrame Join,Inner Join,Left Anti Semi Join,Left Join,Left Semi Join,Outer Join,Right Join,SQL JOIN

NNK

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

PySpark Join Types | Join Two DataFrames
Sours: https://sparkbyexamples.com/pyspark/pyspark-join-explained-with-examples/

Dataframes pyspark join

Summary: Pyspark DataFrames have a join method which takes three parameters: DataFrame on the right side of the join, Which fields are being joined on, and what type of join (inner, outer, left_outer, right_outer, leftsemi). You call the join method from the left side DataFrame object such as .

One of the challenges of working with Pyspark (the python shell of Apache Spark) is that it’s Python and Pandas but with some subtle differences. For example, you can’t just to create a lowercase version of a string column, instead you use a function and that just doesn’t feel right. So I often have to reference the documentation just to get my head straight.

This pyspark tutorial is my attempt at cementing how joins work in Pyspark once and for all. I’ll be using the example data from Coding Horror’s explanation of SQL joins. For the official documentation, see here. Let’s get started!

Setting up the Data in Pyspark

valuesA = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)] TableA = spark.createDataFrame(valuesA,['name','id']) valuesB = [('Rutabaga',1),('Pirate',2),('Ninja',3),('Darth Vader',4)] TableB = spark.createDataFrame(valuesB,['name','id']) TableA.show() TableB.show()

Pyspark Join Data with Two Tables (A and B)

In order to create a DataFrame in Pyspark, you can use a list of structured tuples. In this case, we create TableA with a ‘name’ and ‘id’ column. The takes two parameters: a list of tuples and a list of column names.

The command displays the contents of the DataFrame. The image above has been altered to put the two tables side by side and display a title above the tables.

The last piece we need to perform is to create an alias for these tables. The alias, like in SQL, allows you to distinguish where each column is coming from. The columns are named the same so how can you know if ‘name’ is referencing TableA or TableB? The alias provides a short name for referencing fields and for referencing the fields after creation of the joined table.

ta = TableA.alias('ta') tb = TableB.alias('tb')

Now we can use refer to the DataFrames as or . Let’s move on to the actual joins!

Pyspark Inner Join Example

inner_join = ta.join(tb, ta.name == tb.name) inner_join.show()

Pyspark Inner Join Example Results

An inner join is the default join type used. The fully qualified code might look like . Ultimately, this translates to the following SQL statement:

SELECT ta.*, tb.* FROM ta INNER JOIN tb ON ta.name = tb.name

Now if you want to reference those columns in a later step, you’ll have to use the function and include the alias. For example to filter the TableA ID column to any row that is greater than two.

Pyspark Left Join Example

left_join = ta.join(tb, ta.name == tb.name,how='left') # Could also use 'left_outer' left_join.show()

Pyspark Left Join Example

Notice that Table A is the left hand-side of the query. You are calling join on the ta DataFrame. So it’s just like in SQL where the FROM table is the left-hand side in the join. You can also think of it as you’re reading from left to right so TableA is the left-most table being referenced.

You can use ‘left’ or ‘left_outer’ and the results are exactly the same. It seems like this is a convenience for people coming from different SQL flavor backgrounds.

Notice how the results now include ‘null’ values. In the example below, you can use those nulls to filter for these values.

Pyspark Left Join and Filter Example

left_join = ta.join(tb, ta.name == tb.name,how='left') # Could also use 'left_outer' left_join.filter(col('tb.name').isNull()).show()

Pyspark Left Join and Filter Example

Using the or methods, you can filter a column with respect to the null values inside of it. As in SQL, this is very handy if you want to get the records found in the left side but not found in the right side of a join.

Pyspark Right Join Example

right_join = ta.join(tb, ta.name == tb.name,how='right') # Could also use 'right_outer' right_join.show()

Pyspark Right Join Example

Again, the code is read from left to right so table A is the left side and table B is the right side. If you want to select all records from table B and return data from table A when it matches, you choose ‘right’ or ‘right_outer’ in the last parameter. As in the example above, you could combine this with the isNull to identify records found in the right table but not found in the left table.

Pyspark Full Outer Join Example

full_outer_join = ta.join(tb, ta.name == tb.name,how='full') # Could also use 'full_outer' full_outer_join.show()

Pyspark Full Outer Join Example

Finally, we get to the full outer join. This shows all records from the left table and all the records from the right table and nulls where the two do not match.

PysPark SQL Joins Gotchas and Misc

If you’re paying attention, you’ll notice a couple issues that makes using Pyspark SQL joins a little annoying when coming from a SQL background.

1. Alias References: If you do not apply an alias to the dataframe, you’ll receive an error after you create your joined dataframe. With two columns named the same thing, referencing one of the duplicate named columns returns an error that essentially says it doesn’t know which one you selected. In SQL Server and other languages, the SQL engine wouldn’t let that query go through or it would automatically append a prefix or suffix to that field name.

2. Cross Join: You can also perform a cartesian product using the crossjoin method. This is useful if you’re looking to repeat every row in table A for every row in table B.

3. The ‘leftsemi’ option: I didn’t cover this option above (since Jeff Atwood didn’t either). but if you care only for the left columns and just want to pull in the records that match in both table A and table B, you can choose ‘leftsemi’. This StackOverflow post gives a good explanation.

4. Joining on Multiple Columns: In the second parameter, you use the (ampersand) symbol for and and the (pipe) symbol for or between columns.

And that’s it! I hope you learned something about Pyspark joins! If you feel like going old school, check out my post on Pyspark RDD Examples. But DataFrames are the wave of the future in the Spark world so keep pushing your Pyspark SQL knowledge!

Sours: https://www.learnbymarketing.com/1100/pyspark-joins-by-example/
4.2.2 Spark DataFrame Join - Broadcast Join Example - Apache Spark Tutorial

Merge two DataFrames in PySpark

In this article, we will learn how to merge multiple data frames row-wise in PySpark. Outside chaining unions this is the only way to do it for DataFrames. The module used is pyspark :

Spark (open-source Big-Data processing engine by Apache) is a cluster computing system. It is faster as compared to other cluster computing systems (such as Hadoop). It provides high-level APIs in Python, Scala, and Java. Parallel jobs are easy to write in Spark. We will cover PySpark (Python + Apache Spark) because this will make the learning curve flatter. To install Spark on a linux system, follow this. To run Spark in a multi–cluster system, follow this.

 Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.  

To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course. And to begin with your Machine Learning Journey, join the Machine Learning - Basic Level Course

To do our task we are defining a function called recursively for all the input dataframes and union this one by one. To union, we use pyspark module:



  • Dataframe union() – union() method of the DataFrame is employed to mix two DataFrame’s of an equivalent structure/schema. If schemas aren’t equivalent it returns a mistake.
  • DataFrame unionAll() – unionAll() is deprecated since Spark “2.0.0” version and replaced with union().

Note: In other SQL’s, Union eliminates the duplicates but UnionAll combines two datasets including duplicate records. But, in spark both behave an equivalent and use DataFrame duplicate function to get rid of duplicate rows.

 At the last call, it returns the required resultant dataframe. The following code represents the logic behind our solution to the given problem.

Python3

 

 

 

What takes place is that it takes all the objects that you handed as parameters and reduces them the usage of unionAll (this limit is from Python, no longer the Spark minimize even though they work similarly) which sooner or later reduces it to one DataFrame.

If rather of DataFrames are ordinary RDDs you can bypass a listing of them to the union feature of your SparkContext

Examples:

Sometimes, when the dataframes to combine do not have the same order of columns, it is better to df2.select(df1.columns) in order to ensure both df have the same column order before the union.

Python3

 

The reduce(fun,seq) function is used to apply a particular function passed in its argument to all the list elements mentioned in the sequence passed along. This function is defined in functools module.



Now, let’s understand the whole process with the help of some examples.

Example 1: 

In this example, we create dataframes with columns ‘a’ and ‘b’ of some random values and pass all these three dataframe to our above-created method unionAll() and get the resultant dataframe as output and show the result.

Python3

 

 

 

 

 

Output:

Example 2: 

In this example, we create dataframes with columns ‘a’ and ‘b’ of some random values and pass all these three dataframe to our newly created method unionAll() in which we are not focusing on the names of the columns. We are just doing union the input dataframe to the next dataframe and get the resultant dataframe as output and show the result.

Python3

  

 

 

 

 

 

 

Output:




Sours: https://www.geeksforgeeks.org/merge-two-dataframes-in-pyspark/

Similar news:

She sucked in almost to her throat, and when she released her mouth from the piston, the back of her head. Rested against the upper part of my abdomen. Mom. Mmm.



1525 1526 1527 1528 1529